# Data Acquisition with Web Scraping

In [55]:
import requests
from bs4 import BeautifulSoup 
import pandas as pd

First make the request. The response is a bunch of html.

In [56]:
response = requests.get('https://web-scraping-demo.zgulde.net/news')
html = response.text
html

'<!DOCTYPE html>\n<html lang="en">\n<head>\n    <meta charset="UTF-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <title>News Example Page</title>\n    <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet" />\n    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" />\n</head>\n<body class="mx-auto max-w-screen-lg pb-32">\n    \n<h1 class="my-5 text-4xl text-center">News!</h1>\n<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">\n    <p>\n        <i class="bi bi-exclamation-circle text-xl"></i>\n        All data on this page is strictly for demonstration purposes and fake.\n    </p>\n</div>\n<div class="grid gap-y-12">\n    \n    <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">\n        <img src="/static/placeholder.png" />\n        <d

We can make more sense of that html with the beautiful soup library.

In [57]:
soup = BeautifulSoup(html)
soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>News Example Page</title>
<link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
<link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
</head>
<body class="mx-auto max-w-screen-lg pb-32">
<h1 class="my-5 text-4xl text-center">News!</h1>
<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
<p>
<i class="bi bi-exclamation-circle text-xl"></i>
        All data on this page is strictly for demonstration purposes and fake.
    </p>
</div>
<div class="grid gap-y-12">
<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
<img src="/static/placeholder.png"/>
<div class="col-span-3 space-y-3 py-3">
<h2 class="text-2xl text-green-900">establish last d

From here we can switch between the browser and python and try out different ways of getting different parts of the html document.

We can leverage Google Chrome's developer tools by right clicking and choosing "Inspect". We can then use this html document inspector to help us with our web scraping.

In [60]:
articles = soup.select('.grid.grid-cols-4.gap-x-4.border')
articles

[<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
 <img src="/static/placeholder.png"/>
 <div class="col-span-3 space-y-3 py-3">
 <h2 class="text-2xl text-green-900">establish last difference</h2>
 <div class="grid grid-cols-2 italic">
 <p> 1993-10-06 </p>
 <p class="text-right">By Ronald Owen </p>
 </div>
 <p>Writer institution drop remember remember experience particular. Election child try suggest yeah right.
 Push ground indeed to radio himself about test. Person time reduce list down sense half. Kid military everyone trip participant whether while. Society church sister rest stock.</p>
 </div>
 </div>,
 <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
 <img src="/static/placeholder.png"/>
 <div class="col-span-3 space-y-3 py-3">
 <h2 class="text-2xl text-green-900">until bank put</h2>
 <div class="grid grid-cols-2 italic">
 <p> 1976-05-20 </p>
 <p class="text

In [59]:
articles[0].select('.italic')[0].select('p')

[<p> 1993-10-06 </p>, <p class="text-right">By Ronald Owen </p>]

Bringing it all together:

In [5]:
def process_article(article):
    date, author = articles[0].select('.italic')[0].select('p')
    return {
        'title': article.h2.text,
        'date': date.text,'title': article.h2.text,
        'author': author.text
    }

pd.DataFrame([process_article(article) for article in articles])

Unnamed: 0,title,date,author
0,address gun different,2020-01-13,By Curtis Watkins
1,image candidate window,2020-01-13,By Curtis Watkins
2,necessary part sense,2020-01-13,By Curtis Watkins
3,major example know,2020-01-13,By Curtis Watkins
4,the both poor,2020-01-13,By Curtis Watkins
5,laugh life spring,2020-01-13,By Curtis Watkins
6,up mention house,2020-01-13,By Curtis Watkins
7,onto grow theory,2020-01-13,By Curtis Watkins
8,wide agree card,2020-01-13,By Curtis Watkins
9,fact successful sense,2020-01-13,By Curtis Watkins


In [69]:
response = requests.get('https://web-scraping-demo.zgulde.net/people')
html = response.text
html

'<!DOCTYPE html>\n<html lang="en">\n<head>\n    <meta charset="UTF-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <title>Example People Page</title>\n    <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet" />\n    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" />\n</head>\n<body class="mx-auto max-w-screen-lg pb-32">\n    \n<h1 class="my-5 text-4xl text-center">People</h1>\n\n<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">\n    <p>\n        <i class="bi bi-exclamation-circle text-xl"></i>\n        All data on this page is strictly for demonstration purposes and fake.\n    </p>\n</div>\n\n<div id="people" class="grid grid-cols-2 gap-x-12 gap-y-16">\n    \n    <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">\n    

In [70]:
soup = BeautifulSoup(html)
soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>Example People Page</title>
<link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
<link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
</head>
<body class="mx-auto max-w-screen-lg pb-32">
<h1 class="my-5 text-4xl text-center">People</h1>
<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
<p>
<i class="bi bi-exclamation-circle text-xl"></i>
        All data on this page is strictly for demonstration purposes and fake.
    </p>
</div>
<div class="grid grid-cols-2 gap-x-12 gap-y-16" id="people">
<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
<h2 class="text-2xl text-purple-800 name col-span-full border-b">Christopher Wal

In [102]:
people = soup.select(".person")

In [103]:
people

[<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name col-span-full border-b">Christopher Wallace</h2>
 <p class="quote col-span-full px-5 py-5 text-center text-gray-500">
             "Quality-focused needs-based moderator"
         </p>
 <div class="grid grid-cols-9">
 <i class="bi bi-envelope-fill text-purple-800"></i>
 <p class="email col-span-8">haleywashington@monroe.info</p>
 <i class="bi bi-telephone-fill text-purple-800"></i>
 <p class="phone col-span-8">3800387110</p>
 </div>
 <div class="address grid grid-cols-9">
 <i class="bi bi-geo-fill text-purple-800"></i>
 <p class="col-span-8">
                 31426 Park Rapids Apt. 728 <br/>
                 Howellburgh, KY 06158
             </p>
 </div>
 </div>,
 <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-80

In [122]:
person

<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
<h2 class="text-2xl text-purple-800 name col-span-full border-b">Louis Mcclure</h2>
<p class="quote col-span-full px-5 py-5 text-center text-gray-500">
            "Optimized foreground model"
        </p>
<div class="grid grid-cols-9">
<i class="bi bi-envelope-fill text-purple-800"></i>
<p class="email col-span-8">estanton@yahoo.com</p>
<i class="bi bi-telephone-fill text-purple-800"></i>
<p class="phone col-span-8">+1-636-192-4893x21604</p>
</div>
<div class="address grid grid-cols-9">
<i class="bi bi-geo-fill text-purple-800"></i>
<p class="col-span-8">
                266 Gordon Ridge <br/>
                Taylormouth, SD 55717
            </p>
</div>
</div>

In [105]:
people[0].select('p')

[<p class="quote col-span-full px-5 py-5 text-center text-gray-500">
             "Quality-focused needs-based moderator"
         </p>,
 <p class="email col-span-8">haleywashington@monroe.info</p>,
 <p class="phone col-span-8">3800387110</p>,
 <p class="col-span-8">
                 31426 Park Rapids Apt. 728 <br/>
                 Howellburgh, KY 06158
             </p>]

In [120]:
person.select('.quote')[0].text.strip()

'"Optimized foreground model"'

In [110]:
person.select('.email')[0].text

'estanton@yahoo.com'

In [111]:
person.select('.phone')[0].text

'+1-636-192-4893x21604'

In [118]:
person.select('.address')[0].text.strip()

'266 Gordon Ridge \n                Taylormouth, SD 55717'

In [139]:
def process_article(user):
    return {
        'name' : user.h2.text,
        'quote': user.select('.quote')[0].text.strip(),
        'email': user.select('.email')[0].text,
        'phone': user.select('.phone')[0].text,
        'address' : user.select('.address')[0].text.strip()
        
    }

users = pd.DataFrame([process_article(user) for user in people])

In [140]:
users

Unnamed: 0,name,quote,email,phone,address
0,Christopher Wallace,"""Quality-focused needs-based moderator""",haleywashington@monroe.info,3800387110,31426 Park Rapids Apt. 728 \n H...
1,Carol Parker,"""Right-sized exuding system engine""",herbert23@fletcher.org,134.342.9906,039 Austin Ridge \n Lake Susanb...
2,Katie Gibson,"""Robust optimizing algorithm""",melissa16@kelly.org,+1-038-709-1027,1194 Joshua Field \n Crystalfur...
3,Doris Nguyen,"""Business-focused empowering focus group""",rclark@gmail.com,+1-021-346-7502x991,7305 Jessica Grove \n New Johnm...
4,William Hammond,"""Organic intangible benchmark""",avalenzuela@gmail.com,(295)405-0944,85318 Robert Oval Apt. 759 \n E...
5,Michael Kennedy,"""Ergonomic mobile workforce""",mcaldwell@hotmail.com,001-243-130-6679x167,4639 Erin Stravenue Apt. 252 \n ...
6,Stacie Marshall,"""Compatible transitional infrastructure""",wendyryan@pratt-rogers.com,+1-942-833-6208x2780,41160 Thompson Trace \n East St...
7,Jeremy Rivas,"""Cross-platform coherent array""",rodneybryant@mcdonald.info,001-318-525-0950x9564,6106 Carrie Mission Suite 275 \n ...
8,Brenda Cannon,"""Synchronized interactive superstructure""",corey67@hotmail.com,(533)401-8612x4843,79654 Jennifer Oval Suite 501 \n ...
9,Louis Mcclure,"""Optimized foreground model""",estanton@yahoo.com,+1-636-192-4893x21604,266 Gordon Ridge \n Taylormouth...


In [142]:
users['address'] = users.address.replace('\\n', '', regex = True)

In [143]:
users

Unnamed: 0,name,quote,email,phone,address
0,Christopher Wallace,"""Quality-focused needs-based moderator""",haleywashington@monroe.info,3800387110,31426 Park Rapids Apt. 728 How...
1,Carol Parker,"""Right-sized exuding system engine""",herbert23@fletcher.org,134.342.9906,039 Austin Ridge Lake Susanbur...
2,Katie Gibson,"""Robust optimizing algorithm""",melissa16@kelly.org,+1-038-709-1027,"1194 Joshua Field Crystalfurt,..."
3,Doris Nguyen,"""Business-focused empowering focus group""",rclark@gmail.com,+1-021-346-7502x991,7305 Jessica Grove New Johnmou...
4,William Hammond,"""Organic intangible benchmark""",avalenzuela@gmail.com,(295)405-0944,85318 Robert Oval Apt. 759 Eli...
5,Michael Kennedy,"""Ergonomic mobile workforce""",mcaldwell@hotmail.com,001-243-130-6679x167,4639 Erin Stravenue Apt. 252 E...
6,Stacie Marshall,"""Compatible transitional infrastructure""",wendyryan@pratt-rogers.com,+1-942-833-6208x2780,41160 Thompson Trace East Step...
7,Jeremy Rivas,"""Cross-platform coherent array""",rodneybryant@mcdonald.info,001-318-525-0950x9564,6106 Carrie Mission Suite 275 ...
8,Brenda Cannon,"""Synchronized interactive superstructure""",corey67@hotmail.com,(533)401-8612x4843,79654 Jennifer Oval Suite 501 ...
9,Louis Mcclure,"""Optimized foreground model""",estanton@yahoo.com,+1-636-192-4893x21604,"266 Gordon Ridge Taylormouth, ..."


In [147]:
pd.options.display.max_colwidth = 100

In [148]:
users

Unnamed: 0,name,quote,email,phone,address
0,Christopher Wallace,"""Quality-focused needs-based moderator""",haleywashington@monroe.info,3800387110,"31426 Park Rapids Apt. 728 Howellburgh, KY 06158"
1,Carol Parker,"""Right-sized exuding system engine""",herbert23@fletcher.org,134.342.9906,"039 Austin Ridge Lake Susanburgh, MT 83927"
2,Katie Gibson,"""Robust optimizing algorithm""",melissa16@kelly.org,+1-038-709-1027,"1194 Joshua Field Crystalfurt, VA 82581"
3,Doris Nguyen,"""Business-focused empowering focus group""",rclark@gmail.com,+1-021-346-7502x991,"7305 Jessica Grove New Johnmouth, NJ 57777"
4,William Hammond,"""Organic intangible benchmark""",avalenzuela@gmail.com,(295)405-0944,"85318 Robert Oval Apt. 759 Elizabethmouth, WI 49510"
5,Michael Kennedy,"""Ergonomic mobile workforce""",mcaldwell@hotmail.com,001-243-130-6679x167,"4639 Erin Stravenue Apt. 252 East Michaelside, WA 26730"
6,Stacie Marshall,"""Compatible transitional infrastructure""",wendyryan@pratt-rogers.com,+1-942-833-6208x2780,"41160 Thompson Trace East Stephenville, CT 52306"
7,Jeremy Rivas,"""Cross-platform coherent array""",rodneybryant@mcdonald.info,001-318-525-0950x9564,"6106 Carrie Mission Suite 275 Jameschester, HI 69132"
8,Brenda Cannon,"""Synchronized interactive superstructure""",corey67@hotmail.com,(533)401-8612x4843,"79654 Jennifer Oval Suite 501 Williamport, IN 69061"
9,Louis Mcclure,"""Optimized foreground model""",estanton@yahoo.com,+1-636-192-4893x21604,"266 Gordon Ridge Taylormouth, SD 55717"
