# Data Acquisition with Web Scraping

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

`$ pip install beautifulsoup4`

## Soup Methods

- `soup.select('.class')`: to get all the elements with class `class`
- `soup.select_one('.class')`: to get the first element with class `class`
- `soup.h2`: to get the first `h2` element
- `soup.find_all('h2')`: to get all the elements with tag name of `h2`
- `soup('h2')` : same as `find_all` method above
- `soup.find('h2')`: finds the first matching element

First make the request. The response is a bunch of html.

In [2]:
response = requests.get('https://web-scraping-demo.zgulde.net/news')
html = response.text
html

'<!DOCTYPE html>\n<html lang="en">\n<head>\n    <meta charset="UTF-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <title>News Example Page</title>\n    <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet" />\n    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" />\n</head>\n<body class="mx-auto max-w-screen-lg pb-32">\n    \n<h1 class="my-5 text-4xl text-center">News!</h1>\n<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">\n    <p>\n        <i class="bi bi-exclamation-circle text-xl"></i>\n        All data on this page is strictly for demonstration purposes and fake.\n    </p>\n</div>\n<div class="grid gap-y-12">\n    \n    <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">\n        <img src="/static/placeholder.png" />\n        <d

We can make more sense of that html with the beautiful soup library.

In [9]:
soup = BeautifulSoup(html)
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   News Example Page
  </title>
  <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
  <link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
 </head>
 <body class="mx-auto max-w-screen-lg pb-32">
  <h1 class="my-5 text-4xl text-center">
   News!
  </h1>
  <div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
   <p>
    <i class="bi bi-exclamation-circle text-xl">
    </i>
    All data on this page is strictly for demonstration purposes and fake.
   </p>
  </div>
  <div class="grid gap-y-12">
   <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
    <img src="/static/placeholder.png"/>
    <div class="col-span-3 space-y-3 py-3

From here we can switch between the browser and python and try out different ways of getting different parts of the html document.

We can leverage Google Chrome's developer tools by right clicking and choosing "Inspect". We can then use this html document inspector to help us with our web scraping.

In [4]:
# Use beautifulsoup methods to extract necessary content from an article

In [20]:
articles = soup.select('.grid-cols-4')
articles

[<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
 <img src="/static/placeholder.png"/>
 <div class="col-span-3 space-y-3 py-3">
 <h2 class="text-2xl text-green-900">protect receive base</h2>
 <div class="grid grid-cols-2 italic">
 <p> 2010-09-06 </p>
 <p class="text-right">By Dean Collins </p>
 </div>
 <p>Painting wrong our although skin affect. Want task eat television million.
 Exactly find experience.</p>
 </div>
 </div>,
 <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
 <img src="/static/placeholder.png"/>
 <div class="col-span-3 space-y-3 py-3">
 <h2 class="text-2xl text-green-900">create soon year</h2>
 <div class="grid grid-cols-2 italic">
 <p> 1983-03-02 </p>
 <p class="text-right">By Diane Gibson </p>
 </div>
 <p>Or quite born debate.
 Arrive plan turn decision professor. Either network bed follow try. Spend whether machine force without. Current case a

In [75]:
article = articles[0]
article

<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
<img src="/static/placeholder.png"/>
<div class="col-span-3 space-y-3 py-3">
<h2 class="text-2xl text-green-900">protect receive base</h2>
<div class="grid grid-cols-2 italic">
<p> 2010-09-06 </p>
<p class="text-right">By Dean Collins </p>
</div>
<p>Painting wrong our although skin affect. Want task eat television million.
Exactly find experience.</p>
</div>
</div>

Bringing it all together: Make a function

In [38]:
headline = article.h2.text
headline

'protect receive base'

In [39]:
date = article.p.text.strip()
date

'2010-09-06'

In [43]:
author = article.select('.text-right')[0].text.strip()[3:]
author

'Dean Collins'

In [76]:
article.select('p')[1]

<p class="text-right">By Dean Collins </p>

In [77]:
content = article.select('p')[-1].text
content

'Painting wrong our although skin affect. Want task eat television million.\nExactly find experience.'

In [None]:
def parse_news(article):
    headline = article.h2.text
    date = article.p.text.strip()
    author = article.select('.text-right')[0].text.strip()[3:]
    content = article.select('p')[-1].text
    
    return {
        'headline': headline, 'date': date, 'author': author,
        'content': content
    }

In [None]:
parse_news(articles[0])

In [None]:
# loop through all the articles
pd.DataFrame([parse_news(article) for article in articles])

## Scraping People

In [26]:
response = requests.get('https://web-scraping-demo.zgulde.net/people', headers={'user-agent': 'Codeup DS Hoppper'})
soup = BeautifulSoup(response.text)

In [27]:
people = soup.select('.person')
people

[<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name col-span-full border-b">Jason Gomez</h2>
 <p class="quote col-span-full px-5 py-5 text-center text-gray-500">
             "Versatile contextually-based framework"
         </p>
 <div class="grid grid-cols-9">
 <i class="bi bi-envelope-fill text-purple-800"></i>
 <p class="email col-span-8">derek94@yahoo.com</p>
 <i class="bi bi-telephone-fill text-purple-800"></i>
 <p class="phone col-span-8">+1-032-098-0412x8960</p>
 </div>
 <div class="address grid grid-cols-9">
 <i class="bi bi-geo-fill text-purple-800"></i>
 <p class="col-span-8">
                 293 Martin Rapid Suite 236 <br/>
                 North Sarahchester, AR 26950
             </p>
 </div>
 </div>,
 <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-80

In [31]:
person = people[0]
person

<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
<h2 class="text-2xl text-purple-800 name col-span-full border-b">Jason Gomez</h2>
<p class="quote col-span-full px-5 py-5 text-center text-gray-500">
            "Versatile contextually-based framework"
        </p>
<div class="grid grid-cols-9">
<i class="bi bi-envelope-fill text-purple-800"></i>
<p class="email col-span-8">derek94@yahoo.com</p>
<i class="bi bi-telephone-fill text-purple-800"></i>
<p class="phone col-span-8">+1-032-098-0412x8960</p>
</div>
<div class="address grid grid-cols-9">
<i class="bi bi-geo-fill text-purple-800"></i>
<p class="col-span-8">
                293 Martin Rapid Suite 236 <br/>
                North Sarahchester, AR 26950
            </p>
</div>
</div>

In [32]:
name = person.h2.text
name

'Jason Gomez'

In [57]:
quote = person.select('.quote')[0].text.split()
quote = ' '.join(quote)[1:-1]
quote

'Versatile contextually-based framework'

In [60]:
email = person('p')[1].text
email

'derek94@yahoo.com'

In [62]:
phone = person.select('.phone')[0].text
phone

'+1-032-098-0412x8960'

In [69]:
address = person('p')[-1].text.split()
address = ' '.join(address)
address

'293 Martin Rapid Suite 236 North Sarahchester, AR 26950'

In [70]:
def parse_person(person):
    name = person.h2.text
    quote = person.select('.quote')[0].text.split()
    quote = ' '.join(quote)[1:-1] 
    email = person('p')[1].text
    phone = person.select('.phone')[0].text
    address = person('p')[-1].text.split()
    address = ' '.join(address)

    return {
        'name': name, 'quote': quote, 'email': email,
        'phone': phone,
        'address': address
    }

In [74]:
# loop through all the persons
people = pd.DataFrame([parse_person(person) for person in people])
people

Unnamed: 0,name,quote,email,phone,address
0,Jason Gomez,Versatile contextually-based framework,derek94@yahoo.com,+1-032-098-0412x8960,"293 Martin Rapid Suite 236 North Sarahchester,..."
1,Cathy Choi,Customizable uniform application,amandamartin@hall.com,+1-042-632-0547x9986,"4216 Thomas Grove Apt. 340 Paulahaven, ND 26102"
2,Jennifer Ross,Horizontal methodical hardware,jim72@yahoo.com,6542922892,"2516 Jonathan Hills Suite 516 Brandonton, TN 3..."
3,Jimmy Meyers,Assimilated 4thgeneration approach,sarah94@yahoo.com,+1-460-915-3671x944,"0206 Marquez Village Apt. 809 West Sabrina, IN..."
4,Ian Williams,Monitored executive productivity,aaronjones@fleming.com,001-662-445-7898,"4206 Monroe Gardens Apt. 989 Rebeccahaven, IN ..."
5,David Cole,Cross-group zero tolerance Graphic Interface,crystal85@nguyen-gutierrez.com,271-696-6270x5531,"2669 Barry Walks Jennychester, NH 48030"
6,Carlos Miller,Managed executive time-frame,careyjames@evans-melton.com,+1-981-102-3959x6180,"74020 Foster Lake Suite 187 Wuton, IA 75474"
7,Adam Dunn,Face-to-face bandwidth-monitored synergy,cindyespinoza@gmail.com,6551697726,"20589 Mitchell Hill West Gerald, OR 62265"
8,Terry Hurst,Future-proofed tertiary knowledgebase,williamkrause@jacobs.com,576-013-3854x45724,"53745 Young Crossroad Apt. 137 Thompsonshire, ..."
9,Angela Sparks,Public-key asynchronous alliance,joshua40@griffith.com,748.604.2839x9975,"96700 Lauren Mountain Suite 792 Lake Yvonne, K..."


## Web Scraping Etiquitte

- respect the `robots.txt` file if present

    * [Wikipedia: Robots exclusion standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard)
    * [robotstxt.org](http://www.robotstxt.org/robotstxt.html)
    * [codeup's robots.txt](https://codeup.com/robots.txt)

- use a descriptive user agent

    ```python
    requests.get('http://example.com', headers={'user-agent': 'codeup data science germain cohort'})
    ```

## Exercises

#### Codeup Blog Articles

Visit Codeup's Blog(http://codeup.com/blog/) and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

In [None]:
response = requests.get('https://codeup.com/blog/', headers={'user-agent': 'Codeup DS Hopper'})
soup = BeautifulSoup(response.text)

In [None]:
link = 'https://inshorts.com/en/read/sports'
# use function to make the request
soup = get_soup(link)
# parse data under class 'news-card' and assign to a variable
articles = soup.select('.news-card')
# assign data for first article to a variable
article = articles[1]
# parse text under attributes listed to get title
title = article.find(attrs={"itemprop": "headline"}).text
# parse text under attributes listed to get content
content = article.find(attrs={"itemprop": "articleBody"}).text