In [1]:
import requests
import bs4

In [2]:
# make the http request and turn the response into a beautiful soup object
response = requests.get('https://web-scraping-demo.zgulde.net/news')
html = response.text
soup = bs4.BeautifulSoup(html)

1. Find the container for the information we want `article_container`
1. Within the container, identify the entities that we want to produce a list
1. Process each individual entity (identify the pieces that we want and extract them)

In [3]:
article_container = soup.select('.grid.gap-y-12')[0]

In [4]:
articles = article_container.select('.grid.grid-cols-4.gap-x-4.border')

In [5]:
article = articles[0]
# get a pretty printed representation of our element
print(article.prettify())

<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
 <img src="/static/placeholder.png"/>
 <div class="col-span-3 space-y-3 py-3">
  <h2 class="text-2xl text-green-900">
   tax industry hand
  </h2>
  <div class="grid grid-cols-2 italic">
   <p>
    1976-09-23
   </p>
   <p class="text-right">
    By Robert Clayton
   </p>
  </div>
  <p>
   Away candidate ago laugh public six significant garden. During significant man operation audience thus give. Such walk picture. Case sport amount.
Save herself easy upon war animal doctor pull. Either reason tax according themselves that.
  </p>
 </div>
</div>



`.select` vs `.find` or `.find_all`

- `.select` always gives back a list, sometimes the list is empty, sometimes it has a single element, sometimes it has multiple elements
- `.find` and `.find_all` accept a *tag name*
- `.find` returns a single element
- `.find_all` returns a list of elements
- With `.select` multiple class names each have a `.` in front of them
- With `.find` or `.find_all` you can use a `class_` keyword argument, but the classes must match exactly

In [6]:
article.find('div', class_='grid grid-cols-2 italic')

<div class="grid grid-cols-2 italic">
<p> 1976-09-23 </p>
<p class="text-right">By Robert Clayton </p>
</div>

In [7]:
def process_article(article):
    title = article.find('h2').text
    date_and_byline_div = article.select('.grid.grid-cols-2.italic')[0]
    date_p, by_p = date_and_byline_div.find_all('p')
    summary = article.find_all('p')[-1].text
    
    return {
        "title": title,
        "date": date_p.text,
        "by": by_p.text,
        "summary": summary
    }

In [8]:
process_article(articles[3])

{'title': 'floor Congress citizen',
 'date': ' 2000-06-02 ',
 'by': 'By Grace Matthews ',
 'summary': 'Cause score suggest data TV speak include not. Once then item resource culture sometimes. Individual score free player stage resource huge.\nAnimal above food part road themselves.'}

In [9]:
import pandas as pd


pd.DataFrame([process_article(article) for article in articles])

Unnamed: 0,title,date,by,summary
0,tax industry hand,1976-09-23,By Robert Clayton,Away candidate ago laugh public six significan...
1,factor discussion compare,2017-01-02,By Stanley Thomas,Involve a pressure laugh ready rule window amo...
2,choice church open,2001-04-05,By Tony Lee,Behavior blood chance always heart. Those read...
3,floor Congress citizen,2000-06-02,By Grace Matthews,Cause score suggest data TV speak include not....
4,sign source if,2009-10-02,By Sara Hammond,Attention myself continue item market. Particu...
5,enjoy leave reflect,2011-08-27,By Sandra Walker,Alone shoulder same sometimes serious yourself...
6,middle medical book,2018-10-05,By Dwayne Dixon,Part pass civil long. Create step approach cre...
7,line what recent,1993-09-24,By Michael Young,Whole kid southern answer. Option how short ca...
8,respond be economy,1981-12-25,By Jennifer Ford,Big drive property might scene. Pull another s...
9,book school visit,2003-05-24,By Kathleen Lopez,Civil strong easy ever decide. Bit democratic ...


# Exercises

By the end of this exercise, you should have a file named acquire.py that contains the specified functions. If you wish, you may break your work into separate files for each website (e.g. acquire_codeup_blog.py and acquire_news_articles.py), but the end function should be present in acquire.py (that is, acquire.py should import get_blog_articles from the acquire_codeup_blog module.)

- 1) Codeup Blog Articles

Scrape the article text from the following pages:

    https://codeup.com/codeups-data-science-career-accelerator-is-here/
    https://codeup.com/data-science-myths/
    https://codeup.com/data-science-vs-data-analytics-whats-the-difference/
    https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/
    https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

- 2) News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

    Business
    Sports
    Technology
    Entertainment

The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

# People

In [29]:
response = requests.get('https://web-scraping-demo.zgulde.net/people')
soup = bs4.BeautifulSoup(response.text)
soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>Example People Page</title>
<link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
<link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
</head>
<body class="mx-auto max-w-screen-lg pb-32">
<h1 class="my-5 text-4xl text-center">People</h1>
<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
<p>
<i class="bi bi-exclamation-circle text-xl"></i>
        All data on this page is strictly for demonstration purposes and fake.
    </p>
</div>
<div class="grid grid-cols-2 gap-x-12 gap-y-16" id="people">
<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
<h2 class="text-2xl text-purple-800 name col-span-full border-b">Kristen Murphy<

In [30]:
people_div = soup.select('#people')[0]
people = people_div.select('.person')

In [31]:
person = people[0]
print(person.prettify())

<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name col-span-full border-b">
  Kristen Murphy
 </h2>
 <p class="quote col-span-full px-5 py-5 text-center text-gray-500">
  "Digitized regional conglomeration"
 </p>
 <div class="grid grid-cols-9">
  <i class="bi bi-envelope-fill text-purple-800">
  </i>
  <p class="email col-span-8">
   derekfrazier@fuentes.com
  </p>
  <i class="bi bi-telephone-fill text-purple-800">
  </i>
  <p class="phone col-span-8">
   +1-796-286-5984x69325
  </p>
 </div>
 <div class="address grid grid-cols-9">
  <i class="bi bi-geo-fill text-purple-800">
  </i>
  <p class="col-span-8">
   1747 Curtis River Apt. 420
   <br/>
   Coleborough, MO 26892
  </p>
 </div>
</div>



In [32]:
def process_person(person):
    return {
        'name': person.find(class_='name').text,
        'quote': person.find(class_='quote').text.strip(),
        'email': person.find(class_='email').text,
        'phone': person.find(class_='phone').text,
        'address': person.find(class_='address').text.strip(),
    }

In [34]:
process_person(people[3])

{'name': 'Heather Walker',
 'quote': '"Diverse 5thgeneration service-desk"',
 'email': 'fsherman@gmail.com',
 'phone': '001-772-695-4312x3816',
 'address': '0040 Katherine Glens Suite 395 \n                New Mindy, KS 12383'}

# Turn into dataframe

In [35]:
pd.DataFrame([process_person(person) for person in people])

Unnamed: 0,name,quote,email,phone,address
0,Kristen Murphy,"""Digitized regional conglomeration""",derekfrazier@fuentes.com,+1-796-286-5984x69325,1747 Curtis River Apt. 420 \n C...
1,Jared Rowe,"""Public-key scalable concept""",jamesvargas@warren-cruz.com,521.619.7239x5390,854 Barrett Drive Suite 315 \n ...
2,Sarah Young,"""Realigned global core""",codywatson@hotmail.com,3008748848,702 Walker Plain Suite 169 \n A...
3,Heather Walker,"""Diverse 5thgeneration service-desk""",fsherman@gmail.com,001-772-695-4312x3816,0040 Katherine Glens Suite 395 \n ...
4,Candace Harrison,"""Progressive next generation encoding""",jennifer04@wang.com,(020)481-2579x40506,7134 Goodman Shore \n Woodsvill...
5,Daniel Sellers,"""Open-architected zero-defect secured line""",fdawson@gmail.com,001-261-199-0318x6629,169 Melissa Mills Suite 097 \n ...
6,Jonathon Chavez,"""Stand-alone client-server utilization""",jocelyn95@hotmail.com,(038)716-5644x4284,81295 Nash Harbors Suite 070 \n ...
7,Derek Russell,"""Exclusive well-modulated solution""",rwalters@fuentes.com,001-285-401-2643x664,9510 Travis Drive \n Port Angel...
8,Bruce Jennings,"""Multi-layered zero-defect hardware""",barnestodd@hotmail.com,+1-297-282-3024x87942,08863 Harmon Hills Apt. 441 \n ...
9,Alexander Fernandez,"""Advanced didactic conglomeration""",nwinters@clarke.com,(235)900-0467,32358 Sloan Brook Apt. 232 \n L...
