In [1]:
import requests
import bs4

In [2]:
# make the http request and turn the response into a beautiful soup object
response = requests.get('https://web-scraping-demo.zgulde.net/news')
html = response.text
soup = bs4.BeautifulSoup(html)

1. Find the container for the information we want `article_container`
1. Within the container, identify the entities that we want to produce a list
1. Process each individual entity (identify the pieces that we want and extract them)

In [3]:
article_container = soup.select('.grid.gap-y-12')[0]

In [4]:
articles = article_container.select('.grid.grid-cols-4.gap-x-4.border')

In [5]:
article = articles[0]
# get a pretty printed representation of our element
print(article.prettify())

<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
 <img src="/static/placeholder.png"/>
 <div class="col-span-3 space-y-3 py-3">
  <h2 class="text-2xl text-green-900">
   foreign learn cause
  </h2>
  <div class="grid grid-cols-2 italic">
   <p>
    1970-12-18
   </p>
   <p class="text-right">
    By Edward Manning
   </p>
  </div>
  <p>
   Three step hear American. Score edge well enter person future.
Defense week young economic. They imagine window family concern difficult seat. Up green argue push nothing help question.
  </p>
 </div>
</div>



`.select` vs `.find` or `.find_all`

- `.select` always gives back a list, sometimes the list is empty, sometimes it has a single element, sometimes it has multiple elements
- `.find` and `.find_all` accept a *tag name*
- `.find` returns a single element
- `.find_all` returns a list of elements
- With `.select` multiple class names each have a `.` in front of them
- With `.find` or `.find_all` you can use a `class_` keyword argument, but the classes must match exactly

In [6]:
article.find('div', class_='grid grid-cols-2 italic')

<div class="grid grid-cols-2 italic">
<p> 1970-12-18 </p>
<p class="text-right">By Edward Manning </p>
</div>

In [7]:
def process_article(article):
    title = article.find('h2').text
    date_and_byline_div = article.select('.grid.grid-cols-2.italic')[0]
    date_p, by_p = date_and_byline_div.find_all('p')
    summary = article.find_all('p')[-1].text
    
    return {
        "title": title,
        "date": date_p.text,
        "by": by_p.text,
        "summary": summary
    }

In [8]:
process_article(articles[3])

{'title': 'local threat space',
 'date': ' 1989-02-11 ',
 'by': 'By Julie Mays MD ',
 'summary': 'Money government board early book discuss. Energy somebody camera doctor. Kind sell toward.\nBe contain table some see. Day white democratic property dark nothing.'}

In [9]:
import pandas as pd


pd.DataFrame([process_article(article) for article in articles])

Unnamed: 0,title,date,by,summary
0,foreign learn cause,1970-12-18,By Edward Manning,Three step hear American. Score edge well ente...
1,against after indeed,1976-01-16,By Laura Briggs,Particular seek career pressure rate dog happe...
2,political then deal,1974-01-03,By John Watts,Particular tax population pressure. Operation ...
3,local threat space,1989-02-11,By Julie Mays MD,Money government board early book discuss. Ene...
4,interest current eat,1988-07-10,By Paula Anderson,Thank mean health fire. Wait already his full....
5,do such floor,2015-09-25,By Michael Thomas,Bad worry that on among alone management. What...
6,event one do,1979-12-18,By Evelyn Leblanc,Toward the reduce girl hear large thank.\nFrie...
7,week down ready,1971-03-31,By Jimmy Allen,Practice fund board. Hit democratic protect en...
8,sport father later,2015-06-24,By Max Medina,Reveal easy certain evening. Laugh not note th...
9,civil real game,2018-07-26,By Dustin White DVM,Often husband knowledge attorney less administ...


# Exercises

By the end of this exercise, you should have a file named acquire.py that contains the specified functions. If you wish, you may break your work into separate files for each website (e.g. acquire_codeup_blog.py and acquire_news_articles.py), but the end function should be present in acquire.py (that is, acquire.py should import get_blog_articles from the acquire_codeup_blog module.)

- 1) Codeup Blog Articles

Scrape the article text from the following pages:

    https://codeup.com/codeups-data-science-career-accelerator-is-here/
    https://codeup.com/data-science-myths/
    https://codeup.com/data-science-vs-data-analytics-whats-the-difference/
    https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/
    https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

In [10]:
from requests import get
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [11]:
url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0"}
response = get(url, headers = headers)
raw_html = response.text
raw_html[0:300]

'<!DOCTYPE html><html lang="en-US"><head >\t<meta charset="UTF-8" />\n\t<meta name="viewport" content="width=device-width, initial-scale=1" />\n\t<meta name=\'robots\' content=\'index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1\' />\n<style type="text/css" id="nab-alternative-loader-'

In [12]:
# Turn raw html into BS object.

In [13]:
soup = BeautifulSoup(raw_html)

In [14]:
## How to use BS
    # soup.find - to find one thing 
    # soup.findall - to find all the matching things
    # soup.select - to find all the matching thinhs (as a list of tags)

In [15]:
# Title
Title = soup.select('h1.jupiterx-post-title')[0].text

# Body
Content = soup.select('.jupiterx-post-content')[0].text

# Time
Time = soup.time.text

print(Title)
print('-------------------------------------------------')
print(Time)
print('-------------------------------------------------')
print(Content)

Codeup’s Data Science Career Accelerator is Here!
-------------------------------------------------
September 30, 2018
-------------------------------------------------
The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in Glassdoor’s #1 Best Job in America.
Data Science is a method of providing actionable intelligence from data. The data revolution has hit San Antonio, resulting in an explosion in Data Scientist positions across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen UTSA invest $70 M for a Cybersecurity Center and School of Data Science. We built a program to specifically meet the growing demands of this industry.
Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scie

In [16]:
# Now make a funtion.

In [17]:
def get_codeup_blog(url):

    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0"}
    response = get(url, headers = headers)
    
    soup = BeautifulSoup(response.text)
    
    # Title
    Title = soup.select('h1.jupiterx-post-title')[0].text

    # Body
    Content = soup.select('.jupiterx-post-content')[0].text

    # Time
    Time = soup.time.text
    
    output = {}
    output['Title'] = Title
    output['Time'] = Time
    output['Content'] = Content
    
    return output

In [18]:
get_codeup_blog('https://codeup.com/codeups-data-science-career-accelerator-is-here/')

{'Title': 'Codeup’s Data Science Career Accelerator is Here!',
 'Time': 'September 30, 2018',
 'Content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB

In [19]:
urls = ['https://codeup.com/codeups-data-science-career-accelerator-is-here/',
        'https://codeup.com/data-science-myths/',
        'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/',
        'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/',
        'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/']

In [20]:
def get_blog_articals(urls):
    posts = [get_codeup_blog(url) for url in urls]
    
    return pd.DataFrame(posts)

In [21]:
get_blog_articals(urls)

Unnamed: 0,Title,Time,Content
0,Codeup’s Data Science Career Accelerator is Here!,"September 30, 2018",The rumors are true! The time has arrived. Cod...
1,Data Science Myths,"October 31, 2018",By Dimitri Antoniou and Maggie Giust\nData Sci...
2,Data Science VS Data Analytics: What’s The Dif...,"October 17, 2018","By Dimitri Antoniou\nA week ago, Codeup launch..."
3,10 Tips to Crush It at the SA Tech Job Fair,"August 14, 2018",SA Tech Job Fair\nThe third bi-annual San Anto...
4,Competitor Bootcamps Are Closing. Is the Model...,"August 14, 2018",Competitor Bootcamps Are Closing. Is the Model...


- 2) News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

    Business
    Sports
    Technology
    Entertainment

The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

In [119]:
def get_article(article, category):
    # Attribute selector
    title = article.select("[itemprop='headline']")[0].text
    
    # article body
    content = article.select("[itemprop='articleBody']")[0].text
    
    output = {}
    output["title"] = title
    output["content"] = content
    output["category"] = category
    
    return output

In [122]:
def get_articles(category):
    """
    This function takes in a category as a string. Category must be an available category in inshorts
    Returns a list of dictionaries where each dictionary represents a single inshort article
    """
    base = "https://inshorts.com/en/read/"
    
    # We concatenate our base_url with the category
    url = base + category
    
    # Set the headers to show as Netscape Navigator on Windows 98, b/c I feel like creating an anomaly in the logs
    headers = {"User-Agent": "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"}

    # Get the http response object from the server
    response = get(url, headers=headers)

    # Make soup out of the raw html
    soup = BeautifulSoup(response.text)
    
    # Ignore everything, focusing only on the news cards
    articles = soup.select(".news-card")
    
    output = []
    
    # Iterate through every article tag/soup 
    for article in articles:
        
        # Returns a dictionary of the article's title, body, and category
        article_data = get_article(article, category) 
        
        # Append the dictionary to the list
        output.append(article_data)
    
    # Return the list of dictionaries
    return output

In [124]:
get_articles("technology")

[{'title': "Kangana Ranaut's Twitter account suspended for violating rules",
  'content': 'Actress Kangana Ranaut\'s Twitter account has been suspended by the micro-blogging website for violating rules. She had made comments on the alleged violence that took place in West Bengal after Assembly election results were declared. A writ petition filed in the Bombay High Court last year had sought suspension of her account for "spreading continuous hatred, disharmony in the country".',
  'category': 'technology'},
 {'title': "Twitter issues statement after permanently suspending Kangana's account",
  'content': 'Twitter has permanently suspended Kangana Ranaut\'s account for repeated violations of rules, specifically its "hateful conduct and abusive behaviour policy", the site said. "We\'ve been clear...we\'ll take strong...action on behaviour that has...potential to lead to offline harm," a spokesperson stated. "We enforce the Twitter rules judiciously and impartially for everyone," the spo

# People

In [22]:
response = requests.get('https://web-scraping-demo.zgulde.net/people')
soup = bs4.BeautifulSoup(response.text)
soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>Example People Page</title>
<link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
<link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
</head>
<body class="mx-auto max-w-screen-lg pb-32">
<h1 class="my-5 text-4xl text-center">People</h1>
<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
<p>
<i class="bi bi-exclamation-circle text-xl"></i>
        All data on this page is strictly for demonstration purposes and fake.
    </p>
</div>
<div class="grid grid-cols-2 gap-x-12 gap-y-16" id="people">
<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
<h2 class="text-2xl text-purple-800 name col-span-full border-b">Paul Jackson</h

In [23]:
people_div = soup.select('#people')[0]
people = people_div.select('.person')

In [24]:
person = people[0]
print(person.prettify())

<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name col-span-full border-b">
  Paul Jackson
 </h2>
 <p class="quote col-span-full px-5 py-5 text-center text-gray-500">
  "De-engineered foreground product"
 </p>
 <div class="grid grid-cols-9">
  <i class="bi bi-envelope-fill text-purple-800">
  </i>
  <p class="email col-span-8">
   paulcisneros@yahoo.com
  </p>
  <i class="bi bi-telephone-fill text-purple-800">
  </i>
  <p class="phone col-span-8">
   388-664-9837x1084
  </p>
 </div>
 <div class="address grid grid-cols-9">
  <i class="bi bi-geo-fill text-purple-800">
  </i>
  <p class="col-span-8">
   21798 Samantha Ways Suite 170
   <br/>
   New Jesus, CT 02209
  </p>
 </div>
</div>



In [25]:
def process_person(person):
    return {
        'name': person.find(class_='name').text,
        'quote': person.find(class_='quote').text.strip(),
        'email': person.find(class_='email').text,
        'phone': person.find(class_='phone').text,
        'address': person.find(class_='address').text.strip(),
    }

In [26]:
process_person(people[3])

{'name': 'William Miller',
 'quote': '"Multi-channeled directional installation"',
 'email': 'wbaker@short-allen.info',
 'phone': '151-300-9818',
 'address': '5530 Hancock Lock \n                East Lisaside, TX 92850'}

# Turn into dataframe

In [27]:
pd.DataFrame([process_person(person) for person in people])

Unnamed: 0,name,quote,email,phone,address
0,Paul Jackson,"""De-engineered foreground product""",paulcisneros@yahoo.com,388-664-9837x1084,21798 Samantha Ways Suite 170 \n ...
1,Sherry Skinner,"""Centralized intermediate strategy""",anelson@hotmail.com,8561135679,79801 Anthony Bypass Suite 721 \n ...
2,Jeffrey Curry,"""Adaptive 5thgeneration extranet""",mdavis@saunders.info,463.602.2344x038,422 Annette Union Suite 936 \n ...
3,William Miller,"""Multi-channeled directional installation""",wbaker@short-allen.info,151-300-9818,5530 Hancock Lock \n East Lisas...
4,Michael Bonilla,"""Front-line interactive neural-net""",michael76@bishop.net,+1-255-826-3643x314,"4668 Dyer Rapid \n Nancyport, N..."
5,Gerald Knight,"""Diverse national website""",antoniomacias@yahoo.com,+1-064-561-4278,401 Morgan Crossing \n Solomonl...
6,Tyler Chapman,"""Optimized mobile superstructure""",jasonreilly@yahoo.com,(660)943-1322,7623 Cynthia Ferry Apt. 899 \n ...
7,Douglas Reynolds,"""Self-enabling 4thgeneration portal""",jackshepherd@george-pena.com,247.579.4243x46329,962 Troy Hills Suite 491 \n Sou...
8,Alan Williams,"""Configurable zero-defect moratorium""",amber70@yahoo.com,+1-451-301-4004x8102,3325 Brown Islands Suite 123 \n ...
9,Latoya Bradley,"""Public-key encompassing Graphical User Interf...",brandonwashington@stanton.info,411-835-2069x457,537 Odonnell Run \n West Rachel...
