# Intro to Web Scraping
- Use `requests` to download the HTML
- Use `BeautifulSoup` to parse that HTML to get the thing(s) you need

## Process
- Step 1: use the `request` library to make an HTTP request across the web
- Step 2: use the `reponse.text` property on the `response` object to get the text of the HTML

In [10]:
from requests import get
from bs4 import BeautifulSoup

In [2]:
url = 'https://site-to-scrape.glitch.me'

In [6]:
get(url).text

'<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <title>Site to Scrape!</title>\n    <meta charset="utf-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n    \n    <!-- import the webpage\'s stylesheet -->\n    <link rel="stylesheet" href="/style.css">\n    \n    <!-- import the webpage\'s javascript file -->\n    <script src="/script.js" defer></script>\n  </head>  \n  <body>\n    <header>\n      <h1>This is the header!</h1>\n      <hr>\n    </header>\n    \n    <main>\n      <div>\n        <h1 class="first">\n        This is the main\n        </h1>\n        <h2>\n          This is an h2 of main\n        </h2>\n        <h3>\n          H3 inside of first div inside of main\n        </h3>\n      </div>\n      <div>\n        <h3 class="first">\n          H3 inside of second div inside of main.\n        </h3>\n        <p>\n          Here\'s some text content for us to scrape! 👽\n        </p>\n        

In [11]:
url2 = 'https://web-scraping-demo.zgulde.net/news'

In [53]:
soup = BeautifulSoup(get(url).content, 'html.parser')
soup

<!DOCTYPE html>

<html lang="en">
<head>
<title>Site to Scrape!</title>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<!-- import the webpage's stylesheet -->
<link href="/style.css" rel="stylesheet"/>
<!-- import the webpage's javascript file -->
<script defer="" src="/script.js"></script>
</head>
<body>
<header>
<h1>This is the header!</h1>
<hr/>
</header>
<main>
<div>
<h1 class="first">
        This is the main
        </h1>
<h2>
          This is an h2 of main
        </h2>
<h3>
          H3 inside of first div inside of main
        </h3>
</div>
<div>
<h3 class="first">
          H3 inside of second div inside of main.
        </h3>
<p>
          Here's some text content for us to scrape! 👽
        </p>
<p>
          Here's another paragraph of content! ☠️
        </p>
<a href="https://github.com/ryanorsinger">Click here to visit my portfolio</a>
</div>
</main>
<footer>
<h1>This 

In [54]:
soup.p

<p>
          Here's some text content for us to scrape! 👽
        </p>

In [55]:
soup.find('p')

<p>
          Here's some text content for us to scrape! 👽
        </p>

In [56]:
soup.find_all('p')[1]

<p>
          Here's another paragraph of content! ☠️
        </p>

In [57]:
soup.find_all('p')[1].text.strip()

"Here's another paragraph of content! ☠️"

In [58]:
[thing.text.strip() for thing in soup.select('h1')]

['This is the header!', 'This is the main', 'This is the footer']

In [59]:
soup2 = BeautifulSoup(get(url2).content, 'html.parser')
soup2

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>News Example Page</title>
<link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
<link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
</head>
<body class="mx-auto max-w-screen-lg pb-32">
<h1 class="my-5 text-4xl text-center">News!</h1>
<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
<p>
<i class="bi bi-exclamation-circle text-xl"></i>
        All data on this page is strictly for demonstration purposes and fake.
    </p>
</div>
<div class="grid gap-y-12">
<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
<img src="/static/placeholder.png"/>
<div class="col-span-3 space-y-3 py-3">
<h2 class="text-2xl text-green-900">visit thousand 

In [61]:
len(soup2.find_all('div'))

38

In [69]:
# soup.find_all('div')

In [70]:
# [thing.text.strip() for thing in soup2.select('div')]

In [74]:
soup2.select('div.grid.grid-cols-4')

[<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
 <img src="/static/placeholder.png"/>
 <div class="col-span-3 space-y-3 py-3">
 <h2 class="text-2xl text-green-900">visit thousand important</h2>
 <div class="grid grid-cols-2 italic">
 <p> 1978-03-26 </p>
 <p class="text-right">By Ann Pena </p>
 </div>
 <p>Majority same public small wear only record. Chance send mind place theory out vote which. Gun billion wind member note draw.
 A fill whether remain enter. Now anything each himself well find career. Case carry garden project develop.</p>
 </div>
 </div>,
 <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
 <img src="/static/placeholder.png"/>
 <div class="col-span-3 space-y-3 py-3">
 <h2 class="text-2xl text-green-900">writer owner into</h2>
 <div class="grid grid-cols-2 italic">
 <p> 2007-06-15 </p>
 <p class="text-right">By Patrick Harmon </p>
 </div>
 <p>Docto

In [76]:
articles = soup2.select('div.grid.grid-cols-4')

In [94]:
articles[0].h2.find_all('p')

[]

In [96]:
# [thing.text.strip() for thing in articles]

In [98]:
def get_article_content(art):
    output = {}
    output['headline'] = art.find('h2').text.strip()
    output['date'], output['author'], output['content'] = [thing.text.strip() for thing in art.find_all('p')]
    return output

In [103]:
get_article_content(articles[0])

{'headline': 'visit thousand important',
 'date': '1978-03-26',
 'author': 'By Ann Pena',
 'content': 'Majority same public small wear only record. Chance send mind place theory out vote which. Gun billion wind member note draw.\nA fill whether remain enter. Now anything each himself well find career. Case carry garden project develop.'}

In [109]:
def acquire_articles(url):
    soup = BeautifulSoup(get(url).content, 'html.parser')
    arts = soup.select('div.grid.grid-cols-4')
    final = [get_article_content(art) for art in arts]
    return pd.DataFrame(final)

In [110]:
acquire_articles(url2)

Unnamed: 0,headline,date,author,content
0,test the your,1970-02-25,By Lorraine Mccullough,Later meeting management save responsibility r...
1,work effect eat,1976-07-07,By Mary Miller,Move ability specific model rest blood have cr...
2,under his member,1978-04-15,By Richard Kaufman,Watch kitchen on truth.\nThrow give song them....
3,fast also design,2021-02-02,By Allison Erickson,Ask us firm action. Memory message space meet ...
4,ground stuff bit,1988-11-13,By Blake Porter,Store part just million build owner church. St...
5,board seek action,1999-12-06,By Gina Buchanan,Discussion program career easy national house....
6,spring player deal,1976-07-18,By Kelly Ross,Our threat lose nor enough yeah month. Involve...
7,range party staff,1989-09-13,By David Bates,Her tend most season fill notice. American tel...
8,often phone teacher,2004-06-29,By Alison Thomas,Sign plan age affect. Party mother cost concer...
9,read call bed,2000-05-15,By Charles Wilson,Piece house charge change PM stuff decide beyo...
