<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Webscraping with BeautifulSoup and requests

_Authors: Riley Daggle & Jeff Hale_

---

## Learning Objectives

After this lesson students will be able to:
- Get HTML content from websites with requests 
- Parse website content with BeautifulSoup


### Prior knowledge required
- Python and pandas basics
---

# Web scraping issues

## Terms of service ⭐️
Google is your friend. See what it says about webscraping.

The law is unresolved, but generally, if the data is publicly available and you are using it for educational purposes, it's unlikely that you will have problems. 

![](./assets/scraping-legal-info.png)

[Source](https://mccarthygarberlaw.com/a-comprehensive-legal-guide-to-web-scraping-in-the-us/)

### robots.txt 🤖

https:my_site_name_here.com/robots.txt tells you what pages the site would like scrapers/crawlers to scrape/crawl. 

Read more [here](https://www.promptcloud.com/blog/how-to-read-and-respect-robots-file/#:~:text=txt%20file%20of%20a%20website%20you're%20trying%20to%20crawl,site%20are%20crawlable%20by%20bots.&text=You%20should%20steer%20clear%20from,txt.).

---

## Let's do some scraping
### Imports

In [None]:
# install if needed
# pip install bs4


In [1]:
import bs4
import pandas as pd
import requests

# import pandas, bs4, and requests

In [2]:
bs4.__version__

'4.10.0'

#### Use the requests library to get the content of a sample webpage

In [None]:
bs4.__version__

In [4]:
url = 'https://rldaggie.github.io/sample-html/'
response = requests.get(url)


#### What did we get back?

In [5]:
response
# code 200= successful

<Response [200]>

#### Our response object has a lot more in it, we just have to get it out.
#### Status Codes

## Status codes
Status codes tell you how the target server responded to your request

#### 200 = OK

#### 300s = Redirection

#### 400s = Client Error
- 400 = Bad Request
- 403 = Forbidden (not authorized)
- 404 = Not Found

#### 500s = Server Error

If your request was successful, you now have the contents of the webpage stored in memory on your machine.

---
#### Let's get the good stuff 🚀

In [6]:
# a bunch of html tags.
response.text

'<!DOCTYPE html>\n<html>\n  <head>\n    <meta charset="utf-8">\n    <title>The title</title>\n\n    <style media="screen">\n      tbody tr {\n        color: red;\n      }\n    </style>\n  </head>\n  <body>\n    <h1 class="foobar" id="title">This is an h1</h1>\n\n    <div>\n      <h1 class="foobar">This is yet another heading.</h1>\n\n      Something inside the div\n    </div>\n\n    <h3>Todo List</h3>\n    <ol class="todo">\n      <li class="foobar">Take out trash</li>\n      <li>Pay billz</li>\n      <li class="foobar">Feed dog</li>\n    </ol>\n\n    <h3>Completed</h3>\n    <ol class=\'done\'>\n      <li>Mow lawn</li>\n      <li class="foobar"><span>Take out compost</span></li>\n      <li><span>Create scraping lecture</span></li>\n    </ol>\n\n    <p class=\'foobar\'>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo

In [7]:
type(response.text)
# string, not structured as json. so we can't turn it into json :(
)

str

#### We could parse this by hand 😿

#### But that would be painful and we can instead use a library 😀
### Create a `BeautifulSoup` object

In [8]:
from bs4 import BeautifulSoup

In [9]:
#instaniate and pass it our text
soup = BeautifulSoup(response.content)

### What is it

In [10]:
type(soup)

bs4.BeautifulSoup

#### Let's take a look at it

In [11]:
soup

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<title>The title</title>
<style media="screen">
      tbody tr {
        color: red;
      }
    </style>
</head>
<body>
<h1 class="foobar" id="title">This is an h1</h1>
<div>
<h1 class="foobar">This is yet another heading.</h1>

      Something inside the div
    </div>
<h3>Todo List</h3>
<ol class="todo">
<li class="foobar">Take out trash</li>
<li>Pay billz</li>
<li class="foobar">Feed dog</li>
</ol>
<h3>Completed</h3>
<ol class="done">
<li>Mow lawn</li>
<li class="foobar"><span>Take out compost</span></li>
<li><span>Create scraping lecture</span></li>
</ol>
<p class="foobar">Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. <span>Duis aute irure dolor</span> in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. <em>Excepteu

# `soup.find()`

### Returns either:

1. A soup object of the first match
2. `None`

In [13]:
soup.find('ol')

<ol class="todo">
<li class="foobar">Take out trash</li>
<li>Pay billz</li>
<li class="foobar">Feed dog</li>
</ol>

In [14]:
type(soup.find('ol'))

bs4.element.Tag

In [15]:
ol = soup.find('ol')

#### Get the text in the tag

In [17]:
# use shift tab here to see all available parts
ol.text

'\nTake out trash\nPay billz\nFeed dog\n'

#### Get the attributes of the tag

In [18]:
ol.attrs
# returned as a dictionary

{'class': ['todo']}

# ⭐️ ⭐️`soup.find_all()` ⭐️ ⭐️

### Returns a **_LIST_** (techically a bs4.element.ResultSet) of soup objects that match your query.

## Behaves differently than `find()`

In [24]:
h1_tags= soup.find_all('h1')
h1_tags

[<h1 class="foobar" id="title">This is an h1</h1>,
 <h1 class="foobar">This is yet another heading.</h1>]

In [22]:
type(h1_tags)
# behaves similarly to a list

bs4.element.ResultSet

In [25]:
h1_tags[0]

<h1 class="foobar" id="title">This is an h1</h1>

In [26]:
type(h1_tags[0])

bs4.element.Tag

In [27]:
h1_tags[0].text

'This is an h1'

In [28]:
h1_tags[0].attrs

{'class': ['foobar'], 'id': 'title'}

#### Make a list comprehension that creates a list containing only the text of the tags

#### List comprehension that puts the classes of the h1 tags in a list

In [49]:
lst=[]
i=0
while i <= len(h1_tags)-1 :
    for tag in h1_tags[i]:
        print(i)
        print(h1_tags[i]['class'])
        lst.append(h1_tags[i]['class'])
        print(list)
        i+=1
        print(list)

0
['foobar']
<class 'list'>
<class 'list'>
1
['foobar']
<class 'list'>
<class 'list'>


In [48]:
# jason's solution
[h1tag.attrs['class'] for h1tag in h1_tags]

[['foobar'], ['foobar']]

In [35]:
len(h1_tags)

2

## Todo List

Find the ordered list items where the class = 'done'

In [51]:
soup.find_all('ol', {'class':'done'})

[<ol class="done">
 <li>Mow lawn</li>
 <li class="foobar"><span>Take out compost</span></li>
 <li><span>Create scraping lecture</span></li>
 </ol>]

#### Get the list item texts from the ol

In [52]:
ol = soup.find('ol', {'class': 'done'})

In [53]:
print(ol.text)


Mow lawn
Take out compost
Create scraping lecture



In [56]:
todo_data = {'todos': ol.text}

In [57]:
todo_data

{'todos': '\nMow lawn\nTake out compost\nCreate scraping lecture\n'}

## Let's scrape a beer reviews website

### TOS

Find the Terms of Service for the website. 

### robots.txt

- robots.txt https:my_site_name_here.com/robots.txt tells you what pages it would like you to crawl.

#### Get the content

In [119]:
url = 'https://www.beeradvocate.com/beer/trending'

In [120]:
beer_response = requests.get(url)

In [121]:
beer_response.text



In [122]:
beer_soup = BeautifulSoup(beer_response.content)

#### Find the content of any H2 tags with BS4

In [123]:
trending_table = beer_soup.find('table')

In [124]:
print(trending_table)

<table border="0" cellpadding="2" cellspacing="0" width="100%">
<tr>
<td align="left" bgcolor="#F0F0F0" valign="middle" width="5%"> </td>
<td align="left" bgcolor="#F0F0F0" valign="middle" width="65%"><span class="muted">Sorted by and displaying number of recent ratings.</span></td>
<td align="left" bgcolor="#F0F0F0" valign="middle" width="10%">Ratings</td>
<td align="left" bgcolor="#F0F0F0" valign="middle" width="10%">Avg</td>
<td align="left" bgcolor="#F0F0F0" valign="middle" width="10%">You</td>
</tr>
<tr><td align="center" bgcolor="#F7F7F7" class="hr_bottom_light" valign="top"><span style="font-weight:bold;color:#666666;">1</span></td><td align="left" class="hr_bottom_light" valign="top"><a href="/beer/profile/1199/577852/"><b>KBS - Hazelnut</b></a><span class="muted"><br/><a href="/beer/profile/1199/">Founders Brewing Company</a><br/><a href="/beer/top-styles/157/">Stout - American Imperial</a> | 12.00%</span></td><td align="left" class="hr_bottom_light" valign="top"><b>27</b></td

In [125]:
trending_table.find_all('tr')

[<tr>
 <td align="left" bgcolor="#F0F0F0" valign="middle" width="5%"> </td>
 <td align="left" bgcolor="#F0F0F0" valign="middle" width="65%"><span class="muted">Sorted by and displaying number of recent ratings.</span></td>
 <td align="left" bgcolor="#F0F0F0" valign="middle" width="10%">Ratings</td>
 <td align="left" bgcolor="#F0F0F0" valign="middle" width="10%">Avg</td>
 <td align="left" bgcolor="#F0F0F0" valign="middle" width="10%">You</td>
 </tr>,
 <tr><td align="center" bgcolor="#F7F7F7" class="hr_bottom_light" valign="top"><span style="font-weight:bold;color:#666666;">1</span></td><td align="left" class="hr_bottom_light" valign="top"><a href="/beer/profile/1199/577852/"><b>KBS - Hazelnut</b></a><span class="muted"><br/><a href="/beer/profile/1199/">Founders Brewing Company</a><br/><a href="/beer/top-styles/157/">Stout - American Imperial</a> | 12.00%</span></td><td align="left" class="hr_bottom_light" valign="top"><b>27</b></td><td align="left" class="hr_bottom_light" valign="top">

In [126]:
[i.find('b').text for i in trending_table.find_all('tr')[1:]]

['KBS - Hazelnut',
 'Cocomungo',
 'Speed Castle',
 "Double Dale's",
 'Cold Hearted',
 'Brian Boru',
 'Voodoo Ranger Juice Force IPA',
 'Atomic Torpedo',
 'Powder Day IPA',
 'Guinness Nitro Cold Brew Coffee',
 'Utopias Barrel-Aged World Wide Stout',
 'Black Hearted',
 'Voodoo Ranger Juicy Haze IPA',
 'Trappistes Rochefort Triple Extra',
 'Trappist Tripel',
 'Think Piece',
 'Bourbon County Brand Reserve Blanton’s Stout',
 'Hopnosis',
 'Where the Wild Hops Are',
 'Beer:Barrel:Time (2021)',
 'Irish Cream Stout',
 'Brewer’s Reserve 7 Layer Stout',
 'Western Mutant',
 'Guinness Draught',
 'Voodoo Ranger Agent 77 IPA',
 '4 Giants And The Haze Of Destiny',
 'Cinnamon Bun Ale',
 'Buntastic',
 'KBS - Cinnamon Vanilla Cocoa',
 'Frosé Hydra Deuce',
 'Bourbon County Brand Fourteen Stout',
 'Tropical Beer Hug',
 'Unwanted Dead',
 'Trippel',
 'Adult Beverage',
 'Japanese Green Tea IPA',
 "Term Oil S'mores",
 'King Titus',
 'Double Nugget Nectar',
 'Lupus Salictarius Batch #3',
 'Keep It Crunchy',
 'A

In [130]:
trending_beer_scores = [float(i.find_all('b')[2].text) for i in trending_table.find_all('tr')[1:]]
                              
                              
                              

# print(trending_table)

In [131]:
print(trending_table)

<table border="0" cellpadding="2" cellspacing="0" width="100%">
<tr>
<td align="left" bgcolor="#F0F0F0" valign="middle" width="5%"> </td>
<td align="left" bgcolor="#F0F0F0" valign="middle" width="65%"><span class="muted">Sorted by and displaying number of recent ratings.</span></td>
<td align="left" bgcolor="#F0F0F0" valign="middle" width="10%">Ratings</td>
<td align="left" bgcolor="#F0F0F0" valign="middle" width="10%">Avg</td>
<td align="left" bgcolor="#F0F0F0" valign="middle" width="10%">You</td>
</tr>
<tr><td align="center" bgcolor="#F7F7F7" class="hr_bottom_light" valign="top"><span style="font-weight:bold;color:#666666;">1</span></td><td align="left" class="hr_bottom_light" valign="top"><a href="/beer/profile/1199/577852/"><b>KBS - Hazelnut</b></a><span class="muted"><br/><a href="/beer/profile/1199/">Founders Brewing Company</a><br/><a href="/beer/top-styles/157/">Stout - American Imperial</a> | 12.00%</span></td><td align="left" class="hr_bottom_light" valign="top"><b>27</b></td

#### Grab all the Trending Beers 

In [132]:
pd.DataFrame({'beer': trending_beers), 'scores': trending_beer)scores}).nlargest(10, 'scores)')

SyntaxError: closing parenthesis ')' does not match opening parenthesis '{' (2448414676.py, line 1)

## More Issues
Sometimes the HTML doesn't appear right away. Maybe you need to simulate clicking on buttons.

You can use a headless browser. 

- Selenium with Chromium will do the job. Here's an article on the topic: https://www.scrapingbee.com/blog/selenium-python/

- [Scrapy](https://scrapy.org/) is another option for scraping websites. It makes requests and gets data but is more powerful and complex than requests with BS4.

- Your IP address (or username if logged in) can get blocked if you are deemed to be malicious. 

- DOS (Denial of Service) attacks are real and if you ping a website lots and lots of times quickly you might get blocked, regardless of what robots.txt or the terms of use say.

- If you want to scrape repeatedly, make sure the website doesn't get changed andeaking how you grab the data!

## Summary

You've seen how to use requests with BS4 to get HTML and parse it.

Scraping websites is brittle and can be frustrating. But it's pretty cool. 😉

### Check for understanding

- What requests method do you use to grab HTML?
- How do you get HTML content out of the requests object?