<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Webscraping with BeautifulSoup and requests

_Authors: Riley Daggle & Jeff Hale_

---

## Learning Objectives

After this lesson students will be able to:
- Get HTML content from websites with requests 
- Parse website content with BeautifulSoup


### Prior knowledge required
- Python and pandas basics
---

# Web scraping issues

## Terms of service ⭐️
Google is your friend. See what it says about webscraping.

The law is unresolved, but generally, if the data is publicly available and you are using it for educational purposes, it's unlikely that you will have problems. 

![](./assets/scraping-legal-info.png)

[Source](https://mccarthygarberlaw.com/a-comprehensive-legal-guide-to-web-scraping-in-the-us/)

### robots.txt 🤖

https:my_site_name_here.com/robots.txt tells you what pages the site would like scrapers/crawlers to scrape/crawl. 

Read more [here](https://www.promptcloud.com/blog/how-to-read-and-respect-robots-file/#:~:text=txt%20file%20of%20a%20website%20you're%20trying%20to%20crawl,site%20are%20crawlable%20by%20bots.&text=You%20should%20steer%20clear%20from,txt.).

---

## Let's do some scraping
### Imports

In [1]:
# install if needed
# pip install bs4
import bs4

In [2]:


# import pandas, bs4, and requests
import pandas as pd
import requests

#### Use the requests library to get the content of a sample webpage

In [3]:
bs4.__version__

'4.10.0'

In [5]:
url = 'https://rldaggie.github.io/sample-html/'
response = requests.get(url)

#### What did we get back?

In [6]:
response

<Response [200]>

#### Our response object has a lot more in it, we just have to get it out.
#### Status Codes

## Status codes
Status codes tell you how the target server responded to your request

#### 200 = OK

#### 300s = Redirection

#### 400s = Client Error
- 400 = Bad Request
- 403 = Forbidden (not authorized)
- 404 = Not Found

#### 500s = Server Error

If your request was successful, you now have the contents of the webpage stored in memory on your machine.

---
#### Let's get the good stuff 🚀

In [13]:
response.text.find('<tr>')

1747

In [14]:
response.text[1747]

'<'

In [8]:
type(response.text)

str

In [10]:
# response.json()

#### We could parse this by hand 😿

#### But that would be painful and we can instead use a library 😀
### Create a `BeautifulSoup` object

In [11]:
from bs4 import BeautifulSoup

In [12]:
soup = BeautifulSoup(response.content)
soup

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<title>The title</title>
<style media="screen">
      tbody tr {
        color: red;
      }
    </style>
</head>
<body>
<h1 class="foobar" id="title">This is an h1</h1>
<div>
<h1 class="foobar">This is yet another heading.</h1>

      Something inside the div
    </div>
<h3>Todo List</h3>
<ol class="todo">
<li class="foobar">Take out trash</li>
<li>Pay billz</li>
<li class="foobar">Feed dog</li>
</ol>
<h3>Completed</h3>
<ol class="done">
<li>Mow lawn</li>
<li class="foobar"><span>Take out compost</span></li>
<li><span>Create scraping lecture</span></li>
</ol>
<p class="foobar">Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. <span>Duis aute irure dolor</span> in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. <em>Excepteu

### What is it

In [None]:
type(soup)

#### Let's take a look at it

# `soup.find()`

### Returns either:

1. A soup object of the first match
2. `None`

In [15]:
soup.find('ol')

<ol class="todo">
<li class="foobar">Take out trash</li>
<li>Pay billz</li>
<li class="foobar">Feed dog</li>
</ol>

In [16]:
type(soup.find('ol'))

bs4.element.Tag

In [17]:
ol = soup.find('ol')

#### Get the text in the tag

In [19]:
print(ol.text)


Take out trash
Pay billz
Feed dog



#### Get the attributes of the tag

In [20]:
ol.attrs

{'class': ['todo']}

# ⭐️ ⭐️`soup.find_all()` ⭐️ ⭐️

### Returns a **_LIST_** (techically a bs4.element.ResultSet) of soup objects that match your query.

## Behaves differently than `find()`

In [21]:
h1_tags = soup.find_all('h1')
h1_tags

[<h1 class="foobar" id="title">This is an h1</h1>,
 <h1 class="foobar">This is yet another heading.</h1>]

In [22]:
type(h1_tags)

bs4.element.ResultSet

In [23]:
h1_tags[0]

<h1 class="foobar" id="title">This is an h1</h1>

In [24]:
type(h1_tags[0])

bs4.element.Tag

In [25]:
h1_tags[0].text

'This is an h1'

In [26]:
h1_tags[0].attrs

{'class': ['foobar'], 'id': 'title'}

#### Make a list comprehension that creates a list containing only the text of the tags

#### List comprehension that puts the classes of the h1 tags in a list

In [28]:
[h1tag.attrs['class'] for h1tag in h1_tags]

[['foobar'], ['foobar']]

## Todo List

Find the ordered list items where the class = 'done'

In [31]:
soup.find('ol', {'class': 'done'})

<ol class="done">
<li>Mow lawn</li>
<li class="foobar"><span>Take out compost</span></li>
<li><span>Create scraping lecture</span></li>
</ol>

In [32]:
ol = soup.find('ol', {'class': 'done'})

#### Get the list item texts from the ol

In [34]:
print(ol.text)


Mow lawn
Take out compost
Create scraping lecture



In [35]:
todo_data = {'todos': ol.text}

## Let's scrape a beer reviews website

### TOS

Find the Terms of Service for the website. 

### robots.txt

- robots.txt https:my_site_name_here.com/robots.txt tells you what pages it would like you to crawl.

#### Get the content

In [43]:
url = 'https://www.beeradvocate.com/beer/trending/'

beer_response = requests.get(url)

#### Find the content of any H2 tags with BS4

In [44]:
beer_soup = BeautifulSoup(beer_response.content)

In [46]:
trending_table = beer_soup.find('table')

In [49]:
print(trending_table.find('tr').text)


 
Sorted by and displaying number of recent ratings.
Ratings
Avg
You



#### Grab all the Trending Beers 

In [58]:
trending_beers = [i.find('b').text for i in trending_table.find_all('tr')[1:]]

In [None]:
#scores = ???

In [63]:
trending_beer_scores = [float(i.find_all('b')[2].text) for i in trending_table.find_all('tr')[1:]]

In [71]:
num_reviews =  [float(i.find_all('b')[1].text) for i in trending_table.find_all('tr')[1:]]#trending_beer_scores[:5]

In [72]:
pd.DataFrame({'beer': trending_beers, 'num_ratings': num_reviews, 'scores': trending_beer_scores}).nlargest(10, 'scores')

Unnamed: 0,beer,num_ratings,scores
19,Beer:Barrel:Time (2021),9.0,4.79
66,Samuel,5.0,4.72
69,Double Double Cask,5.0,4.64
16,Bourbon County Brand Reserve Blanton’s Stout,10.0,4.62
42,Starry Noche,6.0,4.57
91,Gggreennn!,5.0,4.54
65,I Will Not Be Afraid,5.0,4.5
10,Utopias Barrel-Aged World Wide Stout,12.0,4.43
36,Term Oil S'mores,6.0,4.43
57,Mega Treat,6.0,4.43


## More Issues
Sometimes the HTML doesn't appear right away. Maybe you need to simulate clicking on buttons.

You can use a headless browser. 

- Selenium with Chromium will do the job. Here's an article on the topic: https://www.scrapingbee.com/blog/selenium-python/

- [Scrapy](https://scrapy.org/) is another option for scraping websites. It makes requests and gets data but is more powerful and complex than requests with BS4.

- Your IP address (or username if logged in) can get blocked if you are deemed to be malicious. 

- DOS (Denial of Service) attacks are real and if you ping a website lots and lots of times quickly you might get blocked, regardless of what robots.txt or the terms of use say.

- If you want to scrape repeatedly, make sure the website doesn't get changed andeaking how you grab the data!

## Summary

You've seen how to use requests with BS4 to get HTML and parse it.

Scraping websites is brittle and can be frustrating. But it's pretty cool. 😉

### Check for understanding

- What requests method do you use to grab HTML?
- How do you get HTML content out of the requests object?