# Web scraping 1

Announcements
1. Quiz 2 corrections
2. Problem set 3, Quiz 3, and discussion board 1 due on Sunday

Topics
1. pd.read_html
2. Beautiful soup - simple html 
3. Beautiful soup - Babish Recipes
4. Adnan speaking

#### Last class, we did some real simple web scraping using read_html. We are going to talk a little bit about what that function was doing and why it won't work in some cases

In [None]:
import pandas as pd
import numpy as np

We will web scrap using Beautiful Soup

In [None]:
from bs4 import BeautifulSoup # now we get beautiful soup
import requests # need this to talk to a website

#### Before we work with our website, let's see what soup does to some basic html

In [None]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [None]:
soup = BeautifulSoup(html_doc,'html.parser') # get the content

In [None]:
type(soup)

bs4.BeautifulSoup

#### the beautiful soup object nests the html file for us. 

In [None]:
print(soup.prettify()) 

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


#### There are some specific attributes of the soup object we can look at quickly, such as the title

In [None]:
print(soup.title)

<title>The Dormouse's story</title>


We can also call different tags. There's the "a" tag. Note that this only shows us the first instance of the a tag

In [None]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [None]:
soup.p

<p class="title"><b>The Dormouse's story</b></p>

We can output the body of the html

In [None]:
soup.body

<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>

In [None]:
type(soup.body)

bs4.element.Tag

#### In that tag, we can pull out specific parts. In this case, specifically the text

In [None]:
print(soup.body.text)


The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



When we use print here, it is organizing the text according to some syntax in the string

Without print, we can see it is just a string

In [None]:
soup.body.text

"\nThe Dormouse's story\nOnce upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.\n...\n"

Since its a string, we can look for certain things in the text

In [None]:
soup.body.text.find('Lacie')

100

If we search for something and it isn't there, it returns a -1

In [None]:
soup.body.text.find('Garrett')

-1

But we can change that with some methods, like the replace method which finds the first input and replaces it when the second input :)

In [None]:
print(soup.body.text.replace('Lacie','Garrett'))


The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Garrett and
Tillie;
and they lived at the bottom of a well.
...



Let's get back to the html code.

We can also pull stuff out of the tags

In [None]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [None]:
soup.a.text

'Elsie'

But as stated earlier, we have multiple instances of the a tag

#### Use find_all to pull out all of the rows that have the a element

In [None]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [None]:
soup.find('a') # find by itself just pulls out one instance

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

#### We could have also done this by trying to find all of the class "sister". Note that class is a special variable in python and cannot be used here, so we add an underscore

In [None]:
soup.find_all(class_ = 'sister')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

We can also find all based on the href or id

In [None]:
soup.find_all(href = 'http://example.com/elsie')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

In [None]:
soup.find_all(id = "link1")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

#### We can pull out all of the names by using 'a' element in a for loop


In [None]:
for names in soup.find_all('a'):
  print(names)

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>


In [None]:
for names in soup.find_all('a'):
  print(names.text)  

Elsie
Lacie
Tillie


We could also print that link that is also present in the html code

Let's also add the href, which is an example url, which may be useful

Note that in order to get stuff inside of the tag, we have to use get

In [None]:
for names in soup.find_all('a'):
  print(names.text)
  print(names.get('href'))

Elsie
http://example.com/elsie
Lacie
http://example.com/lacie
Tillie
http://example.com/tillie


#### Now that we have a basic feel for it, let's pass in our babish link

In [None]:
url = 'https://www.bingingwithbabish.com/recipes'; # passing in an html
response = requests.get(url)

#### We can check to see if this worked by checking the status_code. If its 200, it worked okay

In [None]:
response.status_code

200

In [None]:
soup = BeautifulSoup(response.content,'html.parser') # get the content

Let's use prettify to check out the html

In [None]:
print(soup.prettify()) 

<!DOCTYPE doctype html>
<html class="" lang="en-US" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width,initial-scale=1" name="viewport"/>
  <!-- This is Squarespace. -->
  <!-- andrew-rea-c8g3 -->
  <base href=""/>
  <meta charset="utf-8">
   <title>
    Recipes — Binging With Babish
   </title>
   <link href="https://images.squarespace-cdn.com/content/v1/590be7fd15d5dbc6bf3e22d0/1496092300329-Q82R2259FWA044IZ6Z1T/ke17ZwdGBToddI8pDm48kAPa0Bgm7KpaNbphz84CZZB7gQa3H78H3Y0txjaiv_0fDoOvxcdMmMKkDsyUqMSsMWxHk725yiiHCCLfrh8O1z5QPOohDIaIeljMHgDF5CVlOqpeNLcJ80NK65_fV7S1UYZy7oOZfcwAkStf90BvDJJXmaBZPxlipuWfncWBf81r7zs2yPjc1ECvpa5Zm_kMqw/favicon.ico?format=100w" rel="shortcut icon" type="image/x-icon"/>
   <link href="https://www.bingingwithbabish.com/recipes" rel="canonical"/>
   <meta content="Binging With Babish" proper


That's a lot of html! it should match what we see on the webpage

#### Let's look at the html code and figure out what we need to look for to get text about our recipes

In [None]:
recipes = soup.find_all('div')

In [None]:
len(recipes)

1233

Over a thousand! That's a lot

Let's check the type 

In [None]:
print(type(recipes))

<class 'bs4.element.ResultSet'>


Now, let's just try some items in the list. Maybe the first one?

In [None]:
print(recipes[1].text) # not a recipe. try another number?





Home


Recipes


Basics




About




About




Equipment List




FAQs




Rankings




Blog







cookware


Cookbook


Health




















Maybe the 51st item?

In [None]:
print(recipes[50].text) # here's a recipe


Biscuits inspired by Ted Lasso



#### Seems like just seaching by 'div' is going to take too long.


#### Looks like recipe-title-wrapper seems like a good place to simplify our web scraping

Note that sometimes, searching by class can be tricky, so its worth playing with things and spend the time to find something that works. In this case, I am not sure why recipe-col doesn't work (perhaps it is not specific enough)


In [None]:
recipes = soup.find_all(class_="recipe-title-wrapper")

#### Let's see how many recipes come back when searching for that specific class


In [None]:
len(recipes)

232

#### Cool. that seems more realistic

#### Let's take a look at the first one to get a feel for what we have

In [None]:
print(recipes[0].text)



Gotcha Pork Roast inspired by Food Wars (Shokugeki no Soma)


April 12, 2021




In [None]:
recipes[0]

<div class="recipe-title-wrapper">
<div class="recipe-title" data-content-field="title">
<a href="/recipes/chicken-fingers-community">Chicken Fingers inspired by Community</a>
</div>
<p class="date">
<time class="published" datetime="2021-04-02">April  2, 2021</time>
</p>
</div>

Let's pull out the href from the a tag

In [None]:
recipes[0].find('a').get('href')

'/recipes/chicken-fingers-community'

#### Let's pull out the date, which appears to be under another class

In [None]:
rec1 = recipes[0].find(class_='published') # not all of them have authors listed with this class!
print(rec1)

<time class="published" datetime="2021-04-02">April  2, 2021</time>


Now we can get the date.... cool!

In [None]:
print(rec1.text)

April  2, 2021


#### Let's now go through all of the recipes and try to find some recipes for ourselves

In [None]:
for recipes in soup.find_all(class_='recipe-title-wrapper'):
  if recipes.text.lower().find('chicken') > -1:
    print(recipes.text)
    print(recipes.a.get('href'))



Chicken Fingers inspired by Community


April  2, 2021


/recipes/chicken-fingers-community


Chicken Kiev inspired by Mad Men


March  2, 2021


/recipes/chicken-kiev-mad-men


Sugar Chicken inspired by Rick & Morty


February 23, 2021


/recipes/sugar-chicken-rick-and-morty


Chicken Paprikash inspired by Captain America: Civil War


April 24, 2018


/recipes/2017/6/27/chickenpaprikash


Fried Chicken & Waffle Breakfast Lasagna inspired by The Boondocks


July  4, 2017


/recipes/2017/7/4/fried-chicken-waffle-breakfast-lasagna-inspired-by-the-boondocks


Fried Chicken inspired by Louie


March 16, 2016


/recipes/2016/3/16/louiefriedchicken


#### In this simple demonstration, we mostly tried to pull out text. Data works the same way. Next class, we will try to extract some data from different tables online to build a dataframe that we can analyze.