# DS3000 Day 4

Sep 21/22, 2023

Admin
- Qwickly Attendance (PIN on board)
- Homework 1 should be graded by Friday night
- Homework 2 due by October 10
- Quiz 1 will be posted **next** Tuesday, Sep 26, and must be done by Oct 3 (2 hour time limit)
- Lab will still be next **Tuesday, Sep 26** (to give you more practice before the Quiz)
- Visitor pushed back to **Tuesday October 2**, since I was able to have a Data Scientist at Novartis confirm she can meet us!

Push-Up Tracker
- Section 04: 1
- Section 05: 4
- Section 06: 3

Content:
- Continue/Finish Web Scraping, DS Pipelines

## Homework 2 Tricks
Three things on the homework.
- First; I realize I didn't actually show you how to *change* a timezone in class, I just told you that you couldn't do it with `.localize()`. Luckily it's a straightforward function: `.astimezone()` (see below)
- Second; I ended up making the web scraping questions a bit easier than I intended (you're welcome)... you should only have to use `pd.read_html()`, no need to use Beautiful Soup (we'll use it instead on the Lab in a couple weeks).
- Lastly; for Part 1.1, I mistakenly said that the assert wouldn't work (added a pushup for all sections), when if you coded it correctly it *should* work. You *still* need to write a `change_tz()` function (Part 1.2) for the later parts of the problem to work properly though!

In [1]:
import pytz
from datetime import datetime
# to change a timezone
rightnow = datetime.now()
# localize to eastern
tz_est = pytz.timezone('US/Eastern')
rightnow_est = tz_est.localize(rightnow)
# what timezone should we change it to?
tz_mali = pytz.timezone('Africa/Timbuktu')
# change it
rightnow_mali = rightnow_est.astimezone(tz_mali)
rightnow_mali

datetime.datetime(2023, 9, 22, 17, 52, 8, 933909, tzinfo=<DstTzInfo 'Africa/Timbuktu' GMT0:00:00 STD>)

## Web Scraping
* Using programs or scripts to pretend to browse websites, examine the content on those websites, retrieve and extract data from those websites
* Why scrape?
    * if an API is available for a service, we will nearly always prefer the API to scraping
    * ... but not all services have APIs or the available APIs are too expensive for our project
    * newly published information might not yet be available through ready datasets
* Downsides of scraping:
    * no reference documentation (unlike APIs)
    * no guarantee that a webpage we scrape will look and work the same way the next day (might need to rewrite the whole scraper)
    * if it violates the terms of service it might be seen as a felony (https://www.aclu.org/cases/sandvig-v-barr-challenge-cfaa-prohibition-uncovering-racial-discrimination-online)
    * legal and moral greyzone (even if the ToS does not disallow it, somebody has to pay for the traffic and when you're scraping you're not looking at ads)
    * ... but everbody does it anyway (https://www.hollywoodreporter.com/thr-esq/genius-says-it-caught-google-lyricfind-redhanded-stealing-lyrics-400m-suit-1259383)
* Web scraping pipeline:
    * because the webpages might change their structure it's extra important to keep the crawling/extraction step separate from transformations and loading
    * ETL (Extraction-Transform-Load):
        * **Crawl**: open a given URL using requests and get the HTML source;
        * **Extract**: extract interesting content from the webpage's source.
        * **Transform**: our usual unit conversions, etc
        * **Load**: representing the data in an easy way for storage and analysis
    * **Pro tip**: it's usually a good idea to not only store the transformed data, but also the raw HTML source - because the webpages might change and we might be late to realize we're not extracting right. If we have the original HTML source we can go back to it
    

## Best case scenario
Some webpages publish their data in the form of simple tables (to make your life easier, HW 2 consists of only these situations). In these (rare) cases we can just use pandas .read_html to scrape this data:

https://www.espn.com/nba/team/stats/_/name/bos

In [2]:
import pandas as pd
# read html extracts all the <table> elements from html and returns a list of DataFrames created from them
tables = pd.read_html('https://www.espn.com/nba/team/stats/_/name/bos')
len(tables)

4

In [3]:
tables[0]
# tables[1]
# tables[2]
# tables[3]

Unnamed: 0,Name
0,Jayson Tatum SF
1,Jaylen Brown SG
2,Malcolm Brogdon PG
3,Derrick White PG
4,Marcus Smart PG
5,Al Horford C
6,Grant Williams PF
7,Robert Williams III C
8,Sam Hauser SF
9,Mike Muscala C *


In [4]:
# "glue" dataframes together (more to come on this later in the semester)
player_stats1 = pd.concat(tables[:2], axis=1)
player_stats1

Unnamed: 0,Name,GP,GS,MIN,PTS,OR,DR,REB,AST,STL,BLK,TO,PF,AST/TO
0,Jayson Tatum SF,74,74.0,36.9,30.1,1.1,7.7,8.8,4.6,1.1,0.7,2.9,2.2,1.6
1,Jaylen Brown SG,67,67.0,35.9,26.6,1.2,5.7,6.9,3.5,1.1,0.4,2.9,2.6,1.2
2,Malcolm Brogdon PG,67,0.0,26.0,14.9,0.6,3.6,4.2,3.7,0.7,0.3,1.5,1.6,2.5
3,Derrick White PG,82,70.0,28.3,12.4,0.6,2.9,3.6,3.9,0.7,0.9,1.2,2.2,3.4
4,Marcus Smart PG,61,61.0,32.1,11.5,0.8,2.4,3.1,6.3,1.5,0.4,2.3,2.8,2.7
5,Al Horford C,63,63.0,30.5,9.8,1.2,5.0,6.2,3.0,0.5,1.0,0.6,1.9,5.1
6,Grant Williams PF,79,23.0,25.9,8.1,1.1,3.5,4.6,1.7,0.5,0.4,1.0,2.4,1.6
7,Robert Williams III C,35,20.0,23.5,8.0,3.0,5.4,8.3,1.4,0.6,1.4,1.0,1.9,1.5
8,Sam Hauser SF,80,8.0,16.1,6.4,0.4,2.1,2.6,0.9,0.4,0.3,0.4,1.2,2.4
9,Mike Muscala C *,20,4.0,16.2,5.9,0.7,2.7,3.4,0.6,0.2,0.3,0.5,1.4,1.3


In [5]:
# include the more advanced stats
player_stats2 = pd.concat([player_stats1, tables[3]], axis=1)
player_stats2

Unnamed: 0,Name,GP,GS,MIN,PTS,OR,DR,REB,AST,STL,...,3PA,3P%,FTM,FTA,FT%,2PM,2PA,2P%,SC-EFF,SH-EFF
0,Jayson Tatum SF,74,74.0,36.9,30.1,1.1,7.7,8.8,4.6,1.1,...,9.3,35.0,7.2,8.4,85.4,6.6,11.8,55.8,1.427,0.54
1,Jaylen Brown SG,67,67.0,35.9,26.6,1.2,5.7,6.9,3.5,1.1,...,7.3,33.5,3.9,5.1,76.5,7.7,13.4,57.6,1.29,0.55
2,Malcolm Brogdon PG,67,0.0,26.0,14.9,0.6,3.6,4.2,3.7,0.7,...,4.4,44.4,2.4,2.7,87.0,3.3,6.5,51.0,1.366,0.57
3,Derrick White PG,82,70.0,28.3,12.4,0.6,2.9,3.6,3.9,0.7,...,4.8,38.1,2.0,2.3,87.5,2.5,4.5,54.8,1.342,0.56
4,Marcus Smart PG,61,61.0,32.1,11.5,0.8,2.4,3.1,6.3,1.5,...,5.6,33.6,1.4,1.9,74.6,2.2,4.3,51.9,1.168,0.51
5,Al Horford C,63,63.0,30.5,9.8,1.2,5.0,6.2,3.0,0.5,...,5.2,44.6,0.2,0.3,71.4,1.3,2.4,53.9,1.286,0.63
6,Grant Williams PF,79,23.0,25.9,8.1,1.1,3.5,4.6,1.7,0.5,...,3.7,39.5,1.2,1.5,77.0,1.3,2.3,54.6,1.347,0.57
7,Robert Williams III C,35,20.0,23.5,8.0,3.0,5.4,8.3,1.4,0.6,...,0.0,0.0,0.7,1.2,61.0,3.6,4.8,75.1,1.641,0.75
8,Sam Hauser SF,80,8.0,16.1,6.4,0.4,2.1,2.6,0.9,0.4,...,4.2,41.8,0.2,0.2,70.6,0.5,0.8,65.6,1.293,0.63
9,Mike Muscala C *,20,4.0,16.2,5.9,0.7,2.7,3.4,0.6,0.2,...,3.3,38.5,0.5,0.7,69.2,0.9,1.2,70.8,1.326,0.61


In [6]:
# baseball instead of basketball?
base_tables = pd.read_html('https://www.baseball-reference.com/teams/BOS/2022.shtml')
len(base_tables)

2

In [7]:
base_tables[0]
# base_tables[1]

Unnamed: 0,Rk,Pos,Name,Age,G,PA,AB,R,H,2B,...,OBP,SLG,OPS,OPS+,TB,GDP,HBP,SH,SF,IBB
0,1,C,Christian Vázquez,31,84,318,294,33,83,20,...,.327,.432,.759,109,127,7,3,0,3,0
1,2,1B,Bobby Dalbec,27,117,353,317,40,68,9,...,.283,.369,.652,80,117,5,3,0,4,0
2,3,2B,Trevor Story,29,94,396,357,53,85,22,...,.303,.434,.737,102,155,9,3,0,4,4
3,4,SS,Xander Bogaerts,29,150,631,557,84,171,38,...,.377,.456,.833,131,254,14,10,0,7,2
4,5,3B,Rafael Devers*,25,141,614,555,84,164,42,...,.358,.521,.879,141,289,14,6,0,3,11
5,6,LF,Alex Verdugo*,26,152,644,593,75,166,39,...,.328,.405,.732,102,240,14,3,0,6,2
6,7,CF,Enrique Hernández,30,93,402,361,48,80,24,...,.291,.338,.629,75,122,11,3,0,4,0
7,8,RF,Jackie Bradley Jr.*,32,91,290,271,21,57,19,...,.257,.321,.578,60,87,3,0,2,0,0
8,9,DH,J.D. Martinez,34,139,596,533,76,146,43,...,.341,.448,.790,117,239,20,5,0,5,1
9,Rk,Pos,Name,Age,G,PA,AB,R,H,2B,...,OBP,SLG,OPS,OPS+,TB,GDP,HBP,SH,SF,IBB


## Messy Data

Notice that the baseball data are quite a bit messier than the basketball data. In web scraping, you are beholden to the format of the website (.html) and will almost certainly have to clean data (sometimes extensively) after scraping it.

## Basic HTML
Web pages are written in HTML. The source of https://sapiezynski.com/ds3000/scraping/01.html looks like this:

```html
<html>
    <head>
        <!-- comments in HTML are marked like this -->
        
        <!-- the head tag contains the meta information not displayed but helps browsers render the page -->
    </head>
    <body>
         <!-- This is the body of the document that contains all the visible elements.-->
        <h1>Heading 1</h1>
        <h2>This is what heading 2 looks like</h2>
        
        <p>Text is usually in paragraphs.
            New lines and multiple consecutive whitespace characters are ignored.</p>

<p>Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.</p>   
        
        <p>Links are created using the "a" tag: 
            <a href="https://www.google.com">Click here to google.</a>
            href is an attirbute of the a tag that specify where the link points to.</p>
        
        
    </body>
</html>
```
The keywords in `<>` brackets are called tags. They open with `<tag>` and close with `</tag>`.

In [8]:
## Getting the html content in Python
import requests

response = requests.get('https://sapiezynski.com/ds3000/scraping/01.html')
print(response.text)

<html>
    <head>
        <!-- comments in HTML are marked like this -->
        
        <!-- the head tag contains the meta information not displayed but helps browsers render the page -->
    </head>
    <body>
         <!-- This is the body of the document that contains all the visible elements.-->
        <h1>Heading 1</h1>
        <h2>This is what heading 2 looks like</h2>
        
        <p>Text is usually in paragraphs.
            New lines and multiple consecutive whitespace characters are ignored.</p>

<p>Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.</p>   
        
        <p>Links are created using the "a" tag: 
            <a href="https://www.google.com">Click here to google.</a>
            href is an attirbute of the a tag that specify where the link points to.</p>
        
        
    </body>
</html>




In [9]:
# sometimes this doesn't quite work the way you want (c'est la vie with web scraping)
response2 = requests.get('https://www.nytimes.com/2019/03/10/style/what-is-tik-tok.html')
print(response2.text)

<html><head><title>nytimes.com</title><style>#cmsg{animation: A 1.5s;}@keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style></head><body style="margin:0"><p id="cmsg">Please enable JS and disable any ad blocker</p><script data-cfasync="false">var dd={'rt':'c','cid':'AHrlqAAAAAMAXOXFarb2XGwAmyGGOw==','hsh':'499AE34129FA4E4FABC31582C3075D','t':'bv','s':17439,'e':'b433290b8cf5107b685b2b0a38c22a2ba8d27cfa3b395b9f1a2b43a79447b29f','host':'geo.captcha-delivery.com'}</script><script data-cfasync="false" src="https://ct.captcha-delivery.com/c.js"></script></body></html>



# Beautiful Soup

Even if the .html does look relatively clean, it's still just a big string. How can we deal with it? Luckily there is a module made for just this purpose, and it's even a magic command which we can install directly in jupyter notebook:

In [10]:
pip install bs4

Note: you may need to restart the kernel to use updated packages.


In [11]:
from bs4 import BeautifulSoup

url = 'https://sapiezynski.com/ds3000/scraping/01.html' 
str_html = requests.get(url).text
soup = BeautifulSoup(str_html)

In [12]:
soup

<html>
<head>
<!-- comments in HTML are marked like this -->
<!-- the head tag contains the meta information not displayed but helps browsers render the page -->
</head>
<body>
<!-- This is the body of the document that contains all the visible elements.-->
<h1>Heading 1</h1>
<h2>This is what heading 2 looks like</h2>
<p>Text is usually in paragraphs.
            New lines and multiple consecutive whitespace characters are ignored.</p>
<p>Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.</p>
<p>Links are created using the "a" tag: 
            <a href="https://www.google.com">Click here to google.</a>
            href is an attirbute of the a tag that specify where the link points to.</p>
</body>
</html>

In [13]:
## getting elements by their tag name:
soup.find_all('p')

# find_all returns a list, where each element is an instance of the specified tag

[<p>Text is usually in paragraphs.
             New lines and multiple consecutive whitespace characters are ignored.</p>,
 <p>Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.</p>,
 <p>Links are created using the "a" tag: 
             <a href="https://www.google.com">Click here to google.</a>
             href is an attirbute of the a tag that specify where the link points to.</p>]

In [14]:
# the bs4 object tracks the tags
type(soup.find_all('p')[0])

bs4.element.Tag

In [15]:
for paragraph in soup.find_all('p'):
    # text is a property of a soup object
    print(paragraph.text) 
    print('------')

Text is usually in paragraphs.
            New lines and multiple consecutive whitespace characters are ignored.
------
Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.
------
Links are created using the "a" tag: 
            Click here to google.
            href is an attirbute of the a tag that specify where the link points to.
------


# `.find_all()` on subtrees of soup object


The `.find_all()` method works not only on the whole `soup` object, but also on subtrees of the soup object.  

Consider the site at https://sapiezynski.com/ds3000/scraping/02.html:

```html
<html>
    <body>
        <p>The links in this paragraph point to search engines, like <a href="https://duckduckgo.com">DuckDuckGo</a>, <a href="https://google.com">Google</a>, <a href="https://bing.com">Bing</a></p>
        
        <p>The links in this paragraph point to Internet browsers, like <a href="https://firefox.com">Firefox</a>, <a href="https://chrome.com">Chrome</a>, <a href="https://opera.com">Opera</a></p>.
    </body>
</html>
```

**Goal**: Grab links from the first paragraph only:

In [39]:
# getting the content of the page
url = 'https://sapiezynski.com/ds3000/scraping/02.html'
response = requests.get(url)
soup = BeautifulSoup(response.text)

# finding all paragraphs:
p_all = soup.find_all('p')
p_all[0]

<p>The links in this paragraph point to search engines, like <a href="https://duckduckgo.com">DuckDuckGo</a>, <a href="https://google.com">Google</a>, <a href="https://bing.com">Bing</a></p>

In [40]:
# getting the first paragraph
p_first = p_all[0]
soup.find_all('p')[1].find_all('a')[0].text

'Firefox'

In [18]:
# getting the links from the first paragraph:
links_p_first = p_first.find_all('a')

print(links_p_first)

[<a href="https://duckduckgo.com">DuckDuckGo</a>, <a href="https://google.com">Google</a>, <a href="https://bing.com">Bing</a>]


### Some syntactic sugar: 
To get the first tag under a soup object, refer to it as an attribute

In [41]:
# is equivilent to soup.find_all('p')[0]
soup.p

<p>The links in this paragraph point to search engines, like <a href="https://duckduckgo.com">DuckDuckGo</a>, <a href="https://google.com">Google</a>, <a href="https://bing.com">Bing</a></p>

In [20]:
# so we can condense our code as
plinks = soup.p.find_all('a')
print(plinks)

[<a href="https://duckduckgo.com">DuckDuckGo</a>, <a href="https://google.com">Google</a>, <a href="https://bing.com">Bing</a>]


In [21]:
# iterating over tags
for par in soup.find_all('p'):
    print(par.a)

<a href="https://duckduckgo.com">DuckDuckGo</a>
<a href="https://firefox.com">Firefox</a>


In [22]:
# and the first link in that paragraph can be accessed like this:
link = soup.p.a
print(link)

<a href="https://duckduckgo.com">DuckDuckGo</a>


## Identifying if tags exist

In [44]:
# what if we're trying to access an element that doesn't exist?
# USE THIS ON QUIZ!
header = soup.h3
print(header)

# won't work, because header is of type None
# header.text

None


AttributeError: 'NoneType' object has no attribute 'text'

We can test if a tag exists in a soup object by looking for the first instance of this tag and comparing it to `None`

In [24]:
if soup.h3 is None:
    print("tag h3 doesnt exist in soup")
else:
    print("tag h3 does exist!")

tag h3 doesnt exist in soup


In [25]:
if soup.p is None:
    print("tag p doesnt exist in soup")
else:
    print("tag p does exist!")

tag p does exist!


## Finding tags by `class_`

**Tip**: This is often one of the most useful ways to localize a particular part of a web page.

In [45]:
# get soup
url = 'https://www.allrecipes.com/search?q=cheese+fondue'
response = requests.get(url)
soup = BeautifulSoup(response.text)

In [46]:
soup

<!DOCTYPE html>
<html class="comp no-js searchTemplate static-html html mntl-html" data-ab="99,82,99,99,70,99,99,99" data-allrecipes-resource-version="1.111.0" data-mantle-resource-version="3.14.266" data-resource-version="1.111.0" data-tracking-container="true" id="searchTemplate_1-0" lang="en">
<!--
<globe-environment environment="k8s-prod" application="allrecipes" dataCenter="us-east-1"/>
-->
<head class="loc head">
<script type="text/javascript">var Mntl = window.Mntl || {};</script>
<link href="//js-sec.indexww.com" rel="preconnect"/>
<link href="//c.amazon-adsystem.com" rel="preconnect"/>
<link href="//securepubads.g.doubleclick.net" rel="preconnect"/>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="max-image-preview:large, NOINDEX, FOLLOW, NOODP, NOYDIR" name="robots"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<link href="https://www.allrecipes.com/search" rel="canonical"/>
<title>[cheese fondue] Result

Our **goal** is to get a list of recipes.  Maybe we should find all the `div` tags? What about `span` tags?

In [47]:
# finding via tag ... problematic as we have too many div tags!
len(soup.find_all('div'))

258

In [48]:
len(soup.find_all('span'))

98

Tags can have multiple "classes" they belong to.  For example, in https://www.allrecipes.com/search?q=cheese+fondue the first recipe is encapsulated in this html tag:

    <span class="card__title"><span class="card__title-text">Cheese Fondue</span></span>
    
So this particular span tag belongs to classes:
- `card__title`
- `card__title-text`
    
I suspect only our target recipes belong to the `card__title-text` class.  Lets find them all:

In [49]:
recipe_list = soup.find_all(class_='card__title-text')

len(recipe_list)

24

In [50]:
recipe_list

[<span class="card__title-text">Cheese Fondue</span>,
 <span class="card__title-text">Cheese Fondue</span>,
 <span class="card__title-text">Chef John's Classic Cheese Fondue Is the Ultimate Cheese Lover's Recipe</span>,
 <span class="card__title-text">Best Formula Three-Cheese Fondue</span>,
 <span class="card__title-text">Beer Cheese Fondue</span>,
 <span class="card__title-text">YouTube + Chill: For Serious Cheese-Lovers Only</span>,
 <span class="card__title-text">The 7 Best Fondue Sets for Fun Fondue Dinners at Home</span>,
 <span class="card__title-text">Classic Cheese Fondue</span>,
 <span class="card__title-text">Basic Fondue</span>,
 <span class="card__title-text">Cheese</span>,
 <span class="card__title-text">25 Best Appetizers to Make if You're Obsessed With Cheese</span>,
 <span class="card__title-text">How to Make Cheese Sauce From Scratch</span>,
 <span class="card__title-text">The Most Popular Recipes of the 1970s</span>,
 <span class="card__title-text">What Is Gruyère Ch

In [52]:
recipe_list[20].text

'Quick Fontina Cheese Fondue'

## Finding tags by `id`

Nearly the same as finding by class, but you'll look for `id=` in the html and pass it to the `id` keyword of `soup.find_all()`.

**Goal**: Get the footer from: https://www.scrapethissite.com/



```html
<section id="footer">
        <div class="container">
            <div class="row">
                <div class="col-md-12 text-center text-muted">
                    Lessons and Videos © Hartley Brody 2018
                </div><!--.col-->
            </div><!--.row-->
        </div><!--.container-->
    </section>
```

In [33]:
# get soup from url
url = 'https://www.scrapethissite.com/'
html = requests.get(url).text
soup = BeautifulSoup(html)

In [34]:
soup.find_all(id='footer')

[<section id="footer">
 <div class="container">
 <div class="row">
 <div class="col-md-12 text-center text-muted">
                     Lessons and Videos © Hartley Brody 2023
                 </div><!--.col-->
 </div><!--.row-->
 </div><!--.container-->
 </section>]

Note that you can combine all searches shown above:
- tag
    - p (paragraph)
    - a (link)
    - div, span, ...
- tag class
- tag id

```python
# finds all links (tag type = 'a'), with given class and id
soup.find_all('a', class_='fancy-link', id='blue')

```

## Practice: Web Scraping Pipeline

**Goal:** Get a list of recipe names from www.allrecipes.com like we did for:

https://www.allrecipes.com/search?q=cheese+fondue

1. Write function `crawl_recipes(query)` which:
    * takes the search phrase (the ingredient) as input argument
    * builds the correct url that leads directly to the page that lists the recipes
    * uses `requests` to get the content of this page returns the html text of the page
1. Write `extract_recipes(text)` which:
    * takes the text returned by `crawl_recipes` as argument
    * builds a BeautifulSoup object out of that text 
    * finds names of all recipes
        - to identify which tags / classes to `find_all()`, open the page in your browser and "inspect" 
        - start from the recipe object above, and call another `find_all()` to zoom into the recipe name itself
    * returns the list of recipe names


**Note**: one solution to this is in the Day3_PracticeSol notebook on Canvas

A new function that will help if you wish to query multiple words:

`string.replace()`

So, if you wish to turn `cheese fondue` into `cheese+fondue`:

`string = 'cheese fondue'`

`string.replace(" ", "+")`

In [35]:
# the below is useful for surpressing Future Warnings (i.e. warnings about code that works, but wont in a future version of python)
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import requests
import json
from datetime import datetime
import pandas as pd
import plotly
import plotly.express as px
from bs4 import BeautifulSoup

In [36]:
string = 'cheese fondue'
string = string.replace(" ", "+")
string

'cheese+fondue'

1. Write function `crawl_recipes(query)` which:
    * takes the search phrase (the ingredient) as input argument
    * builds the correct url that leads directly to the page that lists the recipes
    * uses `requests` to get the content of this page returns the html text of the page
1. Write `extract_recipes(text)` which:
    * takes the text returned by `crawl_recipes` as argument
    * builds a BeautifulSoup object out of that text 
    * finds names of all recipes
        - to identify which tags / classes to `find_all()`, open the page in your browser and "inspect" 
        - start from the recipe object above, and call another `find_all()` to zoom into the recipe name itself
    * returns the list of recipe names

In [64]:
def crawl_recipes(query):
    string = query.replace(" ", "+")
    url = f"https://www.allrecipes.com/search?q={string}"
    html = requests.get(url).text
    
    return html
    
def extract_recipes(text):
    soup = BeautifulSoup(text)
    recipe_list = soup.find_all(class_='card__title-text')
    recipe_list = [recipe_list[i].text for i in range(len(recipe_list))]
    return recipe_list

In [67]:
meatloaf_html = crawl_recipes('ceviche')
new_recipe_list = extract_recipes(meatloaf_html)
new_recipe_list

['How to Store Ceviche',
 'Crab Ceviche',
 'Ceviche',
 '11 Fish Ceviche Recipes for Easy, No-Cook Appetizers',
 "Javi's Really Real Mexican Ceviche",
 '9 Shrimp Ceviche Recipes',
 "Jose's Shrimp Ceviche",
 '10 Mexican Ceviche Recipes',
 'Avocado Shrimp Ceviche-Estillo Sarita',
 'Ceviche',
 'Mexican Ceviche',
 'Ceviche Self-Portrait',
 'Mahi Mahi Ceviche',
 'City Ceviche',
 'Juicy and Spicy Ceviche',
 'Ceviche Peruano',
 'Halibut-Mango Ceviche',
 'Shrimp and Pineapple Ceviche',
 'Salmon Ceviche',
 'Easy Shrimp Ceviche',
 'Clamato Shrimp "Ceviche" Style',
 'Tilapia Ceviche',
 'Fresh Tuna Ceviche',
 'Bloody Mary Ceviche']

In [None]:
# new_recipe_list

## Getting info from each recipe's own page:

When we interact with the webpage in the browser, clicking on the header with the recipe name leads us to the actual recipe. Let's have a look at how it's done. Here is the link (`<a >` tag) for the first and third cards of the meatloaf search:

```html
<a class="comp mntl-card-list-items mntl-document-card mntl-card card card--no-image" 
   data-cta="" 
   data-doc-id="6663943" 
   data-ordinal="1" 
   data-tax-levels="" 
   href="https://www.allrecipes.com/recipe/219171/classic-meatloaf/" 
   id="mntl-card-list-items_1-0">
```

```html
<a class="comp mntl-card-list-items mntl-document-card mntl-card card card--no-image" 
   data-cta="" 
   data-doc-id="6663443" 
   data-ordinal="3" 
   data-tax-levels="" 
   href="https://www.allrecipes.com/recipe/223381/melt-in-your-mouth-meat-loaf/" 
   id="mntl-card-list-items_1-0-2">
```



In [None]:
meatloaf_html = crawl_recipes('meatloaf')
soup = BeautifulSoup(meatloaf_html)

In [None]:
# get a single recipe with link
recipe = soup.find_all('a', class_='comp mntl-card-list-items mntl-document-card mntl-card card card--no-image')[0]

In [None]:
recipe

`BeautifulSoup` exposes a tag's attributes as a dictionary:

In [None]:
recipe.attrs

In [None]:
recipe.attrs['href']

# Adding `href` to our dataframe of recipes

Let's modify our `extract_recipes()` function such that rather than returning just the names of the dishes, it returns a list of dictionaries, where each dictionary has the `name` and `url` fields:

## `from_dict`

First, a useful tool to turn a dictionary into a data frame where the keys are features (columns) and the values are lists that correspond to the values of the features (rows) is the `pd.DataFrame.from_dict()` function:

In [None]:
example_dict = {'col1': [1,2,3,4,5],
                'col2': [6,7,8,9,10],
                'col3': ['who', 'what', 'when', 'where', 'why']}
pd.DataFrame.from_dict(example_dict)

In [68]:
def extract_recipes(text):
    """ builds list of recipe names from allrecipies html
    
    Args:
        html_str (str): html response from allrecipes.com, see crawl_recipes()
        
    Returns:
        df_recipe (pd.DataFrame): dataframe of recipes
    """
    # build soup object from text
    soup = BeautifulSoup(text)
    
    recipe_list = []
    for recipe in soup.find_all(class_='card__title-text'):
        # extract / store recipe
        recipe_name = recipe.text
        recipe_list.append(recipe_name)

    href_list = []
    for recipe in soup.find_all('a', class_='comp mntl-card-list-items mntl-document-card mntl-card card card--no-image'):
        # grab the link from each recipe
        recipe_link = recipe.attrs['href']
        href_list.append(recipe_link)
        
        
    # bundle as a dictionary (then use from_dict)
    recipe_dict = {'name': recipe_list,
                   'href': href_list}
    df_recipe = pd.DataFrame.from_dict(recipe_dict)
        
    return df_recipe

In [None]:
extract_recipes(meatloaf_html)

## String Manipulations
- `.split()` & `.join()`
- `.strip()`
- `.replace()`
- `.upper()` & `.lower()`

Visting [a specific recipe's page](https://www.allrecipes.com/recipe/219171/classic-meatloaf/) yields data stored in a string.  The methods above allow us to extract this information.

In [None]:
# .strip removes all leading and trailing whitespace (spaces and newlines)
'\n\n\n hello!      \n    hello! \n\n    \n \n'.strip()

In [None]:
# we saw .replace last class:
'cheese fondue'.replace(' ', '+')

In [None]:
"hello fred".replace("fred", "george")

In [None]:
# can use replace to delete parts of the string
'lets forget about it, okay?'.replace(' it', '')

In [None]:
# capitalize everything
'dont shout!'.upper()

In [None]:
# lowercase everything
'BE QuieT'.lower()

In [69]:
# split will split a string on every occurance of given string (',' below)
'fat: 54 g, calories: 430 cal, sugar: 10g'.split(',')

['fat: 54 g', ' calories: 430 cal', ' sugar: 10g']

In [70]:
# put disparate strings into a single string, glued together by some other string
'<glue>'.join(['a', 'b', 'c', 'd'])

'a<glue>b<glue>c<glue>d'

In [71]:
''.join(['a', 'b', 'c', 'd'])

'abcd'

In [72]:
name_list = 'last0, first0, last1, first1, last2, first2'.split(',')

','.join(name_list[:2])

'last0, first0'

In [76]:
name_list[2:4]
','.join(name_list[2:4]).strip()

'last1, first1'

In [77]:
# visit specific recipe's page
url = 'https://www.allrecipes.com/recipe/283561/classic-cheese-fondue/'
html = requests.get(url).text
soup = BeautifulSoup(html)

In [78]:
soup

<!DOCTYPE html>
<html class="comp no-js taxlevel-4 recipeScTemplate html mntl-html" data-ab="99,82,99,99,70,54,99,99,63,99" data-allrecipes-resource-version="1.112.0" data-lazy-offset="200" data-mantle-resource-version="3.14.277" data-resource-version="1.112.0" data-tracking-container="true" id="recipeScTemplate_1-0" lang="en">
<!--
<globe-environment environment="k8s-prod" application="allrecipes" dataCenter="us-east-1"/>
-->
<head class="loc head">
<script type="text/javascript">var Mntl = window.Mntl || {};</script>
<link href="//js-sec.indexww.com" rel="preconnect"/>
<link href="//c.amazon-adsystem.com" rel="preconnect"/>
<link href="//securepubads.g.doubleclick.net" rel="preconnect"/>
<link href="//allrecipes.groceryserver.com" rel="preconnect"/>
<link href="//products.polaris.me" rel="preconnect"/>
<link href="//calvera.allrecipes.com" rel="preconnect"/>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="max-image-preview:large, NOODP, N

In [79]:
# get prep info from 'mntl-recipe-details__content'
info_str = soup.find_all(class_='mntl-recipe-details__content')[0].text.strip().replace('\n', ' ')
info_str

'Prep Time: 10 mins   Cook Time: 15 mins   Total Time: 25 mins   Servings: 10    Yield: 10 servings'

As a string, this isn't as useful, we'd like to transform it into a dictionary:

```python
prep_info_dict = {'Prep Time': '10 mins',
                  'Cook Time': '15 mins',
                  'Total Time': '25 mins',
                  'Servings': '10',
                  'Yield': '10 servings'}
```

In [84]:
# getting nutrition informatin
# after some crawling we can find the labels here
soup.find_all('span', class_ = 'mntl-nutrition-facts-label__nutrient-name mntl-nutrition-facts-label__nutrient-name--has-postfix')[1].text

'Saturated Fat'

In [83]:
# and the values can be found using the .next_sibling attribute
soup.find_all('span', class_ = 'mntl-nutrition-facts-label__nutrient-name mntl-nutrition-facts-label__nutrient-name--has-postfix')[1].next_sibling

'\n9g\n'

In [85]:
# getting nutrition information
nutr_dict = dict()
nutr_list = soup.find_all('span', class_ = 'mntl-nutrition-facts-label__nutrient-name mntl-nutrition-facts-label__nutrient-name--has-postfix')
for fact in nutr_list:
    nutr_dict[fact.text] = fact.next_sibling.strip()
    
nutr_dict

{'Total Fat': '14g',
 'Saturated Fat': '9g',
 'Cholesterol': '46mg',
 'Sodium': '179mg',
 'Total Carbohydrate': '3g',
 'Total Sugars': '1g',
 'Protein': '13g',
 'Vitamin C': '0mg',
 'Calcium': '461mg',
 'Iron': '0mg',
 'Potassium': '67mg'}

## Lecture Break/Practice
Write two functions: `extract_prep_info()` and `extract_nutrition()`, which both accept a url of a particular recipe (see examples above) and return dictionaries of the prep in of nutritional information, respectively. For example:

```python
url = 'https://www.allrecipes.com/recipe/283561/classic-cheese-fondue/'
extract_prep_info(url)
extract_nutrition(url)

```

yields:

```python
prep_info_dict = {'Prep Time': '10 mins',
                  'Cook Time': '15 mins',
                  'Total Time': '25 mins',
                  'Servings': '10',
                  'Yield': '10 servings'}

```

and

```python
nutr_info_dict = {'Total Fat': '14g',
                  'Saturated Fat': '9g',
                  'Cholesterol': '46mg',
                  'Sodium': '179mg',
                  'Total Carbohydrate': '3g',
                  'Total Sugars': '1g',
                  'Protein': '13g',
                  'Vitamin C': '0mg',
                  'Calcium': '461mg',
                  'Iron': '0mg',
                  'Potassium': '67mg'}

```

In [None]:
#this will help
info_str.split("   ")[0].split(':')

In [100]:
def extract_prep_info(url):
    """ returns a dictionary of recipe preparation info 
    
    Args:
        url (str): location of all recipes "recipe"
        
    Returns:
        prep_info_dict (dict): keys are features ('prep'), 
            vals are str that describe feature ('20 mins')
    """
    html = requests.get(url).text
    soup = BeautifulSoup(html)
    
    prep_str = soup.find_all(class_='mntl-recipe-details__content')[0].text.strip().replace('\n', ' ')
    prep_dict = dict()
    
    for line in prep_str.split('   '):
        line_list = line.split(':')
        prep_dict[line_list[0].strip()] = line_list[1].strip()
    
    return prep_dict

In [101]:
def extract_nutrition(url):
    """ returns a dictionary of nutrition info 
    
    Args:
        url (str): location of all recipes "recipe"
        
    Returns:
        nutr_dict (dict): keys are molecule types ('fat'), 
            vals are str of quantity ('24 g')
    """
    html = requests.get(url).text
    soup = BeautifulSoup(html)
    
    nutr_dict = dict()
    nutr_list = soup.find_all('span', class_ = 'mntl-nutrition-facts-label__nutrient-name mntl-nutrition-facts-label__nutrient-name--has-postfix')
    for fact in nutr_list:
        nutr_dict[fact.text] = fact.next_sibling.strip()
    
    return nutr_dict

In [102]:
url = 'https://www.allrecipes.com/recipe/283561/classic-cheese-fondue/'
extract_prep_info(url)

{'Prep Time': '10 mins',
 'Cook Time': '15 mins',
 'Total Time': '25 mins',
 'Servings': '10',
 'Yield': '10 servings'}

In [103]:
extract_nutrition(url)

{'Total Fat': '14g',
 'Saturated Fat': '9g',
 'Cholesterol': '46mg',
 'Sodium': '179mg',
 'Total Carbohydrate': '3g',
 'Total Sugars': '1g',
 'Protein': '13g',
 'Vitamin C': '0mg',
 'Calcium': '461mg',
 'Iron': '0mg',
 'Potassium': '67mg'}

### Grabbing numeric values (float/int) from messy strings

- We have strings which describe recipe nutrition info (`'100 mg'`)
- We want numeric data types (`float, int`) so that we can plot and operate on these values

In [None]:
# float from string
float('123')

In [None]:
# potential problem when dealing with a full string: replacing g also modifies sugar
nutr_val = 'sugars: 40 g'
nutr_val.replace('g', '')

In [None]:
# endswith is a method of strings.  allows us to test if a string ends with another string
s = 'youll never guess whats last'
s.endswith('t')

In [None]:
# startswith does the same for the beggining of the string
s = 'hello asdf!'
s.startswith('hello')

In [None]:
# removing the unit in the example above
nutr_val = 'sugars: 40 g'

if nutr_val.endswith('g'):
    # reset nutr_val to exclude this last values
    nutr_val = nutr_val[:-1]

In [None]:
nutr_val

In [None]:
# removing the unit in the example above (programmatically)
nutr_val = 'sugars: 40 g'
s_remove = 'g'
if nutr_val.endswith(s_remove):
    nutr_val = nutr_val[:-len(s_remove)]

In [None]:
nutr_val

In [None]:
# removing many units in a loop
nutr_val = 'sugars: 40 Grams'
for s_rm in ['Grams', 'mg', 'g']:
    if nutr_val.endswith(s_rm):
        nutr_val = nutr_val[:-len(s_rm)]

nutr_val.strip()

## Rest of Class (Go slowly; if we don't finish we can next week)
Complete the `extract_nutrition()` below such that:

```python
# get / extract a data frame of recipes (only name and href)
str_query = 'boston cream pie'
html_str = crawl_recipes(str_query)
df_recipe = extract_recipes(html_str)

for row_idx in range(df_recipe.shape[0]):
    # get / extract nutrition info for a particular recipe
    recipe_url = df_recipe.loc[row_idx, 'href']
    nutr_dict = extract_nutrition(recipe_url)
    
    # add each new nutrition feature to the dataframe
    # only if there ARE nutrition features
    if len(nutr_dict) != 0:
        for nutr_feat, nutr_val in nutr_dict.items():
            df_recipe.loc[row_idx, nutr_feat] = nutr_val
    else:
        df_recipe = df_recipe.drop(row_idx, axis=0)

```

generates the `df_recipe`:

|    | name                           | href                                              | Total Fat | Saturated Fat | Cholesterol | Sodium | Total Carbohydrate | Dietary Fiber | Total Sugars | Protein | Vitamin C | Calcium | Iron | Potassium |
|----|--------------------------------|---------------------------------------------------|-----------|---------------|-------------|--------|--------------------|---------------|--------------|---------|-----------|---------|------|-----------|
| 0  | Chef John's Boston Cream Pie   | https://www.allrecipes.com/recipe/220942/chef-... | 41        | 17            | 199         | 514    | 72                 | 2             | 46           | 10      | 0         | 168     | 2    | 230       |
| 1  | Boston Cream Pie               | https://www.allrecipes.com/recipe/8138/boston-... | 13        | 6             | 61          | 230    | 47                 | 1             | 34           | 5       | 0         | 101     | 2    | 134       |
| 2  | Boston Cream Pie I             | https://www.allrecipes.com/recipe/8137/boston-... | 15        | 9             | 94          | 223    | 43                 | 1             | 26           | 5       | 0         | 97      | 2    | 95        |
| 3  | Semi-Homemade Boston Cream Pie | https://www.allrecipes.com/recipe/278930/semi-... | 41        | 16            | 219         | 568    | 79                 | 3             | 53           | 11      | 0         | 186     | 3    | 194       |
| 9  | Hot Milk Sponge Cake II        | https://www.allrecipes.com/recipe/8159/hot-mil... | 3         | 2             | 52          | 231    | 34                 | 0             | 20           | 4       | NaN       | 61      | 2    | 60        |
| 17 | Boston Cream Dessert Cups      | https://www.allrecipes.com/recipe/213446/bosto... | 15        | 7             | 44          | 237    | 32                 | 0             | 22           | 3       | 0         | 41      | 1    | 101       |
| 19 | Boston Creme Mini-Cupcakes     | https://www.allrecipes.com/recipe/220809/bosto... | 12        | 4             | 32          | 253    | 34                 | 0             | 24           | 3       | 0         | 62      | 1    | 100       |

In [None]:
def extract_nutrition(url):
    """ returns a dictionary of nutrition info 
    
    Args:
        url (str): location of all recipes "recipe"
        
    Returns:
        nutr_dict (dict): keys are molecule types ('fat'), 
            vals are floats of quantity ('24 g' = 24)
    """
    pass

In [None]:
# get / extract a data frame of recipes (only name and href)
str_query = 'boston cream pie'
html_str = crawl_recipes(str_query)
df_recipe = extract_recipes(html_str)

In [None]:
df_recipe

In [None]:
url = 'https://www.allrecipes.com/recipe/220942/chef-johns-boston-cream-pie/'

# get soup from url
html = requests.get(url).text
soup = BeautifulSoup(html)

nutr_dict = dict()
nutr_list = soup.find_all('span', class_ = 'mntl-nutrition-facts-label__nutrient-name mntl-nutrition-facts-label__nutrient-name--has-postfix')
for fact in nutr_list:
    nutr_feat = fact.next_sibling.strip()
    # strip units
    for str_rm in ['mg', 'g']:
        if nutr_feat.endswith(str_rm):
            nutr_feat = nutr_feat[:-len(str_rm)]
            
    nutr_dict[fact.text] = float(nutr_feat)
    
nutr_dict

Some recipes will not have nutrition facts:

In [None]:
url2 = 'https://www.allrecipes.com/gallery/most-popular-dessert-from-each-state/'

# get soup from url
html2 = requests.get(url2).text
soup2 = BeautifulSoup(html2)

nutr_dict2 = dict()
nutr_list2 = soup2.find_all('span', class_ = 'mntl-nutrition-facts-label__nutrient-name mntl-nutrition-facts-label__nutrient-name--has-postfix')
for fact in nutr_list2:
    nutr_feat = fact.next_sibling.strip()
    # strip units
    for str_rm in ['mg', 'g']:
        if nutr_feat.endswith(str_rm):
            nutr_feat = nutr_feat[:-len(str_rm)]
            
    nutr_dict2[fact.text] = float(nutr_feat)
    
nutr_dict2

In [None]:
len(nutr_dict2)

In [None]:
def extract_nutrition(url):
    """ returns a dictionary of nutrition info 
    
    Args:
        url (str): location of all recipes "recipe"
        
    Returns:
        nutr_dict (dict): keys are molecule types ('fat'), 
            vals are floats of quantity ('24 g' = 24)
    """

    html = requests.get(url).text
    soup = BeautifulSoup(html)

    nutr_dict = dict()
    nutr_list = soup.find_all('span', class_ = 'mntl-nutrition-facts-label__nutrient-name mntl-nutrition-facts-label__nutrient-name--has-postfix')
    for fact in nutr_list:
        nutr_feat = fact.next_sibling.strip()
        # strip units
        for str_rm in ['mg', 'g']:
            if nutr_feat.endswith(str_rm):
                nutr_feat = nutr_feat[:-len(str_rm)]
            
        nutr_dict[fact.text] = float(nutr_feat)
    
    return nutr_dict

In [None]:
extract_nutrition(url)

In [None]:
# get / extract a data frame of recipes (only name and href)
str_query = 'boston cream pie'
html_str = crawl_recipes(str_query)
df_recipe = extract_recipes(html_str)

for row_idx in range(df_recipe.shape[0]):
    # get / extract nutrition info for a particular recipe
    recipe_url = df_recipe.loc[row_idx, 'href']
    nutr_dict = extract_nutrition(recipe_url)
    
    # add each new nutrition feature to the dataframe
    # only if there ARE nutrition features
    if len(nutr_dict) != 0:
        for nutr_feat, nutr_val in nutr_dict.items():
            df_recipe.loc[row_idx, nutr_feat] = nutr_val
    else:
        df_recipe = df_recipe.drop(row_idx, axis=0)

In [None]:
df_recipe

## Putting it all together
- get list of dictionaries corresponding to recipes (done!)
- get dictionary of nutrition info per recipe (done!)
- aggregating info into dataframe (see below)
- scatter plot (up next)

In [None]:
def get_df_recipe(str_query, recipe_limit=None):
    """ searches for recipes and returns list, with nutrition info
    
    Args:
        str_query (str): search string
        recipe_limit (int): if passed, limits recipe (helpful
            to speed up nutrition scraping for teaching!)
        
    Returns:
        df_recipe (pd.DataFrame): dataframe, each row is recipe.
            includes columns href, name, and nutrition facts
    """    
    # get / extract a data frame of recipes (only name and href)
    html_str = crawl_recipes(str_query)
    df_recipe = extract_recipes(html_str)
    
    if recipe_limit is not None:
        # discard all but first few recipes
        df_recipe = df_recipe.iloc[:recipe_limit, :]

    for row_idx in range(df_recipe.shape[0]):
        # get / extract nutrition info for a particular recipe
        recipe_url = df_recipe.loc[row_idx, 'href']
        nutr_dict = extract_nutrition(recipe_url)
        
        # add each new nutrition feature to the dataframe
        # only if there ARE nutrition features
        if len(nutr_dict) != 0:
            for nutr_feat, nutr_val in nutr_dict.items():
                df_recipe.loc[row_idx, nutr_feat] = nutr_val
        else:
            df_recipe = df_recipe.drop(row_idx, axis=0)

    return df_recipe

In [None]:
query_list = ['pickles', 'truffles', 'peanut butter']

big_df_recipe = pd.DataFrame()
for str_query in query_list:
    # get recipes
    df_recipe_query = get_df_recipe(str_query)
    
    # record the query used to search for these recipes & aggregate
    df_recipe_query['query'] = str_query
    big_df_recipe = pd.concat([big_df_recipe, df_recipe_query])

In [None]:
big_df_recipe

In [None]:
import plotly.express as px

px.scatter(data_frame=big_df_recipe, x='Calcium', y='Potassium', color='query', hover_data=['name'])