# DS3000 Lecture 7 and 8

### Admin:
- HW 3 due on Monday
- HW 4 posted on Tuesday and due on Friday
- Lecture 9 will be a lab. Finish the lab will earn 1 extra credict
- Project description will be released this week

### Content:
- OpenWeather API pipeline
- Intro to Web Scraping

# Pipeline: What is it?

A data pipeline is a collection of functions* which split all the functionality of our data collection and processing

(*can be other structures too, but it may be easier to first understand each as a function)


# Why build a data pipeline?

- Allows pipeline to be run in parts (rather than the whole thing)
- Allows pipeline to be built by different programmers working on different parts in parallel
- Allows us to test each piece of our code seperately
- Allows for modification / re-use of different sections

What we call a "Data Pipeline" here is a specific instance of "Factoring" a piece of software, splitting up its functionality into pieces.
    


# OpenWeather API Pipeline Activity

OpenWeather API offers a few different queries (see [here](https://openweathermap.org/api) for details):
- 3-hour Forecast 5 days (which we have access to)
- Air Pollution API
- etc.


**Goal:**

Build a library of functions which can be pieced together to support the collection, cleaning and display of features from OpenWeather into a scatter plot of two features.

### Lets design one together: 

(think: input/outputs -> handwritten notes)

# Plan out a pipeline

Write a few 'empty' functions including little more than the docstring:

```python
def some_fnc(a_string, a_list):
    """ processes a string and a list (somehow)
    
    Args:
        a_string (str): an input string which ...
        a_list (list): a list which describes ...
        
    Returns:
        output (dict): the output dict which is ...
    """
    pass
```

and a script which uses them:

```python
# inputs (not necessarily complete)
lat = -42
lon = 73

some_output = some_fnc(lat, lon)
some_other_output = some_other_fnc(some_output)

```

which would, if the functions worked, produce a graph like this (note: this starts Oct 6, because I made it yesterday):

<img src="https://i.ibb.co/Ct0JtRJ/newplot-1.png" width=500\img>

# What might these empty functions look like?

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import requests
import json
from datetime import datetime
import pandas as pd
import plotly
import plotly.express as px

def openweather_onecall(latlon_tuple, api_key, units='imperial'):
    """ returns weather data from one location via onecall
    
    https://openweathermap.org/api/one-call-api 
    
    Args:
        latlon_tuple (tuple): first element is lattitude,
            second is longitude
        api_key (str): API key required to access data
        units (str): 'imperial', 'standard', 'metric'
        
    Returns:
        weather_dict (dict): a nested dictionary (tree) which
            contains weather data
    """
    # build url
    lat, lon = latlon_tuple
    url = f'https://api.openweathermap.org/data/2.5/forecast?lat={lat}&lon={lon}&APPID={api_key}&units={units}'

    # get url as a string
    url_text = requests.get(url).text
    
    # convert json to a nested dict
    weather_dict = json.loads(url_text)
    
    return weather_dict

def get_clean_df_daily(weather_dict):
    """ formats daily_dict to a pandas data frame
    
    see https://openweathermap.org/api/one-call-api for
    full daily_dict specification
    
    Args:
        weather_dict (list): list of dictionaries of 3-hour window
            weather features
            
    Returns:
        df_daily (pd.DataFrame): each row is weather from 3-hour window
    """
    hour_dict = weather_dict['list'][0]['main']
    hour_dict['datetime'] = weather_dict['list'][0]['dt_txt']

    df_hourly = pd.Series(hour_dict)

    df_hourly = pd.DataFrame(df_hourly).transpose()
    
    index = 0
    for hour_index in weather_dict['list']:

        hour_dict = hour_index['main']
        hour_dict['datetime'] = hour_index['dt_txt']

        s_hour = pd.Series(hour_dict)
    
        #df_hourly = df_hourly.append(s_hour, ignore_index=True)
        df_hourly.loc[str(index),:] = s_hour
    
        index = index + 1

    df_hourly = df_hourly.iloc[1:,]   
    
    return df_hourly

def scatter_plotly(df, feat_x, feat_y, f_html='scatter.html'):
    """ creates a plotly scatter plot, exports as html 
    
    Args:
        df (pd.DataFrame): pandas dataframe
        x_feat (str): x axis of scatter
        y_feat (str): y axis of scatter
        f_html (str): output html file
        
    Returns:
        f_html (str): output html file
    """
    # creat scatter plot
    fig = px.scatter(df, x=feat_x, y=feat_y)

    # export scatter to html
    plotly.offline.plot(fig, filename=f_html)
    
    return f_html

In [2]:
# inputs
feat_x = 'datetime'
feat_y = 'temp_max'
latlon_tuple = -42, 70
units = 'imperial'
api_key = '2afdede234eabfa52612efba55bcc8ac'

# get data
weather_dict = openweather_onecall(latlon_tuple, 
                                   units=units,
                                   api_key=api_key)

In [3]:
# clean weather dict (make dataframe from dict, process timestamps etc)
df_daily = get_clean_df_daily(weather_dict)
df_daily.head()

Unnamed: 0,temp,feels_like,temp_min,temp_max,pressure,sea_level,grnd_level,humidity,temp_kf,datetime
0,46.78,39.09,46.44,46.78,1032,1032,1032,69,0.19,2024-07-15 18:00:00
1,46.76,38.97,46.67,46.76,1032,1032,1032,70,0.05,2024-07-15 21:00:00
2,48.45,41.79,48.45,48.45,1032,1032,1032,64,0.0,2024-07-16 00:00:00
3,48.78,42.35,48.78,48.78,1032,1032,1032,72,0.0,2024-07-16 03:00:00
4,48.97,43.11,48.97,48.97,1033,1033,1033,70,0.0,2024-07-16 06:00:00


In [4]:
# make scatter
f_html = scatter_plotly(df_daily, feat_x=feat_x, feat_y=feat_y)

## Web Scraping
* Using programs or scripts to pretend to browse websites, examine the content on those websites, retrieve and extract data from those websites
* Why scrape?
    * if an API is available for a service, we will nearly always prefer the API to scraping
    * ... but not all services have APIs or the available APIs are too expensive for our project
    * newly published information might not yet be available through ready datasets
* Downsides of scraping:
    * no reference documentation (unlike APIs)
    * no guarantee that a webpage we scrape will look and work the same way the next day (might need to rewrite the whole scraper)
    * if it violates the terms of service it might be seen as a felony (https://www.aclu.org/cases/sandvig-v-barr-challenge-cfaa-prohibition-uncovering-racial-discrimination-online)
    * legal and moral greyzone (even if the ToS does not disallow it, somebody has to pay for the traffic and when you're scraping you're not looking at ads)
    * ... but everbody does it anyway (https://www.hollywoodreporter.com/thr-esq/genius-says-it-caught-google-lyricfind-redhanded-stealing-lyrics-400m-suit-1259383)
* Web scraping pipeline:
    * because the webpages might change their structure it's extra important to keep the crawling/extraction step separate from transformations and loading
    * ETL (Extraction-Transform-Load):
        * **Crawl**: open a given URL using requests and get the HTML source;
        * **Extract**: extract interesting content from the webpage's source.
        * **Transform**: our usual unit conversions, etc
        * **Load**: representing the data in an easy way for storage and analysis
    * **Pro tip**: it's usually a good idea to not only store the transformed data, but also the raw HTML source - because the webpages might change and we might be late to realize we're not extracting right. If we have the original HTML source we can go back to it
    

## First, whether is it OK to be scrapped?

robots.txt

## Best case scenario
Some webpages publish their data in the form of simple tables. In these (rare) cases we can just use pandas .read_html to scrape this data:

https://www.espn.com/nba/team/stats/_/name/bos

In [7]:
import pandas as pd

In [5]:
table = pd.read_html('https://www.espn.com/nba/team/stats/_/name/bos')
table

[                    Name
 0        Jayson Tatum SF
 1        Jaylen Brown SG
 2       Derrick White PG
 3        Jrue Holiday PG
 4   Kristaps Porzingis C
 5           Al Horford C
 6    Payton Pritchard PG
 7          Sam Hauser SF
 8          Luke Kornet C
 9      Oshae Brissett SF
 10    Xavier Tillman F *
 11       Neemias Queta C
 12     Svi Mykhailiuk SG
 13    Jaden Springer G *
 14        Jordan Walsh G
 15                 Total,
     GP    GS   MIN    PTS   OR    DR   REB   AST  STL  BLK    TO    PF  AST/TO
 0   19  19.0  40.4   25.0  0.9   8.8   9.7   6.3  1.1  0.7   2.6   2.8     2.4
 1   19  19.0  37.2   23.9  1.2   4.8   5.9   3.3  1.2  0.6   2.7   2.7     1.2
 2   19  19.0  35.6   16.7  1.0   3.3   4.3   4.1  0.9  1.2   0.8   2.3     4.8
 3   19  19.0  37.9   13.2  1.9   4.2   6.1   4.4  1.1  0.6   1.5   2.1     2.9
 4    7   4.0  23.6   12.3  0.6   3.9   4.4   1.1  0.7  1.6   0.7   2.1     1.6
 5   19  15.0  30.3    9.2  1.8   5.2   7.0   2.1  0.8  0.8   0.6   1.5     3

In [10]:
len(table)
table[2]

Unnamed: 0,Name
0,Jayson Tatum SF
1,Jaylen Brown SG
2,Derrick White PG
3,Jrue Holiday PG
4,Kristaps Porzingis C
5,Al Horford C
6,Payton Pritchard PG
7,Sam Hauser SF
8,Luke Kornet C
9,Oshae Brissett SF


In [14]:
player_stats = pd.concat(table[:2], axis = 1)
player_stats = pd.concat([player_stats, table[3]], axis = 1)
player_stats

Unnamed: 0,Name,GP,GS,MIN,PTS,OR,DR,REB,AST,STL,...,3PA,3P%,FTM,FTA,FT%,2PM,2PA,2P%,SC-EFF,SH-EFF
0,Jayson Tatum SF,19,19.0,40.4,25.0,0.9,8.8,9.7,6.3,1.1,...,7.3,28.3,6.2,7.2,86.1,6.3,12.3,51.3,1.277,0.48
1,Jaylen Brown SG,19,19.0,37.2,23.9,1.2,4.8,5.9,3.3,1.2,...,5.8,32.7,3.6,5.4,66.0,7.3,12.1,60.7,1.339,0.57
2,Derrick White PG,19,19.0,35.6,16.7,1.0,3.3,4.3,4.1,0.9,...,8.5,40.4,1.8,2.0,92.1,2.3,4.2,55.0,1.32,0.59
3,Jrue Holiday PG,19,19.0,37.9,13.2,1.9,4.2,6.1,4.4,1.1,...,4.6,40.2,1.1,1.2,95.5,3.3,5.6,58.5,1.295,0.59
4,Kristaps Porzingis C,7,4.0,23.6,12.3,0.6,3.9,4.4,1.1,0.7,...,4.1,34.5,2.9,3.1,90.9,2.6,4.4,58.1,1.433,0.55
5,Al Horford C,19,15.0,30.3,9.2,1.8,5.2,7.0,2.1,0.8,...,5.0,36.8,0.4,0.6,63.6,1.6,2.3,72.1,1.261,0.61
6,Payton Pritchard PG,19,0.0,18.7,6.4,0.7,1.2,1.9,2.1,0.2,...,3.2,38.3,0.6,0.6,91.7,1.1,2.4,46.7,1.162,0.53
7,Sam Hauser SF,19,0.0,14.9,5.4,0.4,1.7,2.2,0.6,0.3,...,3.7,38.0,0.2,0.2,100.0,0.5,0.7,69.2,1.226,0.59
8,Luke Kornet C,13,0.0,10.2,3.0,1.5,1.7,3.2,0.5,0.1,...,0.0,0.0,0.8,1.0,84.6,1.1,1.6,66.7,1.857,0.67
9,Oshae Brissett SF,10,0.0,5.5,1.6,0.3,1.1,1.4,0.0,0.3,...,0.2,100.0,0.2,0.4,50.0,0.4,0.9,44.4,1.455,0.64


In [15]:
# baseball instead of basketball?
# https://www.baseball-reference.com/teams/BOS/2022.shtml

base_table = pd.read_html('https://www.baseball-reference.com/teams/BOS/2022.shtml')
len(base_table)


2

In [17]:
base_table[1]

Unnamed: 0,Rk,Pos,Name,Age,W,L,W-L%,ERA,G,GS,...,WP,BF,ERA+,FIP,WHIP,H9,HR9,BB9,SO9,SO/W
0,1,SP,Nick Pivetta,29,10,12,.455,4.56,33,33,...,10,773,92,4.42,1.380,8.8,1.4,3.7,8.8,2.40
1,2,SP,Michael Wacha,30,11,2,.846,3.32,23,23,...,4,515,127,4.14,1.115,7.8,1.3,2.2,7.4,3.35
2,3,SP,Rich Hill*,42,8,7,.533,4.27,26,26,...,0,526,98,3.92,1.303,9.0,1.1,2.7,7.9,2.95
3,4,SP,Nathan Eovaldi,32,6,3,.667,3.87,20,20,...,2,460,109,4.30,1.235,9.5,1.7,1.6,8.5,5.15
4,5,SP,Kutter Crawford,26,3,6,.333,5.47,21,12,...,4,334,77,4.34,1.422,9.4,1.4,3.4,9.0,2.66
5,6,SP,Josh Winckowski,24,5,7,.417,5.89,15,14,...,0,316,72,4.95,1.592,10.9,1.3,3.5,5.6,1.63
6,Rk,Pos,Name,Age,W,L,W-L%,ERA,G,GS,...,WP,BF,ERA+,FIP,WHIP,H9,HR9,BB9,SO9,SO/W
7,7,CL,John Schreiber,28,4,4,.500,2.22,64,0,...,4,257,190,2.50,0.985,6.2,0.4,2.6,10.2,3.89
8,8,RP,Ryan Brasier,34,0,3,.000,5.78,68,0,...,1,263,73,3.61,1.299,9.8,1.3,1.9,9.2,4.92
9,9,RP,Austin Davis*,29,2,1,.667,5.47,50,3,...,1,254,77,3.94,1.564,9.3,0.8,4.8,10.1,2.10


## Messy Data

Notice that the baseball data are quite a bit messier than the basketball data. In web scraping, you are beholden to the format of the website (.html) and will almost certainly have to clean data (sometimes extensively) after scraping it.

## Basic HTML
Web pages are written in HTML. The source of https://sapiezynski.com/ds3000/scraping/01.html looks like this:

```html
<html>
    <head>
        <!-- comments in HTML are marked like this -->
        
        <!-- the head tag contains the meta information not displayed but helps browsers render the page -->
    </head>
    <body>
         <!-- This is the body of the document that contains all the visible elements.-->
        <h1>Heading 1</h1>
        <h2>This is what heading 2 looks like</h2>
        
        <p>Text is usually in paragraphs.
            New lines and multiple consecutive whitespace characters are ignored.</p>

<p>Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.</p>   
        
        <p>Links are created using the "a" tag: 
            <a href="https://www.google.com">Click here to google.</a>
            href is an attirbute of the a tag that specify where the link points to.</p>
        
        
    </body>
</html>
```
The keywords in `<>` brackets are called tags. They open with `<tag>` and close with `</tag>`.

In [20]:
## Getting the html content in Python
import requests

response = requests.get('https://sapiezynski.com/ds3000/scraping/01.html')
print(response.text)

<html>
    <head>
        <!-- comments in HTML are marked like this -->
        
        <!-- the head tag contains the meta information not displayed but helps browsers render the page -->
    </head>
    <body>
         <!-- This is the body of the document that contains all the visible elements.-->
        <h1>Heading 1</h1>
        <h2>This is what heading 2 looks like</h2>
        
        <p>Text is usually in paragraphs.
            New lines and multiple consecutive whitespace characters are ignored.</p>

<p>Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.</p>   
        
        <p>Links are created using the "a" tag: 
            <a href="https://www.google.com">Click here to google.</a>
            href is an attirbute of the a tag that specify where the link points to.</p>
        
        
    </body>
</html>




In [21]:
# sometimes this doesn't quite work the way you want (c'est la vie with web scraping)
response2 = requests.get('https://www.nytimes.com/2019/03/10/style/what-is-tik-tok.html')
print(response2.text)

<html><head><title>nytimes.com</title><style>#cmsg{animation: A 1.5s;}@keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style></head><body style="margin:0"><p id="cmsg">Please enable JS and disable any ad blocker</p><script data-cfasync="false">var dd={'rt':'c','cid':'AHrlqAAAAAMA8wYL_R-WW1IAmyGCHw==','hsh':'499AE34129FA4E4FABC31582C3075D','t':'bv','s':17439,'e':'ec0e6e6eeeda672a40dff8fd801bf6a4d75b96fe9135653c4abb4e50d53e3ec5','host':'geo.captcha-delivery.com'}</script><script data-cfasync="false" src="https://ct.captcha-delivery.com/c.js"></script></body></html>


# Beautiful Soup

Even if the .html does look relatively clean, it's still just a big string. How can we deal with it? Luckily there is a module made for just this purpose, and it's even a magic command which we can install directly in jupyter notebook:

In [12]:
#pip install bs4

Collecting bs4
  Using cached bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py): started
  Building wheel for bs4 (setup.py): finished with status 'done'
  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1257 sha256=c09542241be07caaa3a7ba8e73bbed263656c48a06e7a2d08745e378c42e81b5
  Stored in directory: c:\users\yangx\appdata\local\pip\cache\wheels\73\2b\cb\099980278a0c9a3e57ff1a89875ec07bfa0b6fcbebb9a8cad3
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1
Note: you may need to restart the kernel to use updated packages.




In [23]:
from bs4 import BeautifulSoup

url = "https://sapiezynski.com/ds3000/scraping/01.html"
str_html = requests.get(url).text
soup = BeautifulSoup(str_html)

In [24]:
soup

<html>
<head>
<!-- comments in HTML are marked like this -->
<!-- the head tag contains the meta information not displayed but helps browsers render the page -->
</head>
<body>
<!-- This is the body of the document that contains all the visible elements.-->
<h1>Heading 1</h1>
<h2>This is what heading 2 looks like</h2>
<p>Text is usually in paragraphs.
            New lines and multiple consecutive whitespace characters are ignored.</p>
<p>Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.</p>
<p>Links are created using the "a" tag: 
            <a href="https://www.google.com">Click here to google.</a>
            href is an attirbute of the a tag that specify where the link points to.</p>
</body>
</html>

In [26]:
soup.find_all('p')

[<p>Text is usually in paragraphs.
             New lines and multiple consecutive whitespace characters are ignored.</p>,
 <p>Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.</p>,
 <p>Links are created using the "a" tag: 
             <a href="https://www.google.com">Click here to google.</a>
             href is an attirbute of the a tag that specify where the link points to.</p>]

In [28]:
soup.find_all('p')[0]

<p>Text is usually in paragraphs.
            New lines and multiple consecutive whitespace characters are ignored.</p>

In [29]:
type(soup.find_all('p')[0])

bs4.element.Tag

In [30]:
soup.find_all('p')[0].text

'Text is usually in paragraphs.\n            New lines and multiple consecutive whitespace characters are ignored.'

In [31]:
for paragraph in soup.find_all('p'):
    print(paragraph.text)
    print('-------------------')

Text is usually in paragraphs.
            New lines and multiple consecutive whitespace characters are ignored.
-------------------
Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.
-------------------
Links are created using the "a" tag: 
            Click here to google.
            href is an attirbute of the a tag that specify where the link points to.
-------------------


# `.find_all()` on subtrees of soup object


The `.find_all()` method works not only on the whole `soup` object, but also on subtrees of the soup object.  

Consider the site at https://sapiezynski.com/ds3000/scraping/02.html:

```html
<html>
    <body>
        <p>The links in this paragraph point to search engines, like <a href="https://duckduckgo.com">DuckDuckGo</a>, <a href="https://google.com">Google</a>, <a href="https://bing.com">Bing</a></p>
        
        <p>The links in this paragraph point to Internet browsers, like <a href="https://firefox.com">Firefox</a>, <a href="https://chrome.com">Chrome</a>, <a href="https://opera.com">Opera</a></p>.
    </body>
</html>
```

**Goal**: Grab links from the first paragraph only:

In [32]:
url = "https://sapiezynski.com/ds3000/scraping/02.html"
response = requests.get(url)
soup = BeautifulSoup(response.text)
soup

<html>
<body>
<p>The links in this paragraph point to search engines, like <a href="https://duckduckgo.com">DuckDuckGo</a>, <a href="https://google.com">Google</a>, <a href="https://bing.com">Bing</a></p>
<p>The links in this paragraph point to Internet browsers, like <a href="https://firefox.com">Firefox</a>, <a href="https://chrome.com">Chrome</a>, <a href="https://opera.com">Opera</a>.</p>
</body>
</html>

In [33]:
p_all = soup.find_all('p')
p_all

[<p>The links in this paragraph point to search engines, like <a href="https://duckduckgo.com">DuckDuckGo</a>, <a href="https://google.com">Google</a>, <a href="https://bing.com">Bing</a></p>,
 <p>The links in this paragraph point to Internet browsers, like <a href="https://firefox.com">Firefox</a>, <a href="https://chrome.com">Chrome</a>, <a href="https://opera.com">Opera</a>.</p>]

### Some syntactic sugar: 
To get the first tag under a soup object, refer to it as an attribute

In [34]:
soup.p

<p>The links in this paragraph point to search engines, like <a href="https://duckduckgo.com">DuckDuckGo</a>, <a href="https://google.com">Google</a>, <a href="https://bing.com">Bing</a></p>

In [35]:
soup.p.find_all('a')

[<a href="https://duckduckgo.com">DuckDuckGo</a>,
 <a href="https://google.com">Google</a>,
 <a href="https://bing.com">Bing</a>]

In [36]:
soup.p.a

<a href="https://duckduckgo.com">DuckDuckGo</a>

In [37]:
for par in soup.find_all('p'):
    print(par.a)

<a href="https://duckduckgo.com">DuckDuckGo</a>
<a href="https://firefox.com">Firefox</a>


## Identifying if tags exist

In [39]:
header = soup.h3
print(header)

None


We can test if a tag exists in a soup object by looking for the first instance of this tag and comparing it to `None`

In [40]:
if soup.h3 is None:
    print('the tag h3 does not exist')
else:
    print('the tag h3 does exist')

the tag h3 does not exist


In [41]:
if soup.p is None:
    print('the tag p does not exist')
else:
    print('the tag p does exist')

the tag p does exist


## Finding tags by `class_`

**Tip**: This is often one of the most useful ways to localize a particular part of a web page.

In [42]:
# get soup
url = 'https://www.allrecipes.com/search?q=cheese+fondue'
responses = requests.get(url)
soup = BeautifulSoup(responses.text)

In [44]:
#soup

Our **goal** is to get a list of recipes.  Maybe we should find all the `div` tags? What about `span` tags?

In [45]:
len(soup.find_all('div'))

303

In [48]:
len(soup.find_all('span'))

106

Tags can have multiple "classes" they belong to.  For example, in https://www.allrecipes.com/search?q=cheese+fondue the first recipe is encapsulated in this html tag:

    <span class="card__title"><span class="card__title-text">Cheese Fondue</span></span>
    
So this particular span tag belongs to classes:
- `card__title`
- `card__title-text`
    
I suspect only our target recipes belong to the `card__title-text` class.  Lets find them all:

In [49]:
recipe_list = soup.find_all(class_ = 'card__title-text')
len(recipe_list)

24

In [50]:
recipe_list

[<span class="card__title-text">Cheese Fondue</span>,
 <span class="card__title-text">Best Formula Three-Cheese Fondue</span>,
 <span class="card__title-text">Beer Cheese Fondue</span>,
 <span class="card__title-text">Classic Cheese Fondue</span>,
 <span class="card__title-text">Chef John's Classic Cheese Fondue Is the Ultimate Cheese Lover's Recipe</span>,
 <span class="card__title-text">Cheese Fondue</span>,
 <span class="card__title-text">Quick Fontina Cheese Fondue</span>,
 <span class="card__title-text">Basic Fondue</span>,
 <span class="card__title-text">YouTube + Chill: For Serious Cheese-Lovers Only</span>,
 <span class="card__title-text">Classic Swiss Fondue</span>,
 <span class="card__title-text">Crab Cheese Fondue</span>,
 <span class="card__title-text">Cheese</span>,
 <span class="card__title-text">25 Best Appetizers to Make if You're Obsessed With Cheese</span>,
 <span class="card__title-text">How to Make Cheese Sauce From Scratch</span>,
 <span class="card__title-text">Wh

In [51]:
recipe_list[23].text

'Why Do Some Cheeses Melt Better Than Others?'

## Finding tags by `id`

Nearly the same as finding by class, but you'll look for `id=` in the html and pass it to the `id` keyword of `soup.find_all()`.

**Goal**: Get the footer from: https://www.scrapethissite.com/



```html
<section id="footer">
        <div class="container">
            <div class="row">
                <div class="col-md-12 text-center text-muted">
                    Lessons and Videos © Hartley Brody 2018
                </div><!--.col-->
            </div><!--.row-->
        </div><!--.container-->
    </section>
```

In [52]:
# get soup from url
url = 'https://www.scrapethissite.com/'
html = requests.get(url)
soup = BeautifulSoup(html.text)

In [53]:
soup.find_all(id = 'footer')

[<section id="footer">
 <div class="container">
 <div class="row">
 <div class="col-md-12 text-center text-muted">
                     Lessons and Videos © Hartley Brody 2023
                 </div><!--.col-->
 </div><!--.row-->
 </div><!--.container-->
 </section>]

Note that you can combine all searches shown above:
- tag
    - p (paragraph)
    - a (link)
    - div, span, ...
- tag class
- tag id

```python
# finds all links (tag type = 'a'), with given class and id
soup.find_all('a', class_='fancy-link', id='blue')

```

## Practice: 

**Goal:** Get a list of recipe names from www.allrecipes.com like we did for:

https://www.allrecipes.com/search?q=cheese+fondue

1. Write function `crawl_recipes(query)` which:
    * takes the search phrase (the ingredient) as input argument
    * builds the correct url that leads directly to the page that lists the recipes
    * uses `requests` to get the content of this page returns the html text of the page
1. Write `extract_recipes(text)` which:
    * takes the text returned by `crawl_recipes` as argument
    * builds a BeautifulSoup object out of that text 
    * finds names of all recipes
        - to identify which tags / classes to `find_all()`, open the page in your browser and "inspect" 
        - start from the recipe object above, and call another `find_all()` to zoom into the recipe name itself
    * returns the list of recipe names
    

A new function that will help if you wish to query multiple words:

`string.replace()`

So, if you wish to turn `cheese fondue` into `cheese+fondue`:

In [54]:
string = 'cheese fondue'
string = string.replace(" ", "+")
string

'cheese+fondue'

In [57]:
def crawl_recipes(query):
    """ gets html of from allrecipes.com to search query
    
    Args:
        query (str): search string
        
    Returns:
        html_str (str): html response from allreceipes.com
    """
    query = query.replace(" ", "+")
    url = f'https://www.allrecipes.com/search?q={query}'
    html = requests.get(url).text

    return html
    
def extract_recipes(text):
    """ builds list of recipe names from allrecipies html
    
    Args:
        html_str (str): html response from allrecipes.com, see crawl_recipes()
        
    Returns:
        recipe_list (list): list of recipes
    """
    soup = BeautifulSoup(text)
    recipe_list = []

    for recipe in soup.find_all(class_ = 'card__title-text'):
        recipe = recipe.text
        recipe_list.append(recipe)

    return recipe_list

In [58]:
meatloaf_html = crawl_recipes('meatloaf')

In [61]:
new_recipt_list = extract_recipes(meatloaf_html)

In [62]:
new_recipt_list

['Classic Meatloaf',
 'Melt-In-Your-Mouth Meatloaf',
 'Best Meatloaf',
 'Easy Meatloaf',
 'Creamy Mushroom Meatloaf',
 "Meatloaf that Doesn't Crumble",
 'Meatloaf Cupcakes',
 'Italian Style Turkey Meatloaf',
 'Brown Sugar Meatloaf with Ketchup Glaze',
 "Chris's Incredible Italian Turkey Meatloaf",
 'Turkey and Quinoa Meatloaf',
 'Dill Pickle Meatloaf',
 "Kim's Ultimate Meatloaf",
 "Ellen's Buffalo Meatloaf",
 'Sweet and Sour Meatloaf',
 'Mushroom Meatloaf',
 "Chef John's Meatball-Inspired Meatloaf",
 'Best Turkey Meatloaf',
 'Cottage Meatloaf',
 'Tennessee Meatloaf',
 "Momma's Healthy Meatloaf",
 'Best Ever Meatloaf with Brown Gravy',
 'Smoky Chipotle Meatloaf',
 'Cheeseburger Meatloaf']

## Getting info from each recipe's own page:

When we interact with the webpage in the browser, clicking on the header with the recipe name leads us to the actual recipe. Let's have a look at how it's done. Here is the link (`<a >` tag) for the first and third cards of the meatloaf search:

```html
<a class="comp mntl-card-list-items mntl-document-card mntl-card card card--no-image" 
   data-cta="" 
   data-doc-id="6663943" 
   data-ordinal="1" 
   data-tax-levels="" 
   href="https://www.allrecipes.com/recipe/219171/classic-meatloaf/" 
   id="mntl-card-list-items_1-0">
```

```html
<a class="comp mntl-card-list-items mntl-document-card mntl-card card card--no-image" 
   data-cta="" 
   data-doc-id="6663443" 
   data-ordinal="3" 
   data-tax-levels="" 
   href="https://www.allrecipes.com/recipe/223381/melt-in-your-mouth-meat-loaf/" 
   id="mntl-card-list-items_1-0-2">
```



In [64]:
meatloaf_html = crawl_recipes('meatloaf')
soup = BeautifulSoup(meatloaf_html)

In [66]:
recipe = soup.find_all('a', class_ = 'comp mntl-card-list-items mntl-document-card mntl-card card card--no-image')
recipe[0]

<a class="comp mntl-card-list-items mntl-document-card mntl-card card card--no-image" data-doc-id="6663943" data-ordinal="1" data-tax-levels="" href="https://www.allrecipes.com/recipe/219171/classic-meatloaf/" id="mntl-card-list-items_1-0">
<div class="loc card__top"><div class="card__media mntl-universal-image card__media universal-image__container"><div class="img-placeholder" style="padding-bottom:66.6%;">
<img alt="close up view of a sliced meatloaf on a white platter" class="lazyload card__img universal-image__image" data-expand="300" data-src="https://www.allrecipes.com/thmb/EfIedrgpookiaFCI_MCUAmQOUTY=/282x188/filters:no_upscale():max_bytes(150000):strip_icc()/219171-classic-meatloaf-DDMFS-4x3-6b2d4a8c103146c1856eb0bf135bbffe.jpg" height="188" width="282"/>
<noscript>
<img alt="close up view of a sliced meatloaf on a white platter" class="img--noscript card__img universal-image__image" height="188" src="https://www.allrecipes.com/thmb/EfIedrgpookiaFCI_MCUAmQOUTY=/282x188/filters

In [67]:
recipe[0].attrs

{'id': 'mntl-card-list-items_1-0',
 'class': ['comp',
  'mntl-card-list-items',
  'mntl-document-card',
  'mntl-card',
  'card',
  'card--no-image'],
 'data-doc-id': '6663943',
 'data-tax-levels': '',
 'href': 'https://www.allrecipes.com/recipe/219171/classic-meatloaf/',
 'data-ordinal': '1'}

In [68]:
recipe[0].attrs['href']

'https://www.allrecipes.com/recipe/219171/classic-meatloaf/'

# Adding `href` to our dataframe of recipes

Let's modify our `extract_recipes()` function such that rather than returning just the names of the dishes, it returns a list of dictionaries, where each dictionary has the `name` and `url` fields:

## `from_dict`

First, a useful tool to turn a dictionary into a data frame where the keys are features (columns) and the values are lists that correspond to the values of the features (rows) is the `pd.DataFrame.from_dict()` function:

In [69]:
example_dict = {'col1': [1,2,3,4,5],
                'col2': [6,7,8,9,10],
                'col3': ['who', 'what', 'when', 'where', 'why']}
pd.DataFrame.from_dict(example_dict)

Unnamed: 0,col1,col2,col3
0,1,6,who
1,2,7,what
2,3,8,when
3,4,9,where
4,5,10,why


In [72]:
def extract_recipes(text):
    """ builds list of recipe names from allrecipies html
    
    Args:
        html_str (str): html response from allrecipes.com, see crawl_recipes()
        
    Returns:
        df_recipe (pd.DataFrame): dataframe of recipes
    """

    soup = BeautifulSoup(text)
    
    recipe_list = []
    for recipe in soup.find_all(class_ = 'card__title-text'):
        recipe = recipe.text
        recipe_list.append(recipe)

    href_list = []
    for recipe in soup.find_all('a', class_ = 'comp mntl-card-list-items mntl-document-card mntl-card card card--no-image'):
        recipe_link = recipe.attrs['href']
        href_list.append(recipe_link)

    recipe_dict = {'name': recipe_list, 
                  'href': href_list}
    df_recipe = pd.DataFrame.from_dict(recipe_dict)

    return df_recipe

In [73]:
extract_recipes(meatloaf_html)

Unnamed: 0,name,href
0,Classic Meatloaf,https://www.allrecipes.com/recipe/219171/class...
1,Melt-In-Your-Mouth Meatloaf,https://www.allrecipes.com/recipe/223381/melt-...
2,Best Meatloaf,https://www.allrecipes.com/recipe/74360/the-be...
3,Easy Meatloaf,https://www.allrecipes.com/recipe/16354/easy-m...
4,Creamy Mushroom Meatloaf,https://www.allrecipes.com/recipe/219963/cream...
5,Meatloaf that Doesn't Crumble,https://www.allrecipes.com/recipe/79749/meatlo...
6,Meatloaf Cupcakes,https://www.allrecipes.com/recipe/236847/meatl...
7,Italian Style Turkey Meatloaf,https://www.allrecipes.com/recipe/234508/itali...
8,Brown Sugar Meatloaf with Ketchup Glaze,https://www.allrecipes.com/recipe/238904/brown...
9,Chris's Incredible Italian Turkey Meatloaf,https://www.allrecipes.com/recipe/31843/chriss...


## String Manipulations
- `.split()` & `.join()`
- `.strip()`
- `.replace()`
- `.upper()` & `.lower()`

Visting [a specific recipe's page](https://www.allrecipes.com/recipe/219171/classic-meatloaf/) yields data stored in a string.  The methods above allow us to extract this information.

In [74]:
#.strip removes all leading and trailing whitespaces(newline)
'\n\n\n hello!  \n hello! \n \n    \n \n'.strip()

'hello!  \n hello!'

In [75]:
'cheese fondue'.replace(' ', '+')

'cheese+fondue'

In [76]:
'Hello Fred'.replace('Fred', 'George')

'Hello George'

In [77]:
'lets forget about it, okay?'.replace(' it', '')

'lets forget about, okay?'

In [78]:
'dont shout'.upper()

'DONT SHOUT'

In [79]:
'BE Quiet'.lower()

'be quiet'

In [80]:
#split a string on every given string
'fat: 54 g, calories: 430 cal, sugar: 10g'.split(',')

['fat: 54 g', ' calories: 430 cal', ' sugar: 10g']

In [81]:
#join: glue something together
''.join(['a', 'b', 'c', 'd'])

'abcd'

In [82]:
'<glue>'.join(['a', 'b', 'c', 'd'])

'a<glue>b<glue>c<glue>d'

In [83]:
namelist = 'last0, first0, last1, first1, last2, first2'

In [86]:
name = namelist.split(",")
','.join(name[:2])

'last0, first0'

In [88]:
','.join(name[2:4]).strip()

'last1, first1'

In [67]:
# visit specific recipe's page
url = 'https://www.allrecipes.com/recipe/283561/classic-cheese-fondue/'
html = requests.get(url).text
soup = BeautifulSoup(html)

## Exercise
Write two functions: `extract_prep_info()` and `extract_nutrition()`, which both accept a url of a particular recipe (see examples above) and return dictionaries of the prep in of nutritional information, respectively. For example:

```python
url = 'https://www.allrecipes.com/recipe/283561/classic-cheese-fondue/'
extract_prep_info(url)
extract_nutrition(url)

```

yields:

```python
prep_info_dict = {'Prep Time': '10 mins',
                  'Cook Time': '15 mins',
                  'Total Time': '25 mins',
                  'Servings': '10',
                  'Yield': '10 servings'}

```

and

```python
nutr_info_dict = {'Total Fat': '14g',
                  'Saturated Fat': '9g',
                  'Cholesterol': '46mg',
                  'Sodium': '179mg',
                  'Total Carbohydrate': '3g',
                  'Total Sugars': '1g',
                  'Protein': '13g',
                  'Vitamin C': '0mg',
                  'Calcium': '461mg',
                  'Iron': '0mg',
                  'Potassium': '67mg'}

```

In [10]:
def extract_prep_info(url):
    """ returns a dictionary of recipe preparation info 
    
    Args:
        url (str): location of all recipes "recipe"
        
    Returns:
        prep_info_dict (dict): keys are features ('prep'), 
            vals are str that describe feature ('20 mins')
    """
    pass

In [11]:
def extract_nutrition(url):
    """ returns a dictionary of nutrition info 
    
    Args:
        url (str): location of all recipes "recipe"
        
    Returns:
        nutr_dict (dict): keys are molecule types ('fat'), 
            vals are str of quantity ('24 g')
    """
    pass

In [12]:
url = 'https://www.allrecipes.com/recipe/283561/classic-cheese-fondue/'

### Grabbing numeric values (float/int) from messy strings

- We have strings which describe recipe nutrition info (`'100 mg'`)
- We want numeric data types (`float, int`) so that we can plot and operate on these values

## Rest of Class (Go slowly; if we don't finish we can next week)
Complete the `extract_nutrition()` below such that:

```python
# get / extract a data frame of recipes (only name and href)
str_query = 'boston cream pie'
html_str = crawl_recipes(str_query)
df_recipe = extract_recipes(html_str)

for row_idx in range(df_recipe.shape[0]):
    # get / extract nutrition info for a particular recipe
    recipe_url = df_recipe.loc[row_idx, 'href']
    nutr_dict = extract_nutrition(recipe_url)
    
    # add each new nutrition feature to the dataframe
    # only if there ARE nutrition features
    if len(nutr_dict) != 0:
        for nutr_feat, nutr_val in nutr_dict.items():
            df_recipe.loc[row_idx, nutr_feat] = nutr_val
    else:
        df_recipe = df_recipe.drop(row_idx, axis=0)

```

generates the `df_recipe`:

|    | name                           | href                                              | Total Fat | Saturated Fat | Cholesterol | Sodium | Total Carbohydrate | Dietary Fiber | Total Sugars | Protein | Vitamin C | Calcium | Iron | Potassium |
|----|--------------------------------|---------------------------------------------------|-----------|---------------|-------------|--------|--------------------|---------------|--------------|---------|-----------|---------|------|-----------|
| 0  | Chef John's Boston Cream Pie   | https://www.allrecipes.com/recipe/220942/chef-... | 41        | 17            | 199         | 514    | 72                 | 2             | 46           | 10      | 0         | 168     | 2    | 230       |
| 1  | Boston Cream Pie               | https://www.allrecipes.com/recipe/8138/boston-... | 13        | 6             | 61          | 230    | 47                 | 1             | 34           | 5       | 0         | 101     | 2    | 134       |
| 2  | Boston Cream Pie I             | https://www.allrecipes.com/recipe/8137/boston-... | 15        | 9             | 94          | 223    | 43                 | 1             | 26           | 5       | 0         | 97      | 2    | 95        |
| 3  | Semi-Homemade Boston Cream Pie | https://www.allrecipes.com/recipe/278930/semi-... | 41        | 16            | 219         | 568    | 79                 | 3             | 53           | 11      | 0         | 186     | 3    | 194       |
| 9  | Hot Milk Sponge Cake II        | https://www.allrecipes.com/recipe/8159/hot-mil... | 3         | 2             | 52          | 231    | 34                 | 0             | 20           | 4       | NaN       | 61      | 2    | 60        |
| 17 | Boston Cream Dessert Cups      | https://www.allrecipes.com/recipe/213446/bosto... | 15        | 7             | 44          | 237    | 32                 | 0             | 22           | 3       | 0         | 41      | 1    | 101       |
| 19 | Boston Creme Mini-Cupcakes     | https://www.allrecipes.com/recipe/220809/bosto... | 12        | 4             | 32          | 253    | 34                 | 0             | 24           | 3       | 0         | 62      | 1    | 100       |

In [109]:
def extract_nutrition(url):
    """ returns a dictionary of nutrition info 
    
    Args:
        url (str): location of all recipes "recipe"
        
    Returns:
        nutr_dict (dict): keys are molecule types ('fat'), 
            vals are floats of quantity ('24 g' = 24)
    """
    pass

In [13]:
url = 'https://www.allrecipes.com/recipe/220942/chef-johns-boston-cream-pie/'

In [14]:
def extract_nutrition(url):
    """ returns a dictionary of nutrition info 
    
    Args:
        url (str): location of all recipes "recipe"
        
    Returns:
        nutr_dict (dict): keys are molecule types ('fat'), 
            vals are floats of quantity ('24 g' = 24)
    """

    pass

## Putting it all together
- get list of dictionaries corresponding to recipes (done!)
- get dictionary of nutrition info per recipe (done!)
- aggregating info into dataframe (see below)
- scatter plot (up next)

In [15]:
def get_df_recipe(str_query, recipe_limit=None):
    """ searches for recipes and returns list, with nutrition info
    
    Args:
        str_query (str): search string
        recipe_limit (int): if passed, limits recipe (helpful
            to speed up nutrition scraping for teaching!)
        
    Returns:
        df_recipe (pd.DataFrame): dataframe, each row is recipe.
            includes columns href, name, and nutrition facts
    """    
    pass