# DS3000 Day 3

Sep 17, 2024

Admin
- Homework 1 is due Tuesday, Sep 17 by midnight
- Homework 2 will be posted then, due Oct 8 by midnight
      - Note: you have three weeks to do this, but **do not** put it off! The sooner you complete everything the better.
- Lab 1 scheduled for **next class**, please bring a **charged up** computer with the ability to edit jupyter notebooks.
- **Important** Based on the number of in-class assignments that we will get to have, I will remove either two or one of your lowest scoring one :) 

Push-Up Tracker
- Section 04: 2
- Section 08: 1

Content:
- OpenWeather API pipeline
- Intro to Web Scraping

# Data Pipeline: What is it?

A data pipeline is a collection of functions* which split all the functionality of our data collection and processing

(*can be other structures too, but it may be easier to first understand each as a function)


# Why build a data pipeline?

- Allows pipeline to be run in parts (rather than the whole thing)
- Allows pipeline to be built by different programmers working on different parts in parallel
- Allows us to test each piece of our code seperately
- Allows for modification / re-use of different sections

What we call a "Data Pipeline" here is a specific instance of "Factoring" a piece of software, splitting up its functionality into pieces.
    


# OpenWeather API Pipeline Activity

OpenWeather API offers a few different queries (see [here](https://openweathermap.org/api) for details):
- One Call API (which we have access to)
- Solar Radiation API
- etc.


**Goal:**

Build a library of functions which can be pieced together to support the collection, cleaning and display of features from OpenWeather into a scatter plot of two features.

### Lets design one together: 

(think: input/outputs -> handwritten notes)

# Plan out a pipeline

Write a few 'empty' functions including little more than the docstring:

```python
def some_fnc(a_string, a_list):
    """ processes a string and a list (somehow)
    
    Args:
        a_string (str): an input string which ...
        a_list (list): a list which describes ...
        
    Returns:
        output (dict): the output dict which is ...
    """
    pass
```

and a script which uses them:

```python
# inputs (not necessarily complete)
lat = 42
lon = 71

some_output = some_fnc(lat, lon)
some_other_output = some_other_fnc(some_output)

```

which would, if the functions worked, produce a graph like this (note: this is from an earlier semester; our graph will look different):

<img src="https://i.ibb.co/Ct0JtRJ/newplot-1.png" width=500\img>

**NOTE:** we haven't talked about creating plots yet, but we will next week! For now, I will provide everything you need in the examples.

# What might these empty functions look like?

In [1]:
def openweather_onecall(latlon_tuple, api_key, units='imperial'):
    """ returns weather data from one location via onecall
    
    https://openweathermap.org/api/one-call-api 
    
    Args:
        latlon_tuple (tuple): first element is lattitude,
            second is longitude            
        api_key (str): API key required to access data
        units (str): 'imperial', 'standard', 'metric'
        
    Returns:
        weather_dict (dict): a nested dictionary (tree) which
            contains weather data
    """
    pass
    
def get_clean_df_daily(daily_dict_list):
    """ formats daily_dict to a pandas dataframe
    
    see https://openweathermap.org/api/one-call-api for
    full daily_dict specification
    
    Args:
        daily_dict_list (list): list of dictionaries of daily
            weather features
            
    Returns:
        df_daily (pd.DataFrame): each row is weather from one
            day
    """
    pass
    
def scatter_plotly(df, feat_x, feat_y, f_html='scatter.html'):
    """ creates a plotly scatter plot, exports as html 
    
    Args:
        df (pd.DataFrame): pandas dataframe
        x_feat (str): x axis of scatter
        y_feat (str): y axis of scatter
        f_html (str): output html file
        
    Returns:
        f_html (str): output html file
    """ 
    pass

When the pipeline above is complete, the following script should plot a daily max temp scatter for Boston:

In [2]:
# this code won't work because the functions above are all empty
# inputs
feat_x = 'date'
feat_y = 'temp_max'
latlon_tuple = 42.3601, -71.0589
units = 'imperial'
api_key = 'd36fa352ac73226b30772f64675f41bb'

# get data
weather_dict = openweather_onecall(latlon_tuple, 
                                   units=units,
                                   api_key=api_key)

# clean weather dict (make dataframe from dict, process timestamps etc)
df_daily = get_clean_df_daily(weather_dict['daily'])

# make scatter
f_html = scatter_plotly(df_daily, feat_x=feat_x, feat_y=feat_y)

TypeError: 'NoneType' object is not subscriptable

# Let's go **SLOWLY** through this solution

In [3]:
!pip3 install plotly --break-system-packages
# packages for today
# depresses future warnings (careful here; only use on your own if you know what you're doing)
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
# for calling API and cleaning data
import requests
import json
from datetime import datetime
# pandas (for data frames) and plotly (for plotting)
import pandas as pd
import plotly
import plotly.express as px

Collecting plotly
  Downloading plotly-5.24.1-py3-none-any.whl.metadata (7.3 kB)
Collecting tenacity>=6.2.0 (from plotly)
  Downloading tenacity-9.0.0-py3-none-any.whl.metadata (1.2 kB)
Downloading plotly-5.24.1-py3-none-any.whl (19.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.1/19.1 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading tenacity-9.0.0-py3-none-any.whl (28 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.24.1 tenacity-9.0.0


In [4]:
def openweather_onecall(latlon_tuple, api_key, units='imperial'):
    """ returns weather data from one location via onecall
    
    https://openweathermap.org/api/one-call-api 
    
    Args:
        latlon_tuple (tuple): first element is lattitude,
            second is longitude
        api_key (str): API key required to access data
        units (str): 'imperial', 'standard', 'metric'
        
    Returns:
        weather_dict (dict): a nested dictionary (tree) which
            contains weather data
    """
    # build url
    lat, lon = latlon_tuple
    url = f'https://api.openweathermap.org/data/3.0/onecall?lat={lat}&lon={lon}&appid={api_key}&units={units}'
    
    # get url as a string
    url_text = requests.get(url).text
    
    # convert json to a nested dict
    weather_dict = json.loads(url_text)

    # another, perhaps cleaner option
    # weather_dict = requests.get(url).json()
    
    return weather_dict

def get_clean_df_daily(daily_dict_list):
    """ formats daily_dict to a pandas series
    
    see https://openweathermap.org/api/one-call-api for
    full daily_dict specification
    
    Args:
        daily_dict_list (list): list of dictionaries of daily
            weather features
            
    Returns:
        df_daily (pd.DataFrame): each row is weather from one
            day
    """
    # format to dataframe
    df_weather = pd.DataFrame()
    for daily_dict in daily_dict_list:
        daily_series = pd.Series(dtype='object')

        # build datetime data (.fromtimestamp() assumes local time zone)
        daily_series['date'] = datetime.fromtimestamp(daily_dict['dt'])
        daily_series['sunrise'] = datetime.fromtimestamp(daily_dict['sunrise'])
        daily_series['sunset'] = datetime.fromtimestamp(daily_dict['sunset'])


        # build temp data
        temp_dict = daily_dict['temp']
        for temp_feat, temp in temp_dict.items():
            daily_series[f'temp_{temp_feat}'] = temp

        # build prob of precipitation
        # NOTE: I did confirm that the rain column appears only if there is rain forecasted in the next 48 hours
        daily_series['pop'] = daily_dict['pop']
                
        # collect row in df_weather
        df_weather = pd.concat([df_weather, daily_series.to_frame().T])
    
    return df_weather     

def scatter_plotly(df, feat_x, feat_y, f_html='scatter.html'):
    """ creates a plotly scatter plot, exports as html 
    
    Args:
        df (pd.DataFrame): pandas dataframe
        x_feat (str): x axis of scatter
        y_feat (str): y axis of scatter
        f_html (str): output html file
        
    Returns:
        f_html (str): output html file
    """
    # creat scatter plot
    fig = px.scatter(df, x=feat_x, y=feat_y)

    
    # export scatter to html
    plotly.offline.plot(fig, filename=f_html)
    
    return f_html

In [5]:
# inputs
feat_x = 'date'
feat_y = 'temp_max'
latlon_tuple = 42.3601, -71.0589
units = 'imperial'
api_key = 'cf758020c3c57082bbfd8b62d88ca683'

# get data
weather_dict = openweather_onecall(latlon_tuple, 
                                   units=units,
                                   api_key=api_key)

In [6]:
# clean weather dict (make dataframe from dict, process timestamps etc)
df_daily = get_clean_df_daily(weather_dict['daily'])
df_daily

Unnamed: 0,date,sunrise,sunset,temp_day,temp_min,temp_max,temp_night,temp_eve,temp_morn,pop
0,2024-09-16 12:00:00,2024-09-16 06:25:45,2024-09-16 18:52:17,79.84,56.19,80.83,63.16,78.13,56.19,0.0
0,2024-09-17 12:00:00,2024-09-17 06:26:48,2024-09-17 18:50:30,81.3,58.84,84.87,66.06,76.03,58.84,0.0
0,2024-09-18 12:00:00,2024-09-18 06:27:51,2024-09-18 18:48:43,77.95,60.82,78.08,66.45,69.71,60.82,0.0
0,2024-09-19 12:00:00,2024-09-19 06:28:54,2024-09-19 18:46:55,63.55,61.25,66.45,61.92,64.94,63.14,1.0
0,2024-09-20 12:00:00,2024-09-20 06:29:57,2024-09-20 18:45:08,58.33,56.73,62.29,57.7,57.33,62.08,1.0
0,2024-09-21 12:00:00,2024-09-21 06:31:00,2024-09-21 18:43:21,63.07,56.28,63.54,58.21,61.43,56.28,0.26
0,2024-09-22 12:00:00,2024-09-22 06:32:03,2024-09-22 18:41:34,60.84,54.07,61.61,54.07,59.14,55.76,0.29
0,2024-09-23 12:00:00,2024-09-23 06:33:06,2024-09-23 18:39:47,61.16,50.45,63.52,54.82,60.42,50.45,0.0


In [7]:
# make scatter
f_html = scatter_plotly(df_daily, feat_x=feat_x, feat_y=feat_y)

## Web Scraping

![i](https://crawlbase.com/blog/best-data-memes/web-scraping-memes.jpg)
* Using programs or scripts to pretend to browse websites, examine the content on those websites, retrieve and extract data from those websites
* Why scrape?
    * if an API is available for a service, we will nearly always prefer the API to scraping
    * ... but not all services have APIs or the available APIs are too expensive for our project (for example Twitter)
  ![h1](https://pbs.twimg.com/media/Fn84g5DXwAAvLOB.jpg)
  ![h2](https://i.kym-cdn.com/photos/images/original/002/525/430/ee4)
    * newly published information might not yet be available through ready datasets
* Downsides of scraping:
    * no reference documentation (unlike APIs)
    * no guarantee that a webpage we scrape will look and work the same way the next day (might need to rewrite the whole scraper)
    * if it violates the terms of service it might be seen as a felony (https://www.aclu.org/cases/sandvig-v-barr-challenge-cfaa-prohibition-uncovering-racial-discrimination-online)
    * legal and moral greyzone (even if the ToS does not disallow it, somebody has to pay for the traffic and when you're scraping you're not looking at ads)
    * ... but everbody does it anyway (https://www.hollywoodreporter.com/thr-esq/genius-says-it-caught-google-lyricfind-redhanded-stealing-lyrics-400m-suit-1259383)
* Web scraping pipeline:
    * because the webpages might change their structure it's extra important to keep the crawling/extraction step separate from transformations and loading
    * ETL (Extraction-Transform-Load):
        * **Crawl**: open a given URL using requests and get the HTML source;
        * **Extract**: extract interesting content from the webpage's source.
        * **Transform**: our usual unit conversions, etc
        * **Load**: representing the data in an easy way for storage and analysis
    * **Pro tip**: it's usually a good idea to not only store the transformed data, but also the raw HTML source - because the webpages might change and we might be late to realize we're not extracting right. If we have the original HTML source we can go back to it
    

## Best case scenario
Some webpages publish their data in the form of simple tables. In these (rare) cases we can just use pandas .read_html to scrape this data:

https://www.espn.com/nba/team/stats/_/name/bos

In [9]:
!pip3 install lxml --break-system-packages
import pandas as pd
# read html extracts all the <table> elements from html and returns a list of DataFrames created from them
tables = pd.read_html('https://www.espn.com/nba/team/stats/_/name/bos')
len(tables)

Collecting lxml
  Downloading lxml-5.3.0-cp312-cp312-macosx_10_9_universal2.whl.metadata (3.8 kB)
Downloading lxml-5.3.0-cp312-cp312-macosx_10_9_universal2.whl (8.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: lxml
Successfully installed lxml-5.3.0


4

In [10]:
tables[0]
# tables[1]
# tables[2]
# tables[3]

Unnamed: 0,Name
0,Jayson Tatum SF
1,Jaylen Brown SG
2,Kristaps Porzingis C
3,Derrick White PG
4,Jrue Holiday PG
5,Payton Pritchard PG
6,Sam Hauser SF
7,Al Horford C
8,Neemias Queta C
9,Luke Kornet C


In [11]:
# "glue" dataframes together (more to come on this later in the semester)
player_stats1 = pd.concat(tables[:2], axis=1)
player_stats1

Unnamed: 0,Name,GP,GS,MIN,PTS,OR,DR,REB,AST,STL,BLK,TO,PF,AST/TO
0,Jayson Tatum SF,74,74.0,35.7,26.9,0.9,7.2,8.1,4.9,1.0,0.6,2.5,2.0,1.9
1,Jaylen Brown SG,70,70.0,33.5,23.0,1.2,4.3,5.5,3.6,1.2,0.5,2.4,2.6,1.5
2,Kristaps Porzingis C,57,57.0,29.6,20.1,1.7,5.5,7.2,2.0,0.7,1.9,1.6,2.7,1.3
3,Derrick White PG,73,73.0,32.6,15.2,0.7,3.5,4.2,5.2,1.0,1.2,1.5,2.1,3.4
4,Jrue Holiday PG,69,69.0,32.8,12.5,1.2,4.2,5.4,4.8,0.9,0.8,1.8,1.6,2.7
5,Payton Pritchard PG,82,5.0,22.3,9.6,0.9,2.4,3.2,3.4,0.5,0.1,0.7,1.3,4.6
6,Sam Hauser SF,79,13.0,22.0,9.0,0.6,2.9,3.5,1.0,0.5,0.3,0.4,1.3,2.6
7,Al Horford C,65,33.0,26.8,8.6,1.3,5.1,6.4,2.6,0.6,1.0,0.7,1.4,3.5
8,Neemias Queta C,28,0.0,11.9,5.5,1.9,2.5,4.4,0.7,0.5,0.8,0.5,1.8,1.5
9,Luke Kornet C,63,7.0,15.6,5.3,1.9,2.3,4.1,1.1,0.4,1.0,0.3,1.2,3.2


In [11]:
# include the more advanced stats
player_stats2 = pd.concat([player_stats1, tables[3]], axis=1)
player_stats2

Unnamed: 0,Name,GP,GS,MIN,PTS,OR,DR,REB,AST,STL,...,3PA,3P%,FTM,FTA,FT%,2PM,2PA,2P%,SC-EFF,SH-EFF
0,Jayson Tatum SF,74,74.0,35.7,26.9,0.9,7.2,8.1,4.9,1.0,...,8.2,37.6,5.6,6.7,83.3,6.0,11.0,54.2,1.393,0.55
1,Jaylen Brown SG,70,70.0,33.5,23.0,1.2,4.3,5.5,3.6,1.2,...,5.9,35.4,3.0,4.3,70.3,6.9,12.1,57.0,1.282,0.56
2,Kristaps Porzingis C,57,57.0,29.6,20.1,1.7,5.5,7.2,2.0,0.7,...,5.1,37.5,4.5,5.3,85.8,4.9,8.1,60.6,1.523,0.59
3,Derrick White PG,73,73.0,32.6,15.2,0.7,3.5,4.2,5.2,1.0,...,6.8,39.6,1.9,2.1,90.1,2.6,4.7,55.5,1.319,0.58
4,Jrue Holiday PG,69,69.0,32.8,12.5,1.2,4.2,5.4,4.8,0.9,...,4.7,42.9,0.9,1.0,83.3,2.8,5.3,52.6,1.248,0.58
5,Payton Pritchard PG,82,5.0,22.3,9.6,0.9,2.4,3.2,3.4,0.5,...,4.7,38.5,0.6,0.7,82.1,1.8,3.1,59.3,1.239,0.58
6,Sam Hauser SF,79,13.0,22.0,9.0,0.6,2.9,3.5,1.0,0.5,...,5.9,42.4,0.2,0.2,89.5,0.7,1.2,55.9,1.276,0.62
7,Al Horford C,65,33.0,26.8,8.6,1.3,5.1,6.4,2.6,0.6,...,4.0,41.9,0.4,0.5,86.7,1.6,2.5,65.8,1.341,0.64
8,Neemias Queta C,28,0.0,11.9,5.5,1.9,2.5,4.4,0.7,0.5,...,0.0,0.0,0.7,1.0,71.4,2.4,3.7,64.4,1.481,0.64
9,Luke Kornet C,63,7.0,15.6,5.3,1.9,2.3,4.1,1.1,0.4,...,0.0,100.0,0.8,0.9,90.7,2.2,3.2,69.8,1.645,0.7


In [12]:
# baseball instead of basketball?
base_tables = pd.read_html('https://www.baseball-reference.com/teams/BOS/2024.shtml')
len(base_tables)

10

In [14]:
base_tables[8]
# base_tables[8]

Unnamed: 0,Rk,Pos,Name,Age,G,PA,AB,R,H,2B,...,OBP,SLG,OPS,OPS+,TB,GDP,HBP,SH,SF,IBB
0,1,C,Connor Wong,28,116,450,414,53,119,23,...,.340,.442,.782,115,183,7,9,0,2,1
1,2,1B,Dominic Smith*,29,84,278,249,29,59,20,...,.317,.390,.706,95,97,7,4,0,0,0
2,3,2B,Enmanuel Valdez*,25,70,210,188,22,41,12,...,.278,.378,.655,79,71,4,0,1,4,0
3,4,SS,Ceddanne Rafaela,23,143,539,513,68,128,21,...,.277,.398,.675,84,204,7,6,2,3,1
4,5,3B,Rafael Devers*,27,134,585,510,87,142,34,...,.361,.529,.890,143,270,12,3,0,6,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56,55,P,Luis Guerrero,23,0,0,,,,,...,,,,,,,,,,
57,56,P,Cooper Criswell,27,0,0,,,,,...,,,,,,,,,,
58,,,Team Totals,27.4,150,5749,5183,710,1323,291,...,.321,.430,.751,106,2230,99,67,6,37,26
59,,,Rank in 15 AL teams,,,,1,3,2,1,...,3,3,2,,2,,7,,8,


## Messy Data

Notice that the baseball data are quite a bit messier than the basketball data. In web scraping, you are beholden to the format of the website (.html) and will almost certainly have to clean data (sometimes extensively) after scraping it.

## Basic HTML
Web pages are written in HTML. The source of https://sapiezynski.com/ds3000/scraping/01.html looks like this:

```html
<html>
    <head>
        <!-- comments in HTML are marked like this -->
        
        <!-- the head tag contains the meta information not displayed but helps browsers render the page -->
    </head>
    <body>
         <!-- This is the body of the document that contains all the visible elements.-->
        <h1>Heading 1</h1>
        <h2>This is what heading 2 looks like</h2>
        
        <p>Text is usually in paragraphs.
            New lines and multiple consecutive whitespace characters are ignored.</p>

<p>Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.</p>   
        
        <p>Links are created using the "a" tag: 
            <a href="https://www.google.com">Click here to google.</a>
            href is an attirbute of the a tag that specify where the link points to.</p>
        
        
    </body>
</html>
```
The keywords in `<>` brackets are called tags. They open with `<tag>` and close with `</tag>`.

In [15]:
## Getting the html content in Python
import requests

response = requests.get('https://sapiezynski.com/ds3000/scraping/01.html')
print(response.text)

<html>
    <head>
        <!-- comments in HTML are marked like this -->
        
        <!-- the head tag contains the meta information not displayed but helps browsers render the page -->
    </head>
    <body>
         <!-- This is the body of the document that contains all the visible elements.-->
        <h1>Heading 1</h1>
        <h2>This is what heading 2 looks like</h2>
        
        <p>Text is usually in paragraphs.
            New lines and multiple consecutive whitespace characters are ignored.</p>

<p>Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.</p>   
        
        <p>Links are created using the "a" tag: 
            <a href="https://www.google.com">Click here to google.</a>
            href is an attirbute of the a tag that specify where the link points to.</p>
        
        
    </body>
</html>




In [16]:
# sometimes this doesn't quite work the way you want (c'est la vie with web scraping)
response2 = requests.get('https://www.nytimes.com/2019/03/10/style/what-is-tik-tok.html')
print(response2.text)

<html><head><title>nytimes.com</title><style>#cmsg{animation: A 1.5s;}@keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style></head><body style="margin:0"><p id="cmsg">Please enable JS and disable any ad blocker</p><script data-cfasync="false">var dd={'rt':'c','cid':'AHrlqAAAAAMA10mIHjJ1U0AAmyGCHw==','hsh':'499AE34129FA4E4FABC31582C3075D','t':'bv','s':17439,'e':'6d1cea580718732d3586aff3d3147cc22fc708710fe0a29f90350366f82fd8c7','host':'geo.captcha-delivery.com'}</script><script data-cfasync="false" src="https://ct.captcha-delivery.com/c.js"></script></body></html>


## Using headers
If you've webscraped before, you may know that bad requests can sometimes be circumnavigated by defining `headers` to pass the `requests` functions (be they `.get()` or `.post()`). Below is an (uneccessary) example:

In [17]:
# sometimes you can get around it though, if you're a bit clever and know how to use headers
# Example (one of the most common headers is the 'user-agent', but in this case it's not really necessary; as you'll see shortly)
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36'}

url = 'https://www.allrecipes.com/search?q=milk+tea'
response3 = requests.get(url, headers = headers)
print(response3.text)


<!DOCTYPE html>
<html id="searchTemplate_1-0" class="comp searchTemplate static-html html mntl-html no-js " data-ab="99,99,99,99,99,99,99,99,77,99,99,99,99,99,99,68,99,99,81,58,78,99,64,62" data-mm-transactional-resource-version="1.13.5" data-mm-ads-resource-version="1.2.120" data-mm-video-resource-version="1.4.5" data-mm-myrecipes-resource-version="1.3.18" data-mantle-resource-version="4.0.596" data-mm-digital-issues-resource-version="1.18.5" lang="en" data-tracking-container="true" data-resource-version="2.104.0" data-allrecipes-resource-version="2.104.0" data-mm-recipes-resource-version="1.1.6"><!--
<globe-environment environment="k8s-prod" application="allrecipes" dataCenter="us-east-1"/>
-->
<head class="loc head">
<link rel="preconnect" href="//js-sec.indexww.com">
<link rel="preconnect" href="//c.amazon-adsystem.com">
<link rel="preconnect" href="//securepubads.g.doubleclick.net">
<link rel="dnsprefetch" href="//www.google-analytics.com">
<meta charset="utf-8">
<meta http-equiv=

In [None]:
# Now there can be a case where the headers are more complicated! Lets look at an example of it:

headers = {'Host': 'www.yelp.com',
	'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:128.0) Gecko/20100101 Firefox/128.0',
	'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
	'Accept-Language': 'en-US,en;q=0.5',
	'Accept-Encoding': 'gzip, deflate, br,zstd',
	'Connection': 'keep-alive',
	'Cookie': 'wdi=2|9C05DBA58C8B45A8|0x1.9779231762aedp+30|b25cd5806fb92a56; location=%7B%22city%22%3A+%22Anchorage%22%2C+%22state%22%3A+%22AK%22%2C+%22country%22%3A+%22US%22%2C+%22latitude%22%3A+61.217799%2C+%22longitude%22%3A+-149.898302%2C+%22max_latitude%22%3A+61.2312356%2C+%22min_latitude%22%3A+61.1229806%2C+%22max_longitude%22%3A+-149.775511%2C+%22min_longitude%22%3A+-149.964789%2C+%22zip%22%3A+%22%22%2C+%22address1%22%3A+%22%22%2C+%22address2%22%3A+%22%22%2C+%22address3%22%3A+%22%22%2C+%22neighborhood%22%3A+%22%22%2C+%22borough%22%3A+%22%22%2C+%22provenance%22%3A+%22YELP_GEOCODING_ENGINE%22%2C+%22display%22%3A+%22Anchorage%2C+AK%22%2C+%22unformatted%22%3A+%22Anchorage%2C+AK%22%2C+%22isGoogleHood%22%3A+false%2C+%22usingDefaultZip%22%3A+false%2C+%22accuracy%22%3A+4%2C+%22language%22%3A+null%7D; hl=en_US; spid.d161=5d651ccc-3f8d-4e36-8362-b31965525d02.1709066439.11.1726590790.1724866333.539af496-f766-4c85-a0ba-417bda3ed672.5bf406f6-5923-483d-9359-995065e47b38.0c69bf7e-8153-4ede-b4c7-81a267821d05.1726590785366.2; _ga=GA1.2.9C05DBA58C8B45A8; _ga_K9Z2ZEVC8C=GS1.2.1726590785.14.1.1726590790.0.0.0; xcj=1|p6I8IYDgANRZXHGr0z2T8CbcJigEZxT2B5PgDvA8e8M; _gcl_au=1.1.362654120.1722453424; _scid=276aae79-0de4-411a-a880-6ccdf3622a89; _tt_enable_cookie=1; _ttp=Dc9RTEWvApeN-syNRhqab5lU74R; _sctr=1%7C1726545600000; g_state={"i_p":1723476120544,"i_l":3}; datadome=JnJzrWr81e_folBGLX1xU65izejNA1N_aE9xErbZU5TVTz3zikGDTxL9TYvYR29CoiqQxLv5KnPr0KMldyP2_NS6nBxkHAoLo9JSgNtF38H6EGC9om6VOdoA0nkbA8Zo; _ga_FFVGS59Y99=GS1.1.1722878716.1.0.1722878722.0.0.0; _ga_QGF19QKM8N=GS1.2.1722878716.1.0.1722878716.0.0.0; bse=172c1b05c94f4a50807e82dcdd82e468; bsi=1%7C80aaaf68-2aa9-5b05-9298-fe2c2a55bcd1%7C1726590784533%7C1726590784533%7C1%7C7f29b674117a02f6; spses.d161=*; OptanonConsent=isGpcEnabled=0&datestamp=Tue+Sep+17+2024+12%3A33%3A14+GMT-0400+(Eastern+Daylight+Time)&version=202403.1.0&browserGpcFlag=0&isIABGlobal=false&identifierType=Cookie+Unique+Id&hosts=&consentId=fb8c4385-4db2-4fb3-b228-f5397d3e048c&interactionCount=1&isAnonUser=1&landingPath=https%3A%2F%2Fwww.yelp.com%2Fbiz%2Fhilton-garden-inn-anchorage-anchorage&groups=BG122%3A1%2CC0003%3A1%2CC0002%3A1%2CC0001%3A1%2CC0004%3A1; _scid_r=HfQnaq55DeQWGuOAbM3zYiqJ4-S7i_vUJ7UAoQ; _uetsid=874dd7a0751211efacabdfbd4eb792f7; _uetvid=789c98e04f7111ef989b33b398d6482b',
	'Upgrade-Insecure-Requests': '1',
	'Sec-Fetch-Dest': 'document',
	'Sec-Fetch-Mode': 'navigate',
	'Sec-Fetch-Site': 'same-origin',
	'Sec-Fetch-User': '?1',
	'DNT': '1',
	'Priority':'u=0, i'}

# However when i run this, i get gibberish! Let me demo it to you!

![confused](https://images7.memedroid.com/images/UPLOADED243/65853cb72d33d.jpeg)

In [None]:
# I contact beautiful soap developer, and we were back and forth and found out that the issue was in **Accept-Encoding** 


headers = {'Host': 'www.yelp.com',
	'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:128.0) Gecko/20100101 Firefox/128.0',
	'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
	'Accept-Language': 'en-US,en;q=0.5',
	'Accept-Encoding': 'identity',
	'Connection': 'keep-alive',
	'Cookie': 'wdi=2|9C05DBA58C8B45A8|0x1.9779231762aedp+30|b25cd5806fb92a56; location=%7B%22city%22%3A+%22Anchorage%22%2C+%22state%22%3A+%22AK%22%2C+%22country%22%3A+%22US%22%2C+%22latitude%22%3A+61.217799%2C+%22longitude%22%3A+-149.898302%2C+%22max_latitude%22%3A+61.2312356%2C+%22min_latitude%22%3A+61.1229806%2C+%22max_longitude%22%3A+-149.775511%2C+%22min_longitude%22%3A+-149.964789%2C+%22zip%22%3A+%22%22%2C+%22address1%22%3A+%22%22%2C+%22address2%22%3A+%22%22%2C+%22address3%22%3A+%22%22%2C+%22neighborhood%22%3A+%22%22%2C+%22borough%22%3A+%22%22%2C+%22provenance%22%3A+%22YELP_GEOCODING_ENGINE%22%2C+%22display%22%3A+%22Anchorage%2C+AK%22%2C+%22unformatted%22%3A+%22Anchorage%2C+AK%22%2C+%22isGoogleHood%22%3A+false%2C+%22usingDefaultZip%22%3A+false%2C+%22accuracy%22%3A+4%2C+%22language%22%3A+null%7D; hl=en_US; spid.d161=5d651ccc-3f8d-4e36-8362-b31965525d02.1709066439.11.1726590790.1724866333.539af496-f766-4c85-a0ba-417bda3ed672.5bf406f6-5923-483d-9359-995065e47b38.0c69bf7e-8153-4ede-b4c7-81a267821d05.1726590785366.2; _ga=GA1.2.9C05DBA58C8B45A8; _ga_K9Z2ZEVC8C=GS1.2.1726590785.14.1.1726590790.0.0.0; xcj=1|p6I8IYDgANRZXHGr0z2T8CbcJigEZxT2B5PgDvA8e8M; _gcl_au=1.1.362654120.1722453424; _scid=276aae79-0de4-411a-a880-6ccdf3622a89; _tt_enable_cookie=1; _ttp=Dc9RTEWvApeN-syNRhqab5lU74R; _sctr=1%7C1726545600000; g_state={"i_p":1723476120544,"i_l":3}; datadome=JnJzrWr81e_folBGLX1xU65izejNA1N_aE9xErbZU5TVTz3zikGDTxL9TYvYR29CoiqQxLv5KnPr0KMldyP2_NS6nBxkHAoLo9JSgNtF38H6EGC9om6VOdoA0nkbA8Zo; _ga_FFVGS59Y99=GS1.1.1722878716.1.0.1722878722.0.0.0; _ga_QGF19QKM8N=GS1.2.1722878716.1.0.1722878716.0.0.0; bse=172c1b05c94f4a50807e82dcdd82e468; bsi=1%7C80aaaf68-2aa9-5b05-9298-fe2c2a55bcd1%7C1726590784533%7C1726590784533%7C1%7C7f29b674117a02f6; spses.d161=*; OptanonConsent=isGpcEnabled=0&datestamp=Tue+Sep+17+2024+12%3A33%3A14+GMT-0400+(Eastern+Daylight+Time)&version=202403.1.0&browserGpcFlag=0&isIABGlobal=false&identifierType=Cookie+Unique+Id&hosts=&consentId=fb8c4385-4db2-4fb3-b228-f5397d3e048c&interactionCount=1&isAnonUser=1&landingPath=https%3A%2F%2Fwww.yelp.com%2Fbiz%2Fhilton-garden-inn-anchorage-anchorage&groups=BG122%3A1%2CC0003%3A1%2CC0002%3A1%2CC0001%3A1%2CC0004%3A1; _scid_r=HfQnaq55DeQWGuOAbM3zYiqJ4-S7i_vUJ7UAoQ; _uetsid=874dd7a0751211efacabdfbd4eb792f7; _uetvid=789c98e04f7111ef989b33b398d6482b',
	'Upgrade-Insecure-Requests': '1',
	'Sec-Fetch-Dest': 'document',
	'Sec-Fetch-Mode': 'navigate',
	'Sec-Fetch-Site': 'same-origin',
	'Sec-Fetch-User': '?1',
	'DNT': '1',
	'Priority':'u=0, i'}

# So i after searching on Google, found that i had to put identity rather than gzip, deflate, br,zstd

### Warning
I **strongly** prefer you only scrape websites that allow you to scrape **without** using headers. While sometimes a website is perfectly happy for you to scrape it and you might just need a little fiddling with the headers to make it work, often times a website isn't immediately scrapable **for a reason**; that is, protection of intellectual property, to prevent too much internet traffic flooding the site, etc.

Please try to avoid scraping any sites that require using headers in this class.

# Beautiful Soup

Even if the .html does look relatively clean, it's still just a big string. How can we deal with it? Luckily there is a module made for just this purpose, and it's even a magic command which we can install directly in jupyter notebook:

In [17]:
#pip install bs4

In [19]:
from bs4 import BeautifulSoup

url = 'https://sapiezynski.com/ds3000/scraping/01.html' 
str_html = requests.get(url).text
soup = BeautifulSoup(str_html)

In [20]:
soup

<html>
<head>
<!-- comments in HTML are marked like this -->
<!-- the head tag contains the meta information not displayed but helps browsers render the page -->
</head>
<body>
<!-- This is the body of the document that contains all the visible elements.-->
<h1>Heading 1</h1>
<h2>This is what heading 2 looks like</h2>
<p>Text is usually in paragraphs.
            New lines and multiple consecutive whitespace characters are ignored.</p>
<p>Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.</p>
<p>Links are created using the "a" tag: 
            <a href="https://www.google.com">Click here to google.</a>
            href is an attirbute of the a tag that specify where the link points to.</p>
</body>
</html>

In [21]:
## getting elements by their tag name:
soup.find_all('p')

# find_all returns a list, where each element is an instance of the specified tag

[<p>Text is usually in paragraphs.
             New lines and multiple consecutive whitespace characters are ignored.</p>,
 <p>Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.</p>,
 <p>Links are created using the "a" tag: 
             <a href="https://www.google.com">Click here to google.</a>
             href is an attirbute of the a tag that specify where the link points to.</p>]

In [22]:
# the bs4 object tracks the tags
type(soup.find_all('p')[0])

bs4.element.Tag

In [22]:
for paragraph in soup.find_all('p'):
    # text is a property of a soup object
    print(paragraph.text) 
    print('------')

Text is usually in paragraphs.
            New lines and multiple consecutive whitespace characters are ignored.
------
Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.
------
Links are created using the "a" tag: 
            Click here to google.
            href is an attirbute of the a tag that specify where the link points to.
------


# `.find_all()` on subtrees of soup object


The `.find_all()` method works not only on the whole `soup` object, but also on subtrees of the soup object.  

Consider the site at https://sapiezynski.com/ds3000/scraping/02.html:

```html
<html>
    <body>
        <p>The links in this paragraph point to search engines, like <a href="https://duckduckgo.com">DuckDuckGo</a>, <a href="https://google.com">Google</a>, <a href="https://bing.com">Bing</a></p>
        
        <p>The links in this paragraph point to Internet browsers, like <a href="https://firefox.com">Firefox</a>, <a href="https://chrome.com">Chrome</a>, <a href="https://opera.com">Opera</a></p>.
    </body>
</html>
```

**Goal**: Grab links from the first paragraph only:

In [23]:
# getting the content of the page
url = 'https://sapiezynski.com/ds3000/scraping/02.html'
response = requests.get(url)
soup = BeautifulSoup(response.text)

# finding all paragraphs:
p_all = soup.find_all('p')

In [24]:
# getting the first paragraph
p_first = p_all[0]

In [25]:
# getting the links from the first paragraph:
links_p_first = p_first.find_all('a')

print(links_p_first)

[<a href="https://duckduckgo.com">DuckDuckGo</a>, <a href="https://google.com">Google</a>, <a href="https://bing.com">Bing</a>]


### Some syntactic sugar: 
To get the first tag under a soup object, refer to it as an attribute

In [26]:
# is equivilent to soup.find_all('p')[0]
soup.p

<p>The links in this paragraph point to search engines, like <a href="https://duckduckgo.com">DuckDuckGo</a>, <a href="https://google.com">Google</a>, <a href="https://bing.com">Bing</a></p>

In [27]:
# so we can condense our code as
plinks = soup.p.find_all('a')
print(plinks)

[<a href="https://duckduckgo.com">DuckDuckGo</a>, <a href="https://google.com">Google</a>, <a href="https://bing.com">Bing</a>]


In [28]:
# iterating over tags
for par in soup.find_all('p'):
    print(par.a)

<a href="https://duckduckgo.com">DuckDuckGo</a>
<a href="https://firefox.com">Firefox</a>


In [29]:
# and the first link in that paragraph can be accessed like this:
link = soup.p.a
print(link)

<a href="https://duckduckgo.com">DuckDuckGo</a>


## Identifying if tags exist

In [30]:
# what if we're trying to access an element that doesn't exist?
header = soup.h3
print(header)

# won't work, because header is of type None
# header.text

None


We can test if a tag exists in a soup object by looking for the first instance of this tag and comparing it to `None`

In [31]:
if soup.h3 is None:
    print("tag h3 doesnt exist in soup")
else:
    print("tag h3 does exist!")

tag h3 doesnt exist in soup


In [32]:
if soup.p is None:
    print("tag p doesnt exist in soup")
else:
    print("tag p does exist!")

tag p does exist!


## Finding tags by `class_`

**Tip**: This is often one of the most useful ways to localize a particular part of a web page.

In [26]:
# get soup
url = 'https://www.allrecipes.com/search?q=cheese+fondue'
response = requests.get(url)
soup = BeautifulSoup(response.text)

In [27]:
soup

<!DOCTYPE html>
<html class="comp searchTemplate static-html html mntl-html no-js" data-ab="99,99,99,99,99,99,99,99,77,99,99,99,99,99,99,68,99,99,83,58,78,99,64,62" data-allrecipes-resource-version="2.104.0" data-mantle-resource-version="4.0.596" data-mm-ads-resource-version="1.2.120" data-mm-digital-issues-resource-version="1.18.5" data-mm-myrecipes-resource-version="1.3.18" data-mm-recipes-resource-version="1.1.6" data-mm-transactional-resource-version="1.13.5" data-mm-video-resource-version="1.4.5" data-resource-version="2.104.0" data-tracking-container="true" id="searchTemplate_1-0" lang="en"><!--
<globe-environment environment="k8s-prod" application="allrecipes" dataCenter="us-east-1"/>
-->
<head class="loc head">
<link href="//js-sec.indexww.com" rel="preconnect"/>
<link href="//c.amazon-adsystem.com" rel="preconnect"/>
<link href="//securepubads.g.doubleclick.net" rel="preconnect"/>
<link href="//www.google-analytics.com" rel="dnsprefetch"/>
<meta charset="utf-8"/>
<meta content

Our **goal** is to get a list of recipes.  Maybe we should find all the `div` tags? What about `span` tags?

In [35]:
# finding via tag ... problematic as we have too many div tags!
len(soup.find_all('div'))

303

In [36]:
len(soup.find_all('span'))

106

Tags can have multiple "classes" they belong to.  For example, in https://www.allrecipes.com/search?q=cheese+fondue the first recipe is encapsulated in this html tag:

    <span class="card__title"><span class="card__title-text">Cheese Fondue</span></span>
    
So this particular span tag belongs to classes:
- `card__title`
- `card__title-text`
    
I suspect only our target recipes belong to the `card__title-text` class.  Lets find them all:

In [37]:
recipe_list = soup.find_all(class_='card__title-text')

len(recipe_list)

24

In [38]:
recipe_list

[<span class="card__title-text">Cheese Fondue</span>,
 <span class="card__title-text">Best Formula Three-Cheese Fondue</span>,
 <span class="card__title-text">Beer Cheese Fondue</span>,
 <span class="card__title-text">Classic Cheese Fondue</span>,
 <span class="card__title-text">Cheese Fondue</span>,
 <span class="card__title-text">Chef John's Classic Cheese Fondue Is the Ultimate Cheese Lover's Recipe</span>,
 <span class="card__title-text">Quick Fontina Cheese Fondue</span>,
 <span class="card__title-text">Basic Fondue</span>,
 <span class="card__title-text">Classic Swiss Fondue</span>,
 <span class="card__title-text">YouTube + Chill: For Serious Cheese-Lovers Only</span>,
 <span class="card__title-text">Crab Cheese Fondue</span>,
 <span class="card__title-text">Cheese</span>,
 <span class="card__title-text">25 Best Appetizers to Make if You're Obsessed With Cheese</span>,
 <span class="card__title-text">How to Make Cheese Sauce From Scratch</span>,
 <span class="card__title-text">Th

In [40]:
recipe_list[2].text

'Beer Cheese Fondue'

## Finding tags by `id`

Nearly the same as finding by class, but you'll look for `id=` in the html and pass it to the `id` keyword of `soup.find_all()`.

**Goal**: Get the footer from: https://www.scrapethissite.com/



```html
<section id="footer">
        <div class="container">
            <div class="row">
                <div class="col-md-12 text-center text-muted">
                    Lessons and Videos © Hartley Brody 2018
                </div><!--.col-->
            </div><!--.row-->
        </div><!--.container-->
    </section>
```

In [41]:
# get soup from url
url = 'https://www.scrapethissite.com/'
html = requests.get(url).text
soup = BeautifulSoup(html)

In [42]:
soup.find_all(id='footer')

[<section id="footer">
 <div class="container">
 <div class="row">
 <div class="col-md-12 text-center text-muted">
                     Lessons and Videos © Hartley Brody 2023
                 </div><!--.col-->
 </div><!--.row-->
 </div><!--.container-->
 </section>]

Note that you can combine all searches shown above:
- tag
    - p (paragraph)
    - a (link)
    - div, span, ...
- tag class
- tag id

```python
# finds all links (tag type = 'a'), with given class and id
soup.find_all('a', class_='fancy-link', id='blue')

```

## Practice: Rest of Class (if time, if not next time!)

**Goal:** Get a list of recipe names from www.allrecipes.com like we did for:

https://www.allrecipes.com/search?q=cheese+fondue

1. Write function `crawl_recipes(query)` which:
    * takes the search phrase (the ingredient) as input argument
    * builds the correct url that leads directly to the page that lists the recipes
    * uses `requests` to get the content of this page returns the html text of the page
1. Write `extract_recipes(text)` which:
    * takes the text returned by `crawl_recipes` as argument
    * builds a BeautifulSoup object out of that text 
    * finds names of all recipes
        - to identify which tags / classes to `find_all()`, open the page in your browser and "inspect" 
        - start from the recipe object above, and call another `find_all()` to zoom into the recipe name itself
    * returns the list of recipe names
    

A new function that will help if you wish to query multiple words:

`string.replace()`

So, if you wish to turn `cheese fondue` into `cheese+fondue`:

`string = 'cheese fondue'`

`string.replace(" ", "+")`

In [43]:
string = 'cheese fondue'
string = string.replace(" ", "+")
string

'cheese+fondue'

In [45]:
# put functions here

In [46]:
meatloaf_html = crawl_recipes('meatloaf')
new_recipe_list = extract_recipes(meatloaf_html)

In [47]:
# new_recipe_list

['Classic Meatloaf',
 'Melt-In-Your-Mouth Meatloaf',
 'Best Meatloaf',
 'Easy Meatloaf',
 'Creamy Mushroom Meatloaf',
 "Meatloaf that Doesn't Crumble",
 'Meatloaf Cupcakes',
 'Italian Style Turkey Meatloaf',
 'Brown Sugar Meatloaf with Ketchup Glaze',
 "Chris's Incredible Italian Turkey Meatloaf",
 'Turkey and Quinoa Meatloaf',
 'Dill Pickle Meatloaf',
 "Kim's Ultimate Meatloaf",
 "Ellen's Buffalo Meatloaf",
 'Mushroom Meatloaf',
 "Chef John's Meatball-Inspired Meatloaf",
 'Best Turkey Meatloaf',
 'Cottage Meatloaf',
 'Tennessee Meatloaf',
 'Sweet and Sour Meatloaf',
 "Momma's Healthy Meatloaf",
 'Best Ever Meatloaf with Brown Gravy',
 'Smoky Chipotle Meatloaf',
 'Cheeseburger Meatloaf']