# DS3000 Lecture 7 and 8

### Admin:
- HW 3 due on Monday
- HW 4 posted on Tuesday and due on Friday
- Lecture 9 will be a lab. Finish the lab will earn 1 extra credict
- Project description will be released this week

### Content:
- OpenWeather API pipeline
- Intro to Web Scraping

# Pipeline: What is it?

A data pipeline is a collection of functions* which split all the functionality of our data collection and processing

(*can be other structures too, but it may be easier to first understand each as a function)


# Why build a data pipeline?

- Allows pipeline to be run in parts (rather than the whole thing)
- Allows pipeline to be built by different programmers working on different parts in parallel
- Allows us to test each piece of our code seperately
- Allows for modification / re-use of different sections

What we call a "Data Pipeline" here is a specific instance of "Factoring" a piece of software, splitting up its functionality into pieces.
    


# OpenWeather API Pipeline Activity

OpenWeather API offers a few different queries (see [here](https://openweathermap.org/api) for details):
- 3-hour Forecast 5 days (which we have access to)
- Air Pollution API
- etc.


**Goal:**

Build a library of functions which can be pieced together to support the collection, cleaning and display of features from OpenWeather into a scatter plot of two features.

### Lets design one together: 

(think: input/outputs -> handwritten notes)

# Plan out a pipeline

Write a few 'empty' functions including little more than the docstring:

```python
def some_fnc(a_string, a_list):
    """ processes a string and a list (somehow)
    
    Args:
        a_string (str): an input string which ...
        a_list (list): a list which describes ...
        
    Returns:
        output (dict): the output dict which is ...
    """
    pass
```

and a script which uses them:

```python
# inputs (not necessarily complete)
lat = -42
lon = 73

some_output = some_fnc(lat, lon)
some_other_output = some_other_fnc(some_output)

```

which would, if the functions worked, produce a graph like this (note: this starts Oct 6, because I made it yesterday):

<img src="https://i.ibb.co/Ct0JtRJ/newplot-1.png" width=500\img>

# What might these empty functions look like?

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import requests
import json
from datetime import datetime
import pandas as pd
import plotly
import plotly.express as px

def openweather_onecall(latlon_tuple, api_key, units='imperial'):
    """ returns weather data from one location via onecall
    
    https://openweathermap.org/api/one-call-api 
    
    Args:
        latlon_tuple (tuple): first element is lattitude,
            second is longitude
        api_key (str): API key required to access data
        units (str): 'imperial', 'standard', 'metric'
        
    Returns:
        weather_dict (dict): a nested dictionary (tree) which
            contains weather data
    """
    # build url
    lat, lon = latlon_tuple
    url = f'https://api.openweathermap.org/data/2.5/forecast?lat={lat}&lon={lon}&APPID={api_key}&units={units}'

    # get url as a string
    url_text = requests.get(url).text
    
    # convert json to a nested dict
    weather_dict = json.loads(url_text)
    
    return weather_dict

def get_clean_df_daily(weather_dict):
    """ formats daily_dict to a pandas data frame
    
    see https://openweathermap.org/api/one-call-api for
    full daily_dict specification
    
    Args:
        weather_dict (list): list of dictionaries of 3-hour window
            weather features
            
    Returns:
        df_daily (pd.DataFrame): each row is weather from 3-hour window
    """
    hour_dict = weather_dict['list'][0]['main']
    hour_dict['datetime'] = weather_dict['list'][0]['dt_txt']

    df_hourly = pd.Series(hour_dict)

    df_hourly = pd.DataFrame(df_hourly).transpose()
    
    index = 0
    for hour_index in weather_dict['list']:

        hour_dict = hour_index['main']
        hour_dict['datetime'] = hour_index['dt_txt']

        s_hour = pd.Series(hour_dict)
    
        #df_hourly = df_hourly.append(s_hour, ignore_index=True)
        df_hourly.loc[str(index),:] = s_hour
    
        index = index + 1

    df_hourly = df_hourly.iloc[1:,]   
    
    return df_hourly

def scatter_plotly(df, feat_x, feat_y, f_html='scatter.html'):
    """ creates a plotly scatter plot, exports as html 
    
    Args:
        df (pd.DataFrame): pandas dataframe
        x_feat (str): x axis of scatter
        y_feat (str): y axis of scatter
        f_html (str): output html file
        
    Returns:
        f_html (str): output html file
    """
    # creat scatter plot
    fig = px.scatter(df, x=feat_x, y=feat_y)

    # export scatter to html
    plotly.offline.plot(fig, filename=f_html)
    
    return f_html

ModuleNotFoundError: No module named 'requests'

In [2]:
# inputs
feat_x = 'datetime'
feat_y = 'temp_max'
latlon_tuple = -42, 70
units = 'imperial'
api_key = '2afdede234eabfa52612efba55bcc8ac'

# get data
weather_dict = openweather_onecall(latlon_tuple, 
                                   units=units,
                                   api_key=api_key)

In [5]:
# clean weather dict (make dataframe from dict, process timestamps etc)
df_daily = get_clean_df_daily(weather_dict)
df_daily.head()

Unnamed: 0,temp,feels_like,temp_min,temp_max,pressure,sea_level,grnd_level,humidity,temp_kf,datetime
0,43.66,34.23,43.66,43.66,1018,1018,1018,64,0.0,2024-07-13 18:00:00
1,43.72,34.72,43.72,43.81,1018,1018,1018,63,-0.05,2024-07-13 21:00:00
2,43.5,34.2,43.43,43.5,1019,1019,1019,62,0.04,2024-07-14 00:00:00
3,43.43,34.02,43.43,43.43,1020,1020,1020,57,0.0,2024-07-14 03:00:00
4,43.56,34.18,43.56,43.56,1022,1022,1022,56,0.0,2024-07-14 06:00:00


In [6]:
# make scatter
f_html = scatter_plotly(df_daily, feat_x=feat_x, feat_y=feat_y)

## Web Scraping
* Using programs or scripts to pretend to browse websites, examine the content on those websites, retrieve and extract data from those websites
* Why scrape?
    * if an API is available for a service, we will nearly always prefer the API to scraping
    * ... but not all services have APIs or the available APIs are too expensive for our project
    * newly published information might not yet be available through ready datasets
* Downsides of scraping:
    * no reference documentation (unlike APIs)
    * no guarantee that a webpage we scrape will look and work the same way the next day (might need to rewrite the whole scraper)
    * if it violates the terms of service it might be seen as a felony (https://www.aclu.org/cases/sandvig-v-barr-challenge-cfaa-prohibition-uncovering-racial-discrimination-online)
    * legal and moral greyzone (even if the ToS does not disallow it, somebody has to pay for the traffic and when you're scraping you're not looking at ads)
    * ... but everbody does it anyway (https://www.hollywoodreporter.com/thr-esq/genius-says-it-caught-google-lyricfind-redhanded-stealing-lyrics-400m-suit-1259383)
* Web scraping pipeline:
    * because the webpages might change their structure it's extra important to keep the crawling/extraction step separate from transformations and loading
    * ETL (Extraction-Transform-Load):
        * **Crawl**: open a given URL using requests and get the HTML source;
        * **Extract**: extract interesting content from the webpage's source.
        * **Transform**: our usual unit conversions, etc
        * **Load**: representing the data in an easy way for storage and analysis
    * **Pro tip**: it's usually a good idea to not only store the transformed data, but also the raw HTML source - because the webpages might change and we might be late to realize we're not extracting right. If we have the original HTML source we can go back to it
    

## First, whether is it OK to be scrapped?

robots.txt

## Best case scenario
Some webpages publish their data in the form of simple tables. In these (rare) cases we can just use pandas .read_html to scrape this data:

https://www.espn.com/nba/team/stats/_/name/bos

In [7]:
import pandas as pd

In [1]:
# baseball instead of basketball?
# https://www.baseball-reference.com/teams/BOS/2022.shtml


## Messy Data

Notice that the baseball data are quite a bit messier than the basketball data. In web scraping, you are beholden to the format of the website (.html) and will almost certainly have to clean data (sometimes extensively) after scraping it.

## Basic HTML
Web pages are written in HTML. The source of https://sapiezynski.com/ds3000/scraping/01.html looks like this:

```html
<html>
    <head>
        <!-- comments in HTML are marked like this -->
        
        <!-- the head tag contains the meta information not displayed but helps browsers render the page -->
    </head>
    <body>
         <!-- This is the body of the document that contains all the visible elements.-->
        <h1>Heading 1</h1>
        <h2>This is what heading 2 looks like</h2>
        
        <p>Text is usually in paragraphs.
            New lines and multiple consecutive whitespace characters are ignored.</p>

<p>Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.</p>   
        
        <p>Links are created using the "a" tag: 
            <a href="https://www.google.com">Click here to google.</a>
            href is an attirbute of the a tag that specify where the link points to.</p>
        
        
    </body>
</html>
```
The keywords in `<>` brackets are called tags. They open with `<tag>` and close with `</tag>`.

In [2]:
## Getting the html content in Python
import requests


In [14]:
# sometimes this doesn't quite work the way you want (c'est la vie with web scraping)
response2 = requests.get('https://www.nytimes.com/2019/03/10/style/what-is-tik-tok.html')
print(response2.text)

<html><head><title>nytimes.com</title><style>#cmsg{animation: A 1.5s;}@keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style></head><body style="margin:0"><p id="cmsg">Please enable JS and disable any ad blocker</p><script data-cfasync="false">var dd={'rt':'c','cid':'AHrlqAAAAAMAg2FSlyqAvKkASEY_Dw==','hsh':'499AE34129FA4E4FABC31582C3075D','t':'bv','s':17439,'e':'f5bdd29d2b571013e8a759916935edc6c24068c138a78c1ab611e1d898c28a91','host':'geo.captcha-delivery.com'}</script><script data-cfasync="false" src="https://ct.captcha-delivery.com/c.js"></script></body></html>


# Beautiful Soup

Even if the .html does look relatively clean, it's still just a big string. How can we deal with it? Luckily there is a module made for just this purpose, and it's even a magic command which we can install directly in jupyter notebook:

In [12]:
#pip install bs4

Collecting bs4
  Using cached bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py): started
  Building wheel for bs4 (setup.py): finished with status 'done'
  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1257 sha256=c09542241be07caaa3a7ba8e73bbed263656c48a06e7a2d08745e378c42e81b5
  Stored in directory: c:\users\yangx\appdata\local\pip\cache\wheels\73\2b\cb\099980278a0c9a3e57ff1a89875ec07bfa0b6fcbebb9a8cad3
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1
Note: you may need to restart the kernel to use updated packages.




In [3]:
from bs4 import BeautifulSoup

url = "https://sapiezynski.com/ds3000/scraping/01.html"


# `.find_all()` on subtrees of soup object


The `.find_all()` method works not only on the whole `soup` object, but also on subtrees of the soup object.  

Consider the site at https://sapiezynski.com/ds3000/scraping/02.html:

```html
<html>
    <body>
        <p>The links in this paragraph point to search engines, like <a href="https://duckduckgo.com">DuckDuckGo</a>, <a href="https://google.com">Google</a>, <a href="https://bing.com">Bing</a></p>
        
        <p>The links in this paragraph point to Internet browsers, like <a href="https://firefox.com">Firefox</a>, <a href="https://chrome.com">Chrome</a>, <a href="https://opera.com">Opera</a></p>.
    </body>
</html>
```

**Goal**: Grab links from the first paragraph only:

In [4]:
url = "https://sapiezynski.com/ds3000/scraping/02.html"

### Some syntactic sugar: 
To get the first tag under a soup object, refer to it as an attribute

## Identifying if tags exist

We can test if a tag exists in a soup object by looking for the first instance of this tag and comparing it to `None`

## Finding tags by `class_`

**Tip**: This is often one of the most useful ways to localize a particular part of a web page.

In [6]:
# get soup
url = 'https://www.allrecipes.com/search?q=cheese+fondue'
responses = requests.get(url)
soup = BeautifulSoup(responses.text)

Our **goal** is to get a list of recipes.  Maybe we should find all the `div` tags? What about `span` tags?

Tags can have multiple "classes" they belong to.  For example, in https://www.allrecipes.com/search?q=cheese+fondue the first recipe is encapsulated in this html tag:

    <span class="card__title"><span class="card__title-text">Cheese Fondue</span></span>
    
So this particular span tag belongs to classes:
- `card__title`
- `card__title-text`
    
I suspect only our target recipes belong to the `card__title-text` class.  Lets find them all:

## Finding tags by `id`

Nearly the same as finding by class, but you'll look for `id=` in the html and pass it to the `id` keyword of `soup.find_all()`.

**Goal**: Get the footer from: https://www.scrapethissite.com/



```html
<section id="footer">
        <div class="container">
            <div class="row">
                <div class="col-md-12 text-center text-muted">
                    Lessons and Videos © Hartley Brody 2018
                </div><!--.col-->
            </div><!--.row-->
        </div><!--.container-->
    </section>
```

In [40]:
# get soup from url
url = 'https://www.scrapethissite.com/'
html = requests.get(url)
soup = BeautifulSoup(html.text)

Note that you can combine all searches shown above:
- tag
    - p (paragraph)
    - a (link)
    - div, span, ...
- tag class
- tag id

```python
# finds all links (tag type = 'a'), with given class and id
soup.find_all('a', class_='fancy-link', id='blue')

```

## Practice: 

**Goal:** Get a list of recipe names from www.allrecipes.com like we did for:

https://www.allrecipes.com/search?q=cheese+fondue

1. Write function `crawl_recipes(query)` which:
    * takes the search phrase (the ingredient) as input argument
    * builds the correct url that leads directly to the page that lists the recipes
    * uses `requests` to get the content of this page returns the html text of the page
1. Write `extract_recipes(text)` which:
    * takes the text returned by `crawl_recipes` as argument
    * builds a BeautifulSoup object out of that text 
    * finds names of all recipes
        - to identify which tags / classes to `find_all()`, open the page in your browser and "inspect" 
        - start from the recipe object above, and call another `find_all()` to zoom into the recipe name itself
    * returns the list of recipe names
    

A new function that will help if you wish to query multiple words:

`string.replace()`

So, if you wish to turn `cheese fondue` into `cheese+fondue`:

In [7]:
def crawl_recipes(query):
    """ gets html of from allrecipes.com to search query
    
    Args:
        query (str): search string
        
    Returns:
        html_str (str): html response from allreceipes.com
    """
    pass
    
def extract_recipes(text):
    """ builds list of recipe names from allrecipies html
    
    Args:
        html_str (str): html response from allrecipes.com, see crawl_recipes()
        
    Returns:
        recipe_list (list): list of recipes
    """
    
    pass

## Getting info from each recipe's own page:

When we interact with the webpage in the browser, clicking on the header with the recipe name leads us to the actual recipe. Let's have a look at how it's done. Here is the link (`<a >` tag) for the first and third cards of the meatloaf search:

```html
<a class="comp mntl-card-list-items mntl-document-card mntl-card card card--no-image" 
   data-cta="" 
   data-doc-id="6663943" 
   data-ordinal="1" 
   data-tax-levels="" 
   href="https://www.allrecipes.com/recipe/219171/classic-meatloaf/" 
   id="mntl-card-list-items_1-0">
```

```html
<a class="comp mntl-card-list-items mntl-document-card mntl-card card card--no-image" 
   data-cta="" 
   data-doc-id="6663443" 
   data-ordinal="3" 
   data-tax-levels="" 
   href="https://www.allrecipes.com/recipe/223381/melt-in-your-mouth-meat-loaf/" 
   id="mntl-card-list-items_1-0-2">
```



# Adding `href` to our dataframe of recipes

Let's modify our `extract_recipes()` function such that rather than returning just the names of the dishes, it returns a list of dictionaries, where each dictionary has the `name` and `url` fields:

## `from_dict`

First, a useful tool to turn a dictionary into a data frame where the keys are features (columns) and the values are lists that correspond to the values of the features (rows) is the `pd.DataFrame.from_dict()` function:

In [53]:
example_dict = {'col1': [1,2,3,4,5],
                'col2': [6,7,8,9,10],
                'col3': ['who', 'what', 'when', 'where', 'why']}
pd.DataFrame.from_dict(example_dict)

Unnamed: 0,col1,col2,col3
0,1,6,who
1,2,7,what
2,3,8,when
3,4,9,where
4,5,10,why


In [8]:
def extract_recipes(text):
    """ builds list of recipe names from allrecipies html
    
    Args:
        html_str (str): html response from allrecipes.com, see crawl_recipes()
        
    Returns:
        df_recipe (pd.DataFrame): dataframe of recipes
    """
    pass

## String Manipulations
- `.split()` & `.join()`
- `.strip()`
- `.replace()`
- `.upper()` & `.lower()`

Visting [a specific recipe's page](https://www.allrecipes.com/recipe/219171/classic-meatloaf/) yields data stored in a string.  The methods above allow us to extract this information.

In [67]:
# visit specific recipe's page
url = 'https://www.allrecipes.com/recipe/283561/classic-cheese-fondue/'
html = requests.get(url).text
soup = BeautifulSoup(html)

## Exercise
Write two functions: `extract_prep_info()` and `extract_nutrition()`, which both accept a url of a particular recipe (see examples above) and return dictionaries of the prep in of nutritional information, respectively. For example:

```python
url = 'https://www.allrecipes.com/recipe/283561/classic-cheese-fondue/'
extract_prep_info(url)
extract_nutrition(url)

```

yields:

```python
prep_info_dict = {'Prep Time': '10 mins',
                  'Cook Time': '15 mins',
                  'Total Time': '25 mins',
                  'Servings': '10',
                  'Yield': '10 servings'}

```

and

```python
nutr_info_dict = {'Total Fat': '14g',
                  'Saturated Fat': '9g',
                  'Cholesterol': '46mg',
                  'Sodium': '179mg',
                  'Total Carbohydrate': '3g',
                  'Total Sugars': '1g',
                  'Protein': '13g',
                  'Vitamin C': '0mg',
                  'Calcium': '461mg',
                  'Iron': '0mg',
                  'Potassium': '67mg'}

```

In [10]:
def extract_prep_info(url):
    """ returns a dictionary of recipe preparation info 
    
    Args:
        url (str): location of all recipes "recipe"
        
    Returns:
        prep_info_dict (dict): keys are features ('prep'), 
            vals are str that describe feature ('20 mins')
    """
    pass

In [11]:
def extract_nutrition(url):
    """ returns a dictionary of nutrition info 
    
    Args:
        url (str): location of all recipes "recipe"
        
    Returns:
        nutr_dict (dict): keys are molecule types ('fat'), 
            vals are str of quantity ('24 g')
    """
    pass

In [12]:
url = 'https://www.allrecipes.com/recipe/283561/classic-cheese-fondue/'

### Grabbing numeric values (float/int) from messy strings

- We have strings which describe recipe nutrition info (`'100 mg'`)
- We want numeric data types (`float, int`) so that we can plot and operate on these values

## Rest of Class (Go slowly; if we don't finish we can next week)
Complete the `extract_nutrition()` below such that:

```python
# get / extract a data frame of recipes (only name and href)
str_query = 'boston cream pie'
html_str = crawl_recipes(str_query)
df_recipe = extract_recipes(html_str)

for row_idx in range(df_recipe.shape[0]):
    # get / extract nutrition info for a particular recipe
    recipe_url = df_recipe.loc[row_idx, 'href']
    nutr_dict = extract_nutrition(recipe_url)
    
    # add each new nutrition feature to the dataframe
    # only if there ARE nutrition features
    if len(nutr_dict) != 0:
        for nutr_feat, nutr_val in nutr_dict.items():
            df_recipe.loc[row_idx, nutr_feat] = nutr_val
    else:
        df_recipe = df_recipe.drop(row_idx, axis=0)

```

generates the `df_recipe`:

|    | name                           | href                                              | Total Fat | Saturated Fat | Cholesterol | Sodium | Total Carbohydrate | Dietary Fiber | Total Sugars | Protein | Vitamin C | Calcium | Iron | Potassium |
|----|--------------------------------|---------------------------------------------------|-----------|---------------|-------------|--------|--------------------|---------------|--------------|---------|-----------|---------|------|-----------|
| 0  | Chef John's Boston Cream Pie   | https://www.allrecipes.com/recipe/220942/chef-... | 41        | 17            | 199         | 514    | 72                 | 2             | 46           | 10      | 0         | 168     | 2    | 230       |
| 1  | Boston Cream Pie               | https://www.allrecipes.com/recipe/8138/boston-... | 13        | 6             | 61          | 230    | 47                 | 1             | 34           | 5       | 0         | 101     | 2    | 134       |
| 2  | Boston Cream Pie I             | https://www.allrecipes.com/recipe/8137/boston-... | 15        | 9             | 94          | 223    | 43                 | 1             | 26           | 5       | 0         | 97      | 2    | 95        |
| 3  | Semi-Homemade Boston Cream Pie | https://www.allrecipes.com/recipe/278930/semi-... | 41        | 16            | 219         | 568    | 79                 | 3             | 53           | 11      | 0         | 186     | 3    | 194       |
| 9  | Hot Milk Sponge Cake II        | https://www.allrecipes.com/recipe/8159/hot-mil... | 3         | 2             | 52          | 231    | 34                 | 0             | 20           | 4       | NaN       | 61      | 2    | 60        |
| 17 | Boston Cream Dessert Cups      | https://www.allrecipes.com/recipe/213446/bosto... | 15        | 7             | 44          | 237    | 32                 | 0             | 22           | 3       | 0         | 41      | 1    | 101       |
| 19 | Boston Creme Mini-Cupcakes     | https://www.allrecipes.com/recipe/220809/bosto... | 12        | 4             | 32          | 253    | 34                 | 0             | 24           | 3       | 0         | 62      | 1    | 100       |

In [109]:
def extract_nutrition(url):
    """ returns a dictionary of nutrition info 
    
    Args:
        url (str): location of all recipes "recipe"
        
    Returns:
        nutr_dict (dict): keys are molecule types ('fat'), 
            vals are floats of quantity ('24 g' = 24)
    """
    pass

In [13]:
url = 'https://www.allrecipes.com/recipe/220942/chef-johns-boston-cream-pie/'

In [14]:
def extract_nutrition(url):
    """ returns a dictionary of nutrition info 
    
    Args:
        url (str): location of all recipes "recipe"
        
    Returns:
        nutr_dict (dict): keys are molecule types ('fat'), 
            vals are floats of quantity ('24 g' = 24)
    """

    pass

## Putting it all together
- get list of dictionaries corresponding to recipes (done!)
- get dictionary of nutrition info per recipe (done!)
- aggregating info into dataframe (see below)
- scatter plot (up next)

In [15]:
def get_df_recipe(str_query, recipe_limit=None):
    """ searches for recipes and returns list, with nutrition info
    
    Args:
        str_query (str): search string
        recipe_limit (int): if passed, limits recipe (helpful
            to speed up nutrition scraping for teaching!)
        
    Returns:
        df_recipe (pd.DataFrame): dataframe, each row is recipe.
            includes columns href, name, and nutrition facts
    """    
    pass