# DS3000 Day 3

Sep 19, 2023

Admin
- Qwickly Attendance (PIN on board)
- Homework 1 due Tonight, Sep 19 by midnight
- Homework 2 will be posted then, due Oct 10 by midnight
      - Note: you have three weeks to do this, but **do not** put it off! The sooner you complete everything the better.
- Quiz 1 will be posted **next** Tuesday, Sep 25, and must be done by Oct 3 (2 hour time limit)
- Lab and Visitor tentatively scheduled for Sep 25 as well

Push-Up Tracker
- Section 04: 0
- Section 05: 2
- Section 06: 1

Content:
- OpenWeather API pipeline
- Intro to Web Scraping

# Data Pipeline: What is it?

A data pipeline is a collection of functions* which split all the functionality of our data collection and processing

(*can be other structures too, but it may be easier to first understand each as a function)


# Why build a data pipeline?

- Allows pipeline to be run in parts (rather than the whole thing)
- Allows pipeline to be built by different programmers working on different parts in parallel
- Allows us to test each piece of our code seperately
- Allows for modification / re-use of different sections

What we call a "Data Pipeline" here is a specific instance of "Factoring" a piece of software, splitting up its functionality into pieces.
    


# OpenWeather API Pipeline Activity

OpenWeather API offers a few different queries (see [here](https://openweathermap.org/api) for details):
- One Call API (which we have access to)
- Solar Radiation API
- etc.


**Goal:**

Build a library of functions which can be pieced together to support the collection, cleaning and display of features from OpenWeather into a scatter plot of two features.

### Lets design one together: 

(think: input/outputs -> handwritten notes)

# Plan out a pipeline

Write a few 'empty' functions including little more than the docstring:

```python
def some_fnc(a_string, a_list):
    """ processes a string and a list (somehow)
    
    Args:
        a_string (str): an input string which ...
        a_list (list): a list which describes ...
        
    Returns:
        output (dict): the output dict which is ...
    """
    pass
```

and a script which uses them:

```python
# inputs (not necessarily complete)
lat = -42
lon = 73

some_output = some_fnc(lat, lon)
some_other_output = some_other_fnc(some_output)

```

which would, if the functions worked, produce a graph like this (note: this starts Oct 6, because I made it last year):

<img src="https://i.ibb.co/Ct0JtRJ/newplot-1.png" width=500\img>

**NOTE:** we haven't talked about creating plots yet, but we will next week! For now, I will provide everything you need in the examples.

# What might these empty functions look like?

In [None]:
def openweather_onecall(latlon_tuple, api_key, units='imperial'):
    """ returns weather data from one location via onecall
    
    https://openweathermap.org/api/one-call-api 
    
    Args:
        latlon_tuple (tuple): first element is lattitude,
            second is longitude            
        api_key (str): API key required to access data
        units (str): 'imperial', 'standard', 'metric'
        
    Returns:
        weather_dict (dict): a nested dictionary (tree) which
            contains weather data
    """
    pass
    
def get_clean_df_daily(daily_dict_list):
    """ formats daily_dict to a pandas dataframe
    
    see https://openweathermap.org/api/one-call-api for
    full daily_dict specification
    
    Args:
        daily_dict_list (list): list of dictionaries of daily
            weather features
            
    Returns:
        df_daily (pd.DataFrame): each row is weather from one
            day
    """
    pass
    
def scatter_plotly(df, feat_x, feat_y, f_html='scatter.html'):
    """ creates a plotly scatter plot, exports as html 
    
    Args:
        df (pd.DataFrame): pandas dataframe
        x_feat (str): x axis of scatter
        y_feat (str): y axis of scatter
        f_html (str): output html file
        
    Returns:
        f_html (str): output html file
    """ 
    pass

When the pipeline above is complete, the following script should plot a daily max temp scatter for Boston:

In [None]:
# this code won't work because the functions above are all empty
# inputs
feat_x = 'date'
feat_y = 'temp_max'
latlon_tuple = -42, 70
units = 'imperial'
api_key = 'd36fa352ac73226b30772f64675f41bb'

# get data
weather_dict = openweather_onecall(latlon_tuple, 
                                   units=units,
                                   api_key=api_key)

# clean weather dict (make dataframe from dict, process timestamps etc)
df_daily = get_clean_df_daily(weather_dict['daily'])

# make scatter
f_html = scatter_plotly(df_daily, feat_x=feat_x, feat_y=feat_y)

# Let's go **SLOWLY** through this solution

In [None]:
import requests
import json
from datetime import datetime
import pandas as pd
import plotly
import plotly.express as px

def openweather_onecall(latlon_tuple, api_key, units='imperial'):
    """ returns weather data from one location via onecall
    
    https://openweathermap.org/api/one-call-api 
    
    Args:
        latlon_tuple (tuple): first element is lattitude,
            second is longitude
        api_key (str): API key required to access data
        units (str): 'imperial', 'standard', 'metric'
        
    Returns:
        weather_dict (dict): a nested dictionary (tree) which
            contains weather data
    """
    # build url
    lat, lon = latlon_tuple
    url = f'https://api.openweathermap.org/data/2.5/onecall?lat={lat}&lon={lon}&appid={api_key}&units={units}'
    
    # get url as a string
    url_text = requests.get(url).text
    
    # convert json to a nested dict
    weather_dict = json.loads(url_text)

    # another, perhaps cleaner option
    # weather_dict = requests.get(url).json()
    
    return weather_dict

def get_clean_df_daily(daily_dict_list):
    """ formats daily_dict to a pandas series
    
    see https://openweathermap.org/api/one-call-api for
    full daily_dict specification
    
    Args:
        daily_dict_list (list): list of dictionaries of daily
            weather features
            
    Returns:
        df_daily (pd.DataFrame): each row is weather from one
            day
    """
    # format to dataframe
    df_weather = pd.DataFrame()
    for daily_dict in daily_dict_list:
        daily_series = pd.Series(dtype='object')

        # build datetime data (.fromtimestamp() assumes local time zone)
        # todo: timezone problem (left as HW exercise)
        daily_series['date'] = datetime.fromtimestamp(daily_dict['dt'])
        daily_series['sunrise'] = datetime.fromtimestamp(daily_dict['sunrise'])
        daily_series['sunset'] = datetime.fromtimestamp(daily_dict['sunset'])


        # build temp data
        temp_dict = daily_dict['temp']
        for temp_feat, temp in temp_dict.items():
            daily_series[f'temp_{temp_feat}'] = temp

        # build prob of precipitation
        # NOTE: I did confirm that the rain column appears only if there is rain forecasted in the next 48 hours
        daily_series['pop'] = daily_dict['pop']
                
        # collect row in df_weather
        df_weather = pd.concat([df_weather, daily_series.to_frame().T])
    
    return df_weather     

def scatter_plotly(df, feat_x, feat_y, f_html='scatter.html'):
    """ creates a plotly scatter plot, exports as html 
    
    Args:
        df (pd.DataFrame): pandas dataframe
        x_feat (str): x axis of scatter
        y_feat (str): y axis of scatter
        f_html (str): output html file
        
    Returns:
        f_html (str): output html file
    """
    # creat scatter plot
    fig = px.scatter(df, x=feat_x, y=feat_y)

    
    # export scatter to html
    plotly.offline.plot(fig, filename=f_html)
    
    return f_html

In [None]:
# inputs
feat_x = 'date'
feat_y = 'temp_max'
latlon_tuple = -42, 70
units = 'imperial'
api_key = 'd36fa352ac73226b30772f64675f41bb'

# get data
weather_dict = openweather_onecall(latlon_tuple, 
                                   units=units,
                                   api_key=api_key)

In [None]:
# clean weather dict (make dataframe from dict, process timestamps etc)
df_daily = get_clean_df_daily(weather_dict['daily'])
df_daily

In [None]:
# make scatter
f_html = scatter_plotly(df_daily, feat_x=feat_x, feat_y=feat_y)

## Web Scraping
* Using programs or scripts to pretend to browse websites, examine the content on those websites, retrieve and extract data from those websites
* Why scrape?
    * if an API is available for a service, we will nearly always prefer the API to scraping
    * ... but not all services have APIs or the available APIs are too expensive for our project
    * newly published information might not yet be available through ready datasets
* Downsides of scraping:
    * no reference documentation (unlike APIs)
    * no guarantee that a webpage we scrape will look and work the same way the next day (might need to rewrite the whole scraper)
    * if it violates the terms of service it might be seen as a felony (https://www.aclu.org/cases/sandvig-v-barr-challenge-cfaa-prohibition-uncovering-racial-discrimination-online)
    * legal and moral greyzone (even if the ToS does not disallow it, somebody has to pay for the traffic and when you're scraping you're not looking at ads)
    * ... but everbody does it anyway (https://www.hollywoodreporter.com/thr-esq/genius-says-it-caught-google-lyricfind-redhanded-stealing-lyrics-400m-suit-1259383)
* Web scraping pipeline:
    * because the webpages might change their structure it's extra important to keep the crawling/extraction step separate from transformations and loading
    * ETL (Extraction-Transform-Load):
        * **Crawl**: open a given URL using requests and get the HTML source;
        * **Extract**: extract interesting content from the webpage's source.
        * **Transform**: our usual unit conversions, etc
        * **Load**: representing the data in an easy way for storage and analysis
    * **Pro tip**: it's usually a good idea to not only store the transformed data, but also the raw HTML source - because the webpages might change and we might be late to realize we're not extracting right. If we have the original HTML source we can go back to it
    

## Best case scenario
Some webpages publish their data in the form of simple tables. In these (rare) cases we can just use pandas .read_html to scrape this data:

https://www.espn.com/nba/team/stats/_/name/bos

In [None]:
import pandas as pd
# read html extracts all the <table> elements from html and returns a list of DataFrames created from them
tables = pd.read_html('https://www.espn.com/nba/team/stats/_/name/bos')
len(tables)

In [None]:
tables[0]
# tables[1]
# tables[2]
# tables[3]

In [None]:
# "glue" dataframes together (more to come on this later in the semester)
player_stats1 = pd.concat(tables[:2], axis=1)
player_stats1

In [None]:
# include the more advanced stats
player_stats2 = pd.concat([player_stats1, tables[3]], axis=1)
player_stats2

In [None]:
# baseball instead of basketball?
base_tables = pd.read_html('https://www.baseball-reference.com/teams/BOS/2022.shtml')
len(base_tables)

In [None]:
base_tables[0]
# base_tables[1]

## Messy Data

Notice that the baseball data are quite a bit messier than the basketball data. In web scraping, you are beholden to the format of the website (.html) and will almost certainly have to clean data (sometimes extensively) after scraping it.

## Basic HTML
Web pages are written in HTML. The source of https://sapiezynski.com/ds3000/scraping/01.html looks like this:

```html
<html>
    <head>
        <!-- comments in HTML are marked like this -->
        
        <!-- the head tag contains the meta information not displayed but helps browsers render the page -->
    </head>
    <body>
         <!-- This is the body of the document that contains all the visible elements.-->
        <h1>Heading 1</h1>
        <h2>This is what heading 2 looks like</h2>
        
        <p>Text is usually in paragraphs.
            New lines and multiple consecutive whitespace characters are ignored.</p>

<p>Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.</p>   
        
        <p>Links are created using the "a" tag: 
            <a href="https://www.google.com">Click here to google.</a>
            href is an attirbute of the a tag that specify where the link points to.</p>
        
        
    </body>
</html>
```
The keywords in `<>` brackets are called tags. They open with `<tag>` and close with `</tag>`.

In [None]:
## Getting the html content in Python
import requests

response = requests.get('https://sapiezynski.com/ds3000/scraping/01.html')
print(response.text)

In [None]:
# sometimes this doesn't quite work the way you want (c'est la vie with web scraping)
response2 = requests.get('https://www.nytimes.com/2019/03/10/style/what-is-tik-tok.html')
print(response2.text)

# Beautiful Soup

Even if the .html does look relatively clean, it's still just a big string. How can we deal with it? Luckily there is a module made for just this purpose, and it's even a magic command which we can install directly in jupyter notebook:

In [None]:
pip install bs4

In [None]:
from bs4 import BeautifulSoup

url = 'https://sapiezynski.com/ds3000/scraping/01.html' 
str_html = requests.get(url).text
soup = BeautifulSoup(str_html)

In [None]:
soup

In [None]:
## getting elements by their tag name:
soup.find_all('p')

# find_all returns a list, where each element is an instance of the specified tag

In [None]:
# the bs4 object tracks the tags
type(soup.find_all('p')[0])

In [None]:
for paragraph in soup.find_all('p'):
    # text is a property of a soup object
    print(paragraph.text) 
    print('------')

# `.find_all()` on subtrees of soup object


The `.find_all()` method works not only on the whole `soup` object, but also on subtrees of the soup object.  

Consider the site at https://sapiezynski.com/ds3000/scraping/02.html:

```html
<html>
    <body>
        <p>The links in this paragraph point to search engines, like <a href="https://duckduckgo.com">DuckDuckGo</a>, <a href="https://google.com">Google</a>, <a href="https://bing.com">Bing</a></p>
        
        <p>The links in this paragraph point to Internet browsers, like <a href="https://firefox.com">Firefox</a>, <a href="https://chrome.com">Chrome</a>, <a href="https://opera.com">Opera</a></p>.
    </body>
</html>
```

**Goal**: Grab links from the first paragraph only:

In [None]:
# getting the content of the page
url = 'https://sapiezynski.com/ds3000/scraping/02.html'
response = requests.get(url)
soup = BeautifulSoup(response.text)

# finding all paragraphs:
p_all = soup.find_all('p')

In [None]:
# getting the first paragraph
p_first = p_all[0]

In [None]:
# getting the links from the first paragraph:
links_p_first = p_first.find_all('a')

print(links_p_first)

### Some syntactic sugar: 
To get the first tag under a soup object, refer to it as an attribute

In [None]:
# is equivilent to soup.find_all('p')[0]
soup.p

In [None]:
# so we can condense our code as
plinks = soup.p.find_all('a')
print(plinks)

In [None]:
# iterating over tags
for par in soup.find_all('p'):
    print(par.a)

In [None]:
# and the first link in that paragraph can be accessed like this:
link = soup.p.a
print(link)

## Identifying if tags exist

In [None]:
# what if we're trying to access an element that doesn't exist?
header = soup.h3
print(header)

# won't work, because header is of type None
# header.text

We can test if a tag exists in a soup object by looking for the first instance of this tag and comparing it to `None`

In [None]:
if soup.h3 is None:
    print("tag h3 doesnt exist in soup")
else:
    print("tag h3 does exist!")

In [None]:
if soup.p is None:
    print("tag p doesnt exist in soup")
else:
    print("tag p does exist!")

## Finding tags by `class_`

**Tip**: This is often one of the most useful ways to localize a particular part of a web page.

In [None]:
# get soup
url = 'https://www.allrecipes.com/search?q=cheese+fondue'
response = requests.get(url)
soup = BeautifulSoup(response.text)

In [None]:
soup

Our **goal** is to get a list of recipes.  Maybe we should find all the `div` tags? What about `span` tags?

In [None]:
# finding via tag ... problematic as we have too many div tags!
len(soup.find_all('div'))

In [None]:
len(soup.find_all('span'))

Tags can have multiple "classes" they belong to.  For example, in https://www.allrecipes.com/search?q=cheese+fondue the first recipe is encapsulated in this html tag:

    <span class="card__title"><span class="card__title-text">Cheese Fondue</span></span>
    
So this particular span tag belongs to classes:
- `card__title`
- `card__title-text`
    
I suspect only our target recipes belong to the `card__title-text` class.  Lets find them all:

In [None]:
recipe_list = soup.find_all(class_='card__title-text')

len(recipe_list)

In [None]:
recipe_list

In [None]:
recipe_list[1].text

## Finding tags by `id`

Nearly the same as finding by class, but you'll look for `id=` in the html and pass it to the `id` keyword of `soup.find_all()`.

**Goal**: Get the footer from: https://www.scrapethissite.com/



```html
<section id="footer">
        <div class="container">
            <div class="row">
                <div class="col-md-12 text-center text-muted">
                    Lessons and Videos © Hartley Brody 2018
                </div><!--.col-->
            </div><!--.row-->
        </div><!--.container-->
    </section>
```

In [None]:
# get soup from url
url = 'https://www.scrapethissite.com/'
html = requests.get(url).text
soup = BeautifulSoup(html)

In [None]:
soup.find_all(id='footer')

Note that you can combine all searches shown above:
- tag
    - p (paragraph)
    - a (link)
    - div, span, ...
- tag class
- tag id

```python
# finds all links (tag type = 'a'), with given class and id
soup.find_all('a', class_='fancy-link', id='blue')

```

## Practice: Rest of Class (if time, if not next time!)

**Goal:** Get a list of recipe names from www.allrecipes.com like we did for:

https://www.allrecipes.com/search?q=cheese+fondue

1. Write function `crawl_recipes(query)` which:
    * takes the search phrase (the ingredient) as input argument
    * builds the correct url that leads directly to the page that lists the recipes
    * uses `requests` to get the content of this page returns the html text of the page
1. Write `extract_recipes(text)` which:
    * takes the text returned by `crawl_recipes` as argument
    * builds a BeautifulSoup object out of that text 
    * finds names of all recipes
        - to identify which tags / classes to `find_all()`, open the page in your browser and "inspect" 
        - start from the recipe object above, and call another `find_all()` to zoom into the recipe name itself
    * returns the list of recipe names
    

A new function that will help if you wish to query multiple words:

`string.replace()`

So, if you wish to turn `cheese fondue` into `cheese+fondue`:

`string = 'cheese fondue'`

`string.replace(" ", "+")`

In [None]:
string = 'cheese fondue'
string = string.replace(" ", "+")
string

In [None]:
# put functions here

In [None]:
meatloaf_html = crawl_recipes('meatloaf')
new_recipe_list = extract_recipes(meatloaf_html)

In [None]:
# new_recipe_list