# STA 220 Data & Web Technologies for Data Analysis

### Lecture 10, 10/31/24, Scraping

### Announcements
- 

### Today's topics
 - Web Scraping: 
     - Foodwise
     - Tornado Watch

### Ressources
 - [Foodwise](https://foodwise.org/)
 - [Tornado Watch](https://www.tornadohq.com/)

### Writing Scrapers

Lets scrape the wiki table ourselves. Attention: We are using request, so pay attention to the file that is being returned. Check on devtools the html element for `<thead>` and see what is returned in the network. 

In [1]:
import requests
import lxml.html as lx
import pandas as pd

In [2]:
result = requests.get(url = 'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area')
html = lx.fromstring(result.text)

In [3]:
result.text[:100]

'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-la'

In [None]:
tables = html.xpath('//table[2]')
table = tables[0]

In [None]:
table.text_content()

In [None]:
html.xpath('//*[@id="mw-content-text"]/div[1]/table[2]/thead')

In [None]:
html.xpath('//table[2]/tbody/tr[4]//text()')

In [None]:
def retrieve_rows(html): 
    rows = html.xpath('//table[2]/tbody/tr')
    cells = []
    for row in rows: 
        # ./td|th means we start at the node (not searching the whole doc again), and choose td OR th children
        cells.append([cell.text_content() for cell in row.xpath('./td|th')]) # no text, as some cells are in <b>
    return cells

In [None]:
retrieve_rows(html)

In [None]:
df = pd.DataFrame(retrieve_rows(html))
df.head(10)

### Example: Foodwise

Foodwise, formerly CUESA (Center for Urban Education about Sustainable Agriculture) provides [a chart](https://foodwise.org/eat-seasonally/seasonality-chart-vegetables/) on when certain vegetables are in season. We want to create this chart for ourselves. All the info we need is on `foodwise`, so lets scrape! 

First, observe that the search mask (Food type, Month) invokes an API. However, the params are complicated to assemble, also, the returned object is an html. So we have to scrape the html. First check, using devtools, that the desired information is returned by the API (under `doc`). 

In [None]:
import requests
import lxml.html as lx
import requests_cache
import time
requests_cache.install_cache("lecture10")

In [None]:
url = "https://foodwise.org/eat-seasonally/seasonality-charts/?_food_type=vegetable"

In [None]:
headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
}

Here, the server needs the `user-agent` key in the header. 

In [None]:
response = requests.get(url, headers = headers)
response.raise_for_status()

In [None]:
response.text[:100]

##### First approach

In [None]:
url = "https://foodwise.org/foods/agretti/"
response = requests.get(url)

In [None]:
response.raise_for_status()

In [None]:
response.text # works after executed chunk below, as we use cache

We have to provide the correct header! 

In [None]:
response = requests.get(url, headers = headers)
response.raise_for_status

In [None]:
response.text[:100]

In [None]:
html = lx.fromstring(response.text) # Parse the HTML
html

In [None]:
html.xpath('//section[@class="sidebar__section"][h2[contains(text(), "In Season")]]/text()')[1]

In [None]:
string = html.xpath('//section[@class="sidebar__section"][h2[contains(text(), "In Season")]]/text()')[1]
string

In [None]:
from re import sub
sub(r'\W', ' ', string).split() # we are going to talk about RegEx some other time

In [None]:
def get_months(produce): 
    time.sleep(0.05)
    url = "https://foodwise.org/foods/" + produce + "/"
    response = requests.get(url, headers = headers)
    response.raise_for_status()
    html = lx.fromstring(response.text)
    string = html.xpath('//section[@class="sidebar__section"][h2[contains(text(), "In Season")]]/text()')[1]
    month = sub(r'(In Season)|\W', ' ', string).split()
    return month

In [None]:
month = get_months('carrots')
month 

##### How to get the product in the first place? 

In [None]:
url = 'https://foodwise.org/eat-seasonally/seasonality-charts/'
response = requests.get(url, headers = headers)
response.raise_for_status()

In [None]:
html = lx.fromstring(response.text) # Parse the HTML
html

In [None]:
produce = html.xpath('//div[@class="card-image-title__text-content"]/h3/text()')
produce   

In [None]:
def get_produce(page):
    url = 'https://foodwise.org/eat-seasonally/seasonality-charts/'
    response = requests.get(url, headers = headers, params = {
        '_food_type': 'vegetable',
        '_paged': page
    })
    response.raise_for_status()
    html = lx.fromstring(response.text) # Parse the HTML
    produce = html.xpath('//div[@class="card-image-title__text-content"]/h3/text()')
    return produce

In [None]:
get_produce(1)

In [None]:
produce = [item for sublist in [get_produce(i) for i in range(1,5)] for item in sublist]
produce

##### Iterate over produce items

In [None]:
year = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 
        'October', 'November', 'December']

In [None]:
month = get_months('Tatsoi')

In [None]:
month

In [None]:
d = {'Produce': 'Tatsoi'}
d.update({item: True if item in month else False for item in year})
d

In [None]:
def assemble_row(produce): 
    print(produce)
    month = get_months(produce)
    d = {'Produce': produce}
    d.update({item: True if item in month else False for item in year})
    return d

In [None]:
assemble_row("Tatsoi")

In [None]:
import pandas as pd

In [None]:
pd.DataFrame([assemble_row(p) for p in produce[:3]])

In [None]:
produce[:3]

In [None]:
df = [assemble_row(i) for i in produce] # runs for 45 secs

In [None]:
def get_months(produce): 
    time.sleep(0.05)
    url = "https://foodwise.org/foods/" + produce + "/"
    response = requests.get(url, headers = headers)
    response.raise_for_status()
    html = lx.fromstring(response.text)
    try: 
        string = html.xpath('//section[@class="sidebar__section"][h2[contains(text(), "In Season")]]/text()')[1]
        month = sub(r'(In Season)|\W', ' ', string).split() 
    except:
        month = []
    return month

In [None]:
[assemble_row(i) for i in produce]

Try to catch the error, or check what happened! 

In [None]:
# Not run! 
def get_months(produce): 
    time.sleep(0.05)
    url = "https://foodwise.org/foods/" + produce + "/"
    response = requests.get(url, headers = headers)
    try: response.raise_for_status()
    except requests.HTTPError:
        month = []
        return month 
    else:
        html = lx.fromstring(response.text)
        try: 
            string = html.xpath('//section[@class="sidebar__section"][h2[contains(text(), "In Season")]]/text()')[1]
            month = sub(r'(In Season)|\W', ' ', string).split() 
        except:
            month = []
            return month 
        return month

We have to account for new links.... Retrieve the `href` attribute from the anchor.

In [None]:
url = 'https://foodwise.org/eat-seasonally/seasonality-charts/?_food_type=vegetable&_paged=3' #try page 3,4
response = requests.get(url, headers = headers)
response.raise_for_status()
html = lx.fromstring(response.text) # Parse the HTML
produce = html.xpath('//article[@class="card-image-title__container"]/a/@href') #returns href attribute of anchor link
produce

In [None]:
def get_url(i):
    url = 'https://foodwise.org/eat-seasonally/seasonality-charts/?_food_type=vegetable&_paged=' + str(i)
    response = requests.get(url, headers = headers)
    response.raise_for_status()
    html = lx.fromstring(response.text) # Parse the HTML
    #returns href attribute of anchor link
    produce_link = html.xpath('//article[@class="card-image-title__container"]/a/@href') 
    return produce_link

In [None]:
produce_links = [item for sublist in [get_url(i) for i in range(1,5)] for item in sublist]
produce_links

Lets find the (new) produce name from its site. 

In [None]:
result = requests.get('https://foodwise.org/foods/peppers-chile/', headers = headers)
result.raise_for_status()

In [None]:
html = lx.fromstring(result.text)

In [None]:
html.xpath("//h1/text()")[0]

In [None]:
def get_months(produce_link): 
    time.sleep(0.05)
    response = requests.get(produce_link, headers = headers)
    html = lx.fromstring(response.text)
    try: 
        string = html.xpath('//section[@class="sidebar__section"][h2[contains(text(), "In Season")]]/text()')[1]
    except:
          return [None, []] 
    else: 
        month = sub(r'(In Season)|\W', ' ', string).split() 
        name = html.xpath("//h1/text()")[0]
        return [name, month]

In [None]:
get_months('https://foodwise.org/foods/nettles/')

In [None]:
def assemble_row(produce_link): 
    name, month = get_months(produce_link)
    d = {'Produce': name}
    d.update({item: True if item in month else False for item in year})
    return d

In [None]:
assemble_row('https://foodwise.org/foods/nettles/')

In [None]:
df = [assemble_row(i) for i in produce_links] 
df

In [None]:
tbl = pd.DataFrame(df)
tbl.shape

In [None]:
tbl.head()

In [None]:
tbl.set_index("Produce")

### Tornado Watch 

We are interested in scraping and plotting the locations of all tornado warnings in the last 48 hours. 

In [None]:
import requests
import lxml.html as lx
import time
import pandas as pd

In [None]:
result = requests.get('https://www.tornadohq.com/')
result.raise_for_status

In [None]:
html = lx.fromstring(result.text) # Parse the HTML

In [None]:
warnings = html.xpath('//pre')
warnings

In [None]:
warning = warnings[0].text
warning

Lets match the latitude-longitude pair after `LAT...LON`. 

In [None]:
from re import findall

In [None]:
findall('(?<=LAT\.{3}LON\s)(\d+\s\d+)', warning)[0].split()

Rename the coordinates in readable format. 

In [None]:
coord_list = [findall('(?<=LAT\.{3}LON\s)(\d+\s\d+)', warning.text)[0].split() for warning in warnings]

In [None]:
coord = pd.DataFrame(coord_list)
coord.columns = ['N', 'W']
coord = coord.applymap(lambda x: float(x) / 100) # convert location in readable format
coord['W'] = -coord['W'] # longitude to west is negative
coord.head()

Plot the results (consider a [mapbox token](https://studio.mapbox.com/) to plot.)!

In [None]:
import plotly.express as px
import geopandas as gpd

px.set_mapbox_access_token(open("./../keys/mapbox.txt").read())
fig = px.scatter_mapbox(coord,
                        lat='N',
                        lon='W',
                        zoom=4)
fig.show()

### Summary 

- Scraping does not necessarily return the desired, make use of error handling 
- Make use of the advantages of devtools to see how the website is structured