# STA 141B Data & Web Technologies for Data Analysis

### Lecture 11, 11/3/25, Scraping

### Today's topics
 - Web Scraping: 
     - Foodwise
     - Tornado Watch

### Ressources
 - [Foodwise](https://foodwise.org/)
 - [Tornado Watch](https://www.tornadohq.com/)

### Writing Scrapers

Lets scrape the wiki table ourselves. Attention: We are using request, so pay attention to the file that is being returned. Check on devtools the html element for `<thead>` and see what is returned in the network. 

In [2]:
import requests
import lxml.html as lx
import pandas as pd

In [3]:
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
result = requests.get(url = 'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area', headers = headers)
result.raise_for_status()
html = lx.fromstring(result.text)

In [4]:
result.text[:100]

'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-la'

In [5]:
tables = html.xpath('//table')
table = tables[0]

In [6]:
table

<Element table at 0x1206649f0>

In [7]:
table.text_content()

'Population tablesof U.S. citiesThe skyline of New York City, the most populous city in the United States\nCities\nPopulationAreaDensityEthnic identityForeign-bornIncomeSpanish speakersCapitalsBy decadeBy stateBy decade/state\n\nUrban areas\nPopulous cities and metropolitan areas\n\nMetropolitan areas\n184 combined statistical areas935 core-based statistical areas393 metropolitan statistical areas542 micropolitan statistical areas\n\nMegaregions\nRelated population listsNorth American metro areasWorld citiesStates and territories.mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .navbar-collapse{float:left;text-align:left}.mw-parser-output .navbar-boxtext{word-spacing:0}.mw-parser-output .navbar ul{display:inline-block;white-space:nowrap;line-height:inherit}.mw-parser-output .navbar-brackets::before{margin-right:-0.125em;content:"[ "}.mw-parser-output .navbar-brackets::after{margin-left:-0.125em;content:" ]"}.mw-parser-output .navbar li{word-spa

In [8]:
html.xpath('//*[@id="mw-content-text"]/div[1]/table[2]/thead')

[]

In [9]:
html.xpath('//table[2]/tbody/tr[4]//text()')

['\n',
 'Juneau',
 '\n',
 'AK',
 '\n',
 '2,702.9\n',
 '\n',
 '7,000',
 '\n',
 '555.1\n',
 '\n',
 '1,438',
 '\n',
 '3,258.0\n',
 '\n',
 '8,438',
 '\n',
 '32,255\n']

In [10]:
from re import sub 
def remove(string):
    '''
    Removes everything inside [], a whitespace before that and *'s.
    '''
    if isinstance(string, str):
        string = sub(r'\s*\[.*\]\**|\n|,|\*', '', string)
        # \s means every whitespace (incl. space and newline) followed by any text between square brackets and an trailing * OR just \n OR just comma,
        # * means zero or more occurences, . any character
        # this aims to remove the [a]* after Tribune and the /n in the columns
    return string

In [11]:
def retrieve_rows(html): 
    rows = html.xpath('//table[2]/tbody/tr')
    cells = []
    for row in rows: 
        # ./td|th means we start at the node (not searching the whole doc again), and choose td OR th children
        cells.append([remove(cell.text_content()) for cell in row.xpath('./td|th')]) # no text, as some cells are in <b>
    return cells

In [12]:
retrieve_rows(html)

[['City', 'ST', 'Land area', 'Water area', 'Total area', 'Population(2020)'],
 ['(mi2)', '(km2)', '(mi2)', '(km2)', '(mi2)', '(km2)'],
 ['Sitka',
  'AK',
  '2870.2',
  '7434',
  '1904.3',
  '4932',
  '4774.5',
  '12366',
  '8458'],
 ['Juneau',
  'AK',
  '2702.9',
  '7000',
  '555.1',
  '1438',
  '3258.0',
  '8438',
  '32255'],
 ['Wrangell',
  'AK',
  '2556.1',
  '6620',
  '915.0',
  '2370',
  '3471.1',
  '8990',
  '2127'],
 ['Anchorage',
  'AK',
  '1706.8',
  '4421',
  '237.7',
  '616',
  '1944.5',
  '5036',
  '291247'],
 ['Tribune', 'KS', '778.2', '2016', '0', '0', '778.2', '2016', '1182'],
 ['Jacksonville',
  'FL',
  '747.3',
  '1935',
  '127.1',
  '329',
  '874.5',
  '2265',
  '949611'],
 ['Anaconda ', 'MT', '736.7', '1908', '4.7', '12', '741.4', '1920', '9421'],
 ['Butte ', 'MT', '715.8', '1854', '0.6', '1.6', '716.3', '1855', '34494'],
 ['Houston', 'TX', '640.8', '1660', '31.2', '81', '672.0', '1740', '2304580'],
 ['Oklahoma City',
  'OK',
  '607.0',
  '1572',
  '14.3',
  '37',
  

In [13]:
df = pd.DataFrame(retrieve_rows(html))
df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,City,ST,Land area,Water area,Total area,Population(2020),,,
1,(mi2),(km2),(mi2),(km2),(mi2),(km2),,,
2,Sitka,AK,2870.2,7434,1904.3,4932,4774.5,12366.0,8458.0
3,Juneau,AK,2702.9,7000,555.1,1438,3258.0,8438.0,32255.0
4,Wrangell,AK,2556.1,6620,915.0,2370,3471.1,8990.0,2127.0
5,Anchorage,AK,1706.8,4421,237.7,616,1944.5,5036.0,291247.0
6,Tribune,KS,778.2,2016,0,0,778.2,2016.0,1182.0
7,Jacksonville,FL,747.3,1935,127.1,329,874.5,2265.0,949611.0
8,Anaconda,MT,736.7,1908,4.7,12,741.4,1920.0,9421.0
9,Butte,MT,715.8,1854,0.6,1.6,716.3,1855.0,34494.0


In [14]:
df.columns = df.iloc[0]

In [15]:
df = df.iloc[2:]

In [16]:
df

Unnamed: 0,City,ST,Land area,Water area,Total area,Population(2020),NaN,NaN.1,NaN.2
2,Sitka,AK,2870.2,7434,1904.3,4932,4774.5,12366,8458
3,Juneau,AK,2702.9,7000,555.1,1438,3258.0,8438,32255
4,Wrangell,AK,2556.1,6620,915.0,2370,3471.1,8990,2127
5,Anchorage,AK,1706.8,4421,237.7,616,1944.5,5036,291247
6,Tribune,KS,778.2,2016,0,0,778.2,2016,1182
...,...,...,...,...,...,...,...,...,...
147,Toledo,OH,80.5,208,3.3,8.5,83.8,217,270871
148,Jonesboro,AR,80.2,208,0.6,1.6,80.7,209,78576
149,El Reno,OK,79.6,206,0.6,1.6,80.2,208,16989
150,Ellsworth,ME,79.3,205,14.6,38,93.9,243,8399


### Example: Foodwise

Foodwise, formerly CUESA (Center for Urban Education about Sustainable Agriculture) provides [a chart](https://foodwise.org/eat-seasonally/seasonality-chart-vegetables/) on when certain vegetables are in season. We want to create this chart for ourselves. All the info we need is on `foodwise`, so lets scrape! 

First, observe that the search mask (Food type, Month) invokes an API. However, the params are complicated to assemble, also, the returned object is an html. So we have to scrape the html. First check, using devtools, that the desired information is returned by the API (under `doc`). 

In [17]:
import requests
import lxml.html as lx
import requests_cache
import time
requests_cache.install_cache("../output/lecture9")

ModuleNotFoundError: No module named 'requests_cache'

In [18]:
url = "https://foodwise.org/eat-seasonally/seasonality-charts/?_food_type=vegetable"

In [19]:
headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
}

Here, the server needs the `user-agent` key in the header. 

##### First approach

In [20]:
response = requests.get(url, headers=headers)
response.raise_for_status()

In [21]:
response.text[:100]

'<!doctype html>\n<html lang="en-US">\n<head>\n\t<meta charset="UTF-8">\n\t<meta name="viewport" content="w'

In [22]:
url = "https://foodwise.org/foods/corn/"
response = requests.get(url, headers=headers)

We have to provide the correct header! 

In [23]:
response.raise_for_status()

In [24]:
response.text # works after executed chunk below, as we use cache

'<!doctype html>\n<html lang="en-US">\n<head>\n\t<meta charset="UTF-8">\n\t<meta name="viewport" content="width=device-width, initial-scale=1">\n\t<link rel="profile" href="https://gmpg.org/xfn/11">\n\n\t<!-- refine, and properly optimize and include (using enqueue_script) the following files before production launch -->\n\t<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.4/css/all.min.css"\n\t      integrity="sha512-1ycn6IcaQQ40/MKBW2W4Rhis/DbILU74C1vSrLJxCq57o941Ym01SwNsOMqvEBFlcgUa6xLiPY/NS5R+E6ztJQ=="\n\t      crossorigin="anonymous" referrerpolicy="no-referrer"/>\n\n\t<link rel="preconnect" href="https://fonts.googleapis.com">\n\t<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>\n\t<link\n\t\thref="https://fonts.googleapis.com/css2?family=Source+Sans+Pro:ital,wght@0,200;0,300;0,400;0,600;0,700;0,900;1,200;1,300;1,400;1,600;1,700;1,900&family=Waterfall&display=swap"\n\t\trel="stylesheet">\n\n\n\t<meta name=\'robots\' content=\

In [25]:
response = requests.get(url, headers = headers)
response.raise_for_status

<bound method Response.raise_for_status of <Response [200]>>

In [26]:
response.text[:100]

'<!doctype html>\n<html lang="en-US">\n<head>\n\t<meta charset="UTF-8">\n\t<meta name="viewport" content="w'

In [None]:
# requests_cache.install_cache("../output/lecture10")

In [27]:
response.url

'https://foodwise.org/foods/corn/'

Find the table 'In Season' from the HTML. (Use Inspect!)

In [28]:
html = lx.fromstring(response.text) # Parse the HTML
html

<Element html at 0x120cdf230>

In [29]:
html.xpath('//section[@class="sidebar__section"][h2[contains(text(), "In Season")]]/text()')[1]

'\n                    June • July • August • September • October            '

In [30]:
string = html.xpath('//section[@class="sidebar__section"][h2[contains(text(), "In Season")]]/text()')[1]
string

'\n                    June • July • August • September • October            '

In [31]:
from re import sub
st = sub(r'\W', ' ', string)
st

'                     June   July   August   September   October            '

In [32]:
sub(r'\W', ' ', st).split() # recall regex: \W is any non-alphanumeric value. In particular, we are removing everything but letters or numbers.

['June', 'July', 'August', 'September', 'October']

In [35]:
import time
def get_months(product): 
    time.sleep(0.1)
    url = "https://foodwise.org/foods/" + product + "/"
    response = requests.get(url, headers = headers)
    response.raise_for_status()
    html = lx.fromstring(response.text)
    try: # N
        string = html.xpath('//section[@class="sidebar__section"][h2[contains(text(), "In Season")]]/text()')[1]
        month = sub(r'(In Season)|\W', ' ', string).split() # remove (In Season) or any non-alphanumeric content
    except:
        month = []
    
    return month

In [36]:
month = get_months('corn')
month 

['June', 'July', 'August', 'September', 'October']

##### How to get the product in the first place? 

Visit https://foodwise.org/eat-seasonally/seasonality-charts/?_food_type=vegetable
and use Inspect.

In [37]:
url = 'https://foodwise.org/eat-seasonally/seasonality-charts/?_food_type=vegetable'
response = requests.get(url, headers = headers)
response.raise_for_status()

In [38]:
html = lx.fromstring(response.text) # Parse the HTML
html

<Element html at 0x120cde390>

In [39]:
produce = html.xpath('//div[@class="card-image-title__text-content"]/h3/text()')
produce   

['Artichokes',
 'Arugula',
 'Asparagus',
 'Beets',
 'Bitter melon',
 'Bok choy',
 'Broccoli',
 'Broccoli rabe',
 'Brussels sprouts',
 'Burdock',
 'Cabbage',
 'Cactus pads',
 'Cardoons',
 'Carrots',
 'Cauliflower',
 'Celeriac',
 'Celery',
 'Celtuce',
 'Chard',
 'Chickweed']

In [None]:
# [i.text for i in produce]
# N

These are only the very first entries. Click on next page.

In [40]:
def get_produce(page):
    url = 'https://foodwise.org/eat-seasonally/seasonality-charts/'
    response = requests.get(url, headers = headers, params = {
        '_food_type': 'vegetable',
        '_paged': page
    })
    response.raise_for_status()
    html = lx.fromstring(response.text) # Parse the HTML
    products = html.xpath('//div[@class="card-image-title__text-content"]/h3/text()')
    return products

In [41]:
get_produce(2)

['Chicory',
 'Collard greens',
 'Corn',
 'Cress',
 'Cresta di Gallo',
 'Cucumbers',
 'Dandelion greens',
 'Eggplant',
 'Endive',
 'Fava beans',
 'Fava greens',
 'Fennel',
 'Garlic',
 'Ginger root',
 'Green beans',
 'Herbs',
 'Horseradish',
 'Jicama',
 'Kale',
 'Kohlrabi']

There are four pages in total.

In [42]:
url = 'https://foodwise.org/eat-seasonally/seasonality-charts'
response = requests.get(url, headers = headers, params = {
    '_food_type': 'vegetable'
})
response.raise_for_status()
html = lx.fromstring(response.text) # Parse the HTML
pages = html.xpath('//div[@class="facetwp-facet facetwp-facet-query_pager facetwp-type-pager"]')
pages[0].text_content()

''

In [43]:
lst = [get_produce(i) for i in range(1,5)]

In [None]:
lst

[['Artichokes',
  'Arugula',
  'Asparagus',
  'Beets',
  'Bitter melon',
  'Bok choy',
  'Broccoli',
  'Broccoli rabe',
  'Brussels sprouts',
  'Burdock',
  'Cabbage',
  'Cactus pads',
  'Cardoons',
  'Carrots',
  'Cauliflower',
  'Celeriac',
  'Celery',
  'Celtuce',
  'Chard',
  'Chickweed'],
 ['Chicory',
  'Collard greens',
  'Corn',
  'Cress',
  'Cresta di Gallo',
  'Cucumbers',
  'Dandelion greens',
  'Eggplant',
  'Endive',
  'Fava beans',
  'Fava greens',
  'Fennel',
  'Garlic',
  'Ginger root',
  'Green beans',
  'Herbs',
  'Horseradish',
  'Jicama',
  'Kale',
  'Kohlrabi'],
 ['Komatsuna',
  'Lambsquarters',
  'Leeks',
  'Lettuce',
  'Mushrooms',
  'Mustard greens',
  'Nettles',
  'Okra',
  'Onions',
  'Orach',
  'Parsnips',
  'Pea shoots',
  'Peas',
  'Peppers, chile',
  'Peppers, sweet',
  'Potatoes',
  'Purslane',
  'Radishes',
  'Romanesco',
  'Rutabagas'],
 ['Salsify',
  'Scallions',
  'Shallots',
  'Shelling beans',
  'Spinach',
  'Sprouts',
  'Squash, summer',
  'Squash, 

In [44]:
produce = [item for pages in [get_produce(i) for i in range(1,5)] for item in pages]
produce

['Artichokes',
 'Arugula',
 'Asparagus',
 'Beets',
 'Bitter melon',
 'Bok choy',
 'Broccoli',
 'Broccoli rabe',
 'Brussels sprouts',
 'Burdock',
 'Cabbage',
 'Cactus pads',
 'Cardoons',
 'Carrots',
 'Cauliflower',
 'Celeriac',
 'Celery',
 'Celtuce',
 'Chard',
 'Chickweed',
 'Chicory',
 'Collard greens',
 'Corn',
 'Cress',
 'Cresta di Gallo',
 'Cucumbers',
 'Dandelion greens',
 'Eggplant',
 'Endive',
 'Fava beans',
 'Fava greens',
 'Fennel',
 'Garlic',
 'Ginger root',
 'Green beans',
 'Herbs',
 'Horseradish',
 'Jicama',
 'Kale',
 'Kohlrabi',
 'Komatsuna',
 'Lambsquarters',
 'Leeks',
 'Lettuce',
 'Mushrooms',
 'Mustard greens',
 'Nettles',
 'Okra',
 'Onions',
 'Orach',
 'Parsnips',
 'Pea shoots',
 'Peas',
 'Peppers, chile',
 'Peppers, sweet',
 'Potatoes',
 'Purslane',
 'Radishes',
 'Romanesco',
 'Rutabagas',
 'Salsify',
 'Scallions',
 'Shallots',
 'Shelling beans',
 'Spinach',
 'Sprouts',
 'Squash, summer',
 'Squash, winter',
 'Sunchokes',
 'Sweet potatoes',
 'Taro root',
 'Tatsoi',
 'To

##### Iterate over produce items

In [45]:
seasonality_info = [get_months(p) for p in produce]

HTTPError: 404 Client Error: Not Found for url: https://foodwise.org/foods/Peppers,%20chile/

In [46]:
url = 'https://foodwise.org/eat-seasonally/seasonality-charts/'
response = requests.get(url, headers = headers, params = {
    '_food_type': 'vegetable',
    '_paged': 3
})
response.raise_for_status()
html = lx.fromstring(response.text) # Parse the HTML


In [47]:
products = html.xpath('//a[@class="card-image-title__outer-link"]/@href')

In [48]:
products

['https://foodwise.org/foods/komatsuna/',
 'https://foodwise.org/foods/lambsquarters/',
 'https://foodwise.org/foods/leeks/',
 'https://foodwise.org/foods/lettuce/',
 'https://foodwise.org/foods/mushrooms/',
 'https://foodwise.org/foods/mustard-greens/',
 'https://foodwise.org/foods/nettles/',
 'https://foodwise.org/foods/okra/',
 'https://foodwise.org/foods/onions/',
 'https://foodwise.org/foods/orach/',
 'https://foodwise.org/foods/parsnips/',
 'https://foodwise.org/foods/pea-shoots/',
 'https://foodwise.org/foods/peas/',
 'https://foodwise.org/foods/peppers-chile/',
 'https://foodwise.org/foods/peppers-sweet/',
 'https://foodwise.org/foods/potatoes/',
 'https://foodwise.org/foods/purslane/',
 'https://foodwise.org/foods/radishes/',
 'https://foodwise.org/foods/romanesco/',
 'https://foodwise.org/foods/rutabagas/']

In [None]:
products[0].text

'\n\t\t\t\t\t\t\t\t'

In [49]:
def get_products(page):
    url = 'https://foodwise.org/eat-seasonally/seasonality-charts/'
    response = requests.get(url, headers = headers, params = {
        '_food_type': 'vegetable',
        '_paged': page
    })
    response.raise_for_status()
    html = lx.fromstring(response.text) # Parse the HTML
    return(html.xpath('//a[@class="card-image-title__outer-link"]/@href'))

In [50]:
get_products(1)

['https://foodwise.org/foods/artichokes/',
 'https://foodwise.org/foods/arugula/',
 'https://foodwise.org/foods/asparagus/',
 'https://foodwise.org/foods/beets/',
 'https://foodwise.org/foods/bitter-melon/',
 'https://foodwise.org/foods/bok-choy/',
 'https://foodwise.org/foods/broccoli/',
 'https://foodwise.org/foods/broccoli-rabe/',
 'https://foodwise.org/foods/brussels-sprouts/',
 'https://foodwise.org/foods/burdock/',
 'https://foodwise.org/foods/cabbage/',
 'https://foodwise.org/foods/cactus-pads/',
 'https://foodwise.org/foods/cardoons/',
 'https://foodwise.org/foods/carrots/',
 'https://foodwise.org/foods/cauliflower/',
 'https://foodwise.org/foods/celeriac/',
 'https://foodwise.org/foods/celery/',
 'https://foodwise.org/foods/celtuce/',
 'https://foodwise.org/foods/chard/',
 'https://foodwise.org/foods/chickweed/']

In [51]:
lst = [el for p in range(1,5) for el in get_products(p)]

In [52]:
lst

['https://foodwise.org/foods/artichokes/',
 'https://foodwise.org/foods/arugula/',
 'https://foodwise.org/foods/asparagus/',
 'https://foodwise.org/foods/beets/',
 'https://foodwise.org/foods/bitter-melon/',
 'https://foodwise.org/foods/bok-choy/',
 'https://foodwise.org/foods/broccoli/',
 'https://foodwise.org/foods/broccoli-rabe/',
 'https://foodwise.org/foods/brussels-sprouts/',
 'https://foodwise.org/foods/burdock/',
 'https://foodwise.org/foods/cabbage/',
 'https://foodwise.org/foods/cactus-pads/',
 'https://foodwise.org/foods/cardoons/',
 'https://foodwise.org/foods/carrots/',
 'https://foodwise.org/foods/cauliflower/',
 'https://foodwise.org/foods/celeriac/',
 'https://foodwise.org/foods/celery/',
 'https://foodwise.org/foods/celtuce/',
 'https://foodwise.org/foods/chard/',
 'https://foodwise.org/foods/chickweed/',
 'https://foodwise.org/foods/chicory/',
 'https://foodwise.org/foods/collard-greens/',
 'https://foodwise.org/foods/corn/',
 'https://foodwise.org/foods/cress/',
 'ht

In [53]:
def get_months(url): 
    time.sleep(0.1)
#    url = "https://foodwise.org/foods/" + product + "/"
    response = requests.get(url, headers = headers)
    response.raise_for_status()
    html = lx.fromstring(response.text)
    try: # N
        string = html.xpath('//section[@class="sidebar__section"][h2[contains(text(), "In Season")]]/text()')[1]
        month = sub(r'(In Season)|\W', ' ', string).split() # remove (In Season) or any non-alphanumeric content
    except:
        month = []
    
    return month

In [54]:
seasonality_info = [get_months(url) for url in lst]

In [None]:
seasonality_info

[['March',
  'April',
  'May',
  'June',
  'September',
  'October',
  'November',
  'December'],
 ['January',
  'February',
  'March',
  'April',
  'May',
  'June',
  'July',
  'August',
  'September',
  'October',
  'November',
  'December'],
 ['February', 'March', 'April', 'May', 'June'],
 ['January',
  'February',
  'March',
  'April',
  'May',
  'June',
  'July',
  'August',
  'September',
  'October',
  'November',
  'December'],
 ['June', 'July', 'August', 'September', 'October', 'November'],
 ['January',
  'February',
  'March',
  'April',
  'May',
  'June',
  'July',
  'August',
  'September',
  'October',
  'November',
  'December'],
 ['January',
  'February',
  'March',
  'April',
  'May',
  'June',
  'July',
  'August',
  'September',
  'October',
  'November',
  'December'],
 ['January',
  'February',
  'March',
  'April',
  'May',
  'June',
  'September',
  'October',
  'November',
  'December'],
 ['January',
  'February',
  'March',
  'April',
  'May',
  'September',
  '

In [None]:
year = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 
        'October', 'November', 'December']

In [None]:
month = get_months('potatoes')

In [None]:
month

['June', 'July', 'August']

In [None]:
[item in month for item in year]

[False,
 False,
 False,
 False,
 False,
 True,
 True,
 True,
 False,
 False,
 False,
 False]

In [None]:
def assemble_row(produce): 
    months = get_months(produce)
    months = [item in months for item in year]
    months.insert(0, produce)
    return months

In [None]:
assemble_row('potatoes')

['potatoes',
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 True,
 False,
 False,
 False,
 False]

In [None]:
produce[:10]

['Artichokes',
 'Arugula',
 'Asparagus',
 'Beets',
 'Bitter melon',
 'Bok choy',
 'Broccoli',
 'Broccoli rabe',
 'Brussels sprouts',
 'Burdock']

In [None]:
produce

['Artichokes',
 'Arugula',
 'Asparagus',
 'Beets',
 'Bitter melon',
 'Bok choy',
 'Broccoli',
 'Broccoli rabe',
 'Brussels sprouts',
 'Burdock',
 'Cabbage',
 'Cactus pads',
 'Cardoons',
 'Carrots',
 'Cauliflower',
 'Celeriac',
 'Celery',
 'Celtuce',
 'Chard',
 'Chickweed',
 'Chicory',
 'Collard greens',
 'Corn',
 'Cress',
 'Cresta di Gallo',
 'Cucumbers',
 'Dandelion greens',
 'Eggplant',
 'Endive',
 'Fava beans',
 'Fava greens',
 'Fennel',
 'Garlic',
 'Ginger root',
 'Green beans',
 'Herbs',
 'Horseradish',
 'Jicama',
 'Kale',
 'Kohlrabi',
 'Komatsuna',
 'Lambsquarters',
 'Leeks',
 'Lettuce',
 'Mushrooms',
 'Mustard greens',
 'Nettles',
 'Okra',
 'Onions',
 'Orach',
 'Parsnips',
 'Pea shoots',
 'Peas',
 'Peppers, chile',
 'Peppers, sweet',
 'Potatoes',
 'Purslane',
 'Radishes',
 'Romanesco',
 'Rutabagas',
 'Salsify',
 'Scallions',
 'Shallots',
 'Shelling beans',
 'Spinach',
 'Sprouts',
 'Squash, summer',
 'Squash, winter',
 'Sunchokes',
 'Sweet potatoes',
 'Taro root',
 'Tatsoi',
 'To

In [None]:
[assemble_row(i) for i in produce] # throws an error

HTTPError: 404 Client Error: Not Found for url: https://foodwise.org/foods/Peppers,%20chile/

In [None]:
def get_months(produce): 
    time.sleep(0.05)
    url = "https://foodwise.org/foods/" + produce + "/"
    response = requests.get(url, headers = headers)
    response.raise_for_status()
    html = lx.fromstring(response.text)
    try: 
        string = html.xpath('//section[@class="sidebar__section"][h2[contains(text(), "In Season")]]/text()')[1]
        month = sub(r'(In Season)|\W', ' ', string).split() 
    except:
        month = []
    return month

In [None]:
[assemble_row(i) for i in produce] # also throws an error

HTTPError: 404 Client Error: Not Found for url: https://foodwise.org/foods/Peppers,%20chile/

In [None]:
# N
# it should be peppers-chile

In [None]:
def get_months(produce): 
    time.sleep(0.05)
    url = "https://foodwise.org/foods/" + produce + "/"
    response = requests.get(url, headers = headers)
    try: response.raise_for_status()
    except requests.HTTPError:
        month = []
        return month 
    else:
        html = lx.fromstring(response.text)
        try: 
            string = html.xpath('//section[@class="sidebar__section"][h2[contains(text(), "In Season")]]/text()')[1]
            month = sub(r'(In Season)|\W', ' ', string).split() 
        except:
            month = []
            return month 
        return month

In [None]:
[assemble_row(i) for i in produce]

[['Artichokes',
  False,
  False,
  True,
  True,
  True,
  True,
  False,
  False,
  True,
  True,
  True,
  True],
 ['Arugula',
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True],
 ['Asparagus',
  False,
  True,
  True,
  True,
  True,
  True,
  False,
  False,
  False,
  False,
  False,
  False],
 ['Beets',
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True],
 ['Bitter melon',
  False,
  False,
  False,
  False,
  False,
  True,
  True,
  True,
  True,
  True,
  True,
  False],
 ['Bok choy',
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True],
 ['Broccoli',
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True],
 ['Broccoli rabe',
  True,
  True,
  True,
  True,
  True,
  True,
  False,
  False,
  True,
  True,
  True,
  True],
 ['Brussels sprouts',
  True,
  True,
  True,
  True,
  True,
  False,
  False,
  Fal

In [None]:
def get_months(produce): 
    time.sleep(0.05)
    url = "https://foodwise.org/foods/" + produce + "/"
    response = requests.get(url, headers = headers)
    try: response.raise_for_status()
    except requests.HTTPError:
        return None
    else:
        html = lx.fromstring(response.text)
        try: string = html.xpath('//section[@class="sidebar__section"]')[0].text_content()
        except: print(produce)
        string = html.xpath('//section[@class="sidebar__section"]')[0].text_content()
        month = sub(r'(In Season)|\W', ' ', string).split()
        return month

In [None]:
def assemble_row(produce): 
    months = get_months(produce)
    try: months = [item in months for item in year]
    except: print(produce)
    months.insert(0, produce)
    return months

In [None]:
[assemble_row(i) for i in produce]

We have to account for new links.... Retrieve the `href` attribute from the anchor. Again: Use __Inspect__.

In [None]:
url = 'https://foodwise.org/eat-seasonally/seasonality-charts/?_food_type=vegetable&_paged=3' #try page 3,4
response = requests.get(url, headers = headers)
response.raise_for_status()
html = lx.fromstring(response.text) # Parse the HTML
produce = html.xpath('//article[@class="card-image-title__container"]/a/@href') #returns href attribute of anchor link
produce

['https://foodwise.org/foods/komatsuna/',
 'https://foodwise.org/foods/lambsquarters/',
 'https://foodwise.org/foods/leeks/',
 'https://foodwise.org/foods/lettuce/',
 'https://foodwise.org/foods/mushrooms/',
 'https://foodwise.org/foods/mustard-greens/',
 'https://foodwise.org/foods/nettles/',
 'https://foodwise.org/foods/okra/',
 'https://foodwise.org/foods/onions/',
 'https://foodwise.org/foods/orach/',
 'https://foodwise.org/foods/parsnips/',
 'https://foodwise.org/foods/pea-shoots/',
 'https://foodwise.org/foods/peas/',
 'https://foodwise.org/foods/peppers-chile/',
 'https://foodwise.org/foods/peppers-sweet/',
 'https://foodwise.org/foods/potatoes/',
 'https://foodwise.org/foods/purslane/',
 'https://foodwise.org/foods/radishes/',
 'https://foodwise.org/foods/romanesco/',
 'https://foodwise.org/foods/rutabagas/']

In [None]:
def get_url(i):
    url = 'https://foodwise.org/eat-seasonally/seasonality-charts/?_paged=' + str(i)
    response = requests.get(url, headers = headers)
    response.raise_for_status()
    html = lx.fromstring(response.text) # Parse the HTML
    #returns href attribute of anchor link
    produce_link = html.xpath('//article[@class="card-image-title__container"]/a/@href') 
    return produce_link

In [None]:
produce_links = [item for sublist in [get_url(i) for i in range(1,5)] for item in sublist]
produce_links

['https://foodwise.org/foods/agretti/',
 'https://foodwise.org/foods/almonds/',
 'https://foodwise.org/foods/amaranth/',
 'https://foodwise.org/foods/apples/',
 'https://foodwise.org/foods/apricots/',
 'https://foodwise.org/foods/apriums/',
 'https://foodwise.org/foods/artichokes/',
 'https://foodwise.org/foods/arugula/',
 'https://foodwise.org/foods/asian-pears/',
 'https://foodwise.org/foods/asparagus/',
 'https://foodwise.org/foods/avocados/',
 'https://foodwise.org/foods/baked-goods/',
 'https://foodwise.org/foods/bee-products/',
 'https://foodwise.org/foods/beets/',
 'https://foodwise.org/foods/bitter-melon/',
 'https://foodwise.org/foods/blackberries/',
 'https://foodwise.org/foods/blueberries/',
 'https://foodwise.org/foods/bok-choy/',
 'https://foodwise.org/foods/boysenberries/',
 'https://foodwise.org/foods/broccoli/',
 'https://foodwise.org/foods/broccoli-rabe/',
 'https://foodwise.org/foods/brown-rice/',
 'https://foodwise.org/foods/brussels-sprouts/',
 'https://foodwise.org

Lets find the (new) produce name from its site. 

In [None]:
result = requests.get('https://foodwise.org/foods/peppers-chile/', headers = headers)
result.raise_for_status()

In [None]:
html = lx.fromstring(result.text)

In [None]:
html.xpath("//h1/text()")[0]

'Peppers, chile'

In [None]:
def get_months(produce_link): 
    time.sleep(0.05)
    response = requests.get(produce_link, headers = headers)
    try: response.raise_for_status()
    except requests.HTTPError:
        return [None, []] 
    else:
        html = lx.fromstring(response.text)
        try: 
            string = html.xpath('//section[@class="sidebar__section"][h2[contains(text(), "In Season")]]/text()')[1]
        except:
            return [None, []] 
        else:
            month = sub(r'(In Season)|\W', ' ', string).split() 
            name = html.xpath("//h1/text()")[0]
            return [name, month]

In [None]:
def assemble_row(produce_link): 
    name, months = get_months(produce_link)
    months = [item in months for item in year]
    months.insert(0, name)
    return months

In [None]:
df = [assemble_row(i) for i in produce_links] 
df

[['Agretti',
  False,
  False,
  False,
  False,
  False,
  False,
  False,
  False,
  True,
  True,
  True,
  True],
 ['Almonds',
  False,
  False,
  False,
  False,
  False,
  False,
  False,
  True,
  True,
  True,
  True,
  False],
 ['Amaranth',
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True],
 ['Apples',
  False,
  False,
  False,
  False,
  False,
  False,
  False,
  True,
  True,
  True,
  True,
  False],
 ['Apricots',
  False,
  False,
  False,
  False,
  True,
  True,
  True,
  False,
  False,
  False,
  False,
  False],
 ['Apriums',
  False,
  False,
  False,
  False,
  True,
  True,
  False,
  False,
  False,
  False,
  False,
  False],
 ['Artichokes',
  False,
  False,
  True,
  True,
  True,
  True,
  False,
  False,
  True,
  True,
  True,
  True],
 ['Arugula',
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True],
 ['Pears',
  False,
  False,
  False,
  False,
  False,
  False,
  Fa

In [None]:
import pandas as pd
tbl = pd.DataFrame(df)
tbl.shape

(80, 13)

In [None]:
tbl.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,Agretti,False,False,False,False,False,False,False,False,True,True,True,True
1,Almonds,False,False,False,False,False,False,False,True,True,True,True,False
2,Amaranth,True,True,True,True,True,True,True,True,True,True,True,True
3,Apples,False,False,False,False,False,False,False,True,True,True,True,False
4,Apricots,False,False,False,False,True,True,True,False,False,False,False,False


In [None]:
columnames = year.copy()
columnames.insert(0, 'Produce')
tbl.columns = columnames

In [None]:
tbl

Unnamed: 0,Produce,January,February,March,April,May,June,July,August,September,October,November,December
0,Agretti,False,False,False,False,False,False,False,False,True,True,True,True
1,Almonds,False,False,False,False,False,False,False,True,True,True,True,False
2,Amaranth,True,True,True,True,True,True,True,True,True,True,True,True
3,Apples,False,False,False,False,False,False,False,True,True,True,True,False
4,Apricots,False,False,False,False,True,True,True,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,Kale,True,True,True,True,True,True,True,True,True,True,True,True
76,Kiwi,False,False,False,False,False,False,False,False,True,True,True,True
77,Kohlrabi,True,True,True,True,False,False,False,False,False,False,True,True
78,,False,False,False,False,False,False,False,False,False,False,False,False


### Tornado Watch 

We are interested in scraping and plotting the locations of all tornado warnings in the last 48 hours. 

See the link <a href="https://www.tornadohq.com/">here<a>.

In [None]:
import requests
import lxml.html as lx
import time
import pandas as pd

In [None]:
result = requests.get('https://www.tornadohq.com/')
result.raise_for_status

In [None]:
html = lx.fromstring(result.text) # Parse the HTML

In [None]:
warnings = html.xpath('//pre')
warnings

In [None]:
warning = warnings[0].text
warning

In [None]:
for w in warnings:
    print(w.text)
    print("\n\n-----THIS IS A NEW WARNING-----\n\n")

Lets match the latitude-longitude pair after `LAT...LON`. 

In [None]:
from re import findall

In [None]:
findall('(?<=LAT\.{3}LON\s)(\d+\s\d+)', warning)

In [None]:
findall('(?<=LAT\.{3}LON\s)(\d+\s\d+)', warning)[0].split()
# (?<=...)	Positive Lookbehind.
# group consisting of ? optional character, LAT...LON followed by any whitespace. \d: any digit, at least one occurence, whitespace, \d any digit, at least one occurence

Rename the coordinates in readable format. 

In [None]:
coord_list = [findall('(?<=LAT\.{3}LON\s)(\d+\s\d+)', warning.text)[0].split() for warning in warnings]

In [None]:
coord = pd.DataFrame(coord_list)
coord.columns = ['N', 'W']
coord = coord.map(lambda x: float(x) / 100) # convert location in readable format
coord['W'] = -coord['W'] # longitude to west is negative
coord.head()

Plot the results (consider a [mapbox token](https://studio.mapbox.com/) to plot.)!

In [None]:
coord

In [None]:
import plotly.express as px
import geopandas as gpd

# px.set_mapbox_access_token(open("./../keys/mapbox.txt").read())
fig = px.scatter_mapbox(coord,
                        lat='N',
                        lon='W',
                        zoom=2,
                        mapbox_style="open-street-map")
fig.show()


### Summary 

- Scraping does not necessarily return the desired, make use of error handling 
- Make use of the advantages of devtools to see how the website is structured