# STA 220 Data & Web Technologies for Data Analysis

### Lecture 7, 1/30/24, Scraping

### Announcements
- HW 2 is online! 

### Today's topics
 - Web Scraping: 
     - Foodwise
     - Tornado Watch

### Ressources
 - [Foodwise](https://foodwise.org/)
 - [Tornado Watch](https://www.tornadohq.com/)

### Writing Scrapers

Lets scrape the wiki table ourselves. Attention: We are using request, so pay attention to the file that is being returned. Check on devtools the html element for `<thead>` and see what is returned in the network. 

In [1]:
import requests
import lxml.html as lx
import pandas as pd

In [None]:
result = requests.get(url = 'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area')
html = lx.fromstring(result.text)

In [3]:
result.text[:100]

'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-la'

In [7]:
tables = html.xpath('//table[2]')
table = tables[0]

In [8]:
table.text_content()

'\n\n\nCity\n\nST\n\nLand area\n\nWater area\n\nTotal area\n\nPopulation(2020)\n\n\n(mi2)\n(km2)\n(mi2)\n(km2)\n(mi2)\n(km2)\n\n\n\n\xa0\n \n \n \n \n \n \n \n \n\n\nSitka\nAK\n2,870.1\n\n7,434\n1,945.1\n\n5,038\n4,815.1\n\n12,471\n8,458\n\n\nJuneau\nAK\n2,704.2\n\n7,004\n550.7\n\n1,426\n3,254.9\n\n8,430\n32,255\n\n\nWrangell\nAK\n2,556.1\n\n6,620\n920.6\n\n2,384\n3,476.7\n\n9,005\n2,127\n\n\nAnchorage\nAK\n1,707.0\n\n4,421\n239.7\n\n621\n1,946.7\n\n5,042\n291,247\n\n\nTribune[a]*\nKS\n778.2\n\n2,016\n0\n\n0\n778.2\n\n2,016\n1,182\n\n\nJacksonville\nFL\n747.3\n\n1,935\n127.2\n\n329\n874.5\n\n2,265\n949,611\n\n\nAnaconda\nMT\n736.7\n\n1,908\n4.7\n\n12\n741.4\n\n1,920\n9,421\n\n\nButte *\nMT\n715.8\n\n1,854\n0.6\n\n1.6\n716.3\n\n1,855\n34,494\n\n\nHouston\nTX\n640.6\n\n1,659\n31.2\n\n81\n671.8\n\n1,740\n2,304,580\n\n\nOklahoma City\nOK\n606.5\n\n1,571\n14.3\n\n37\n620.8\n\n1,608\n681,054\n\n\nPhoenix\nAZ\n518.3\n\n1,342\n1.0\n\n2.6\n519.3\n\n1,345\n1,608,139\n\n\nSan Antonio\nTX\n498.9\n

In [13]:
html.xpath('//*[@id="mw-content-text"]/div[1]/table[2]/thead')

[]

In [22]:
html.xpath('//table[2]/tbody/tr[4]//text()')

['\n',
 'Sitka',
 '\n',
 'AK',
 '\n',
 '2,870.1\n',
 '\n',
 '7,434',
 '\n',
 '1,945.1\n',
 '\n',
 '5,038',
 '\n',
 '4,815.1\n',
 '\n',
 '12,471',
 '\n',
 '8,458\n']

In [23]:
def retrieve_rows(html): 
    rows = html.xpath('//table[2]/tbody/tr')
    cells = []
    for row in rows: 
        # ./td|th means we start at the node (not searching the whole doc again), and choose td OR th children
        cells.append([cell.text_content() for cell in row.xpath('./td|th')]) # no text, as some cells are in <b>
    return cells

In [24]:
retrieve_rows(html)

[['City\n',
  'ST\n',
  'Land area\n',
  'Water area\n',
  'Total area\n',
  'Population(2020)\n'],
 ['(mi2)', '(km2)', '(mi2)', '(km2)', '(mi2)', '(km2)\n'],
 ['\xa0', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' \n'],
 ['Sitka',
  'AK',
  '2,870.1\n',
  '7,434',
  '1,945.1\n',
  '5,038',
  '4,815.1\n',
  '12,471',
  '8,458\n'],
 ['Juneau',
  'AK',
  '2,704.2\n',
  '7,004',
  '550.7\n',
  '1,426',
  '3,254.9\n',
  '8,430',
  '32,255\n'],
 ['Wrangell',
  'AK',
  '2,556.1\n',
  '6,620',
  '920.6\n',
  '2,384',
  '3,476.7\n',
  '9,005',
  '2,127\n'],
 ['Anchorage',
  'AK',
  '1,707.0\n',
  '4,421',
  '239.7\n',
  '621',
  '1,946.7\n',
  '5,042',
  '291,247\n'],
 ['Tribune[a]*',
  'KS',
  '778.2\n',
  '2,016',
  '0\n',
  '0',
  '778.2\n',
  '2,016',
  '1,182\n'],
 ['Jacksonville',
  'FL',
  '747.3\n',
  '1,935',
  '127.2\n',
  '329',
  '874.5\n',
  '2,265',
  '949,611\n'],
 ['Anaconda',
  'MT',
  '736.7\n',
  '1,908',
  '4.7\n',
  '12',
  '741.4\n',
  '1,920',
  '9,421\n'],
 ['Butte *',
  'MT',
 

In [26]:
df = pd.DataFrame(retrieve_rows(html))
df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,City\n,ST\n,Land area\n,Water area\n,Total area\n,Population(2020)\n,,,
1,(mi2),(km2),(mi2),(km2),(mi2),(km2)\n,,,
2,,,,,,,,,\n
3,Sitka,AK,"2,870.1\n",7434,"1,945.1\n",5038,"4,815.1\n",12471.0,"8,458\n"
4,Juneau,AK,"2,704.2\n",7004,550.7\n,1426,"3,254.9\n",8430.0,"32,255\n"
5,Wrangell,AK,"2,556.1\n",6620,920.6\n,2384,"3,476.7\n",9005.0,"2,127\n"
6,Anchorage,AK,"1,707.0\n",4421,239.7\n,621,"1,946.7\n",5042.0,"291,247\n"
7,Tribune[a]*,KS,778.2\n,2016,0\n,0,778.2\n,2016.0,"1,182\n"
8,Jacksonville,FL,747.3\n,1935,127.2\n,329,874.5\n,2265.0,"949,611\n"
9,Anaconda,MT,736.7\n,1908,4.7\n,12,741.4\n,1920.0,"9,421\n"


### Example: Foodwise

Foodwise, formerly CUESA (Center for Urban Education about Sustainable Agriculture) provides [a chart](https://foodwise.org/eat-seasonally/seasonality-chart-vegetables/) on when certain vegetables are in season. We want to create this chart for ourselves. All the info we need is on `foodwise`, so lets scrape! 

First, observe that the search mask (Food type, Month) invokes an API. However, the params are complicated to assemble, also, the returned object is an html. So we have to scrape the html. First check, using devtools, that the desired information is returned by the API (under `doc`). 

In [46]:
import requests
import lxml.html as lx
import requests_cache
import time
requests_cache.install_cache("lecture10")

In [47]:
url = "https://foodwise.org/eat-seasonally/seasonality-charts/?_food_type=vegetable"

In [48]:
headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
}

Here, the server needs the `user-agent` key in the header. 

In [49]:
response = requests.get(url, headers = headers)
response.raise_for_status()

In [50]:
response.text[:100]

'<!doctype html>\n<html lang="en-US">\n<head>\n\t<meta charset="UTF-8">\n\t<meta name="viewport" content="w'

##### First approach

In [55]:
url = "https://foodwise.org/foods/artichokes/"
response = requests.get(url)

In [56]:
response.raise_for_status()

In [53]:
response.text # works after executed chunk below, as we use cache

'<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body>\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n'

We have to provide the correct header! 

In [54]:
response = requests.get(url, headers = headers)
response.raise_for_status

<bound method Response.raise_for_status of <Response [200]>>

In [57]:
response.text[:100]

'<!doctype html>\n<html lang="en-US">\n<head>\n\t<meta charset="UTF-8">\n\t<meta name="viewport" content="w'

In [58]:
html = lx.fromstring(response.text) # Parse the HTML
html

<Element html at 0x7fbe487ab900>

In [68]:
html.xpath('//section[@class="sidebar__section"][h2[contains(text(), "In Season")]]/text()')[1]

'\n                    March • April • May • June • September • October • November • December            '

In [69]:
string = html.xpath('//section[@class="sidebar__section"][h2[contains(text(), "In Season")]]/text()')[1]
string

'\n                    March • April • May • June • September • October • November • December            '

In [70]:
from re import sub
sub(r'\W', ' ', string).split() # we are going to talk about RegEx some other time

['March',
 'April',
 'May',
 'June',
 'September',
 'October',
 'November',
 'December']

In [71]:
def get_months(produce): 
    time.sleep(0.05)
    url = "https://foodwise.org/foods/" + produce + "/"
    response = requests.get(url, headers = headers)
    response.raise_for_status()
    html = lx.fromstring(response.text)
    string = html.xpath('//section[@class="sidebar__section"][h2[contains(text(), "In Season")]]/text()')[1]
    month = sub(r'(In Season)|\W', ' ', string).split()
    return month

In [76]:
month = get_months('carrots')
month 

['January',
 'February',
 'March',
 'April',
 'May',
 'June',
 'July',
 'August',
 'September',
 'October',
 'November',
 'December']

##### How to get the product in the first place? 

In [77]:
url = 'https://foodwise.org/eat-seasonally/seasonality-charts/?_food_type=vegetable'
response = requests.get(url, headers = headers)
response.raise_for_status()

In [78]:
html = lx.fromstring(response.text) # Parse the HTML
html

<Element html at 0x7fbe487b7c20>

In [83]:
produce = html.xpath('//div[@class="card-image-title__text-content"]/h3/text()')
produce   

['Artichokes',
 'Arugula',
 'Asparagus',
 'Beets',
 'Bitter melon',
 'Bok choy',
 'Broccoli',
 'Broccoli rabe',
 'Brussels sprouts',
 'Burdock',
 'Cabbage',
 'Cactus pads',
 'Cardoons',
 'Carrots',
 'Cauliflower',
 'Celeriac',
 'Celery',
 'Celtuce',
 'Chard',
 'Chickweed']

In [82]:
[i.text for i in produce]

['Artichokes',
 'Arugula',
 'Asparagus',
 'Beets',
 'Bitter melon',
 'Bok choy',
 'Broccoli',
 'Broccoli rabe',
 'Brussels sprouts',
 'Burdock',
 'Cabbage',
 'Cactus pads',
 'Cardoons',
 'Carrots',
 'Cauliflower',
 'Celeriac',
 'Celery',
 'Celtuce',
 'Chard',
 'Chickweed']

In [84]:
def get_produce(page):
    url = 'https://foodwise.org/eat-seasonally/seasonality-charts/'
    response = requests.get(url, headers = headers, params = {
        '_food_type': 'vegetable',
        '_paged': page
    })
    response.raise_for_status()
    html = lx.fromstring(response.text) # Parse the HTML
    produce = html.xpath('//div[@class="card-image-title__text-content"]/h3/text()')
    return produce

In [86]:
get_produce(2)

['Chicory',
 'Collard greens',
 'Corn',
 'Cress',
 'Cresta di Gallo',
 'Cucumbers',
 'Dandelion greens',
 'Eggplant',
 'Endive',
 'Fava beans',
 'Fava greens',
 'Fennel',
 'Garlic',
 'Ginger root',
 'Green beans',
 'Herbs',
 'Horseradish',
 'Jicama',
 'Kale',
 'Kohlrabi']

In [87]:
produce = [item for sublist in [get_produce(i) for i in range(1,5)] for item in sublist]
produce

['Artichokes',
 'Arugula',
 'Asparagus',
 'Beets',
 'Bitter melon',
 'Bok choy',
 'Broccoli',
 'Broccoli rabe',
 'Brussels sprouts',
 'Burdock',
 'Cabbage',
 'Cactus pads',
 'Cardoons',
 'Carrots',
 'Cauliflower',
 'Celeriac',
 'Celery',
 'Celtuce',
 'Chard',
 'Chickweed',
 'Chicory',
 'Collard greens',
 'Corn',
 'Cress',
 'Cresta di Gallo',
 'Cucumbers',
 'Dandelion greens',
 'Eggplant',
 'Endive',
 'Fava beans',
 'Fava greens',
 'Fennel',
 'Garlic',
 'Ginger root',
 'Green beans',
 'Herbs',
 'Horseradish',
 'Jicama',
 'Kale',
 'Kohlrabi',
 'Komatsuna',
 'Lambsquarters',
 'Leeks',
 'Lettuce',
 'Mushrooms',
 'Mustard greens',
 'Nettles',
 'Okra',
 'Onions',
 'Orach',
 'Parsnips',
 'Pea shoots',
 'Peas',
 'Peppers, chile',
 'Peppers, sweet',
 'Potatoes',
 'Purslane',
 'Radishes',
 'Romanesco',
 'Rutabagas',
 'Salsify',
 'Scallions',
 'Shallots',
 'Shelling beans',
 'Spinach',
 'Sprouts',
 'Squash, summer',
 'Squash, winter',
 'Sunchokes',
 'Sweet potatoes',
 'Taro root',
 'Tatsoi',
 'To

##### Iterate over produce items

In [88]:
year = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 
        'October', 'November', 'December']

In [92]:
month = get_months('potatoes')

In [93]:
month

['June', 'July', 'August']

In [94]:
[item in month for item in year]

[False,
 False,
 False,
 False,
 False,
 True,
 True,
 True,
 False,
 False,
 False,
 False]

In [95]:
def assemble_row(produce): 
    months = get_months(produce)
    months = [item in months for item in year]
    months.insert(0, produce)
    return months

In [97]:
assemble_row('apples')

['apples',
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 True,
 True,
 False]

In [98]:
produce[:10]

['Artichokes',
 'Arugula',
 'Asparagus',
 'Beets',
 'Bitter melon',
 'Bok choy',
 'Broccoli',
 'Broccoli rabe',
 'Brussels sprouts',
 'Burdock']

In [99]:
[assemble_row(i) for i in produce]

IndexError: list index out of range

In [104]:
def get_months(produce): 
    time.sleep(0.05)
    url = "https://foodwise.org/foods/" + produce + "/"
    response = requests.get(url, headers = headers)
    response.raise_for_status()
    html = lx.fromstring(response.text)
    try: 
        string = html.xpath('//section[@class="sidebar__section"][h2[contains(text(), "In Season")]]/text()')[1]
        month = sub(r'(In Season)|\W', ' ', string).split() 
    except:
        month = []
    return month

In [105]:
[assemble_row(i) for i in produce]

HTTPError: 404 Client Error: Not Found for url: https://foodwise.org/foods/Peppers,%20chile/

In [112]:
def get_months(produce): 
    time.sleep(0.05)
    url = "https://foodwise.org/foods/" + produce + "/"
    response = requests.get(url, headers = headers)
    try: response.raise_for_status()
    except requests.HTTPError:
        month = []
        return month 
    else:
        html = lx.fromstring(response.text)
        try: 
            string = html.xpath('//section[@class="sidebar__section"][h2[contains(text(), "In Season")]]/text()')[1]
            month = sub(r'(In Season)|\W', ' ', string).split() 
        except:
            month = []
            return month 
        return month

In [113]:
[assemble_row(i) for i in produce]

[['Artichokes',
  False,
  False,
  True,
  True,
  True,
  True,
  False,
  False,
  True,
  True,
  True,
  True],
 ['Arugula',
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True],
 ['Asparagus',
  False,
  True,
  True,
  True,
  True,
  True,
  False,
  False,
  False,
  False,
  False,
  False],
 ['Beets',
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True],
 ['Bitter melon',
  False,
  False,
  False,
  False,
  False,
  True,
  True,
  True,
  True,
  True,
  True,
  False],
 ['Bok choy',
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True],
 ['Broccoli',
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True],
 ['Broccoli rabe',
  True,
  True,
  True,
  True,
  True,
  True,
  False,
  False,
  True,
  True,
  True,
  True],
 ['Brussels sprouts',
  True,
  True,
  True,
  True,
  True,
  False,
  False,
  Fal

In [None]:
def get_months(produce): 
    time.sleep(0.05)
    url = "https://foodwise.org/foods/" + produce + "/"
    response = requests.get(url, headers = headers)
    try: response.raise_for_status()
    except requests.HTTPError:
        return None
    else:
        html = lx.fromstring(response.text)
        try: string = html.xpath('//section[@class="sidebar__section"]')[0].text_content()
        except: print(produce)
        string = html.xpath('//section[@class="sidebar__section"]')[0].text_content()
        month = sub(r'(In Season)|\W', ' ', string).split()
        return month

In [None]:
def assemble_row(produce): 
    months = get_months(produce)
    try: months = [item in months for item in year]
    except: print(produce)
    months.insert(0, produce)
    return months

In [None]:
[assemble_row(i) for i in produce]

We have to account for new links.... Retrieve the `href` attribute from the anchor.

In [114]:
url = 'https://foodwise.org/eat-seasonally/seasonality-charts/?_food_type=vegetable&_paged=3' #try page 3,4
response = requests.get(url, headers = headers)
response.raise_for_status()
html = lx.fromstring(response.text) # Parse the HTML
produce = html.xpath('//article[@class="card-image-title__container"]/a/@href') #returns href attribute of anchor link
produce

['https://foodwise.org/foods/komatsuna/',
 'https://foodwise.org/foods/lambsquarters/',
 'https://foodwise.org/foods/leeks/',
 'https://foodwise.org/foods/lettuce/',
 'https://foodwise.org/foods/mushrooms/',
 'https://foodwise.org/foods/mustard-greens/',
 'https://foodwise.org/foods/nettles/',
 'https://foodwise.org/foods/okra/',
 'https://foodwise.org/foods/onions/',
 'https://foodwise.org/foods/orach/',
 'https://foodwise.org/foods/parsnips/',
 'https://foodwise.org/foods/pea-shoots/',
 'https://foodwise.org/foods/peas/',
 'https://foodwise.org/foods/peppers-chile/',
 'https://foodwise.org/foods/peppers-sweet/',
 'https://foodwise.org/foods/potatoes/',
 'https://foodwise.org/foods/purslane/',
 'https://foodwise.org/foods/radishes/',
 'https://foodwise.org/foods/romanesco/',
 'https://foodwise.org/foods/rutabagas/']

In [115]:
def get_url(i):
    url = 'https://foodwise.org/eat-seasonally/seasonality-charts/?_paged=' + str(i)
    response = requests.get(url, headers = headers)
    response.raise_for_status()
    html = lx.fromstring(response.text) # Parse the HTML
    #returns href attribute of anchor link
    produce_link = html.xpath('//article[@class="card-image-title__container"]/a/@href') 
    return produce_link

In [116]:
produce_links = [item for sublist in [get_url(i) for i in range(1,5)] for item in sublist]
produce_links

['https://foodwise.org/foods/agretti/',
 'https://foodwise.org/foods/almonds/',
 'https://foodwise.org/foods/amaranth/',
 'https://foodwise.org/foods/apples/',
 'https://foodwise.org/foods/apricots/',
 'https://foodwise.org/foods/apriums/',
 'https://foodwise.org/foods/artichokes/',
 'https://foodwise.org/foods/arugula/',
 'https://foodwise.org/foods/asian-pears/',
 'https://foodwise.org/foods/asparagus/',
 'https://foodwise.org/foods/avocados/',
 'https://foodwise.org/foods/baked-goods/',
 'https://foodwise.org/foods/bee-products/',
 'https://foodwise.org/foods/beets/',
 'https://foodwise.org/foods/bitter-melon/',
 'https://foodwise.org/foods/blackberries/',
 'https://foodwise.org/foods/blueberries/',
 'https://foodwise.org/foods/bok-choy/',
 'https://foodwise.org/foods/boysenberries/',
 'https://foodwise.org/foods/broccoli/',
 'https://foodwise.org/foods/broccoli-rabe/',
 'https://foodwise.org/foods/brown-rice/',
 'https://foodwise.org/foods/brussels-sprouts/',
 'https://foodwise.org

Lets find the (new) produce name from its site. 

In [117]:
result = requests.get('https://foodwise.org/foods/peppers-chile/', headers = headers)
result.raise_for_status()

In [118]:
html = lx.fromstring(result.text)

In [121]:
html.xpath("//h1/text()")[0]

'Peppers, chile'

In [133]:
def get_months(produce_link): 
    time.sleep(0.05)
    response = requests.get(produce_link, headers = headers)
    try: response.raise_for_status()
    except requests.HTTPError:
        return [None, []] 
    else:
        html = lx.fromstring(response.text)
        try: 
            string = html.xpath('//section[@class="sidebar__section"][h2[contains(text(), "In Season")]]/text()')[1]
        except:
            return [None, []] 
        else 
            month = sub(r'(In Season)|\W', ' ', string).split() 
            name = html.xpath("//h1/text()")[0]
            return [name, month]

In [134]:
def assemble_row(produce_link): 
    name, months = get_months(produce_link)
    months = [item in months for item in year]
    months.insert(0, name)
    return months

In [135]:
df = [assemble_row(i) for i in produce_links] 
df

[['Agretti',
  False,
  False,
  False,
  False,
  False,
  False,
  False,
  False,
  True,
  True,
  True,
  True],
 ['Almonds',
  False,
  False,
  False,
  False,
  False,
  False,
  False,
  True,
  True,
  True,
  True,
  False],
 ['Amaranth',
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True],
 ['Apples',
  False,
  False,
  False,
  False,
  False,
  False,
  False,
  True,
  True,
  True,
  True,
  False],
 ['Apricots',
  False,
  False,
  False,
  False,
  True,
  True,
  True,
  False,
  False,
  False,
  False,
  False],
 ['Apriums',
  False,
  False,
  False,
  False,
  True,
  True,
  False,
  False,
  False,
  False,
  False,
  False],
 ['Artichokes',
  False,
  False,
  True,
  True,
  True,
  True,
  False,
  False,
  True,
  True,
  True,
  True],
 ['Arugula',
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True],
 ['Asian pears',
  False,
  False,
  False,
  False,
  False,
  False

In [136]:
import pandas as pd
tbl = pd.DataFrame(df)
tbl.shape

(80, 13)

In [137]:
tbl.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,Agretti,False,False,False,False,False,False,False,False,True,True,True,True
1,Almonds,False,False,False,False,False,False,False,True,True,True,True,False
2,Amaranth,True,True,True,True,True,True,True,True,True,True,True,True
3,Apples,False,False,False,False,False,False,False,True,True,True,True,False
4,Apricots,False,False,False,False,True,True,True,False,False,False,False,False


In [138]:
columnames = year.copy()
columnames.insert(0, 'Produce')
tbl.columns = columnames

In [139]:
tbl

Unnamed: 0,Produce,January,February,March,April,May,June,July,August,September,October,November,December
0,Agretti,False,False,False,False,False,False,False,False,True,True,True,True
1,Almonds,False,False,False,False,False,False,False,True,True,True,True,False
2,Amaranth,True,True,True,True,True,True,True,True,True,True,True,True
3,Apples,False,False,False,False,False,False,False,True,True,True,True,False
4,Apricots,False,False,False,False,True,True,True,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,Kale,True,True,True,True,True,True,True,True,True,True,True,True
76,Kiwi,False,False,False,False,False,False,False,False,True,True,True,True
77,Kohlrabi,True,True,True,True,False,False,False,False,False,False,True,True
78,,False,False,False,False,False,False,False,False,False,False,False,False


### Tornado Watch 

We are interested in scraping and plotting the locations of all tornado warnings in the last 48 hours. 

In [None]:
import requests
import lxml.html as lx
import time
import pandas as pd

In [None]:
result = requests.get('https://www.tornadohq.com/')
result.raise_for_status

In [None]:
html = lx.fromstring(result.text) # Parse the HTML

In [None]:
warnings = html.xpath('//pre')
warnings

In [None]:
warning = warnings[0].text
warning

Lets match the latitude-longitude pair after `LAT...LON`. 

In [None]:
from re import findall

In [None]:
findall('(?<=LAT\.{3}LON\s)(\d+\s\d+)', warning)[0].split()

Rename the coordinates in readable format. 

In [None]:
coord_list = [findall('(?<=LAT\.{3}LON\s)(\d+\s\d+)', warning.text)[0].split() for warning in warnings]

In [None]:
coord = pd.DataFrame(coord_list)
coord.columns = ['N', 'W']
coord = coord.applymap(lambda x: float(x) / 100) # convert location in readable format
coord['W'] = -coord['W'] # longitude to west is negative
coord.head()

Plot the results (consider a [mapbox token](https://studio.mapbox.com/) to plot.)!

In [None]:
import plotly.express as px
import geopandas as gpd

px.set_mapbox_access_token(open("./../keys/mapbox.txt").read())
fig = px.scatter_mapbox(coord,
                        lat='N',
                        lon='W',
                        zoom=4)
fig.show()

### Summary 

- Scraping does not necessarily return the desired, make use of error handling 
- Make use of the advantages of devtools to see how the website is structured