# Avalanche Data Scraper

Pull all forecast data from Northwest Avalanche Center for Mount Hood, OR Keep the criticial data along with the 'Forecast Discussion' which provides some context and _may_ be useful for adding some nuance to the data.

Unfortunately the aspect, likelihood, and size of hazard (e.g. 'northwest-facing slopes') are stored in .png images on their website; decoding these may be a pain so I'm just scraping the urls for now.

The goal is to take this information and start to build an automated avalanche forecast based on live weather data. While I will continue to trust NWAC's human-generated forecast, there may be times where the two forecasts diverge; if the automated model performs well, this information could prove relevant.

Additionally, NWAC is only forecasts during the core of the winter season; an automated forecast (if accurate) could be useful for early and late-season.

In [106]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep

# store in dictionary for simplicity
data = {}

In [134]:
current = "https://www.nwac.us/avalanche-forecast/current/mt-hood/"
r = requests.get(current)
sup = BeautifulSoup(r.content, "html5lib")

In [147]:
# scratch cell for figuring out what tags have relevant information
ps = sup.find(id="problems")
l = ps.find_all(class_='problem')
for e in l:
    print(e.attrs)

{'class': ['problem', 'wind-slab']}
{'class': ['problem', 'loose-wet']}


In [204]:
def extract(r):
    '''take the request (if not 404) and extract relevant info; return the timestamp and a dictionary of data'''
    row = {}

    soup = BeautifulSoup(r.content, "html5lib")
    elevs = soup.find(id="elevation-levels")
    problem_section = soup.find(id="problems")
    problem_tags = problem_section.find_all(class_="problem", recursive=False)
    discussion = soup.find(id='discussion').contents
    
    problems = []
    sizes = []
    likelihoods = []
    octagons = []
    
    for div in problem_tags:
        # print(div.attrs)
        problems.append(div.attrs['class'][1])
        sizes.append(div.find(class_='problem-sizes').attrs['src'])
        likelihoods.append(div.find(class_='problem-likelihood').attrs['src'])
        octagons.append(div.find(class_='problem-octagon').attrs['src'])

    row['problems'] = problems
    row['sizes'] = sizes
    row['likelihoods'] = likelihoods
    row['octagons'] = octagons
    row['discussion'] = discussion
    
    issued = elevs.contents[1].contents[1].string.strip().strip('Issued: ')

    for tag in ["treeline-above", "treeline-near", "treeline-below"]:
        el = elevs.find(id=tag)
        danger = el.contents[3].contents[5].h4.string
        # print(tag, danger)
        row[tag] = danger

    print('\r' + str(row['problems']))
    return issued, row

# extract()

In [196]:
ex = extract(requests.get(current))

[['wind-slab', 'loose-wet']]


In [84]:
first_forecast = "https://www.nwac.us/avalanche-forecast/avalanche-region-forecast/73/mt-hood/"
incomplete_forecast = "https://www.nwac.us/avalanche-forecast/avalanche-region-forecast/3623/mt-hood/"
# url to format, 73 - 3900 ?:
url_base = "https://www.nwac.us/avalanche-forecast/avalanche-region-forecast/{}/mt-hood/"

In [209]:
for i in range(73):
# for i in range(73, 75):
    url = url_base.format(i)
    r = requests.get(url)
    print('\r' + str(i) + ' ' + str(r.status_code))
    if r.status_code == 404:
        continue
    else:
        # be nice to server
        sleep(.5)

    try:
        issued, row = extract(r)
        data[issued] = row
    except AttributeError:
        continue

0 404
1 404
2 404
3 404
4 404
5 404
6 404
7 404
8 404
9 404
10 404
11 404
12 404
13 404
14 404
15 404
16 404
17 404
18 404
19 404
20 404
21 404
22 404
23 404
24 404
25 404
26 404
27 404
28 404
29 404
30 404
31 404
32 404
33 404
34 404
35 404
36 404
37 404
38 404
39 404
40 404
41 404
42 404
43 404
44 404
45 404
46 404
47 404
48 404
49 404
50 404
51 404
52 404
53 404
54 404
55 404
56 404
57 404
58 404
59 404
60 404
61 404
62 404
63 404
64 404
65 404
66 404
67 404
68 404
69 404
70 404
71 404
72 404


In [206]:
len(data)

666

In [207]:
df = pd.DataFrame.from_dict(data, orient='index')
df.head()

Unnamed: 0,treeline-above,treeline-below,discussion,sizes,octagons,treeline-near,likelihoods,problems
"10:01 AM PST Sunday, March 12, 2017",Considerable,Moderate,"[  , <div class=""forecast-snowpack""...",[https://d22fgw9k2fjwhz.cloudfront.net/media/i...,"[/avalanche-forecast/octagon/problem/7087.png,...",Considerable,[https://d22fgw9k2fjwhz.cloudfront.net/media/i...,"[wind-slab, storm-slabs, cornices, loose-wet]"
"10:11 AM PST Tuesday, December 15, 2015",Considerable,Moderate,"[  , <div class=""forecast-snowpack""...",[https://d22fgw9k2fjwhz.cloudfront.net/media/i...,"[/avalanche-forecast/octagon/problem/3605.png,...",Moderate,[https://d22fgw9k2fjwhz.cloudfront.net/media/i...,"[wind-slab, storm-slabs]"
"10:17 PM PST Monday, March 6, 2017",High,Considerable,"[  , <div class=""forecast-snowpack""...",[https://d22fgw9k2fjwhz.cloudfront.net/media/i...,"[/avalanche-forecast/octagon/problem/6929.png,...",High,[https://d22fgw9k2fjwhz.cloudfront.net/media/i...,"[wind-slab, storm-slabs]"
"10:18 AM PST Monday, March 6, 2017",Considerable,Moderate,"[  , <div class=""forecast-snowpack""...",[https://d22fgw9k2fjwhz.cloudfront.net/media/i...,"[/avalanche-forecast/octagon/problem/6908.png,...",Considerable,[https://d22fgw9k2fjwhz.cloudfront.net/media/i...,"[wind-slab, storm-slabs]"
"10:19 AM PST Tuesday, March 25, 2014",Considerable,Moderate,"[  , <div class=""forecast-snowpack""...",[https://d22fgw9k2fjwhz.cloudfront.net/media/i...,"[/avalanche-forecast/octagon/problem/1497.png,...",Considerable,[https://d22fgw9k2fjwhz.cloudfront.net/media/i...,"[loose-wet, wind-slab, storm-slabs]"


In [208]:
df.to_csv('nwac_mount_hood_dec2013-dec2017.csv')