# Investigation of Avalanche Tendencies:
## Lucas Crichton, Omer Tahir

## Abstract:

## Introduction
Within this investigation, data from North America and Europe was explored to reveal potential trends among avalanche occurrences. The data used was provided by Avalanche Canada,the Colorado Avalanche Information Center (CAIC) and European Avalanche Warning Services (EAWS). Firstly, an exploratory analysis will be conducted to investigate the relationship between the type of activity performed at the time of the avalanches and the number of deaths caused by the avalanches. Next We will investigate whether the number of avalanche deaths are relatively even throughout the ski season or whether there is a time of the year where deadly avalanches are more common. 

## Sources:
- “Avalanche.org " Accidents.” Avalanche.org, Colorado Avalanche Information Center, 5 Feb. 2020, https://avalanche.org/avalanche-accidents/. 
- “Fatalities.” EAWS, 25 Nov. 2021, https://www.avalanches.org/fatalities/fatalities-20/. 
- “Historical Incidents.” Avalanche Canada, https://www.avalanche.ca/incidents. 


# Preparing the Data:
firstly, we must prepare the data so that our data frame for our analysis contains data from all 3 sources and all necessary variables.

## Installing Necessary Packages:

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


%matplotlib inline

## Extracting Avalanche Canada Data:

In [88]:
# Get data on avalanche forecasts and incidents from Avalanche Canada
# Avalanche Canada has an unstable public api
# https://github.com/avalanche-canada/ac-web
# Since API might change, this code might break
import json
import os
import urllib.request
import pandas as pd
import time
import requests
import io
import zipfile
import warnings


# Incidents
url = "http://incidents.avalanche.ca/public/incidents/?format=json"
req = urllib.request.Request(url)
with urllib.request.urlopen(req) as response:
    result = json.loads(response.read().decode('utf-8'))
incident_list = result["results"]
# incident_list

while (result["next"] != None):
    req = urllib.request.Request(result["next"])
    with urllib.request.urlopen(req) as response:
        result = json.loads(response.read().decode('utf-8'))
    incident_list = incident_list + result["results"]
incident_list

incidents_brief = pd.DataFrame.from_dict(incident_list,orient="columns")
pd.options.display.max_rows = 20
pd.options.display.max_columns = 8
incidents_brief

Unnamed: 0,id,date,location,location_province,group_activity,num_involved,num_injured,num_fatal
0,8bc4720d-498c-4793-81ef-c43db9f36ca4,2021-11-27,"Sunshine Bowl, Hasler Area",BC,Snowmobiling,3.0,0.0,1
1,6a3a4698-d047-4082-bdea-92f4db7e63bf,2021-05-30,Mount Andromeda-Skyladder,AB,Mountaineering,2.0,0.0,2
2,ba14a125-29f7-4432-97ad-73a53207a5e7,2021-04-05,Haddo Peak,AB,Skiing,2.0,0.0,1
3,59023c05-b679-4e9f-9c06-910021318663,2021-03-29,Eureka Peak,BC,Snowmobiling,1.0,0.0,1
4,10774b2d-b7de-42ac-a600-9828cb4e6129,2021-03-04,Reco Mountain,BC,Snowmobiling,1.0,0.0,1
...,...,...,...,...,...,...,...,...
484,101c517b-29a4-4c49-8934-f6c56ddd882d,1840-02-01,Château-Richer,QC,Unknown,,,1
485,b2e1c50a-1533-4145-a1a2-0befca0154d5,1836-02-09,Quebec,QC,Unknown,,,1
486,18e8f963-da33-4682-9312-57ca2cc9ad8d,1833-05-24,Carbonear,NL,Unknown,,0.0,1
487,083d22df-ed50-4687-b9ab-1649960a0fbe,1825-02-04,Saint-Joseph de Lévis,QC,Inside Building,,,5


In [80]:
import urllib.request
import re

url = 'http://pythonprogramming.net/parse-website-using-regular-expressions-urllib/'

req = urllib.request.Request(url)
resp = urllib.request.urlopen(req)
respData = resp.read().decode('utf-8')
print(respData)
paragraphs = re.findall(r'<p>(.*?)</p>',str(respData))

for eachP in paragraphs:
    print(eachP)

In this video, we use two of Python 3's standard library modules, re and urllib, to parse paragraph data from a website. As we saw, initially, when you use Python 3 and urllib to parse a website, you get all of the HTML data, like using "view source" on a web page. This HTML data is great if you are viewing via a browser, but is incredibly messy if you are viewing the raw source. For this reason, we need to build something that can sift through the mess and just pull the article data that we are interested in. There are some web scraping libraries out there, namely BeautifulSoup, which are aimed at doing this same sort of task.
On to the code:
Up to this point, everything should look pretty typical, as you've seen it all before. We specify our url, our values dict, encode the values, build our request, make our request, and then store the request to respData. We can print it out if we want to see what we're working with. If you are using an IDE, sometimes printing out the source code i

In [3]:
# incidents
# We can get more information about these incidents e.g. "https://www.avalanche.ca/incidents/37d909e4-c6de-43f1-8416-57a34cd48255"
# this information is also available through the API
def get_incident_details(id):
    url = "http://incidents.avalanche.ca/public/incidents/{}?format=json".format(id)
    req = urllib.request.Request(url)
    with urllib.request.urlopen(req) as response:
        result = json.loads(response.read().decode('utf-8'))
    return(result)


incidentsfile = "https://datascience.quantecon.org/assets/data/avalanche_incidents.csv"

# To avoid loading the avalanche Canada servers, we save the incident details locally.
if (not os.path.isfile(incidentsfile)):
    incident_detail_list = incidents_brief.id.apply(get_incident_details).to_list()
    incidents = pd.DataFrame.from_dict(incident_detail_list, orient="columns")
    incidents.to_csv(incidentsfile)
else:
    incidents = pd.read_csv(incidentsfile)

incidents

Unnamed: 0,id,ob_date,location,location_desc,...,weather_comment,snowpack_obs,snowpack_comment,documents
0,8bc4720d-498c-4793-81ef-c43db9f36ca4,2021-11-27,"Sunshine Bowl, Hasler Area",Approx. 17km East of Powder King ski area,...,"Overcast, windy conditions were reported with ...","{'hs': None, 'hn24': None, 'hst': None, 'hst_r...",A snow profile near the avalanche on the follo...,"[{'date': '2021-11-30', 'title': 'Scene photo'..."
1,6a3a4698-d047-4082-bdea-92f4db7e63bf,2021-05-30,Mount Andromeda-Skyladder,Approximately 96km SE of Jasper,...,,"{'hs': None, 'hn24': None, 'hst': None, 'hst_r...",,"[{'date': '2021-06-01', 'title': 'Mt Andromeda..."
2,ba14a125-29f7-4432-97ad-73a53207a5e7,2021-04-05,Haddo Peak,Approximately 6km SW of Lake Louise Village,...,,"{'hs': None, 'hn24': None, 'hst': None, 'hst_r...",,"[{'date': '2021-04-05', 'title': 'Overview pho..."
3,59023c05-b679-4e9f-9c06-910021318663,2021-03-29,Eureka Peak,Approximately 100km east of Williams Lake,...,,"{'hs': None, 'hn24': None, 'hst': None, 'hst_r...",,"[{'date': '2021-04-01', 'title': 'Overview', '..."
4,10774b2d-b7de-42ac-a600-9828cb4e6129,2021-03-04,Reco Mountain,Approximately 13km east of New Denver,...,,"{'hs': None, 'hn24': None, 'hst': None, 'hst_r...",,"[{'date': '2021-03-05', 'title': 'Scene Overvi..."
...,...,...,...,...,...,...,...,...,...
484,101c517b-29a4-4c49-8934-f6c56ddd882d,1840-02-01,Château-Richer,,...,,"{'hs': None, 'hn24': None, 'hst': None, 'hst_r...",,[]
485,b2e1c50a-1533-4145-a1a2-0befca0154d5,1836-02-09,Quebec,more details unknown,...,,"{'hs': None, 'hn24': None, 'hst': None, 'hst_r...",,[]
486,18e8f963-da33-4682-9312-57ca2cc9ad8d,1833-05-24,Carbonear,,...,,"{'hs': None, 'hn24': None, 'hst': None, 'hst_r...",,"[{'title': 'Carbonear, May 24, 1833', 'source'..."
487,083d22df-ed50-4687-b9ab-1649960a0fbe,1825-02-04,Saint-Joseph de Lévis,Pointe Lévis,...,,"{'hs': None, 'hn24': None, 'hst': None, 'hst_r...",,[]


In [4]:

# clean up activity names
incidents.group_activity.unique()

skiings = [ 'Skiing', 'Skiing/Snowboarding',
      'Snowboarding', 'Backcountry Skiing',
        'Ski touring', 'Heliskiing',
       'Mechanized Skiing', 'Out-of-bounds Skiing', 'Lift Skiing Closed', 'Lift Skiing Open',
       'Out-of-Bounds Skiing']
mountaineering_and_climbing = ['Mountaineering',
       'Snow Biking', 'Snowshoeing', 
       'Ice Climbing',
       'Snowshoeing & Hiking']
snowmobiling = ['Snowmobiling']
non_leisure = [ 'Work',
       'At Outdoor Worksite',
         'Control Work',
       'Inside Building', 'Car/Truck on Road', 'Inside Car/Truck on Road', 'Outside Building']
other_or_unknown= ['Other Recreational', 'Hunting/Fishing', 'Unknown',]
def activities_can(s):
    if s in skiings:
        return "Skiing"
    elif s in mountaineering_and_climbing:
        return "Mountaineering/Climbing"
    elif s in snowmobiling:
        return "Snowmobiling"
    elif s in non_leisure:
        return "Non-Leisure Activities"
    else:
        return "Other/Unknown"

incidents['group_activity'] = incidents['group_activity'].apply(activities_can)
incidents['group_activity'].unique()


incidents

Unnamed: 0,id,ob_date,location,location_desc,...,weather_comment,snowpack_obs,snowpack_comment,documents
0,8bc4720d-498c-4793-81ef-c43db9f36ca4,2021-11-27,"Sunshine Bowl, Hasler Area",Approx. 17km East of Powder King ski area,...,"Overcast, windy conditions were reported with ...","{'hs': None, 'hn24': None, 'hst': None, 'hst_r...",A snow profile near the avalanche on the follo...,"[{'date': '2021-11-30', 'title': 'Scene photo'..."
1,6a3a4698-d047-4082-bdea-92f4db7e63bf,2021-05-30,Mount Andromeda-Skyladder,Approximately 96km SE of Jasper,...,,"{'hs': None, 'hn24': None, 'hst': None, 'hst_r...",,"[{'date': '2021-06-01', 'title': 'Mt Andromeda..."
2,ba14a125-29f7-4432-97ad-73a53207a5e7,2021-04-05,Haddo Peak,Approximately 6km SW of Lake Louise Village,...,,"{'hs': None, 'hn24': None, 'hst': None, 'hst_r...",,"[{'date': '2021-04-05', 'title': 'Overview pho..."
3,59023c05-b679-4e9f-9c06-910021318663,2021-03-29,Eureka Peak,Approximately 100km east of Williams Lake,...,,"{'hs': None, 'hn24': None, 'hst': None, 'hst_r...",,"[{'date': '2021-04-01', 'title': 'Overview', '..."
4,10774b2d-b7de-42ac-a600-9828cb4e6129,2021-03-04,Reco Mountain,Approximately 13km east of New Denver,...,,"{'hs': None, 'hn24': None, 'hst': None, 'hst_r...",,"[{'date': '2021-03-05', 'title': 'Scene Overvi..."
...,...,...,...,...,...,...,...,...,...
484,101c517b-29a4-4c49-8934-f6c56ddd882d,1840-02-01,Château-Richer,,...,,"{'hs': None, 'hn24': None, 'hst': None, 'hst_r...",,[]
485,b2e1c50a-1533-4145-a1a2-0befca0154d5,1836-02-09,Quebec,more details unknown,...,,"{'hs': None, 'hn24': None, 'hst': None, 'hst_r...",,[]
486,18e8f963-da33-4682-9312-57ca2cc9ad8d,1833-05-24,Carbonear,,...,,"{'hs': None, 'hn24': None, 'hst': None, 'hst_r...",,"[{'title': 'Carbonear, May 24, 1833', 'source'..."
487,083d22df-ed50-4687-b9ab-1649960a0fbe,1825-02-04,Saint-Joseph de Lévis,Pointe Lévis,...,,"{'hs': None, 'hn24': None, 'hst': None, 'hst_r...",,[]


In [5]:
from itertools import chain
# pd.DataFrame(chain.from_iterable(incidents.avalanche_obs)).replace(r'^s*$', float('NaN'), regex = True).dropna()
pd.DataFrame(chain.from_iterable(incidents.avalanche_obs))

Unnamed: 0,size,type,trigger,aspect,elevation,slab_width,slab_thickness,observation_date
0,3.0,S,Ma,NE,1700.0,350.0,60.0,
1,2.5,S,Sa,N,3075.0,60.0,75.0,
2,2.0,S,Sa,E,2950.0,40.0,50.0,
3,2.5,CS,Sa,E,2170.0,50.0,,
4,3.0,S,Ma,W,2465.0,125.0,85.0,
...,...,...,...,...,...,...,...,...
484,,,U,,,,,1800-01-01
485,,,U,,,,,1843-12-18
486,,,U,,,,,1840-02-01
487,,,U,,,,,1836-02-09


## Extracting avalanche accidents in the US 

In [6]:
from bs4 import BeautifulSoup
import requests
from urllib.request import Request, urlopen

site = "https://avalanche.org/avalanche-accidents/"

# This is done to prevent 'HTTPError: HTTP Error 403: Forbidden'
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(site,headers=hdr)
page = urlopen(req)

# Prepare soup to access the source code
soup = BeautifulSoup(page)

# Scrape the source code to access the source containing the tables
soup.find('div', class_='content-area').iframe

# Read the cleaned up source and convert it into dataframes 
df = pd.read_html('https://avalanche.state.co.us/caic/acc/acc_us.php', parse_dates=True)

# Only select the useful tables
df = df[1::2]

# Clean the tables and merge them into one single dataframe representing cases in the US
def format_date_col(s, year):
    """
    This function is used to clean the date columns.
    It takes a string and cleans the string by removing the dagger sign and
    adds the year to the date string.
    """
    month = s.replace('†','').replace('/','-')
    year = str(year) + '-'
    return year+month

years = (2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009)
for data, yr in zip(df, years):
    data['Date'] = data['Date'].apply(format_date_col, args=[yr])
    
us_incidents = pd.concat(df).reset_index().drop(columns = ["index"])

us_incidents

Unnamed: 0,Date,State,Location,Description,Killed
0,2021-12-17,ID,"Ryan Peak, Idaho",1 skier and 1 snowmobiler killed,2
1,2021-12-11,WA,"Silver Basin, closed portion of Crystal Mounta...",6 backcountry tourers caught and 1 killed,1
2,2020-05-13,AK,"Ruth Glacier, Denali National Park and Preserve","2 climbers caught in serac fall, 1 killed",1
3,2020-03-27,AK,Matanuska Glacier,1 heliskier killed,1
4,2020-03-22,CO,Lime Creek south of Edwards,"2 sidecountry skiers caught, 1 buried and killed",1
...,...,...,...,...,...
269,2009-01-06,CO,Battle Mountain - outside Vail Mountain ski area,"1 snowboader caught, partially buried critical...",1
270,2009-01-03,MT,"Scotch Bonnet Mountain, near Lulu Pass","1 Snowmobiler caught, buried, and killed",1
271,2009-01-02,OR,Near Paulina Peak,"1 Snowmobiler caught, buried, and killed",1
272,2009-12-17,ID,"Rock Lake, Cascade, Idaho","2 snowmobilers caught, buried, 1 rescued, 1 ki...",1


In [7]:
us_incidents.Description.unique

<bound method Series.unique of 0                       1 skier and 1 snowmobiler killed
1              6 backcountry tourers caught and 1 killed
2              2 climbers caught in serac fall, 1 killed
3                                     1 heliskier killed
4       2 sidecountry skiers caught, 1 buried and killed
                             ...                        
269    1 snowboader caught, partially buried critical...
270             1 Snowmobiler caught, buried, and killed
271             1 Snowmobiler caught, buried, and killed
272    2 snowmobilers caught, buried, 1 rescued, 1 ki...
273                           Hyalite Avalanche Fatality
Name: Description, Length: 274, dtype: object>

In [47]:
txt = "Hello, my name is Omer."
# nltk.tokenize.word_tokenize(txt)
tokens = [token for token in nltk.tokenize.word_tokenize(txt)]
tokens = [token for token in tokens if not token in stopwords]

# len(nltk.corpus.stopwords.words('english'))

stopwords = set(nltk.corpus.stopwords.words('english'))
stopwords = stopwords.union(set(string.punctuation))


In [54]:
from bs4 import BeautifulSoup
import nltk
import string
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
# Remove stopwords (the, a, is, etc)
stopwords = set(nltk.corpus.stopwords.words('english'))
# Remove punctuation too (., !, "", etc)
stopwords=stopwords.union(set(string.punctuation))
# Lemmatize words e.g. snowed and snowing are both snow (verb)
wnl = nltk.WordNetLemmatizer()
def text_prep(txt):
    soup = BeautifulSoup(txt, "lxml")
    [s.extract() for s in soup('style')] # remove css
    txt=soup.text # remove html tags
    txt = txt.lower()
    tokens = [token for token in nltk.tokenize.word_tokenize(txt)]
    tokens = [token for token in tokens if not token in stopwords]
    tokens = [wnl.lemmatize(token) for token in tokens]
    if (len(tokens)==0):
        tokens = ["EMPTYSTRING"]
    return(tokens)

us_incidents['Description'].apply(text_prep)
# text_prep(us_incidents.Description.highlights[1000])

[nltk_data] Downloading package punkt to /home/jupyter/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/jupyter/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jupyter/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


0                     [1, skier, 1, snowmobiler, killed]
1            [6, backcountry, tourer, caught, 1, killed]
2           [2, climber, caught, serac, fall, 1, killed]
3                                 [1, heliskier, killed]
4      [2, sidecountry, skier, caught, 1, buried, kil...
                             ...                        
269    [1, snowboader, caught, partially, buried, cri...
270             [1, snowmobiler, caught, buried, killed]
271             [1, snowmobiler, caught, buried, killed]
272    [2, snowmobilers, caught, buried, 1, rescued, ...
273                       [hyalite, avalanche, fatality]
Name: Description, Length: 274, dtype: object

## Extracting avalanche accidents in Europe

In [9]:
# Make a list of urls to be read
url1 = "https://www.avalanches.org/fatalities/"
url2 = "https://www.avalanches.org/fatalities/fatalities-20/"
url3 = "https://www.avalanches.org/fatalities/fatalities-19/"
urls = [url1, url2, url3]

# Scrape the tables from each url and make a list of the tables
df = [pd.read_html(url, parse_dates=True) for url in urls]

# Make a list of the dataframes within the table list and concat them together to form a single dataframe
df = [df[0][0], df[1][0], df[2][0]]
eu_incidents = pd.concat(df)

eu_incidents

Unnamed: 0,ID,Location,Country,Date,...,Group Size,Avalanche Comment,Incident Comment,Type
0,2782,Mentet,Spain,2021-11-28 00:00:00,...,1.0,"""Destructive Avalanche Size of 2.5""","""Completely buried. Fatal result. Re-analisis ...",Mountaineering/Climbing
1,2781,"Val d\'Ayas, Gran Sommettaz",Italy,2021-11-29 12:05:00,...,,,,Off-piste skiing
2,2783,La Thuille,Italy,2021-12-07 13:09:00,...,3.0,,,Backcountry skiing
3,2785,Monte Sorbetta,Italy,2021-12-16 13:04:00,...,2.0,,,Backcountry skiing
0,1814,Großvenediger,Austria,2020-10-10 00:00:00,...,1.0,,,Mountaineering/Climbing
...,...,...,...,...,...,...,...,...,...
35,149,Mont Brûlé,Switzerland,2020-05-08 12:00:00,...,,,,
36,87,Tofana di Rozes - rifugio Giussani,Italy,2020-05-09 09:30:00,...,2.0,,,Off-piste skiing
37,108,Pizzo del Diavolo/Canalone della Malgina,Italy,2020-05-12 10:15:00,...,1.0,,,Off-piste skiing
38,180,Gråfonnfjellet,Norway,2020-05-24 12:00:00,...,3.0,,,Backcountry skiing


In [10]:
eu_incidents.Type.unique()

array(['Mountaineering/Climbing', 'Off-piste skiing',
       'Backcountry skiing', nan, 'Hiking on foot or snowshoeing',
       'Travelling on road', 'On skiruns', 'Snowmobiling', 'Other'],
      dtype=object)

## Questions:
note: the 2 websites we chose to use outside of the avalanche canada one were https://avalanche.org/avalanche-accidents/, https://www.avalanches.org/fatalities/fatalities-20/
1.  in us accident reports if you click the link on location  it comes up with a list of avalanche details. We can't figure out how to get those into a column to compare. any advice? We are also struggling to scrape the data from the EU site where it says "details" under the date so if you have any tips for this as well please let us know. 



<span style="color:red">
For the US, the links to details run some javascript code. This can sometimes require running a full browser with javascript engine to scrape the information. Selenium is the main tool for doing so. Fortunately, there is a simpler way here. 

If you look at the source of https://avalanche.org/avalanche-accidents/ , you see that all the needed information is in an iframe that contains the url https://avalanche.state.co.us/caic/acc/acc_us.php . Viewing the source of that url reveals that the details pages just have javascript that opens urls like https://avalanche.state.co.us/caic/acc/acc_report.php?acc_id=798&amp;accfm=inv 
</span>

In [139]:
# get urls of details
import re
accidents = BeautifulSoup(urlopen('https://avalanche.state.co.us/caic/acc/acc_us.php').read(), "html.parser")
reporturls = re.findall("win=window.open\('([^']+)'", accidents.prettify())
del reporturls[50]

In [133]:
def getaccidentdetails(url):
  url = re.sub('&amp;', '&', url)
  soup = BeautifulSoup(urlopen(url).read(), "html.parser")
  details = dict()
  for item in soup.find_all("li", class_="acc_rep_list"):
    subi = item.find_all("li")
    if (len(subi) > 0):
      for subitem in subi:
        s = re.split(":[\xa0| ]", subitem.text)
        details[s[0]] = s[1]
    else:
      s = re.split(":[\xa0| ]", item.text)
      details[s[0]] = s[1]
  return(details)

# you might want to organize this differently for throttling and/or caching
details = pd.DataFrame([getaccidentdetails(url) for url in reporturls[:]])


In [140]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
  display(details.head())
details.to_csv('us_cases_extended.csv')

Unnamed: 0,Location,State,Time,Summary Description,Primary Activity,Primary Travel Mode,Location Setting,Caught,"Partially Buried, Non-Critical","Partially Buried, Critical",Fully Buried,Injured,Killed,Type,Trigger,Trigger (subcode),Size - Relative to Path,Size - Destructive Force,Sliding Surface,Slope Aspect,Site Elevation,Slope Angle,Slope Characteristic
0,"Ryan Peak, Idaho",Idaho,\nUnknown\n,1 skier and 1 snowmobiler killed,Hybrid Rider,Snowmobile,--,0,0.0,0.0,0,0,2,--,--,--,--,--,--,--,--,--,--
1,"Silver Basin, closed portion of Crystal Mounta...",Washington,,6 backcountry tourers caught and 1 killed,Backcountry Tourer,Ski,,0,,,3,0,1,SS,AS - Skier,u - An unintentional release,R3,D2.5,--,NE,6600 ft,35 °,--
2,"Ruth Glacier, Denali National Park and Preserve",Alaska,\n5:00 AM\n(Estimated),"2 climbers caught in serac fall, 1 killed",Climber,Foot,,2,,,0,1,1,I,N - Natural,--,--,--,--,--,--,--,--
3,Matanuska Glacier,Alaska,\nUnknown\n,1 heliskier killed,Mechanised Guiding Client,Ski,,0,,,0,0,1,--,--,--,--,--,--,--,--,--,--
4,Lime Creek south of Edwards,Colorado,\n2:00 PM\n(Estimated),"2 sidecountry skiers caught, 1 buried and killed",Sidecountry Rider,Ski,Accessed BC from Ski Area,2,0.0,0.0,1,0,1,SS,AS - Skier,u - An unintentional release,R1,D2,G - At Ground/Ice/Firm,NW,9763 ft,45 °,Gully/Couloir




2. Currently we can't get similar data to that from incidents.avalanche_obs in the Canada Data set from the other 2 data sets. We only have around 150 data points then with a lot of missing values. This question is 2-fold. Is it still possible to run a Prediction model on this smaller amount of data, and if so which would be best? Additionally, if you have any recommendations for getting the similar data from the other 2 sources please let us know. 
3. which of the items in the plan below should we include to ensure full marks on this project and do you have any suggestions for improvement?

4. is there any specific format we need our citations in?

## Plan:

- investigate frequency of accidents per type of activity 
    - try to use beautiful soup to rename values in the description for the US data if we can't get the avalanche details thing sorted in question 1
    - use beautiful soup to rename values so that Type column in EU matches the activity column in Canada data
    - create bar graph
- investigate avalanche tendencies by date: See if there is a trend that there are more fatal avalanches later in the ski season than the start
    - plot number of deaths as a function of the date, removing the year from the date

- compare # of deaths or probability of fatal incident between activities or avalanche size. - will do depending on how data looks.
    
- prediction analysis:
    - use a prediction technique (likely random forests, but will use the answer in question 2)
    - as of now will only use avalanche Canada data unless we can scrape the data from US and EU to have the same variables but we are currently lost as to how to do so.

<span style="color:red">
2. You can run a prediction model with a smaller amount of data. Any method should be okay, but you will need to select less complex models (i.e. fewer features / bigger penalty for lasso, less deep trees). Cross-validation should do this automatically.
<br>
3. On the one hand, the prediction analysis would use the most ideas from the course. On the other hand, the other three bullets seem to be of more practical use. Both can lead to full marks. Do whatever you find more interesting. <br>
I know this is a draft, but for the final version, be sure to add markdown cells between the code saying what you're doing and why.
<br>
4. I'm happy with anything that includes a url. If there is no url, be sure to have the author(s), title, journal (if applicable), and year.
</span>