## Bergfex Webscraping
<b> Milestone 2</b> 

Scraping the snow level data from meteocentrale.ch website in order to create Pandas dataframes that contain the snow level for 91 weather stations around Switzerland.
This code uses BeautifulSoup to parse the html tags. A for-loop iterates over each html tag and adds the corresponding information into empty lists. Afterwards the data is cleaned.
Finally, we add the GPS coordinates of each weather station.

### Installations

In [1]:
# unhash and run the below line once
#conda install -c anaconda beautifulsoup4

### Imports

In [2]:
import requests
import re
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

# Still to do:

- DONE: check vs. NB III if anything missing
- wip: go through comments
- add coordinates part for snow points
- potentially check out other weather data

# 1 Data Scraping: snow information

In [3]:
# Initializing the future colums of our dataframe with empty lists

snowlevel = []  # height in cm
location = []  # town and elevation of town

# Only scrape one page (no looping over several pages necessary as not so many data points available)
link = 'http://www.meteocentrale.ch/de/wetter/hitlisten/schneehoehen.html'
page = requests.get(link, timeout=10)
print(page.status_code)
soup = BeautifulSoup(page.content, "html.parser")  # bs4.BeautifulSoup object
hitlist = soup.findAll('table', {'class': 'hitlist'})  #bs4.element.ResultSet

# If '200' then the scraping was successful

200


# 2 Extracting the necessary information
Village Name, Elevation, Snow Level

In [4]:
# Get location name
location_item = hitlist[0].findAll('a')
location.append([info.get_text().strip() for info in location_item])
loc = location[0]

# Get snowlevel
snowlevel_item = hitlist[0].findAll('td', {'class': 'value'}) 
snowlevel.append([info.get_text().strip() for info in snowlevel_item])
snow = snowlevel[0]

# Combine into DF
heights_df = pd.DataFrame({'location': loc,'snowlevel': snow})
heights_df

Unnamed: 0,location,snowlevel
0,"Grimsel-Hospiz, 1980 m",309 cm
1,"Glattalp, 1858 m",288 cm
2,"Weissfluhjoch, 2690 m",249 cm
3,"Gütsch/Andermatt, 2282 m",162 cm
4,"Schwägalp, 1350 m",118 cm
...,...,...
87,"Zermatt, 1638 m",0 cm
88,"Zollikofen, 553 m",0 cm
89,"Zürich-Affoltern, 443 m",0 cm
90,"Zürich-Flughafen, 432 m",0 cm


# 3 Data Cleaning

## Clean numerical data

In [5]:
# Remove cm, convert to 'int'
heights_df['snowlevel_in_cm']=pd.Series(heights_df['snowlevel']).str.replace(" cm", '')
heights_df['snowlevel_in_cm']=pd.Series(heights_df['snowlevel_in_cm']).astype(int)

# Split location into 'village' and 'elevation of village' and merge with previous DF
split_loc = pd.Series(heights_df['location']).str.split(',',n=2,expand = True)
merged_df = pd.merge(heights_df, split_loc, left_index=True, right_index=True)

# Drop unused columns, rename final columns, remove unit, sort columns
intermediate_df = merged_df.iloc[:,[2,3,4]].copy()
intermediate_df.columns = ['snowlevel_in_cm', 'location', 'height_in_m']
intermediate_df['height_in_m']=pd.Series(intermediate_df['height_in_m']).str.replace(" m", '')
snow_level_df = pd.DataFrame(intermediate_df, columns = ['location', 'height_in_m', 'snowlevel_in_cm'])
snow_level_df.head()

Unnamed: 0,location,height_in_m,snowlevel_in_cm
0,Grimsel-Hospiz,1980,309
1,Glattalp,1858,288
2,Weissfluhjoch,2690,249
3,Gütsch/Andermatt,2282,162
4,Schwägalp,1350,118
...,...,...,...
87,Zermatt,1638,0
88,Zollikofen,553,0
89,Zürich-Affoltern,443,0
90,Zürich-Flughafen,432,0


## Write CSV

In [6]:
# change the file_path to your path if necessary
file_path = '../data'
snow_level_df.to_csv(file_path & 'snow_level_test.csv', index = False)

# 4 Adding Coordinates

## Matching the snow level observation station with GPS coordinates

We need to get the coordinates for each snow observation station (for which we have the name of the locality)

We use two sources (all coordinates are given in EPSG:4326/WGS84): 
    1/ from World Cities Database, which contains most large cities in Switzerland and geographic info saved as swiss_cities.csv
    2/ Ski resort coordinates, for the smaller but more relevant localities in mountaineous areas. Those were collected from Wikipedia and skiresort.info, saved as ski.csv

In [None]:
# cleaning of dataframes: this step needs to be adapted to data sources and csv content. 
# Here the files were overwritten with the cleaned up version, so no need to run the following steps

filepath = '../data/' # change to your local folder

cities = pd.read_csv((filepath + 'swiss_cities.csv'))
ski_resorts = pd.read_csv((filepath + 'ski.csv'))

cities = cities.drop(columns ={'iso2', 'country', 'capital', 'population', 'population_proper', 'admin_name'})
ski_resorts = ski_resorts.rename(columns = {'X':'lng', 'Y':'lat', 'Name': 'city'})
ski_resorts = ski_resorts.drop(columns = {'description'})
ski_resorts = ski_resorts[['city', 'lat', 'lng']]


# merging the two dataframes

swiss_cities = pd.concat([cities, ski_resorts], ignore_index=True)


In [None]:
swiss_cities.info() # 146 rows

In [None]:
# saving the dataframe as city_coord
swiss_cities.to_csv(filepath +'city_coord.csv', index = False) # has city, lat long in it

## Write CSV

In [7]:
# change the file_path to your path if necessary
#file_path = '../data/'
snow_level_coord.to_csv(file_path & 'snow_coordinates_test.csv', index = False)

NameError: name 'snow_level_coord' is not defined

# CONTINUE IN NOTEBOOK III