<html>
<h1> Bergfex webscraping
</h1>
<body>We gather information about outdoor sport itineraries from <a href='https://www.bergfex.com/sommer/bern-region/touren/?isAjax=1&page=1'> Bergfex.com </a> in the region of Bern.<br />
    - The website displays 20 tours per page, with infos such as length, type of sport, rating, etc... <br />
    - Most informations are provided by users.<br />
<br />
We then clean the data, and do a bunch of simple analysis.
</body>
</html>

<html>
<h2> Part 1: Data scraping and cleaning
</h2>

</html>

In [1]:
import requests
import re
from bs4 import BeautifulSoup
import pandas as pd

import numpy as np
import matplotlib
import matplotlib.pyplot as plt

In [2]:
# initializing the future colums of our dataframe with empty lists

title = []  # Title of the tour
difficulty = []  # difficulty (easy, medium, hard)
sport = []  # Sport type (hiking, sledging, snowshoe...)
length = []  # length in km
time = []  # tour time in hours:minutes
climb = []  # positive elevation climb in m
minmax = []  # minimum and maximum altitude of the tour in m
technique = []  # technique difficulty ratings (out of 6)
fitness = []  # fitness difficulty ratings (out of 6)
total_title = [] # will help extracting the ID of the tour

# all ratings are stored together so we will need this along the way:
rating = []  # list to store technique and fitness rating info

In [3]:
# looping over the 10 first pages of Bergfex
# each separate tour on the page is framed by a div tag with 'touren-details'

page_number = 20  # number of pages we want to scap through

for p in range(1, (page_number+1)):
    base_link = 'https://www.bergfex.com/sommer/bern-region/touren/?isAjax=1&page='
    link = base_link+str(p)  # going over p pages with numbers appended to the base link
    page = requests.get(link, timeout=5)
    print("scraped page", p)
    soup = BeautifulSoup(page.content, "html.parser")  # bs4.BeautifulSoup object
    tours = soup.findAll('div', {'class': 'touren-detail'})  # checks for the separate tours on the page

# For each page, we loop over each tour and fill our lists with info     
    for i in range(0,len(tours)):
        tour_1 = tours[i] # tour iterating over the 20 tours of the page

        tour_title = tour_1.findAll('a' ) # this gets the title out
        title.append([info.get_text().strip() for info in tour_title])
        total_title.append(tour_title) # this gets the full tourtitle information
        
        tour_diff = tour_1.findAll('span', {'class': 'tour-difficulty'}) #tour level difficulty
        difficulty.append([info.get_text().strip() for info in tour_diff])

        tour_type = tour_1.findAll('span', {'class': 'tour-type'}) # putting type into sports (should all be hiking)
        sport.append([info.get_text().strip() for info in tour_type])

        tour_stats = tour_1.findAll('div', {'class': 'tour-stats'}) # stats has 4 info binned together
        stat_text = [info.get_text().strip() for info in tour_stats]
        length.append(stat_text[0])
        time.append(stat_text[1])
        climb.append(stat_text[2])
        minmax.append(stat_text[3])
        
        tour_rating = tour_1.find_all("div", {'class': 'tour-rating'}) # getting the rating data 
        rating.append([info for info in tour_rating])   # it's a class name so we can't get_text.

scraped page 1
scraped page 2
scraped page 3
scraped page 4
scraped page 5
scraped page 6
scraped page 7
scraped page 8
scraped page 9
scraped page 10
scraped page 11
scraped page 12
scraped page 13
scraped page 14
scraped page 15
scraped page 16
scraped page 17
scraped page 18
scraped page 19
scraped page 20


In [4]:
print("number of tours collected:", len(title))

number of tours collected: 400


In [5]:
### Get the titles and IDs into a list and then a DF
links_title = []

for i in total_title:
    i = str(i)
    i = i.strip('[]')
    print(i)
    links_title.append(i)

id_df3 = pd.DataFrame(links_title)
#df_test

<a class="h2" href="/sommer/bern-region/touren/trailrunning/310007,trailsummit-chrinnenhorn-t3/" title="Trailsummit Chrinnenhorn T3">Trailsummit Chrinnenhorn T3</a>
<a class="h2" href="/sommer/bern-region/touren/wanderung/49274,aus-dem-diemtigtal-zum-seebergsee/" title="Aus dem Diemtigtal zum Seebergsee">Aus dem Diemtigtal zum Seebergsee</a>
<a class="h2" href="/sommer/bern-region/touren/wanderung/131936,panoramaweg-hasliberg/" title="Panoramaweg Hasliberg">Panoramaweg Hasliberg</a>
<a class="h2" href="/sommer/bern-region/touren/wanderung/131157,lauterbrunnen--muerren/" title="Lauterbrunnen - Mürren">Lauterbrunnen - Mürren</a>
<a class="h2" href="/sommer/bern-region/touren/wanderung/81701,von-der-saane-zum-denkmal-der-schlacht-von-laupen/" title="Von der Saane zum Denkmal der Schlacht von Laupen">Von der Saane zum Denkmal der Schlacht von Laupen</a>
<a class="h2" href="/sommer/bern-region/touren/mountainbike/133916,habkern--waldegg--beatenberg/" title="Habkern - Waldegg - Beatenberg">H

In [6]:
### Strip out the ID --> in column ix 5 as 'str'
id_df2 = id_df3[0].str.split(' ',expand = True)
id_df1 = id_df2[2].str.split(',',expand = True)
id_df = id_df1[0].str.split('/',expand = True)
id_df[5][0]

## ---> id_df is holding the IDs

'310007'

In [7]:
# Dealing with the ratings for fitness and technique
# we need to extract Rating data which is "embedded" as a class name in the bs4.BeautifulSoup
rating[0]

[<div class="tour-rating">
 <div class="tour-rating-label">Technique</div>
 <div class="rating-circles rating-max6"><div class="rating-5"></div></div>
 </div>,
 <div class="tour-rating">
 <div class="tour-rating-label">Fitness</div>
 <div class="rating-circles rating-max6"><div class="rating-5"></div></div>
 </div>]

In [8]:
# converting into string
tour_rating_str = str(rating)

# taking out unnecessary info - each second rating is Technique or Fitness, starting with technique
rating_all_short = tour_rating_str.replace('<div class="tour-rating">\n<div class="tour-rating-label">Technique</div>\n<div class="rating-circles rating-max6"><div class="','').replace('<div class="tour-rating">\n<div class="tour-rating-label">Fitness</div>\n<div class="rating-circles rating-max6"><div class="','').replace('"></div></div>\n</div>',"")
rating_even_shorter = rating_all_short.replace('[','').replace(']','')


## splitting into lists
rating_list = rating_even_shorter.split(", ")

In [9]:
# filling the lists technique = [] and fitness = []  we defined in the beginning

for i in range(0,(len(rating_list))):
    if i == 0:
        technique.append(rating_list[i]) # first item goes into technique
    elif i % 2 == 0:
        technique.append(rating_list[i]) # then every second item as well
    else:
        fitness.append(rating_list[i]) # the other items go into fitness

In [10]:
# Forming a dataframe from our lists as columns:

hikes_df = pd.DataFrame(
    {'title': title,
     'difficulty': difficulty,
     'sport': sport,
     'length': length,
     'time': time,
     'climb': climb,
     'minmax': minmax,
     'technique': technique,
     'fitness': fitness
    })

In [11]:
#  First draft dataframe: brackets, units, minmax to separate
hikes_df

Unnamed: 0,title,difficulty,sport,length,time,climb,minmax,technique,fitness
0,[Trailsummit Chrinnenhorn T3],[difficult],[Trailrunning],10.06km,02:49h,"1,755hm","1,039 - 2,740m",rating-5,rating-5
1,[Aus dem Diemtigtal zum Seebergsee],[medium],[Hiking],18.63km,06:30h,"1,023hm","1,081 - 1,929m",rating-3,rating-4
2,[Panoramaweg Hasliberg],[medium],[Hiking],8.6km,03:00h,355hm,"979 - 1,263m",rating-2,rating-3
3,[Lauterbrunnen - Mürren],[medium],[Hiking],5.11km,02:30h,846hm,"798 - 1,644m",rating-2,rating-3
4,[Von der Saane zum Denkmal der Schlacht von La...,[],[Hiking],14.11km,03:40h,212hm,482 - 649m,rating-,rating-0
...,...,...,...,...,...,...,...,...,...
395,[Von der Elsigenalp auf das Elsighorn],[medium],[Hiking],10.55km,04:30h,833hm,"1,777 - 2,333m",rating-2,rating-3
396,[In der Auenlandschaft der Aare bei Münsingen],[],[Hiking],11.89km,02:50h,97hm,515 - 565m,rating-,rating-0
397,[Saanen - Saanenmöser - Oeschseite - Zweisimmen],[medium],[Cycling],17.36km,01:30h,270hm,"941 - 1,280m",rating-3,rating-2
398,[Schynige Platte - Oberberghorn - Schynige Pla...,[easy],[Hiking],3.06km,01:15h,179hm,"1,940 - 2,066m",rating-2,rating-2


<html>
<h2> Part 2: Data cleaning
</h2>
<body></body>
</html>

In [12]:
# Converting the text columns to strings for easier handling later

hikes_df[['title', 'difficulty', 'sport']] = hikes_df[['title', 'difficulty', 'sport']].astype('str')

In [13]:
# Removing the brackets

hikes_df['title'] = hikes_df['title'].str.strip('[]')
hikes_df['difficulty'] = hikes_df['difficulty'].str.strip('[]')
hikes_df['sport'] = hikes_df['sport'].str.strip('[]')

In [14]:
# For some reason '' appear now, removing those as well

hikes_df['title'] = pd.Series(hikes_df['title']).str.replace("'", '')
hikes_df['difficulty'] = pd.Series(hikes_df['difficulty']).str.replace("'", '')
hikes_df['sport'] = pd.Series(hikes_df['sport']).str.replace("'", '')

In [15]:
# The "string-cleaned" dataframe
hikes_df

Unnamed: 0,title,difficulty,sport,length,time,climb,minmax,technique,fitness
0,Trailsummit Chrinnenhorn T3,difficult,Trailrunning,10.06km,02:49h,"1,755hm","1,039 - 2,740m",rating-5,rating-5
1,Aus dem Diemtigtal zum Seebergsee,medium,Hiking,18.63km,06:30h,"1,023hm","1,081 - 1,929m",rating-3,rating-4
2,Panoramaweg Hasliberg,medium,Hiking,8.6km,03:00h,355hm,"979 - 1,263m",rating-2,rating-3
3,Lauterbrunnen - Mürren,medium,Hiking,5.11km,02:30h,846hm,"798 - 1,644m",rating-2,rating-3
4,Von der Saane zum Denkmal der Schlacht von Laupen,,Hiking,14.11km,03:40h,212hm,482 - 649m,rating-,rating-0
...,...,...,...,...,...,...,...,...,...
395,Von der Elsigenalp auf das Elsighorn,medium,Hiking,10.55km,04:30h,833hm,"1,777 - 2,333m",rating-2,rating-3
396,In der Auenlandschaft der Aare bei Münsingen,,Hiking,11.89km,02:50h,97hm,515 - 565m,rating-,rating-0
397,Saanen - Saanenmöser - Oeschseite - Zweisimmen,medium,Cycling,17.36km,01:30h,270hm,"941 - 1,280m",rating-3,rating-2
398,Schynige Platte - Oberberghorn - Schynige Platte,easy,Hiking,3.06km,01:15h,179hm,"1,940 - 2,066m",rating-2,rating-2


In [16]:
# getting rid of the units (they're always the same anyway)

hikes_df['length'] = pd.Series(hikes_df['length']).str.replace("km", '')
hikes_df['time'] = pd.Series(hikes_df['time']).str.replace("h", '')
hikes_df['climb'] = pd.Series(hikes_df['climb']).str.replace("hm", '')
hikes_df['minmax'] = pd.Series(hikes_df['minmax']).str.replace("m", '')

In [17]:
# making 2 columns out of the last one, and removing the old minmax column

hikes_df[['min','max']] = hikes_df['minmax'].str.split("-",expand=True)
hikes_df = hikes_df.drop(columns=['minmax'])

# moving the ratings back to the end
hikes_df = hikes_df[['title', 'difficulty', 'sport', 'length', 'time', 'climb', 'min', 'max', 'technique', 'fitness']]


In [18]:
# The clean dataframe in string format

hikes_df

Unnamed: 0,title,difficulty,sport,length,time,climb,min,max,technique,fitness
0,Trailsummit Chrinnenhorn T3,difficult,Trailrunning,10.06,02:49,1755,1039,2740,rating-5,rating-5
1,Aus dem Diemtigtal zum Seebergsee,medium,Hiking,18.63,06:30,1023,1081,1929,rating-3,rating-4
2,Panoramaweg Hasliberg,medium,Hiking,8.6,03:00,355,979,1263,rating-2,rating-3
3,Lauterbrunnen - Mürren,medium,Hiking,5.11,02:30,846,798,1644,rating-2,rating-3
4,Von der Saane zum Denkmal der Schlacht von Laupen,,Hiking,14.11,03:40,212,482,649,rating-,rating-0
...,...,...,...,...,...,...,...,...,...,...
395,Von der Elsigenalp auf das Elsighorn,medium,Hiking,10.55,04:30,833,1777,2333,rating-2,rating-3
396,In der Auenlandschaft der Aare bei Münsingen,,Hiking,11.89,02:50,97,515,565,rating-,rating-0
397,Saanen - Saanenmöser - Oeschseite - Zweisimmen,medium,Cycling,17.36,01:30,270,941,1280,rating-3,rating-2
398,Schynige Platte - Oberberghorn - Schynige Platte,easy,Hiking,3.06,01:15,179,1940,2066,rating-2,rating-2


In [19]:
# converting to number data types:

# length
hikes_df['length'] = (hikes_df['length']).astype(float)

# Climb
# to handle the conversions we replace missing values by 0. Maybe nan would work too, but didn't find out how yet
hikes_df['climb'] = pd.Series(hikes_df['climb']).str.replace(",", '') # replace thousand separator
hikes_df['climb'] = pd.Series(hikes_df['climb']).str.replace("-", '0') # replaces - by 0
hikes_df['climb'] = (hikes_df['climb']).astype(int)


In [20]:
#min and max
hikes_df['min'] = pd.Series(hikes_df['min']).str.replace(",", '') # replace thousand separator
hikes_df['min'] = (hikes_df['min']).astype(int)
hikes_df['max'] = pd.Series(hikes_df['max']).str.replace(",", '') # replace thousand separator
hikes_df['max'] = (hikes_df['max']).astype(int)

In [21]:
# converting times to time objects

# pandas datetime doesn't handle times over 24 hours so we have to give him an alternative
try:
    hikes_df['time'] = pd.to_datetime(hikes_df['time'], format='%H:%M').dt.time #duration 
except ValueError:
    pass
# I want to get rid of seconds but it's bed time

In [22]:
## rating technique
hikes_df['technique'] = pd.Series(hikes_df['technique']).str.replace("rating-", '')
hikes_df['technique'] = pd.to_numeric(hikes_df['technique'], errors='coerce')
hikes_df['technique'] = pd.Series(hikes_df['technique']).replace(np.nan, 0, regex = True)
hikes_df['technique'] = pd.Series(hikes_df['technique']).astype('int')

# rating fitness
hikes_df['fitness'] = pd.Series(hikes_df['fitness']).str.replace("rating-", '')
hikes_df['fitness'] = pd.to_numeric(hikes_df['fitness'], errors='coerce')
hikes_df['fitness'] = pd.Series(hikes_df['fitness']).replace(np.nan, 0, regex = True)
hikes_df['fitness'] = hikes_df['fitness'].astype('int')

In [23]:
# Finally the clean dataframe 

hikes_df

Unnamed: 0,title,difficulty,sport,length,time,climb,min,max,technique,fitness
0,Trailsummit Chrinnenhorn T3,difficult,Trailrunning,10.06,02:49,1755,1039,2740,5,5
1,Aus dem Diemtigtal zum Seebergsee,medium,Hiking,18.63,06:30,1023,1081,1929,3,4
2,Panoramaweg Hasliberg,medium,Hiking,8.60,03:00,355,979,1263,2,3
3,Lauterbrunnen - Mürren,medium,Hiking,5.11,02:30,846,798,1644,2,3
4,Von der Saane zum Denkmal der Schlacht von Laupen,,Hiking,14.11,03:40,212,482,649,0,0
...,...,...,...,...,...,...,...,...,...,...
395,Von der Elsigenalp auf das Elsighorn,medium,Hiking,10.55,04:30,833,1777,2333,2,3
396,In der Auenlandschaft der Aare bei Münsingen,,Hiking,11.89,02:50,97,515,565,0,0
397,Saanen - Saanenmöser - Oeschseite - Zweisimmen,medium,Cycling,17.36,01:30,270,941,1280,3,2
398,Schynige Platte - Oberberghorn - Schynige Platte,easy,Hiking,3.06,01:15,179,1940,2066,2,2


In [24]:
# the dataframe info sum up yay we managed!
hikes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   title       400 non-null    object 
 1   difficulty  400 non-null    object 
 2   sport       400 non-null    object 
 3   length      400 non-null    float64
 4   time        400 non-null    object 
 5   climb       400 non-null    int64  
 6   min         400 non-null    int64  
 7   max         400 non-null    int64  
 8   technique   400 non-null    int64  
 9   fitness     400 non-null    int64  
dtypes: float64(1), int64(5), object(4)
memory usage: 31.4+ KB


#### <html>
<h2> Part 3: Downloading GeoData for the hikes
</h2>
<body>to create the for loop for all the 400 hikes</body>
</html>

In [25]:
### Imports
import requests
import re
from pykml import parser
from lxml import etree
from lxml import objectify

#### <html> 
<h3> Download the KML file
    </h3>
<body>We do this by parsing the url, and then converting into the the kml link that is used on the website.

- https://www.bergfex.com/downloads/gps/?type=&id=271548&fileType=kml
- Get file name from the url
    </body>
</html>

In [26]:
kml_link = ""
url = "https://www.bergfex.com/downloads/gps/"
url3= "https://www.bergfex.com/downloads/gps/?type=&amp;id=271548&amp;fileType=kml"
if url3.find('/'):
    print(url3.rsplit('/', 1)[1])
    kml_link = (url3.rsplit('/', 1)[1])
    new_s = kml_link.replace('=&amp;', '=&')
    print(new_s)

conc_kml_link = (url + new_s)
print(conc_kml_link )

?type=&amp;id=271548&amp;fileType=kml
?type=&id=271548&amp;fileType=kml
https://www.bergfex.com/downloads/gps/?type=&id=271548&amp;fileType=kml


#### <html> 
<h3>Write the data to a klm file locally
    </h3>
<body> 
- TO DO:
- This will be the file that we need to write our data from, and then add it to a data frame
    </body>
    </html>

In [27]:
r = requests.get(conc_kml_link, allow_redirects=True)
write_link = ( new_s + ".klm")
write_link
open(write_link, 'wb').write(r.content)

14653

#### <html> 
<h3>Create loop to download all entries
    </h3>
    <body> DONE </body>

In [31]:
## get list of all IDs
final_id = id_df.iloc[:,5]
final_id

0      310007
1       49274
2      131936
3      131157
4       81701
        ...  
395    151862
396    101889
397    131604
398    131173
399     84396
Name: 5, Length: 400, dtype: object

In [35]:
test_loop = final_id.iloc[0:3]
test_loop

0    310007
1     49274
2    131936
Name: 5, dtype: object

In [40]:
## For loop to create: i) all kml files (400 items) --> take 'final_id', ii) for testing --> take 'test_loop'
## Saves the files in same directory as the notebook

for sport_id in test_loop: 
    url_id = f"https://www.bergfex.com/downloads/gps/?type=&amp;id='{sport_id}&amp;fileType=kml"
    print(url_id)
    r = requests.get(url_id, allow_redirects=True)
    write_link = (sport_id + ".klm")
    open(write_link, 'wb').write(r.content)

write_link


https://www.bergfex.com/downloads/gps/?type=&amp;id='310007&amp;fileType=kml
https://www.bergfex.com/downloads/gps/?type=&amp;id='49274&amp;fileType=kml
https://www.bergfex.com/downloads/gps/?type=&amp;id='131936&amp;fileType=kml


'131936.klm'

#### <html> 
<h3> Read out the first postion of each file and add to DF
    </h3>

<body> Still to do </body>

#### <html>
<h2> Part 4: Get snow data
</h2>
<body>DONE</body>
</html>

In [42]:
### all the imports

import requests
import re
from bs4 import BeautifulSoup
import pandas as pd

import numpy as np
import matplotlib
import matplotlib.pyplot as plt

In [43]:
### initializing the future colums of our dataframe with empty lists

snowlevel = []  # height in cm
location = []  # town and elevation of town

## no looping over several pages necessary

## only scrape first page
link = 'http://www.meteocentrale.ch/de/wetter/hitlisten/schneehoehen.html'
page = requests.get(link, timeout=10)
print(page.status_code)
soup = BeautifulSoup(page.content, "html.parser")  # bs4.BeautifulSoup object
hitlist = soup.findAll('table', {'class': 'hitlist'})  #bs4.element.ResultSet

200


In [44]:
## get location
location_item = hitlist[0].findAll('a')
location.append([info.get_text().strip() for info in location_item])
loc = location[0]
#len(loc)
#loc

## get snowlevel
snowlevel_item = hitlist[0].findAll('td', {'class': 'value'}) #
snowlevel.append([info.get_text().strip() for info in snowlevel_item])
#len(snowlevel)
snow = snowlevel[0]
#len(snow)
#snow


## combine into DF
heights_df = pd.DataFrame({'location': loc,'snowlevel': snow})
heights_df

Unnamed: 0,location,snowlevel
0,"Weissfluhjoch, 2690 m",58 cm
1,"Corvatsch, 3315 m",45 cm
2,"Säntis, 2490 m",20 cm
3,"Glattalp, 1858 m",15 cm
4,"Schwägalp, 1350 m",4 cm
...,...,...
74,"Zermatt, 1638 m",0 cm
75,"Zollikofen, 553 m",0 cm
76,"Zürich-Affoltern, 443 m",0 cm
77,"Zürich-Flughafen, 432 m",0 cm


In [45]:
### data manipulation: 

##remove cm, convert to 'int'
heights_df['snowlevel_in_cm']=pd.Series(heights_df['snowlevel']).str.replace(" cm", '')
heights_df['snowlevel_in_cm']=pd.Series(heights_df['snowlevel_in_cm']).astype(int)
#type(heights_df['snowlevel_in_cm'][0])
#heights_df

## split location into 'village' and 'elevation of village' and merge with previous DF
split_loc = pd.Series(heights_df['location']).str.split(',',n=2,expand = True)
merged_df = pd.merge(heights_df, split_loc, left_index=True, right_index=True)
#merged_df

## drop unused columns, rename final columns, remove unit, sort columns
intermediate_df1 = merged_df.iloc[:,[2,3,4]]
intermediate_df1.columns = ['snowlevel_in_cm', 'location', 'height_in_m']
intermediate_df1['height_in_m']=pd.Series(intermediate_df1['height_in_m']).str.replace(" m", '')
intermediate_df2 = pd.DataFrame(intermediate_df1, columns = ['location', 'height_in_m', 'snowlevel_in_cm'])



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  intermediate_df1['height_in_m']=pd.Series(intermediate_df1['height_in_m']).str.replace(" m", '')


In [47]:
### final dataframe

intermediate_df2

Unnamed: 0,location,height_in_m,snowlevel_in_cm
0,Weissfluhjoch,2690,58
1,Corvatsch,3315,45
2,Säntis,2490,20
3,Glattalp,1858,15
4,Schwägalp,1350,4
...,...,...,...
74,Zermatt,1638,0
75,Zollikofen,553,0
76,Zürich-Affoltern,443,0
77,Zürich-Flughafen,432,0


In [None]:
### Export into csv

#intermediate_df2.to_csv(r'/Users/sd/Documents/0-Coding/sarah-dutschke/04-Visualization/Challenge/snow_level.csv', index = False)

#### <html>
<h2> Part 5: Data analysis 
</h2>
<body>(just to check the dataframe actually works)</body>
</html>

In [None]:
# Number of entries by types of sports
hikes_df['sport'].value_counts().plot(kind='bar')

In [None]:
# Creating a df for "Number of entries by sport type" + average tour length

sporttype = hikes_df.groupby('sport') \
       .agg({'sport':'count', 'length':'mean'}) \
       .rename(columns={'sport':'count','length':'average_length'}) \
       .reset_index()
sporttype

In [None]:
# Difficulty

hikes_df['difficulty'].unique()

In [None]:
hikes_df['difficulty'].value_counts().plot(kind='bar')

In [None]:
# mean length for difficulty:

hikes_df.groupby('difficulty').mean()

In [None]:
#df.reset_index().plot.scatter(x = 'index', y = 'value')
tour_len = sorted(pd.Series(hikes_df["length"]))
tour_climb = sorted(pd.Series(hikes_df["climb"]))
plt.plot(tour_len, tour_climb)
plt.ylabel('climb (m)')
plt.xlabel("length (km)")
plt.title('All activities')

In [None]:
## Creating a df for deeper analysis for Hiking
# Filter for hiking
hiking = hikes_df[hikes_df.sport == "Hiking"]

# calculating the count by Fitness level and the average length 
fitness_length = hiking.groupby('fitness') \
       .agg({'fitness':'count', 'length':'mean'}) \
       .rename(columns={'fitness':'count','length':'average_length'}) \
       .reset_index()

# calculating weighted and normalized length
fitness_length['weighted_ave_length'] = fitness_length["count"] * fitness_length["average_length"]
fitness_length["normalized_weighted_length"] = (fitness_length["weighted_ave_length"] - fitness_length["weighted_ave_length"].min()) \
                                                / (fitness_length["weighted_ave_length"].max() - fitness_length["weighted_ave_length"].min()) * 10
fitness_length


In [None]:
## Plotting Length vs. Fitness rating
#--> Conclusion: the routes with no fitness rating seems quite long. Investigate further

plt.plot(fitness_length["fitness"], fitness_length["average_length"])
plt.ylabel('average_length')
plt.xlabel("rating_fitness")
plt.title('Hiking: Average Length by Fitness Level Rating')

plt.show()

In [None]:
# quality checking data / or realizing we don't understand the columns?

hikes_df["residual"]= hikes_df["climb"] - (hikes_df["max"] - hikes_df["min"])
hikes_df["residual"].plot.density()

#### <html>
<h2> Part 6: Dashboard
</h2>
<body>(pretty much a lot to do stil :-))</body>
</html>