# Data aggregation for runners data visualization
In this notebook, we create JSON files that will be useful to create the website to visualize the data. Each JSON file corresponds to a specific runner and contains all the necessary infomation about him/her. For more details about how this JSON file is created, see [here](#Detailed-aggregation-by-runner). 

**Warning:** We make the assumption that two persons that have the same name (first name and family name) and the same birth year is a unique person. 

* [Load data](#Load-data)
* [Clean data](#Clean-data)
* [Detailed aggregation by runner](#Detailed-aggregation-by-runner)

# Load data
Requirements:
`pip install unidecode`

In [1]:
import numpy as np
import pandas as pd
import json
import re
from unidecode import unidecode
from datetime import timedelta
import math
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_context('notebook')

## Load information about each dyad (runner, race)

In [2]:
# It's a bit long but you can load a remote CSV file from its URL. :fire:
# raw_df = pd.read_pickle('https://drive.google.com/file/d/0BypxDaHZHjhfNG9qbHA0NGJpbU0/view?usp=sharing')
# Or from a local copy:
raw_df = pd.read_pickle('/home/ondine/Desktop/ADA/df_userID.pickle')

In [3]:
raw_df.head()

Unnamed: 0,Race,Date,RaceYear,RaceMonth,Category,Distance,Name,Sex,Year,LivingPlace,Rank,Time,Pace,Place,MinTemp,MaxTemp,Weather,RaceID,UserID
0,Kerzerslauf,sam. 18.03.2000,2000,3,M20,15.0,Abgottspon Peter,M,1974.0,Zermatt,233,01:02:25,00:04:09,Kerzers,,,,http://services.datasport.com/2000/lauf/kerzers,Abgottspon Peter 1974.0
1,Kerzerslauf,sam. 18.03.2000,2000,3,M35,15.0,Abplanalp Michael,M,1964.0,Bern,32,00:55:11.700000,00:03:40,Kerzers,,,,http://services.datasport.com/2000/lauf/kerzers,Abplanalp Michael 1964.0
2,Kerzerslauf,sam. 18.03.2000,2000,3,M50,15.0,Abt Werner,M,1947.0,Spiez,155,01:12:42.900000,00:04:50,Kerzers,,,,http://services.datasport.com/2000/lauf/kerzers,Abt Werner 1947.0
3,Kerzerslauf,sam. 18.03.2000,2000,3,F45,15.0,Ackermann Antoinette,F,1953.0,Alterswil,48,01:22:36.700000,00:05:30,Kerzers,,,,http://services.datasport.com/2000/lauf/kerzers,Ackermann Antoinette 1953.0
4,Kerzerslauf,sam. 18.03.2000,2000,3,F50,15.0,Ackermann Hedy,F,1946.0,Alterswil,42,01:23:29.300000,00:05:33,Kerzers,,,,http://services.datasport.com/2000/lauf/kerzers,Ackermann Hedy 1946.0


## Load extra information about the races

In [4]:
races_info = pd.read_csv('../datasets/races-information.csv',index_col=0).drop('url', axis=1)
races_info.head()

Unnamed: 0,date,name,location,min_temp,max_temp,uv_index,weather_desc,latitude,longitude,weekday,day,month,year
0,sam. 27.03.1999,Männedörfler Waldlauf,Männedorf,,,,,47.2574625,8.6946733,saturday,27,3,1999
1,sam. 20.03.1999,Kerzerslauf,Kerzers,,,,,46.97488999999999,7.1954365,saturday,20,3,1999
2,sam. 24.04.1999,Luzerner Stadtlauf,Luzern,,,,,47.05016819999999,8.3093072,saturday,24,4,1999
3,sam. 24.04.1999,20km de Lausanne,Lausanne,,,,,46.5196535,6.6322734,saturday,24,4,1999
4,sam. 24.04.1999,"Chäsitzerlouf, Kehrsatz",Kehrsatz,,,,,,,saturday,24,4,1999


## Merge both tables, keeping the dyads (runner, race)

In [5]:
df = pd.merge(raw_df, races_info, how='left', left_on=['Race','Date'], right_on=['name','date'])\
    .drop(['date','name','MinTemp','MaxTemp','Weather','RaceYear','RaceMonth','RaceID'],axis=1)
print(df.shape)
df.columns

(1648676, 24)


Index(['Race', 'Date', 'Category', 'Distance', 'Name', 'Sex', 'Year',
       'LivingPlace', 'Rank', 'Time', 'Pace', 'Place', 'UserID', 'location',
       'min_temp', 'max_temp', 'uv_index', 'weather_desc', 'latitude',
       'longitude', 'weekday', 'day', 'month', 'year'],
      dtype='object')

In [6]:
df.head()

Unnamed: 0,Race,Date,Category,Distance,Name,Sex,Year,LivingPlace,Rank,Time,...,min_temp,max_temp,uv_index,weather_desc,latitude,longitude,weekday,day,month,year
0,Kerzerslauf,sam. 18.03.2000,M20,15.0,Abgottspon Peter,M,1974.0,Zermatt,233,01:02:25,...,,,,,46.97488999999999,7.1954365,saturday,18,3,2000
1,Kerzerslauf,sam. 18.03.2000,M35,15.0,Abplanalp Michael,M,1964.0,Bern,32,00:55:11.700000,...,,,,,46.97488999999999,7.1954365,saturday,18,3,2000
2,Kerzerslauf,sam. 18.03.2000,M50,15.0,Abt Werner,M,1947.0,Spiez,155,01:12:42.900000,...,,,,,46.97488999999999,7.1954365,saturday,18,3,2000
3,Kerzerslauf,sam. 18.03.2000,F45,15.0,Ackermann Antoinette,F,1953.0,Alterswil,48,01:22:36.700000,...,,,,,46.97488999999999,7.1954365,saturday,18,3,2000
4,Kerzerslauf,sam. 18.03.2000,F50,15.0,Ackermann Hedy,F,1946.0,Alterswil,42,01:23:29.300000,...,,,,,46.97488999999999,7.1954365,saturday,18,3,2000


# Clean data
## Take care of NaN values

In [7]:
df.loc[df.latitude == 'n', 'latitude'] = np.nan
df.loc[df.longitude == 'a', 'longitude'] = np.nan

## Convert types

In [8]:
df.latitude = df.latitude.apply(float)
df.longitude = df.longitude.apply(float)
df.Distance = df.Distance.apply(round)
df.Year = df.Year.fillna(0).apply(int)

## Clean races and runners names

In [24]:
def clean_name(x):
    return x.replace("/"," ").replace("\\"," ").replace("."," ")

df.Race = df.Race.apply(clean_name)
df.Name = df.Name.apply(clean_name)

## Convert times and paces to seconds

In [10]:
df['time'] = df.Time.apply(timedelta.total_seconds)
df['pace'] = df.Time.apply(timedelta.total_seconds)

## Only keep valid names

In [25]:
df = df[df.Name != ""]

## Set a double index (Name, Birth year)
We make the asumption that two persons with the same name and birth year are the same person. The index should however not necessarily be unique since a person can have made more than one competition. And indeed, the index is not unique, as we can see here:

In [26]:
doubleindex_df = df.set_index(['Name','Year'])
doubleindex_df.index.is_unique

False

We sort the indices so that we can split the creation of the JSON files.

In [27]:
doubleindex_df.sort_index(inplace=True)

In [28]:
doubleindex_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Race,Date,Category,Distance,Sex,LivingPlace,Rank,Time,Pace,Place,...,max_temp,uv_index,weather_desc,latitude,longitude,weekday,day,month,year,time
Name,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
0Berson Jose,1976,"Course de l'Avent, Fribourg",dim. 23.11.2014,P-H30,6,M,Orsonnens,18,00:30:34.600000,00:04:51,Fribourg,...,13.0,0.0,Clear,46.806477,7.161972,sunday,23,11,2014,1834.6
0Dermatt Ernst,1959,Luzerner Stadtlauf,sam. 03.05.2003,K/28,6,M,Hergiswil,275,00:27:42.100000,00:05:02,Luzern,...,,,,47.050168,8.309307,saturday,3,5,2003,1662.1
1 Sergeant Crespo Reina Juan Antoni,1971,"Bieler Lauftage, Biel Bienne",ven. 17.06.2005,100/MP,100,M,Espana 3,5,08:30:58.100000,00:05:06,Biel/Bienne,...,,,,47.136778,7.246791,friday,17,6,2005,30658.1
1 Sergeant Crespo Reina Juan Antoni,1971,"Bieler Lauftage, Biel Bienne",ven. 09.06.2006,100/MP,100,M,Espana 3,8,08:59:51.300000,00:05:23,Biel/Bienne,...,,,,47.136778,7.246791,friday,9,6,2006,32391.3
1 Sergeant Zarza Rodriguez Jose Lui,1966,"Bieler Lauftage, Biel Bienne",ven. 17.06.2005,100/MP,100,M,Espana 1,3,08:24:14.400000,00:05:02,Biel/Bienne,...,,,,47.136778,7.246791,friday,17,6,2005,30254.4


We print here the indices of the columns that we care.

In [29]:
doubleindex_df.columns

Index(['Race', 'Date', 'Category', 'Distance', 'Sex', 'LivingPlace', 'Rank',
       'Time', 'Pace', 'Place', 'UserID', 'location', 'min_temp', 'max_temp',
       'uv_index', 'weather_desc', 'latitude', 'longitude', 'weekday', 'day',
       'month', 'year', 'time'],
      dtype='object')

In [30]:
df.columns

Index(['Race', 'Date', 'Category', 'Distance', 'Name', 'Sex', 'Year',
       'LivingPlace', 'Rank', 'Time', 'Pace', 'Place', 'UserID', 'location',
       'min_temp', 'max_temp', 'uv_index', 'weather_desc', 'latitude',
       'longitude', 'weekday', 'day', 'month', 'year', 'time'],
      dtype='object')

# Detailed aggregation by runner

Steps :
* For each runner, build a runner_dict which contains (hierarchically) the data for the chosen runner, for all the races he has participated to.

So for each runner we have:

```
runner_dict = {
    'name': Name,
    'birth': Year,
    'sex': Sex,
    'races': {
        'race_1': {
            'race': Race,
            'location': location,
            'latitude': latitude,
            'longitude': longitude,
            'date': {'date_1': {
                        'weekday': weekday,
                        'day', day,
                        'month': month,
                        'year': year,
                        'livingplace': LivingPlace,
                        'categories': {
                            'category_1': {
                                'distance': Distance, 
                                'rank': Rank, 
                                'time': time, 
                                'pace': pace
                                },
                            ...             # 'category_2', etc.
                            }
                      },
                      ...                   # 'date_2', etc.
            },
        }, 
        ...                                 # 'race_2', etc.
    }
}
```
* Each of these runner_dict is exported to a JSON file whose name is an encoded name coming from the real name of the runner.

* So we also build a last dictionnary that maps the full names of the runners to their 'encodings' (used for the JSON file names), as follows:
```
names_dict = {
    encoded_name_1 : name_1,
    encoded_name_2 : name_2,
    ...
}
```
* Export names_dict to a JSON file.

Note that since there are many JSON files, they are direcly loaded onto the website github organisation directory [hopsuisse.github.io](https://github.com/hopsuisse/hopsuisse.github.io) to avoid moving a large number of files, operation that takes time.

# Helpers

In [31]:
week_dict = {
    'lun': 'monday',
    'mar': 'tuesday',
    'mer': 'wednesday',
    'jeu': 'thursday',
    'ven': 'friday',
    'sam': 'saturday',
    'dim': 'sunday'
}

def fill_date(dataframe, dictionary):
    weekday = dataframe.weekday.unique()[0]
    day = dataframe.day.unique()[0]
    month = dataframe.month.unique()[0]
    year = dataframe.year.unique()[0]
    if pd.isnull(weekday) or pd.isnull(day) or pd.isnull(month) or pd.isnull(year):
        # compute
        dictionary['weekday'] = dataframe.Date.apply(lambda x: week_dict[x.split('.')[0].strip()]).unique()[0]
        dictionary['day'] = int(dataframe.Date.apply(lambda x: int(x.split('.')[1].strip())).unique()[0])
        dictionary['month'] = int(dataframe.Date.apply(lambda x: int(x.split('.')[2].strip())).unique()[0])
        dictionary['year'] = int(dataframe.Date.apply(lambda x: int(x.split('.')[3].strip())).unique()[0])
    else:
        dictionary['weekday'] = weekday
        dictionary['day'] = int(day)
        dictionary['month'] = int(month)
        dictionary['year'] = int(year)

## Main loop to build the JSON files
### Careful: Very long loop! (>12h)

In [35]:
### ITERATION OVER RUNNERS
names_dict = {}
#out_dict = {}

i = 500000
i_max = doubleindex_df.index.unique().shape[0]

for (runner,birth) in doubleindex_df.index.unique():
    i = i+1
    sub_df_temp = df[df.Name == runner].copy()
    sub_df = sub_df_temp[sub_df_temp.Year == birth].copy()
    runner_dict = {}
    runner_dict['name'] = runner
    runner_dict['birth'] = birth
    runner_dict['sex'] = sub_df.Sex.unique()[0]
    race_wrapper = {}
    
    for race in sub_df.Race.unique():
        race_dict = {}
        race_dict['race'] = race
        subsub_df = sub_df[sub_df.Race == race].copy()
        race_dict['location'] = subsub_df.location.unique()[0]
        race_dict['latitude'] = float(subsub_df.latitude.unique()[0])
        race_dict['longitude'] = float(subsub_df.longitude.unique()[0])
        date_wrapper = {}
        
        for date in subsub_df.Date.unique():
            subsubsub_df = subsub_df[subsub_df.Date == date].copy()
            date_dict = {}
            # Note that for some dates, we don't already have this info and have to compute it
            fill_date(subsubsub_df, date_dict)
            # TODO: weather !
            # TODO: total number of runners !
            date_dict['livingplace'] = subsubsub_df.LivingPlace.unique()[0]
            cat_wrapper = {}
            
            for category in subsubsub_df.Category.unique():
                lastsub_df = subsubsub_df[subsubsub_df.Category == category].copy()
                cat_dict = {}
                cat_dict['distance'] = int(lastsub_df.Distance.unique()[-1])
                cat_dict['rank'] = int(lastsub_df.Rank.unique()[-1])
                cat_dict['time'] = float(lastsub_df.time.unique()[-1])
                cat_dict['pace'] = int(lastsub_df.pace.unique()[-1])
                cat_wrapper[category] = cat_dict

                if lastsub_df.shape[0] != 1:
                    #print()
                    #print()
                    #print('Two runners have the same name, same birth year and run in the same race,',\
                    #      'in the same category.')
                    #print(lastsub_df)
                    #print()
                    #print()
                    break
                        
            date_dict['categories'] = cat_wrapper
            date_wrapper[date] = date_dict
            
        race_dict['date'] = date_wrapper
        race_wrapper[race] = race_dict
        
    runner_dict['races'] = race_wrapper
    runner_id = runner+' '+str(birth)
    encoded_name = re.sub('[^0-9a-zA-Z]+', '', unidecode(runner_id).lower())
    names_dict[encoded_name] = runner_id
    #out_dict[encoded_name] = runner_dict
    
    with open('../../hopsuisse.github.io/runnerdata/' + encoded_name + '.json', 'w') as out_file:
        json.dump(runner_dict, out_file)
    
    if i%1000 == 0:
        print(i,'runners out of',i_max,'have been analysed.')
print('All done.')

with open('../../hopsuisse.github.io/_data/runnersnames.json', 'w') as out_file:
    json.dump(names_dict, out_file)
print('Files saved.')

501000 runners out of 531419 have been analysed.
502000 runners out of 531419 have been analysed.
503000 runners out of 531419 have been analysed.
504000 runners out of 531419 have been analysed.
505000 runners out of 531419 have been analysed.
506000 runners out of 531419 have been analysed.
507000 runners out of 531419 have been analysed.
508000 runners out of 531419 have been analysed.
509000 runners out of 531419 have been analysed.
510000 runners out of 531419 have been analysed.
511000 runners out of 531419 have been analysed.
512000 runners out of 531419 have been analysed.
513000 runners out of 531419 have been analysed.
514000 runners out of 531419 have been analysed.
515000 runners out of 531419 have been analysed.
516000 runners out of 531419 have been analysed.
517000 runners out of 531419 have been analysed.
518000 runners out of 531419 have been analysed.
519000 runners out of 531419 have been analysed.
520000 runners out of 531419 have been analysed.
521000 runners out o

### If splitting the data is needed: here is the merge loop used.

In [40]:
#with open('runnersnames4.json') as json_data4:
#    with open('runnersnames1.json') as json_data1:
#        with open('runnersnames2.json') as json_data2:
#            with open('runnersnames3.json') as json_data3:
#                d1 = json.load(json_data1)
#                d2 = json.load(json_data2)
#                d3 = json.load(json_data3)
#                d4 = json.load(json_data4)
#                d = {}
#                for dd in (d1,d2,d3,d4):
#                    d.update(dd)
#                with open('runnersnames.json', 'w') as out_file:
#                    json.dump(d, out_file)