 # Exploración

 ###### Ir a [Lectura y preparación de datos](https://github.com/pontnou/utmb/blob/master/umtb.1.ipynb "Parte 1").

 ## Imports

In [1]:
import os
import numpy as np
import pandas as pd
from IPython.display import display


 ## Lectura del fichero de datos

In [2]:
file = '.\\data\\utmb.1.csv'
df = pd.read_csv(file, low_memory=False)
pd.set_option('display.max_columns', df.shape[1])
df[:3]


Unnamed: 0,Id,Year,Bib,Name,Rank,Category,Team,Nationality,Time,Timediff,ColDeVoza,LaCharme,Delevret,StGervais,Contamines,LaBalme,ContaR,Bellevue,LesHouches,GarePp,Bonhomme,Chapieux,ColSeigne,RefugeElisabetta,LacCombal,MtFavre,Checrouit,Courmayeur,Courmayeur2,Bertone,RifugioElena,Bonatti,Arnouvaz,ColFerret,LaPeule,LaFouly,PrazDeFort,Champex,Martigny,Bovinette,LaGiete,Trient,Tseppes,Catogne,Vallorcine,ColDesMontets,LaTeteAuxVents,LaFlegere,Argentiere,Gardes,1km,Arrivee
0,0,2003,47,SHERPA Dachhiri-Dawa,1,SEH,TEAM FILA,NP,20:05:58,00:00:00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,1,2003,8,GAYLORD Topher,2,SEH,THE NORTH FACE,US,22:15:17,00:09:19,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,2,2003,301,SYBROWSKY Brandon,2,SEH,TEAM MONTRAIL,US,22:15:17,00:09:19,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


 ## Conversión de tipos de columnas

 #### Las columnas con la etiqueta crono están situadas a partir del índice 8. Las recuperamos directamente sin pasar por el mapa de columnas. Verificamos al final que podemos operar con los valores correctamente con una resta de tiempos cualquiera.

In [3]:
cronos = list(df)[8:]
for crono in cronos:
    print(crono)
    td = pd.to_timedelta(df[crono])
    df[crono] = td

df.Time[30] - df.Time[29]


Time
Timediff
ColDeVoza
LaCharme
Delevret
StGervais
Contamines
LaBalme
ContaR
Bellevue
LesHouches
GarePp
Bonhomme
Chapieux
ColSeigne
RefugeElisabetta
LacCombal
MtFavre
Checrouit
Courmayeur
Courmayeur2
Bertone
RifugioElena
Bonatti
Arnouvaz
ColFerret
LaPeule
LaFouly
PrazDeFort
Champex
Martigny
Bovinette
LaGiete
Trient
Tseppes
Catogne
Vallorcine
ColDesMontets
LaTeteAuxVents
LaFlegere
Argentiere
Gardes
1km
Arrivee


Timedelta('0 days 00:41:08')

 ## Verificaciones

 #### De entrada haremos algunas verificaciones básicas. Empezamos por comprobar la variable `Arrivee`.

In [4]:
df.groupby('Year')['Arrivee'].min().sort_values()


Year
2012   10:32:36
2017   19:01:54
2014   20:11:44
2013   20:34:57
2011   20:36:43
2008   20:56:59
2006   21:06:06
2004   21:06:18
2015   21:09:15
2005   21:11:07
2007   21:31:58
2009   21:33:18
2016   22:00:02
2003        NaT
2010        NaT
Name: Arrivee, dtype: timedelta64[ns]

 #### Se ven tres valores anómalos
 1. 2012 El tiempo de llegada es la mitad de los demás años informados.
 2. 2003 No hay tiempo de llegada.
 3. 2010 No hay tiempo de llegada.

 #### Preparamos una función auxiliar para recuperar solamente los datos de las columnas con información, el variopinto esquema de cada año hace que en muchas variables no haya ningún dato; de esta manera podemos realizar el análisis con menos ruido.

In [5]:
# función auxiliar para recuperar un año  eliminando las
# columnas con todos los valores nans


def get_year(df, year):
    year_df = df.loc[df.Year == year]
    notnas = year_df.columns[year_df.notna().any()].tolist()
    return year_df[notnas]


 #### Año 2012

In [6]:
get_year(df, 2012)[:3]


Unnamed: 0,Id,Year,Bib,Name,Rank,Category,Team,Nationality,Time,Delevret,StGervais,Contamines,LaBalme,ContaR,Bellevue,LesHouches,GarePp,Argentiere,1km,Arrivee
17739,0,2012,1508,D'HAENE François,1,SE H,,FR,10:32:36,01:13:43,01:46:47,02:37:58,03:28:33,04:52:08,06:23:31,06:48:48,08:00:14,09:36:18,NaT,10:32:36
17740,1,2012,1504,BUUD Jonas,2,SE H,IFK MORA,SE,11:03:19,01:15:21,01:47:51,02:39:11,03:30:51,05:01:23,06:40:05,07:07:24,08:26:36,10:09:24,10:59:50,11:03:19
17741,2,2012,11,FOOTE Michael,3,SE H,THE NORTH FACE,US,11:19:00,01:15:36,01:53:13,02:51:49,03:52:54,05:27:29,07:05:02,07:33:01,08:50:13,10:25:40,11:15:35,11:19:00


 #### Para 2012 vemos que los datos son correctos y parecen razonables, pero los puntos de paso y los tiempos indican que se trata de otra carrera. No podremos utilizar este año para contrastarlo con los demás. Como ni siquiera sabemos de qué carrera se trata, vamos a excluirlo directamente del resto del análisis. Hay que borrar las observaciones cuyo año es 2012 y, después, eliminar las variables exclusivas de ese año sabiendo que no tienen datos.

In [7]:
# excluimos las observaciones del año 2012 (es otra carrera)
df = df[df.Year != 2012]
# localizamos sus variables exclusivas sabiendo que serán nans
empties = df.columns[df.isna().all()].tolist()
display(empties)
# las borramos del dataframe
df.drop(columns=empties, inplace=True)
display(df.shape)


['ContaR', 'Bellevue', 'LesHouches', 'GarePp', '1km']

(30288, 47)

 #### Año 2003

In [8]:
get_year(df, 2003)[:3]


Unnamed: 0,Id,Year,Bib,Name,Rank,Category,Team,Nationality,Time,Timediff
0,0,2003,47,SHERPA Dachhiri-Dawa,1,SEH,TEAM FILA,NP,20:05:58,00:00:00
1,1,2003,8,GAYLORD Topher,2,SEH,THE NORTH FACE,US,22:15:17,00:09:19
2,2,2003,301,SYBROWSKY Brandon,2,SEH,TEAM MONTRAIL,US,22:15:17,00:09:19


 #### En 2003 se comprueba que no hay tiempos de paso intermedios, sino solamente el tiempo total. Sus observaciones pueden aportar información al análisis. Actualizaremos la columna `Arrivee` con el valor de `Time`.

In [9]:
# mascara
# preparasmo una máscara
mask = df.Year == 2003
# actualizamos Arrivee
df.loc[mask, 'Arrivee'] = df.loc[mask, 'Time']
get_year(df, 2003)[:3]


Unnamed: 0,Id,Year,Bib,Name,Rank,Category,Team,Nationality,Time,Timediff,Arrivee
0,0,2003,47,SHERPA Dachhiri-Dawa,1,SEH,TEAM FILA,NP,20:05:58,00:00:00,20:05:58
1,1,2003,8,GAYLORD Topher,2,SEH,THE NORTH FACE,US,22:15:17,00:09:19,22:15:17
2,2,2003,301,SYBROWSKY Brandon,2,SEH,TEAM MONTRAIL,US,22:15:17,00:09:19,22:15:17


 #### Año 2010

In [10]:
get_year(df, 2010)[:3]


Unnamed: 0,Id,Year,Bib,Name,Rank,Category,Team,Nationality,Time,Delevret,StGervais
12876,0,2010,2216,CLEMENT Benoît,1,V1 H,,,0 days,01:53:45,02:52:44
12877,1,2010,3018,AGUETTAZ Christophe,2,V1 H,,,0 days,01:57:42,03:00:38
12878,2,2010,3913,PECH Philippe,3,V1 H,,,0 days,01:55:20,02:52:30


 #### 2010 no es un gran año para nuestra tarea de análisis, solo hay dos columnas de paso y no hay tiempo total. Mantendremos las observaciones porque pueden ser útiles para el análisis de participantes cuando no se requieran tiempos.

 #### Por último, examinamos todos los años para comprobar su calidad –al menos las de las primeras filas–.

In [11]:
years = np.arange(2003, 2018)
for y in years:
    if y != 2012:
        display(get_year(df, y).head(3))


Unnamed: 0,Id,Year,Bib,Name,Rank,Category,Team,Nationality,Time,Timediff,Arrivee
0,0,2003,47,SHERPA Dachhiri-Dawa,1,SEH,TEAM FILA,NP,20:05:58,00:00:00,20:05:58
1,1,2003,8,GAYLORD Topher,2,SEH,THE NORTH FACE,US,22:15:17,00:09:19,22:15:17
2,2,2003,301,SYBROWSKY Brandon,2,SEH,TEAM MONTRAIL,US,22:15:17,00:09:19,22:15:17


Unnamed: 0,Id,Year,Bib,Name,Rank,Category,Team,Nationality,Time,ColDeVoza,Contamines,Bonhomme,Chapieux,ColSeigne,MtFavre,Courmayeur,Bertone,RifugioElena,ColFerret,LaFouly,Champex,Bovinette,Trient,Vallorcine,Gardes,Arrivee
67,0,2004,81,DELEBARRE Vincent,1,SE H,LES TRAILERS DU MONT BLANC - UFO - QUECHUA,FR,21:06:18,01:17:37,02:23:00,04:17:21,04:45:35,06:20:15,07:36:22,08:34:56,09:41:45,11:41:35,12:19:04,13:11:18,15:21:42,16:32:35,17:27:51,19:01:47,20:19:16,21:06:18
68,1,2004,1,SHERPA Dachhiri-Dawa,2,SE H,TEAM QUECHUA,NP,23:02:28,01:16:49,02:18:21,04:17:00,04:45:33,06:21:30,07:36:38,08:34:38,09:41:47,11:42:26,12:20:13,13:11:13,15:22:57,16:58:52,17:59:11,20:24:25,22:07:48,23:02:28
69,2,2004,845,PACHE Jean Claude,3,SE H,CARC ROMONT,CH,23:40:08,01:24:13,02:31:44,04:50:35,05:23:26,07:11:56,08:31:20,NaT,10:45:49,12:38:34,13:18:51,14:18:56,16:38:42,18:07:07,19:08:46,21:20:18,22:45:50,23:40:08


Unnamed: 0,Id,Year,Bib,Name,Rank,Category,Team,Nationality,Time,ColDeVoza,Contamines,LaBalme,Chapieux,ColSeigne,RefugeElisabetta,MtFavre,Checrouit,Courmayeur,Bertone,Bonatti,Arnouvaz,ColFerret,LaPeule,LaFouly,PrazDeFort,Champex,Bovinette,Trient,Tseppes,Vallorcine,Argentiere,Arrivee
1443,0,2005,2795,JAQUEROD Christophe,1,SE H,TEAM LAFUMA,CH,21:11:07,NaT,02:19:35,03:13:40,04:55:45,06:40:51,06:59:55,07:50:32,08:19:51,08:46:30,09:53:34,11:04:43,11:49:00,12:56:02,13:52:19,13:55:20,14:47:48,NaT,17:12:05,17:58:44,18:44:53,19:33:33,20:16:05,21:11:07
1444,1,2005,1,DELEBARRE Vincent,2,SE H,TRAILERS DU MONT BLANC - QUECHUA,FR,21:25:02,01:17:41,-1 days +23:02:51,03:13:43,04:40:40,06:09:22,06:27:53,07:18:27,07:47:56,08:17:21,09:25:49,10:30:51,11:09:10,12:10:55,12:33:18,13:06:42,14:04:39,NaT,16:45:13,17:42:05,18:31:38,19:26:51,20:18:23,21:25:02
1445,2,2005,2287,OLMO Marco,3,V2 H,M&amp;P RISK AGENCI,IT,21:49:57,NaT,02:29:44,03:28:03,05:10:24,06:50:17,07:16:57,08:13:06,08:47:08,09:17:44,10:27:11,NaT,12:12:08,13:15:53,13:52:27,14:13:50,15:03:12,NaT,NaT,NaT,19:07:58,20:08:18,20:49:49,21:49:57


Unnamed: 0,Id,Year,Bib,Name,Rank,Category,Team,Nationality,Time,ColDeVoza,Contamines,LaBalme,Bonhomme,Chapieux,ColSeigne,RefugeElisabetta,MtFavre,Checrouit,Courmayeur,Bertone,Bonatti,Arnouvaz,ColFerret,LaPeule,LaFouly,PrazDeFort,Champex,Bovinette,Trient,Tseppes,Vallorcine,Argentiere,Arrivee
3365,0,2006,3,OLMO Marco,1,V2 H,M&amp;AMP;P RISK AGENCI,IT,21:06:06,01:22:22,02:30:52,03:29:15,04:33:28,05:07:58,06:39:29,07:05:15,07:52:35,08:24:15,08:52:22,10:03:00,11:08:59,11:51:28,12:50:45,NaT,13:46:58,14:37:35,15:28:22,16:49:09,17:35:33,18:21:28,19:16:38,20:03:38,21:06:06
3366,1,2006,3021,NEMETH Csaba,2,SE H,E.G. MOUNTEX,HU,21:37:32,01:18:17,02:22:05,03:17:56,04:27:33,04:59:44,06:27:18,06:52:40,07:39:38,08:11:31,08:44:09,09:47:44,10:53:21,11:34:51,12:37:19,NaT,13:37:14,14:27:39,15:20:38,16:54:58,17:48:30,18:39:09,19:46:42,20:34:58,21:37:32
3367,2,2006,2,DELEBARRE Vincent,3,SE H,TRAILERS DU MONT BLANC - CMBM - QUECHUA - EMHM...,FR,21:59:29,01:20:53,02:23:30,03:21:58,04:27:07,04:57:36,06:29:55,06:52:30,07:38:55,08:07:19,08:38:15,09:47:29,10:51:20,11:31:32,12:31:55,NaT,13:26:48,14:15:37,15:11:28,17:00:51,18:27:42,07:13:57,20:18:59,21:02:55,21:59:29


Unnamed: 0,Id,Year,Bib,Name,Rank,Category,Team,Nationality,Time,LaCharme,StGervais,Contamines,LaBalme,Bonhomme,Chapieux,ColSeigne,RefugeElisabetta,Checrouit,Courmayeur,Bertone,Bonatti,Arnouvaz,ColFerret,LaFouly,Champex,Bovinette,Trient,Catogne,Vallorcine,Argentiere,Arrivee
5900,0,2007,1,OLMO Marco,1,V2 H,,IT,21:31:58,01:30:59,02:04:37,03:09:52,04:08:35,05:12:23,05:44:50,07:17:00,07:42:30,09:00:30,09:30:20,10:35:15,11:44:07,12:24:33,13:22:54,14:19:12,15:55:25,17:19:20,18:05:00,19:11:00,19:46:33,20:28:52,21:31:58
5901,1,2007,4,LUKAS Jens,2,V1 H,LSG KARLSRUHE,DE,22:23:55,01:36:06,02:07:50,03:17:10,04:23:29,NaT,06:10:27,07:50:38,08:18:48,09:42:16,10:17:18,11:29:57,12:30:50,13:11:51,14:09:48,15:08:05,16:46:36,18:17:46,19:05:06,20:06:00,20:40:46,21:23:09,22:23:55
5902,2,2007,8,MERMOUD Nicolas,3,V1 H,,FR,22:30:51,01:24:54,01:52:38,02:53:18,03:53:25,04:56:03,05:22:51,07:04:11,07:27:37,08:44:00,09:12:37,10:22:35,11:29:34,12:07:03,13:10:34,14:09:07,15:54:00,17:37:23,18:27:43,19:45:00,20:22:48,21:15:49,22:30:51


Unnamed: 0,Id,Year,Bib,Name,Rank,Category,Team,Nationality,Time,LaCharme,StGervais,Contamines,LaBalme,Bonhomme,Chapieux,ColSeigne,LacCombal,MtFavre,Checrouit,Courmayeur,Bertone,Bonatti,Arnouvaz,ColFerret,LaFouly,Champex,Bovinette,Trient,Catogne,Vallorcine,LaTeteAuxVents,LaFlegere,Arrivee
8214,0,2008,4048,JORNET BURGADA Kilian,1,ES H,SALOMON SANTIVERI,ES,20:56:59,01:17:09,01:48:47,02:45:57,03:44:42,04:41:16,05:14:28,06:37:53,07:05:53,07:46:57,08:16:16,08:44:20,09:41:42,10:34:32,11:10:59,12:06:19,13:00:38,14:42:00,16:05:55,16:47:01,17:43:46,18:14:27,19:52:43,20:14:19,20:56:59
8215,1,2008,5,SHERPA Dachhiri-Dawa,2,SE H,LES TRAILLEURS DU MT BLANC,NP,21:56:52,01:17:37,01:48:46,02:45:57,03:49:11,04:51:20,05:22:44,06:56:49,07:27:47,08:07:20,08:34:27,09:02:38,10:04:10,11:02:24,11:40:40,12:40:38,13:43:28,15:16:07,16:42:46,17:25:26,18:32:43,19:07:51,20:51:30,21:20:24,21:56:52
8216,2,2008,2176,CHORIER Julien,3,SE H,EAC ACTIVASPORT,FR,22:31:35,01:21:06,01:54:59,02:53:42,03:54:31,04:56:55,05:29:51,07:03:23,07:30:39,08:13:21,08:40:17,09:10:45,10:14:50,11:12:35,11:53:26,12:55:27,13:58:46,15:35:18,17:09:09,18:00:29,19:04:51,19:45:44,21:22:24,21:49:14,22:31:35


Unnamed: 0,Id,Year,Bib,Name,Rank,Category,Team,Nationality,Time,StGervais,Contamines,LaBalme,Bonhomme,Chapieux,ColSeigne,LacCombal,MtFavre,Checrouit,Courmayeur,Bertone,Bonatti,Arnouvaz,ColFerret,LaFouly,Champex,Bovinette,Trient,Catogne,Vallorcine,LaTeteAuxVents,LaFlegere,Arrivee
10590,0,2009,1,JORNET BURGADA Kilian,1,ES H,SALOMON SANTIVERI,ES,21:33:18,01:54:13,02:52:33,03:44:24,04:38:33,05:09:42,06:38:03,07:06:04,07:48:39,08:18:03,08:45:11,09:47:01,10:50:32,11:33:20,12:31:14,13:35:03,15:06:03,16:38:00,17:26:16,18:33:25,19:06:33,20:28:50,20:50:11,21:33:18
10591,1,2009,20,CHAIGNEAU Sebastien,2,SE H,,FR,22:36:45,01:56:49,02:59:51,04:02:48,05:05:10,05:39:53,07:17:03,07:44:20,08:28:19,08:58:24,09:30:56,10:37:26,11:40:47,12:19:59,13:16:05,14:18:40,16:03:31,17:35:41,18:21:34,19:28:25,20:03:29,21:34:02,21:59:07,22:36:45
10592,2,2009,4,KABURAKI Tsuyoshi,3,V1 H,THE NORTH FACE JAPAN,JP,22:48:36,02:03:52,03:07:08,04:06:02,05:08:52,05:43:58,07:16:03,07:44:17,08:30:52,08:58:28,09:29:38,11:11:58,12:07:39,12:43:30,13:42:37,14:40:36,16:13:24,17:39:52,18:26:36,19:33:15,20:10:52,21:43:13,22:05:23,22:48:36


Unnamed: 0,Id,Year,Bib,Name,Rank,Category,Team,Nationality,Time,Delevret,StGervais
12876,0,2010,2216,CLEMENT Benoît,1,V1 H,,,0 days,01:53:45,02:52:44
12877,1,2010,3018,AGUETTAZ Christophe,2,V1 H,,,0 days,01:57:42,03:00:38
12878,2,2010,3913,PECH Philippe,3,V1 H,,,0 days,01:55:20,02:52:30


Unnamed: 0,Id,Year,Bib,Name,Rank,Category,Team,Nationality,Time,Delevret,StGervais,Contamines,LaBalme,Bonhomme,Chapieux,ColSeigne,LacCombal,MtFavre,Checrouit,Courmayeur,Bertone,Bonatti,Arnouvaz,ColFerret,LaFouly,Champex,Martigny,Trient,Catogne,Vallorcine,Argentiere,Arrivee
15371,0,2011,1501,JORNET BURGADA Kilian,1,SE H,SALOMON,ES,20:36:43,01:17:20,01:54:37,02:52:35,03:48:44,04:46:11,05:17:31,06:44:11,07:09:28,07:47:28,08:12:15,08:36:42,09:30:28,10:19:21,10:54:30,11:48:11,12:52:26,14:30:42,16:14:15,17:34:42,18:35:14,19:05:29,19:45:30,20:36:43
15372,1,2011,1516,KARRERA ARANBURU Iker,2,SE H,SALOMON SANTIVERY,ES,20:45:30,01:17:20,01:54:36,02:52:37,03:48:52,04:46:18,05:17:30,06:44:16,07:09:24,07:47:34,08:12:34,08:36:42,09:30:26,10:19:21,10:54:18,11:48:09,12:52:26,14:30:53,16:14:12,17:37:09,18:35:14,19:07:58,19:45:34,20:45:30
15373,2,2011,1504,CHAIGNEAU Sebastien,3,SE H,THE NORTH FACE,FR,20:55:41,01:20:46,01:57:45,02:55:25,03:51:27,04:48:07,05:17:29,06:44:19,07:09:22,07:47:25,08:12:13,08:36:43,09:32:27,10:23:24,10:56:58,11:49:59,12:54:18,14:30:58,16:14:19,17:40:52,18:40:14,19:12:57,19:55:47,20:55:41


Unnamed: 0,Id,Year,Bib,Name,Rank,Category,Team,Nationality,Time,Delevret,StGervais,Contamines,LaBalme,Bonhomme,Chapieux,ColSeigne,LacCombal,MtFavre,Checrouit,Courmayeur,Bertone,Bonatti,Arnouvaz,ColFerret,LaFouly,Champex,Bovinette,Trient,Catogne,Vallorcine,LaTeteAuxVents,LaFlegere,Arrivee
20220,0,2013,2212,THEVENARD Xavier,1,SE H,ASICS,FR,20:34:57,01:18:39,01:53:05,02:54:16,03:48:30,04:48:58,05:16:57,NaT,07:03:29,07:44:30,08:07:03,08:31:03,09:32:25,10:23:17,10:59:27,11:52:50,12:48:00,14:23:12,15:42:05,16:27:05,17:26:35,17:58:40,19:30:23,19:48:24,20:34:57
20221,1,2013,2202,HERAS HERNANDEZ Miguel Angel,2,SE H,SALOMON,ES,20:54:08,01:19:31,01:55:57,02:58:16,03:51:09,04:50:22,05:17:54,06:41:36,07:03:39,07:42:44,08:06:59,08:31:02,09:26:25,10:17:47,10:55:06,11:57:34,12:57:48,14:31:31,15:50:58,16:38:53,17:42:47,18:16:33,19:46:10,20:08:29,20:54:08
20222,2,2013,12,DOMINGUEZ LEDO Javier,3,SE H,KIROLAK-CM GAZTEIZ,ES,21:17:38,01:21:12,01:55:29,02:57:39,03:51:42,04:50:49,05:17:29,06:44:33,07:06:52,07:48:16,08:12:43,08:37:33,09:39:46,10:31:40,11:06:59,12:05:16,13:03:01,14:46:51,16:09:10,16:55:21,18:02:45,18:37:01,20:07:22,20:30:39,21:17:38


Unnamed: 0,Id,Year,Bib,Name,Rank,Category,Team,Nationality,Time,Delevret,StGervais,Contamines,LaBalme,Bonhomme,Chapieux,ColSeigne,LacCombal,MtFavre,Checrouit,Courmayeur,Bertone,Bonatti,Arnouvaz,ColFerret,LaFouly,Champex,LaGiete,Trient,Catogne,Vallorcine,LaTeteAuxVents,LaFlegere,Arrivee
22688,0,2014,2,D'HAENE François,1,SE H,TEAM SALOMON INTERNATIONAL,FR,20:11:44,01:13:25,01:47:16,02:45:30,03:40:58,04:40:32,05:08:16,06:32:03,06:54:40,07:34:48,07:58:31,08:23:25,09:20:28,10:14:05,10:50:43,11:46:29,12:44:01,14:12:17,15:41:13,16:08:03,17:06:50,17:38:40,19:00:02,19:27:48,20:11:44
22689,1,2014,5,KARRERA ARANBURU Iker,2,V1 H,SALOMON INTERNACIONAL,ES,20:55:42,01:13:25,01:47:18,02:45:33,03:42:00,04:40:34,05:08:19,06:32:02,06:54:54,07:34:51,07:58:28,08:23:26,09:20:26,10:14:38,10:50:53,11:46:28,12:44:29,14:22:48,16:06:54,16:35:26,17:38:13,18:11:34,19:44:08,20:12:34,20:55:42
22690,2,2014,8,CASTANER BERNAT Tofol,2,V1 H,SALOMON INT.,ES,20:55:42,01:13:28,01:47:20,02:45:30,03:41:02,04:40:37,05:08:19,06:32:05,06:54:52,07:34:49,07:58:27,08:23:27,09:23:25,10:15:38,10:50:59,11:46:33,12:44:14,14:22:27,16:06:12,16:34:07,17:38:08,18:11:29,19:44:01,20:13:04,20:55:42


Unnamed: 0,Id,Year,Bib,Name,Rank,Category,Team,Nationality,Time,Delevret,StGervais,Contamines,LaBalme,Bonhomme,Chapieux,ColSeigne,LacCombal,MtFavre,Checrouit,Courmayeur,Courmayeur2,Bertone,Bonatti,Arnouvaz,ColFerret,LaFouly,Champex,LaGiete,Trient,Catogne,Vallorcine,LaTeteAuxVents,LaFlegere,Arrivee
25119,0,2015,8,THEVENARD Xavier,1,SE H,ASICS,FR,21:09:15,01:14:46,01:49:08,02:47:35,03:44:39,04:44:08,05:12:57,06:37:26,07:29:53,08:09:59,08:35:00,09:00:26,09:03:48,09:58:44,10:48:14,11:23:57,12:18:12,13:09:52,14:36:13,16:11:24,16:38:16,17:43:31,18:17:02,19:55:59,20:26:43,21:09:15
25120,1,2015,3,HERNANDO ALZAGA Luis Alberto,2,SE H,ADIDAS TRAIL RUNNING,ES,21:57:17,01:13:15,01:47:29,02:47:30,03:45:07,04:44:04,05:12:50,06:37:28,07:30:15,08:10:07,08:35:04,09:00:39,09:04:40,10:01:40,10:53:13,11:28:43,12:27:16,13:34:19,15:12:12,17:09:05,17:37:55,18:50:49,19:22:44,20:52:31,21:19:17,21:57:17
25121,2,2015,7,LANEY David,3,SE H,NIKE TRAIL ELITE,US,21:59:42,01:20:10,01:59:04,03:00:51,03:55:41,04:59:39,05:34:13,07:05:04,08:14:41,08:59:37,09:29:26,10:00:40,10:04:33,11:06:09,12:01:57,12:42:49,13:36:57,14:35:46,16:14:07,17:41:46,18:10:51,19:08:05,19:41:16,21:02:48,21:27:03,21:59:42


Unnamed: 0,Id,Year,Bib,Name,Rank,Category,Team,Nationality,Time,Delevret,StGervais,Contamines,LaBalme,Bonhomme,Chapieux,ColSeigne,LacCombal,MtFavre,Checrouit,Courmayeur,Courmayeur2,Bertone,Bonatti,Arnouvaz,ColFerret,LaFouly,Champex,LaGiete,Trient,Catogne,Vallorcine,LaTeteAuxVents,LaFlegere,Arrivee
27680,0,2016,25,POMMERET Ludovic,1,V1 H,HOKA - CABB,FR,22:00:02,01:16:24,01:49:50,02:54:04,04:10:14,05:30:18,06:04:46,07:38:37,08:24:38,09:04:12,09:28:30,09:56:43,10:01:23,10:57:27,11:53:41,12:33:15,13:27:27,14:20:36,15:56:23,17:30:57,17:57:55,18:54:44,19:25:32,20:50:37,21:17:39,22:00:02
27681,1,2016,9,GRINIUS Gediminas,2,SE H,TEAM VIBRAM,LT,22:26:05,01:19:40,01:55:12,02:57:34,03:57:47,05:03:24,05:33:19,07:06:39,08:00:44,08:46:55,09:14:33,09:43:33,09:46:58,10:51:46,11:49:16,12:26:30,13:25:55,14:19:50,15:51:13,17:29:39,17:58:21,19:05:42,19:39:49,21:15:17,21:44:08,22:26:05
27682,2,2016,12,TOLLEFSON Tim,3,SE H,NIKE TRAIL,US,22:30:28,01:23:03,02:00:31,03:02:34,04:01:59,05:05:41,05:36:49,07:16:59,08:18:03,09:00:35,09:28:27,09:58:08,10:00:25,11:03:42,12:03:21,12:39:38,13:36:43,14:35:30,16:12:36,17:55:17,18:21:58,19:29:24,20:04:06,21:24:54,21:52:03,22:30:28


Unnamed: 0,Id,Year,Bib,Name,Rank,Category,Team,Nationality,Time,Delevret,StGervais,Contamines,LaBalme,Bonhomme,Chapieux,ColSeigne,LacCombal,MtFavre,Checrouit,Courmayeur,Bertone,Bonatti,Arnouvaz,ColFerret,LaFouly,Champex,LaGiete,Trient,Tseppes,Vallorcine,ColDesMontets,LaFlegere,Arrivee
30234,0,2017,4,D'HAENE François,1,SE H,Salomon,FR,19:01:54,01:11:50,01:45:05,02:41:09,03:33:40,04:28:07,04:53:31,06:18:02,06:37:51,07:15:35,07:39:09,08:02:18,08:54:29,09:44:00,10:17:44,11:11:12,12:04:26,13:24:20,14:55:05,15:24:59,16:06:17,16:51:13,17:20:02,18:23:09,19:01:54
30235,1,2017,2,JORNET BURGADA Kilian,2,SE H,Salomon,ES,19:16:59,01:10:00,01:44:21,02:41:01,03:33:45,04:29:18,04:54:39,06:18:04,06:37:54,07:15:37,07:39:16,08:02:49,08:57:30,09:48:28,10:23:53,11:18:54,12:12:40,13:33:52,15:13:06,15:41:22,16:23:16,17:05:14,17:34:21,18:39:27,19:16:59
30236,2,2017,14,TOLLEFSON Tim,3,SE H,Hoka,US,19:53:00,01:15:24,01:48:38,02:45:17,03:41:50,04:41:04,05:10:05,06:40:51,07:02:40,07:42:45,08:08:05,08:33:53,09:29:48,10:21:27,10:55:21,NaT,12:46:12,14:08:23,15:45:55,16:12:00,16:56:16,17:39:45,18:09:03,19:17:41,19:53:00


 #### New kid in town! Todo parece correcto excepto un tiempo de paso negativo en el año 2005 para el corredor con dorsal 1 en Contamines. Revisemos los tiempos negativos.

In [12]:
# recuperamos los datos de tiempos por posición
# sabiendo que son consecutivos
cronos = df[list(df)[8:]]

#función auxiliar reutilizable


def check_less_than_0(cronos):
    # filtramos tiempos negativos y no nans
    less_than_0 = (cronos < pd.Timedelta(0)) & pd.notna(cronos)
    # sumamos el resultado para contarlos (suma de booleanos)
    totals_lt0 = less_than_0.sum().copy()
    # guardamos los que tienen al menos un valor negativo ordenadamente
    totals_gt0 = totals_lt0[totals_lt0 > 0].sort_values(ascending=False)
    display(totals_gt0)
    return totals_gt0.index.values


cols_lt0 = check_less_than_0(cronos)


Courmayeur    41
Contamines     3
ColDeVoza      2
Tseppes        1
LaFouly        1
dtype: int64

 #### No son muchos, pero no es correcto dejarlos en el dataframe. Los ponemos a  `nan`.

In [13]:
for col in cols_lt0:
    mask = (cronos[col] < pd.Timedelta(0)) & pd.notna(cronos[col])
    df.loc[mask, col] = np.nan

cronos = df[list(df)[8:]]
check_less_than_0(cronos)


Series([], dtype: int64)

array([], dtype=object)

 #### Vamos a ver ahora la calidad de la variable Nationality.

In [14]:
# pasamos todos los valores a minúsculas
df.Nationality = df.Nationality.str.lower()
# agrupamos, contamos y ordenamos los países
countries = df.groupby('Nationality')[
    'Nationality'].count().sort_values(ascending=False)
# mostramos los primeros
display(countries[:5])
# mostramos los últimos
display(countries[-5:])
# mostramos todos los valores
display(countries.index.values)


Nationality
      14601
fr     7938
es     1591
it     1116
gb      888
Name: Nationality, dtype: int64

Nationality
by    1
pa    1
cy    1
ne    1
bn    1
Name: Nationality, dtype: int64

array([' ', 'fr', 'es', 'it', 'gb', 'jp', 'de', 'ch', 'us', 'be', 'pl',
       'pt', 'cn', 'hk', 'au', 'at', 'se', 'hu', 'ca', 'nl', 'gr', 'ie',
       'ar', 'dk', 'cz', 'hr', 'ro', 'br', 'si', 'lv', 'fi', 'lu', 'sk',
       'sg', 'tr', 'no', 'bg', 'cl', 'nz', 'is', 'ru', 'mx', 'za', 'ph',
       'kr', 'co', 'my', 'tn', 'ec', 'tw', 'il', 'np', 've', 'in', 'ua',
       'gt', 'pe', 'lt', 'rs', 'id', 'ba', 'uy', 'ma', 'th', 'mk', 'lb',
       'py', 'cr', 'ir', 'dz', 'ad', 'ge', 'sv', 'sm', 'vn', 'ee', 'mt',
       'mu', 'bo', 'me', 'by', 'pa', 'cy', 'ne', 'bn'], dtype=object)

 #### **¡Cuidado!**, hay un país con un espacio en blanco como código. Lo tendremos en cuenta más adelante.

 #### Revisamos las categorías. En los listados de arriba se veían algunas categorías con espacios.

In [15]:
# pasamos todos los valores a minúsculas
df.Category = df.Category.str.lower()
# eliminamos los espacios
df.Category = df.Category.str.replace(' ', '')
# agrupamos, contamos y ordenamos los países
categories = df.groupby('Category')['Category'].count()
# mostramos los primeros
display(categories[:5])
# mostramos los últimos
display(categories[-5:])
# mostramos todos los valores
display(categories.index.values)


Category
esf       5
esh      79
juh       3
sef     947
seh    9645
Name: Category, dtype: int64

Category
v2h    5185
v3f      48
v3h     809
v4f       1
v4h      51
Name: Category, dtype: int64

array(['esf', 'esh', 'juh', 'sef', 'seh', 'v1f', 'v1h', 'v2f', 'v2h',
       'v3f', 'v3h', 'v4f', 'v4h'], dtype=object)

 #### La variable categoría está formada por la categoría y el sexo. Vamos a separarlas.

In [16]:
# creamos la columna sex cogiendo el ultimo carácter de category
df['Sex'] = df.Category.str[2]
display(df['Sex'][:3])
# suprimimos el sexo de categoría
df['Category'] = df.Category.str[:2]
# por conveniencia, reordenamos las columnas insertando el sexo en el índice 8
# para mantener los timedeltas juntos al final del dataframe
cols = list(df)
cols.remove('Sex')
cols.insert(8, 'Sex')
df = df[cols]


0    h
1    h
2    h
Name: Sex, dtype: object

 #### Guardamos los resultados en un nuevo csv para seguir la exploración en otro notebook.

In [17]:
file = '.\\data\\utmb.2.csv'
df.to_csv(file, index=False)


 ###### Ir a [Visualización](https://github.com/pontnou/utmb/blob/master/umtb.3.ipynb "Parte 3").