# **Práctica 2.2: Preprocesado**

<hr>

## **1. Objetivo**
En esta práctica descargaremos un conjunto de datos y lo analizaremos y limpiaremos para utilizar en futuras prácticas. 


### **Librería Pandas**
Se considera la librería más popular de análisis de datos en Python.  
Maneja todas sus operaciones mediante un objeto *"Dataframe"*.

Permite, entre otras operaciones:

* Cargar y almacenar datos en diferentes formatos (csv, tsv, xlsx, txt...).
* Manipular filas, columnas y celdas.
* Filtrar o agrupar contenido.
* Realizar la intersección, concatenación o combinación de varios Dataframes

Para instalarla:


In [None]:
! pip install pandas

<div class="alert alert-block alert-warning">
    <strong>NOTA:</strong> La exclamación antes del código indica a Jupyter que este no es Python y ha de ejecutarse en la terminal. Esto nos permite instalar librerías directamente desde el notebook.
</div>

<hr>

## **2. Análisis exploratorio de datos (EDA)**

Este análisis tiene como objetivos principales:
* Conocer los datos a los que nos vamos a enfrentar.
* Limpiar el conjunto de datos:
  * Eliminar filas o columnas vacías.
  * Eliminar valores incongruentes.

Nuestro conjunto de datos contiene información sobre una carrera de la temporada 2023 de Formula 1.  
A continuación descargaremos los datos y los analizaremos con la librería Pandas.

In [2]:
import pandas as pd

url_data = "https://raw.githubusercontent.com/AIC-Uniovi/Sistemas-Inteligentes/refs/heads/main/datasets/f1_23_monaco.csv"
data = pd.read_csv(url_data)

**Descripción de las columnas del dataset**

| Columna               | Descripción |
|-----------------------|------------|
| `Time`               | Tiempo total transcurrido en la sesión. |
| `Driver`             | Código de tres letras del piloto. |
| `DriverNumber`       | Número del piloto en la carrera. |
| `LapTime`            | Tiempo total de la vuelta. |
| `LapNumber`          | Número de la vuelta en la sesión. |
| `Stint`              | Número de stint actual (período entre paradas en boxes). |
| `PitOutTime`         | Tiempo en el que el piloto salió de boxes. |
| `PitInTime`          | Tiempo en el que el piloto entró en boxes. |
| `Sector1Time`        | Tiempo registrado en el primer sector de la vuelta. |
| `Sector2Time`        | Tiempo registrado en el segundo sector de la vuelta. |
| `Sector3Time`        | Tiempo registrado en el tercer sector de la vuelta. |
| `SpeedI1`           | Velocidad medida en el primer punto de detección. |
| `SpeedI2`           | Velocidad medida en el segundo punto de detección. |
| `SpeedFL`           | Velocidad en la línea de meta. |
| `SpeedST`           | Velocidad máxima en el sector. |
| `IsPersonalBest`    | Indica si la vuelta es la mejor personal del piloto (`True`/`False`). |
| `Compound`          | Tipo de compuesto de neumáticos utilizado. |
| `TyreLife`         | Número de vueltas que lleva el neumático en uso. |
| `FreshTyre`         | Indica si el neumático era nuevo al inicio de la vuelta (`True`/`False`). |
| `Team`              | Nombre del equipo del piloto. |
| `LapStartTime`      | Tiempo de inicio de la vuelta en la sesión. |
| `LapStartDate`      | Fecha y hora exacta del inicio de la vuelta. |
| `TrackStatus`       | Estado de la pista en la vuelta (ej. bandera amarilla, verde, etc.). |
| `Position`          | Posición del piloto al finalizar la vuelta. |
| `Deleted`           | Indica si la vuelta fue eliminada (`True`/`False`). |
| `DeletedReason`     | Razón por la que la vuelta fue eliminada (si aplica). |
| `IsAccurate`        | Indica si los datos de la vuelta son precisos (`True`/`False`). |

### **Operaciones básicas**

In [3]:
# Nombre de columnas
data.columns

Index(['Time', 'Driver', 'DriverNumber', 'LapTime', 'LapNumber', 'Stint',
       'PitOutTime', 'PitInTime', 'Sector1Time', 'Sector2Time', 'Sector3Time',
       'SpeedI1', 'SpeedI2', 'SpeedFL', 'SpeedST', 'IsPersonalBest',
       'Compound', 'TyreLife', 'FreshTyre', 'Team', 'LapStartTime',
       'LapStartDate', 'TrackStatus', 'Position', 'Deleted', 'DeletedReason',
       'IsAccurate'],
      dtype='object')

In [12]:
# Tipos de las columnas
data.dtypes

Time               object
Driver             object
DriverNumber        int64
LapTime            object
LapNumber         float64
Stint             float64
PitOutTime         object
PitInTime          object
Sector1Time        object
Sector2Time        object
Sector3Time        object
SpeedI1           float64
SpeedI2           float64
SpeedFL           float64
SpeedST           float64
IsPersonalBest       bool
Compound           object
TyreLife          float64
FreshTyre            bool
Team               object
LapStartTime       object
LapStartDate       object
TrackStatus         int64
Position          float64
Deleted              bool
DeletedReason      object
IsAccurate           bool
dtype: object

In [13]:
# Número de columnas
len(data.columns)

27

In [14]:
# Número de filas
len(data)

1513

In [15]:
# Obtener estadisticas básicas de todo el conjunto
data.describe()

Unnamed: 0,DriverNumber,LapNumber,Stint,SpeedI1,SpeedI2,SpeedFL,SpeedST,TyreLife,TrackStatus,Position
count,1513.0,1513.0,1513.0,1391.0,1513.0,1476.0,1513.0,1513.0,1513.0,1513.0
mean,28.438863,38.523463,1.775942,175.464414,173.806345,257.080623,272.918705,19.269002,2.85195,10.262393
std,23.285504,22.125804,0.942474,28.337102,21.097378,7.230135,13.549024,13.872089,4.798371,5.671148
min,1.0,1.0,1.0,90.0,79.0,187.0,157.0,1.0,1.0,1.0
25%,11.0,19.0,1.0,151.0,162.0,256.0,271.0,8.0,1.0,5.0
50%,22.0,38.0,1.0,189.0,184.0,258.0,278.0,16.0,1.0,10.0
75%,44.0,57.0,2.0,197.0,189.0,259.0,280.0,28.0,1.0,15.0
max,81.0,78.0,6.0,212.0,197.0,272.0,288.0,56.0,21.0,20.0


In [16]:
# Buscar columnas con valores inexistentes
data.isnull().any()

Time              False
Driver            False
DriverNumber      False
LapTime            True
LapNumber         False
Stint             False
PitOutTime         True
PitInTime          True
Sector1Time        True
Sector2Time       False
Sector3Time       False
SpeedI1            True
SpeedI2           False
SpeedFL            True
SpeedST           False
IsPersonalBest    False
Compound          False
TyreLife          False
FreshTyre         False
Team              False
LapStartTime      False
LapStartDate      False
TrackStatus       False
Position          False
Deleted           False
DeletedReason      True
IsAccurate        False
dtype: bool

In [17]:
# Mostrar las 5 primeras filas
data.head(5)

Unnamed: 0,Time,Driver,DriverNumber,LapTime,LapNumber,Stint,PitOutTime,PitInTime,Sector1Time,Sector2Time,...,TyreLife,FreshTyre,Team,LapStartTime,LapStartDate,TrackStatus,Position,Deleted,DeletedReason,IsAccurate
0,0 days 01:03:27.458000,VER,1,0 days 00:01:24.238000,1.0,1.0,,,,0 days 00:00:37.420000,...,1.0,True,Red Bull Racing,0 days 01:02:02.950000,2023-05-28 13:03:03.927,12,1.0,False,,False
1,0 days 01:04:46.825000,VER,1,0 days 00:01:19.367000,2.0,1.0,,,0 days 00:00:20.954000,0 days 00:00:37.366000,...,2.0,True,Red Bull Racing,0 days 01:03:27.458000,2023-05-28 13:04:28.435,1,1.0,False,,True
2,0 days 01:06:05.899000,VER,1,0 days 00:01:19.074000,3.0,1.0,,,0 days 00:00:20.854000,0 days 00:00:37.288000,...,3.0,True,Red Bull Racing,0 days 01:04:46.825000,2023-05-28 13:05:47.802,1,1.0,False,,True
3,0 days 01:07:24.028000,VER,1,0 days 00:01:18.129000,4.0,1.0,,,0 days 00:00:20.835000,0 days 00:00:36.637000,...,4.0,True,Red Bull Racing,0 days 01:06:05.899000,2023-05-28 13:07:06.876,1,1.0,False,,True
4,0 days 01:08:42.047000,VER,1,0 days 00:01:18.019000,5.0,1.0,,,0 days 00:00:20.745000,0 days 00:00:36.734000,...,5.0,True,Red Bull Racing,0 days 01:07:24.028000,2023-05-28 13:08:25.005,1,1.0,False,,True


In [18]:
# Mostrar las 5 últimas filas
data.tail(5)

Unnamed: 0,Time,Driver,DriverNumber,LapTime,LapNumber,Stint,PitOutTime,PitInTime,Sector1Time,Sector2Time,...,TyreLife,FreshTyre,Team,LapStartTime,LapStartDate,TrackStatus,Position,Deleted,DeletedReason,IsAccurate
1508,0 days 02:45:26.431000,PIA,81,0 days 00:01:27.443000,73.0,2.0,,,0 days 00:00:23.295000,0 days 00:00:42.212000,...,19.0,True,McLaren,0 days 02:43:58.988000,2023-05-28 14:44:59.965,1,10.0,False,,True
1509,0 days 02:46:52.616000,PIA,81,0 days 00:01:26.185000,74.0,2.0,,,0 days 00:00:22.381000,0 days 00:00:42.112000,...,20.0,True,McLaren,0 days 02:45:26.431000,2023-05-28 14:46:27.408,1,10.0,False,,True
1510,0 days 02:48:18.759000,PIA,81,0 days 00:01:26.143000,75.0,2.0,,,0 days 00:00:22.061000,0 days 00:00:42.272000,...,21.0,True,McLaren,0 days 02:46:52.616000,2023-05-28 14:47:53.593,1,10.0,False,,True
1511,0 days 02:49:44.374000,PIA,81,0 days 00:01:25.615000,76.0,2.0,,,0 days 00:00:21.991000,0 days 00:00:41.496000,...,22.0,True,McLaren,0 days 02:48:18.759000,2023-05-28 14:49:19.736,1,10.0,False,,True
1512,0 days 02:51:09.159000,PIA,81,0 days 00:01:24.785000,77.0,2.0,,,0 days 00:00:21.970000,0 days 00:00:40.944000,...,23.0,True,McLaren,0 days 02:49:44.374000,2023-05-28 14:50:45.351,1,10.0,False,,True


In [19]:
# Acceder a una columna
data["Driver"]

0       VER
1       VER
2       VER
3       VER
4       VER
       ... 
1508    PIA
1509    PIA
1510    PIA
1511    PIA
1512    PIA
Name: Driver, Length: 1513, dtype: object

In [20]:
# Obtener múltiples estadisticas de una columna
data["Stint"].describe()

count    1513.000000
mean        1.775942
std         0.942474
min         1.000000
25%         1.000000
50%         1.000000
75%         2.000000
max         6.000000
Name: Stint, dtype: float64

In [21]:
# Operaciones sobre columnas numéricas
data["LapNumber"]+1

0        2.0
1        3.0
2        4.0
3        5.0
4        6.0
        ... 
1508    74.0
1509    75.0
1510    76.0
1511    77.0
1512    78.0
Name: LapNumber, Length: 1513, dtype: float64

In [22]:
# Ver los valores únicos (sin repeticiones) de una columna
data["Team"].unique()

array(['Red Bull Racing', 'Alpine', 'Aston Martin', 'Ferrari', 'Williams',
       'Haas F1 Team', 'AlphaTauri', 'Alfa Romeo', 'McLaren', 'Mercedes'],
      dtype=object)

In [23]:
len(data["Team"].unique())

10

In [24]:
# Acceder a varias columnas
data[["Driver","Team"]]

Unnamed: 0,Driver,Team
0,VER,Red Bull Racing
1,VER,Red Bull Racing
2,VER,Red Bull Racing
3,VER,Red Bull Racing
4,VER,Red Bull Racing
...,...,...
1508,PIA,McLaren
1509,PIA,McLaren
1510,PIA,McLaren
1511,PIA,McLaren


In [25]:
# Obtener una lista de valores para una columna y acceder a un elemento
data["Team"].values[180]

'Red Bull Racing'

In [26]:
# Acceder a la fila 1280 , columna 1 (empezando en cero)
data.iloc[1280, 1]

'SAI'

In [27]:
# Ordenar por la fila "Time"
data.sort_values(["LapTime"])

Unnamed: 0,Time,Driver,DriverNumber,LapTime,LapNumber,Stint,PitOutTime,PitInTime,Sector1Time,Sector2Time,...,TyreLife,FreshTyre,Team,LapStartTime,LapStartDate,TrackStatus,Position,Deleted,DeletedReason,IsAccurate
1157,0 days 01:45:31.903000,HAM,44,0 days 00:01:15.650000,33.0,2.0,,,0 days 00:00:19.760000,0 days 00:00:35.647000,...,2.0,True,Mercedes,0 days 01:44:16.253000,2023-05-28 13:45:17.230,1,8.0,False,,True
355,0 days 02:02:23.489000,LEC,16,0 days 00:01:15.773000,46.0,2.0,,,0 days 00:00:19.996000,0 days 00:00:35.718000,...,2.0,True,Ferrari,0 days 02:01:07.716000,2023-05-28 14:02:08.693,1,8.0,False,,True
126,0 days 02:06:14.872000,GAS,10,0 days 00:01:15.831000,49.0,2.0,,,0 days 00:00:19.843000,0 days 00:00:35.692000,...,2.0,True,Alpine,0 days 02:04:59.041000,2023-05-28 14:06:00.018,1,8.0,False,,True
127,0 days 02:07:30.827000,GAS,10,0 days 00:01:15.955000,50.0,2.0,,,0 days 00:00:20.045000,0 days 00:00:35.628000,...,3.0,True,Alpine,0 days 02:06:14.872000,2023-05-28 14:07:15.849,1,8.0,False,,True
359,0 days 02:07:29.057000,LEC,16,0 days 00:01:15.956000,50.0,2.0,,,0 days 00:00:19.730000,0 days 00:00:35.934000,...,6.0,True,Ferrari,0 days 02:06:13.101000,2023-05-28 14:07:14.078,1,7.0,False,,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
731,0 days 02:38:48.233000,TSU,22,0 days 00:02:17.012000,68.0,2.0,,,0 days 00:00:25.629000,0 days 00:01:21.925000,...,15.0,True,AlphaTauri,0 days 02:36:31.221000,2023-05-28 14:37:32.198,12,13.0,False,,True
1257,0 days 02:15:32.997000,SAI,55,0 days 00:02:19.381000,55.0,2.0,,0 days 02:15:04.653000,0 days 00:00:27.227000,0 days 00:01:01.073000,...,22.0,True,Ferrari,0 days 02:13:13.616000,2023-05-28 14:14:14.593,12,8.0,False,,False
947,0 days 02:15:34.526000,HUL,27,0 days 00:02:19.643000,54.0,2.0,,0 days 02:15:00.544000,0 days 00:00:27.923000,0 days 00:00:56.020000,...,53.0,True,Haas F1 Team,0 days 02:13:14.883000,2023-05-28 14:14:15.860,12,18.0,False,,False
574,0 days 02:24:30.971000,MAG,20,0 days 00:02:26.648000,58.0,2.0,,,0 days 00:01:08.435000,0 days 00:00:50.544000,...,2.0,True,Haas F1 Team,0 days 02:22:04.323000,2023-05-28 14:23:05.300,12,19.0,False,,True


<div class="alert alert-block alert-warning">
    <strong>NOTA:</strong> Las operaciones anteriores no son 'inplace', es decir, no modifican el DataFrame, solo lo consultan.
</div>

In [28]:
# Obtener el número de vueltas máximo, medio y mínimo que se utilizó un juego de neumáticos
mean_life = data["TyreLife"].mean()
min_life  = data["TyreLife"].min()
max_life  = data["TyreLife"].max()

print(min_life, mean_life ,max_life)

1.0 19.2690019828156 56.0


In [29]:
# Cambiar el tipo de una serie de columnas a int
data[["DriverNumber", "LapNumber", "Stint", "TyreLife", "TrackStatus", "Position"]] = data[["DriverNumber", "LapNumber", "Stint", "TyreLife", "TrackStatus", "Position"]].astype(int)

In [30]:
# Añadir una nueva columna
data["Nueva_columna_uno"] = 1 # Todas las filas tendrán el mismo valor
data["Nueva_columna_dos"] = list(range(len(data))) # Nueva columna a partir de una lista de valores (tantos como filas)
data["Nueva_columna_tres"] = data["Stint"] + 1 # Nueva columna a partir de otra

In [31]:
# Eliminar columnas
data = data.drop(columns=["Nueva_columna_uno", "Nueva_columna_dos", "Nueva_columna_tres"])
# Esto es equivalente a:
# data.drop(columns=["Nueva_columna_uno", "Nueva_columna_dos", "Nueva_columna_tres"], inplace=True)

<div class="alert alert-block alert-info">
    <b>Ejercicio:</b> Obtén el número de pilotos existentes.
</div>

In [84]:
# Tu código aquí
len(data["Driver"].unique())

20

<div class="alert alert-block alert-info">
    <b>Ejercicio:</b> Cambia el tipo de las columnas: "Time", "LapTime", "PitOutTime", "PitInTime", "Sector1Time", "Sector2Time", "Sector3Time" y "LapStartTime" a <a href="https://pandas.pydata.org/docs/reference/api/pandas.to_timedelta.html"><i>timedelta</i></a>.
</div>

In [105]:
time_columns = ["Time", "LapTime", "PitOutTime", "PitInTime", "Sector1Time", "Sector2Time", "Sector3Time", "LapStartTime"]
# Tu código aquí
for col in time_columns:
    data[col] = pd.to_timedelta(data[col])
    
for col in time_columns:
    print(data[col].dtype)

timedelta64[ns]
timedelta64[ns]
timedelta64[ns]
timedelta64[ns]
timedelta64[ns]
timedelta64[ns]
timedelta64[ns]
timedelta64[ns]


<div class="alert alert-block alert-info">
    <b>Ejercicio:</b> Cambia el tipo de la columna: "LapStartDate" a <a href="https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html"><i>datetime</i></a>.
</div>

In [98]:
# Tu código aquí
data["LapStartDate"]= pd.to_datetime(data["LapStartDate"])

print(data["LapStartDate"].dtype)


datetime64[ns]


<div class="alert alert-block alert-info">
    <b>Ejercicio:</b> ¿Qué piloto ha sido el mejor en el primer sector? ¿Y en el segundo?
</div>

In [None]:
# Tu código aquí
data.sort_values("Sector1Time").head(1)[["Driver","Sector1Time"]]

data.sort_values("Sector1Time").iloc[0,1]


Unnamed: 0,Driver,Sector1Time
359,LEC,0 days 00:00:19.730000


In [111]:
data.to_pickle("prueba")

### **Filtrado de datos**

In [36]:
# Obtener el valor de una celda en concreto
data.loc[572, 'Team']

'Haas F1 Team'

In [37]:
# Obtener las vueltas de los pilotos cuyo equipo es "Ferrari"
data_ferrari = data.loc[data["Team"]=="Ferrari"]
data_ferrari

Unnamed: 0,Time,Driver,DriverNumber,LapTime,LapNumber,Stint,PitOutTime,PitInTime,Sector1Time,Sector2Time,...,TyreLife,FreshTyre,Team,LapStartTime,LapStartDate,TrackStatus,Position,Deleted,DeletedReason,IsAccurate
310,0 days 01:03:32.328000,LEC,16,0 days 00:01:29.108000,1,1,,,,0 days 00:00:38.983000,...,1,True,Ferrari,0 days 01:02:02.950000,2023-05-28 13:03:03.927,12,6,False,,False
311,0 days 01:04:52.877000,LEC,16,0 days 00:01:20.549000,2,1,,,0 days 00:00:21.352000,0 days 00:00:37.692000,...,2,True,Ferrari,0 days 01:03:32.328000,2023-05-28 13:04:33.305,1,6,False,,True
312,0 days 01:06:12.898000,LEC,16,0 days 00:01:20.021000,3,1,,,0 days 00:00:21.051000,0 days 00:00:37.547000,...,3,True,Ferrari,0 days 01:04:52.877000,2023-05-28 13:05:53.854,1,6,False,,True
313,0 days 01:07:32.366000,LEC,16,0 days 00:01:19.468000,4,1,,,0 days 00:00:21.023000,0 days 00:00:37.375000,...,4,True,Ferrari,0 days 01:06:12.898000,2023-05-28 13:07:13.875,1,6,False,,True
314,0 days 01:08:51.203000,LEC,16,0 days 00:01:18.837000,5,1,,,0 days 00:00:20.803000,0 days 00:00:37.198000,...,5,True,Ferrari,0 days 01:07:32.366000,2023-05-28 13:08:33.343,1,6,False,,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1276,0 days 02:46:14.565000,SAI,55,0 days 00:01:28.585000,74,3,,,0 days 00:00:24.031000,0 days 00:00:42.481000,...,19,True,Ferrari,0 days 02:44:45.980000,2023-05-28 14:45:46.957,1,8,False,,True
1277,0 days 02:47:40.953000,SAI,55,0 days 00:01:26.388000,75,3,,,0 days 00:00:22.468000,0 days 00:00:41.871000,...,20,True,Ferrari,0 days 02:46:14.565000,2023-05-28 14:47:15.542,1,8,False,,True
1278,0 days 02:49:06.729000,SAI,55,0 days 00:01:25.776000,76,3,,,0 days 00:00:22.362000,0 days 00:00:41.296000,...,21,True,Ferrari,0 days 02:47:40.953000,2023-05-28 14:48:41.930,1,8,False,,True
1279,0 days 02:50:32.308000,SAI,55,0 days 00:01:25.579000,77,3,,,0 days 00:00:22.283000,0 days 00:00:41.078000,...,22,True,Ferrari,0 days 02:49:06.729000,2023-05-28 14:50:07.706,1,8,False,,True


In [38]:
# Obtener todas las vueltas 1,2 de los pilotos 
data.loc[data["LapNumber"]<=2]

Unnamed: 0,Time,Driver,DriverNumber,LapTime,LapNumber,Stint,PitOutTime,PitInTime,Sector1Time,Sector2Time,...,TyreLife,FreshTyre,Team,LapStartTime,LapStartDate,TrackStatus,Position,Deleted,DeletedReason,IsAccurate
0,0 days 01:03:27.458000,VER,1,0 days 00:01:24.238000,1,1,,,,0 days 00:00:37.420000,...,1,True,Red Bull Racing,0 days 01:02:02.950000,2023-05-28 13:03:03.927,12,1,False,,False
1,0 days 01:04:46.825000,VER,1,0 days 00:01:19.367000,2,1,,,0 days 00:00:20.954000,0 days 00:00:37.366000,...,2,True,Red Bull Racing,0 days 01:03:27.458000,2023-05-28 13:04:28.435,1,1,False,,True
78,0 days 01:03:32.921000,GAS,10,0 days 00:01:29.701000,1,1,,,,0 days 00:00:39.087000,...,1,True,Alpine,0 days 01:02:02.950000,2023-05-28 13:03:03.927,12,7,False,,False
79,0 days 01:04:53.557000,GAS,10,0 days 00:01:20.636000,2,1,,,0 days 00:00:21.350000,0 days 00:00:37.855000,...,2,True,Alpine,0 days 01:03:32.921000,2023-05-28 13:04:33.898,1,7,False,,True
156,0 days 01:04:01.410000,PER,11,0 days 00:01:58.190000,1,1,,0 days 01:03:37.245000,,0 days 00:00:45.390000,...,1,True,Red Bull Racing,0 days 01:02:02.950000,2023-05-28 13:03:03.927,12,18,False,,False
157,0 days 01:05:21.114000,PER,11,0 days 00:01:19.704000,2,2,0 days 01:04:02.058000,,0 days 00:00:22.935000,0 days 00:00:36.337000,...,1,True,Red Bull Racing,0 days 01:04:01.410000,2023-05-28 13:05:02.387,1,18,False,,False
232,0 days 01:03:28.747000,ALO,14,0 days 00:01:25.527000,1,1,,,,0 days 00:00:37.981000,...,2,False,Aston Martin,0 days 01:02:02.950000,2023-05-28 13:03:03.927,12,2,False,,False
233,0 days 01:04:48.393000,ALO,14,0 days 00:01:19.646000,2,1,,,0 days 00:00:20.871000,0 days 00:00:37.656000,...,3,False,Aston Martin,0 days 01:03:28.747000,2023-05-28 13:04:29.724,1,2,False,,True
310,0 days 01:03:32.328000,LEC,16,0 days 00:01:29.108000,1,1,,,,0 days 00:00:38.983000,...,1,True,Ferrari,0 days 01:02:02.950000,2023-05-28 13:03:03.927,12,6,False,,False
311,0 days 01:04:52.877000,LEC,16,0 days 00:01:20.549000,2,1,,,0 days 00:00:21.352000,0 days 00:00:37.692000,...,2,True,Ferrari,0 days 01:03:32.328000,2023-05-28 13:04:33.305,1,6,False,,True


In [39]:
# Obtener las vueltas 10 de los pilotos de "Ferrari"
data_ferrari_10 = data.loc[(data["LapNumber"]==10) & (data["Team"]=="Ferrari")]
data_ferrari_10

Unnamed: 0,Time,Driver,DriverNumber,LapTime,LapNumber,Stint,PitOutTime,PitInTime,Sector1Time,Sector2Time,...,TyreLife,FreshTyre,Team,LapStartTime,LapStartDate,TrackStatus,Position,Deleted,DeletedReason,IsAccurate
319,0 days 01:15:24.870000,LEC,16,0 days 00:01:18.219000,10,1,,,0 days 00:00:20.645000,0 days 00:00:36.610000,...,10,True,Ferrari,0 days 01:14:06.651000,2023-05-28 13:15:07.628,1,6,False,,True
1212,0 days 01:15:21.840000,SAI,55,0 days 00:01:18.070000,10,1,,,0 days 00:00:20.574000,0 days 00:00:36.868000,...,10,True,Ferrari,0 days 01:14:03.770000,2023-05-28 13:15:04.747,1,4,False,,True


In [40]:
# Obtener las vueltas de los pilotos "SAI" o "LEC"
data.loc[(data["Driver"]=="SAI") | (data["Driver"]=="LEC")]

Unnamed: 0,Time,Driver,DriverNumber,LapTime,LapNumber,Stint,PitOutTime,PitInTime,Sector1Time,Sector2Time,...,TyreLife,FreshTyre,Team,LapStartTime,LapStartDate,TrackStatus,Position,Deleted,DeletedReason,IsAccurate
310,0 days 01:03:32.328000,LEC,16,0 days 00:01:29.108000,1,1,,,,0 days 00:00:38.983000,...,1,True,Ferrari,0 days 01:02:02.950000,2023-05-28 13:03:03.927,12,6,False,,False
311,0 days 01:04:52.877000,LEC,16,0 days 00:01:20.549000,2,1,,,0 days 00:00:21.352000,0 days 00:00:37.692000,...,2,True,Ferrari,0 days 01:03:32.328000,2023-05-28 13:04:33.305,1,6,False,,True
312,0 days 01:06:12.898000,LEC,16,0 days 00:01:20.021000,3,1,,,0 days 00:00:21.051000,0 days 00:00:37.547000,...,3,True,Ferrari,0 days 01:04:52.877000,2023-05-28 13:05:53.854,1,6,False,,True
313,0 days 01:07:32.366000,LEC,16,0 days 00:01:19.468000,4,1,,,0 days 00:00:21.023000,0 days 00:00:37.375000,...,4,True,Ferrari,0 days 01:06:12.898000,2023-05-28 13:07:13.875,1,6,False,,True
314,0 days 01:08:51.203000,LEC,16,0 days 00:01:18.837000,5,1,,,0 days 00:00:20.803000,0 days 00:00:37.198000,...,5,True,Ferrari,0 days 01:07:32.366000,2023-05-28 13:08:33.343,1,6,False,,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1276,0 days 02:46:14.565000,SAI,55,0 days 00:01:28.585000,74,3,,,0 days 00:00:24.031000,0 days 00:00:42.481000,...,19,True,Ferrari,0 days 02:44:45.980000,2023-05-28 14:45:46.957,1,8,False,,True
1277,0 days 02:47:40.953000,SAI,55,0 days 00:01:26.388000,75,3,,,0 days 00:00:22.468000,0 days 00:00:41.871000,...,20,True,Ferrari,0 days 02:46:14.565000,2023-05-28 14:47:15.542,1,8,False,,True
1278,0 days 02:49:06.729000,SAI,55,0 days 00:01:25.776000,76,3,,,0 days 00:00:22.362000,0 days 00:00:41.296000,...,21,True,Ferrari,0 days 02:47:40.953000,2023-05-28 14:48:41.930,1,8,False,,True
1279,0 days 02:50:32.308000,SAI,55,0 days 00:01:25.579000,77,3,,,0 days 00:00:22.283000,0 days 00:00:41.078000,...,22,True,Ferrari,0 days 02:49:06.729000,2023-05-28 14:50:07.706,1,8,False,,True


In [41]:
# Otra opción para lo anterior
data.loc[data["Driver"].isin(["SAI","LEC"])]

Unnamed: 0,Time,Driver,DriverNumber,LapTime,LapNumber,Stint,PitOutTime,PitInTime,Sector1Time,Sector2Time,...,TyreLife,FreshTyre,Team,LapStartTime,LapStartDate,TrackStatus,Position,Deleted,DeletedReason,IsAccurate
310,0 days 01:03:32.328000,LEC,16,0 days 00:01:29.108000,1,1,,,,0 days 00:00:38.983000,...,1,True,Ferrari,0 days 01:02:02.950000,2023-05-28 13:03:03.927,12,6,False,,False
311,0 days 01:04:52.877000,LEC,16,0 days 00:01:20.549000,2,1,,,0 days 00:00:21.352000,0 days 00:00:37.692000,...,2,True,Ferrari,0 days 01:03:32.328000,2023-05-28 13:04:33.305,1,6,False,,True
312,0 days 01:06:12.898000,LEC,16,0 days 00:01:20.021000,3,1,,,0 days 00:00:21.051000,0 days 00:00:37.547000,...,3,True,Ferrari,0 days 01:04:52.877000,2023-05-28 13:05:53.854,1,6,False,,True
313,0 days 01:07:32.366000,LEC,16,0 days 00:01:19.468000,4,1,,,0 days 00:00:21.023000,0 days 00:00:37.375000,...,4,True,Ferrari,0 days 01:06:12.898000,2023-05-28 13:07:13.875,1,6,False,,True
314,0 days 01:08:51.203000,LEC,16,0 days 00:01:18.837000,5,1,,,0 days 00:00:20.803000,0 days 00:00:37.198000,...,5,True,Ferrari,0 days 01:07:32.366000,2023-05-28 13:08:33.343,1,6,False,,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1276,0 days 02:46:14.565000,SAI,55,0 days 00:01:28.585000,74,3,,,0 days 00:00:24.031000,0 days 00:00:42.481000,...,19,True,Ferrari,0 days 02:44:45.980000,2023-05-28 14:45:46.957,1,8,False,,True
1277,0 days 02:47:40.953000,SAI,55,0 days 00:01:26.388000,75,3,,,0 days 00:00:22.468000,0 days 00:00:41.871000,...,20,True,Ferrari,0 days 02:46:14.565000,2023-05-28 14:47:15.542,1,8,False,,True
1278,0 days 02:49:06.729000,SAI,55,0 days 00:01:25.776000,76,3,,,0 days 00:00:22.362000,0 days 00:00:41.296000,...,21,True,Ferrari,0 days 02:47:40.953000,2023-05-28 14:48:41.930,1,8,False,,True
1279,0 days 02:50:32.308000,SAI,55,0 days 00:01:25.579000,77,3,,,0 days 00:00:22.283000,0 days 00:00:41.078000,...,22,True,Ferrari,0 days 02:49:06.729000,2023-05-28 14:50:07.706,1,8,False,,True


In [42]:
# Obtener vueltas de equipos que contengan "Bull"
data.loc[data["Team"].str.contains("Bull")]

Unnamed: 0,Time,Driver,DriverNumber,LapTime,LapNumber,Stint,PitOutTime,PitInTime,Sector1Time,Sector2Time,...,TyreLife,FreshTyre,Team,LapStartTime,LapStartDate,TrackStatus,Position,Deleted,DeletedReason,IsAccurate
0,0 days 01:03:27.458000,VER,1,0 days 00:01:24.238000,1,1,,,,0 days 00:00:37.420000,...,1,True,Red Bull Racing,0 days 01:02:02.950000,2023-05-28 13:03:03.927,12,1,False,,False
1,0 days 01:04:46.825000,VER,1,0 days 00:01:19.367000,2,1,,,0 days 00:00:20.954000,0 days 00:00:37.366000,...,2,True,Red Bull Racing,0 days 01:03:27.458000,2023-05-28 13:04:28.435,1,1,False,,True
2,0 days 01:06:05.899000,VER,1,0 days 00:01:19.074000,3,1,,,0 days 00:00:20.854000,0 days 00:00:37.288000,...,3,True,Red Bull Racing,0 days 01:04:46.825000,2023-05-28 13:05:47.802,1,1,False,,True
3,0 days 01:07:24.028000,VER,1,0 days 00:01:18.129000,4,1,,,0 days 00:00:20.835000,0 days 00:00:36.637000,...,4,True,Red Bull Racing,0 days 01:06:05.899000,2023-05-28 13:07:06.876,1,1,False,,True
4,0 days 01:08:42.047000,VER,1,0 days 00:01:18.019000,5,1,,,0 days 00:00:20.745000,0 days 00:00:36.734000,...,5,True,Red Bull Racing,0 days 01:07:24.028000,2023-05-28 13:08:25.005,1,1,False,,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
227,0 days 02:45:47.279000,PER,11,0 days 00:01:29.537000,72,6,,,0 days 00:00:23.628000,0 days 00:00:43.487000,...,2,True,Red Bull Racing,0 days 02:44:17.742000,2023-05-28 14:45:18.719,1,17,False,,True
228,0 days 02:47:14.099000,PER,11,0 days 00:01:26.820000,73,6,,,0 days 00:00:22.267000,0 days 00:00:42.197000,...,3,True,Red Bull Racing,0 days 02:45:47.279000,2023-05-28 14:46:48.256,1,17,False,,True
229,0 days 02:48:39.803000,PER,11,0 days 00:01:25.704000,74,6,,,0 days 00:00:22.036000,0 days 00:00:41.152000,...,4,True,Red Bull Racing,0 days 02:47:14.099000,2023-05-28 14:48:15.076,1,17,False,,True
230,0 days 02:50:13.951000,PER,11,0 days 00:01:34.148000,75,6,,,0 days 00:00:23.990000,0 days 00:00:46.718000,...,5,True,Red Bull Racing,0 days 02:48:39.803000,2023-05-28 14:49:40.780,1,17,False,,True


<div class="alert alert-block alert-info">
    <b>Ejercicio:</b> Soluciona los NaT en los tiempos de las columnas "Sector1Time" y "LapTime".
</div>

In [None]:
# Tu código aquí
data["Sector1Time"] = data["Sector1Time"].fillna("00:00.000")
data["LapTime"] = data["LapTime"].fillna("00:00.000")

<div class="alert alert-block alert-info">
    <b>Ejercicio:</b> Obtén el tiempo medio por vuelta de los pilotos de "AlphaTauri" entre las vueltas 1 y 20 (inclusive).
</div>

In [81]:
# Tu código aquí
data.loc[(data["LapNumber"].between(1, 20)) & (data["Team"]=="AlphaTauri")]

Unnamed: 0,Time,Driver,DriverNumber,LapTime,LapNumber,Stint,PitOutTime,PitInTime,Sector1Time,Sector2Time,...,TyreLife,FreshTyre,Team,LapStartTime,LapStartDate,TrackStatus,Position,Deleted,DeletedReason,IsAccurate
587,0 days 01:03:36.669000,DEV,21,0 days 00:01:33.449000,1,1,,,00:00.000,0 days 00:00:40.159000,...,1,True,AlphaTauri,0 days 01:02:02.950000,2023-05-28 13:03:03.927,12,12,False,,False
588,0 days 01:04:58.433000,DEV,21,0 days 00:01:21.764000,2,1,,,0 days 00:00:21.649000,0 days 00:00:38.019000,...,2,True,AlphaTauri,0 days 01:03:36.669000,2023-05-28 13:04:37.646,1,12,False,,True
589,0 days 01:06:19.013000,DEV,21,0 days 00:01:20.580000,3,1,,,0 days 00:00:21.456000,0 days 00:00:37.754000,...,3,True,AlphaTauri,0 days 01:04:58.433000,2023-05-28 13:05:59.410,1,12,False,,True
590,0 days 01:07:39.389000,DEV,21,0 days 00:01:20.376000,4,1,,,0 days 00:00:21.328000,0 days 00:00:37.583000,...,4,True,AlphaTauri,0 days 01:06:19.013000,2023-05-28 13:07:19.990,1,12,False,,True
591,0 days 01:08:59.179000,DEV,21,0 days 00:01:19.790000,5,1,,,0 days 00:00:21.181000,0 days 00:00:37.482000,...,5,True,AlphaTauri,0 days 01:07:39.389000,2023-05-28 13:08:40.366,1,12,False,,True
592,0 days 01:10:18.403000,DEV,21,0 days 00:01:19.224000,6,1,,,0 days 00:00:20.916000,0 days 00:00:37.242000,...,6,True,AlphaTauri,0 days 01:08:59.179000,2023-05-28 13:10:00.156,1,12,False,,True
593,0 days 01:11:37.441000,DEV,21,0 days 00:01:19.038000,7,1,,,0 days 00:00:21.023000,0 days 00:00:37.059000,...,7,True,AlphaTauri,0 days 01:10:18.403000,2023-05-28 13:11:19.380,1,12,False,,True
594,0 days 01:12:56.184000,DEV,21,0 days 00:01:18.743000,8,1,,,0 days 00:00:20.786000,0 days 00:00:36.902000,...,8,True,AlphaTauri,0 days 01:11:37.441000,2023-05-28 13:12:38.418,1,12,False,,True
595,0 days 01:14:15.638000,DEV,21,0 days 00:01:19.454000,9,1,,,0 days 00:00:21.067000,0 days 00:00:37.188000,...,9,True,AlphaTauri,0 days 01:12:56.184000,2023-05-28 13:13:57.161,1,12,False,,True
596,0 days 01:15:34.595000,DEV,21,0 days 00:01:18.957000,10,1,,,0 days 00:00:20.932000,0 days 00:00:36.939000,...,10,True,AlphaTauri,0 days 01:14:15.638000,2023-05-28 13:15:16.615,1,12,False,,True


<div class="alert alert-block alert-info">
    <b>Ejercicio:</b> ¿Cuál fué la velocidad máxima en la línea de meta de Alonso? ¿Y de Verstappen? ¿En qué vueltas?.
</div>

In [45]:
# Tu código aquí

### **Agrupación de datos**

In [46]:
# Número de pilotos por equipo
data.groupby("Team")["Driver"].nunique().reset_index()

Unnamed: 0,Team,Driver
0,Alfa Romeo,2
1,AlphaTauri,2
2,Alpine,2
3,Aston Martin,2
4,Ferrari,2
5,Haas F1 Team,2
6,McLaren,2
7,Mercedes,2
8,Red Bull Racing,2
9,Williams,2


In [47]:
# Lista de pilotos por equipo
data.groupby("Team")["Driver"].unique().reset_index()

Unnamed: 0,Team,Driver
0,Alfa Romeo,"[ZHO, BOT]"
1,AlphaTauri,"[DEV, TSU]"
2,Alpine,"[GAS, OCO]"
3,Aston Martin,"[ALO, STR]"
4,Ferrari,"[LEC, SAI]"
5,Haas F1 Team,"[MAG, HUL]"
6,McLaren,"[NOR, PIA]"
7,Mercedes,"[HAM, RUS]"
8,Red Bull Racing,"[VER, PER]"
9,Williams,"[SAR, ALB]"


In [48]:
# Otra opción
data.groupby("Team")["Driver"].apply(lambda x: list(set(x))).reset_index()

Unnamed: 0,Team,Driver
0,Alfa Romeo,"[ZHO, BOT]"
1,AlphaTauri,"[DEV, TSU]"
2,Alpine,"[GAS, OCO]"
3,Aston Martin,"[STR, ALO]"
4,Ferrari,"[LEC, SAI]"
5,Haas F1 Team,"[HUL, MAG]"
6,McLaren,"[NOR, PIA]"
7,Mercedes,"[RUS, HAM]"
8,Red Bull Racing,"[VER, PER]"
9,Williams,"[ALB, SAR]"


In [49]:
# Número de vueltas por piloto ordenado de mayor a menor.
data.groupby("Driver")["LapNumber"].max().sort_values(ascending=False).reset_index()

Unnamed: 0,Driver,LapNumber
0,ALO,78
1,LEC,78
2,HAM,78
3,GAS,78
4,SAI,78
5,VER,78
6,RUS,78
7,OCO,78
8,DEV,77
9,BOT,77


In [50]:
# Otra opción
data.groupby("Driver")["LapNumber"].size().sort_values(ascending=False).reset_index()

Unnamed: 0,Driver,LapNumber
0,ALO,78
1,LEC,78
2,HAM,78
3,GAS,78
4,SAI,78
5,VER,78
6,RUS,78
7,OCO,78
8,DEV,77
9,BOT,77


In [51]:
# Velocidad media en la linea de meta por cada equipo
data.groupby('Team')['SpeedFL'].mean().sort_values(ascending=False).reset_index()

Unnamed: 0,Team,SpeedFL
0,Aston Martin,259.226562
1,Red Bull Racing,259.135135
2,Ferrari,258.217105
3,Williams,257.898649
4,McLaren,256.94702
5,Alfa Romeo,256.662252
6,Haas F1 Team,256.021127
7,Mercedes,255.882353
8,AlphaTauri,255.596026
9,Alpine,255.559211


In [52]:
# Otra opción que permite personalizar el nombre de la nueva columna así como crear varias de una vez
data.groupby('Team').agg(AvgFlSpeed=("SpeedFL", "mean")).sort_values("AvgFlSpeed", ascending=False).reset_index()

Unnamed: 0,Team,AvgFlSpeed
0,Aston Martin,259.226562
1,Red Bull Racing,259.135135
2,Ferrari,258.217105
3,Williams,257.898649
4,McLaren,256.94702
5,Alfa Romeo,256.662252
6,Haas F1 Team,256.021127
7,Mercedes,255.882353
8,AlphaTauri,255.596026
9,Alpine,255.559211


In [53]:
# La Pivot Table o tabla dinámica también permite agrupar datos de forma más compleja.
# En este ejemplo se muestra para cada piloto de cada equipo, el número de vueltas que dió con cada compuesto así como los totales por filas y columnas (margins)
data.pivot_table(index=["Team", "Driver"], columns=["Compound"], values="LapNumber", aggfunc="count", fill_value=0, margins=True, margins_name="Total")


Unnamed: 0_level_0,Compound,HARD,INTERMEDIATE,MEDIUM,SOFT,WET,Total
Team,Driver,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alfa Romeo,BOT,51,26,0,0,0,77
Alfa Romeo,ZHO,51,25,0,1,0,77
AlphaTauri,DEV,0,24,53,0,0,77
AlphaTauri,TSU,0,23,53,0,0,76
Alpine,GAS,47,24,7,0,0,78
Alpine,OCO,22,24,32,0,0,78
Aston Martin,ALO,54,23,1,0,0,78
Aston Martin,STR,51,2,0,0,0,53
Ferrari,LEC,44,23,11,0,0,78
Ferrari,SAI,33,23,22,0,0,78


<div class="alert alert-block alert-info">
    <b>Ejercicio:</b> Obtén, para cada piloto, el número de vueltas anuladas. Ordena de mayor a menor.
</div>

In [70]:
# Tu código aquí
# Número de vueltas por piloto ordenado de mayor a menor.
data.groupby("Driver")["LapNumber"].max().sort_values(ascending=False).reset_index()
    

Unnamed: 0,Driver,LapNumber
0,ALO,78
1,LEC,78
2,HAM,78
3,GAS,78
4,SAI,78
5,VER,78
6,RUS,78
7,OCO,78
8,DEV,77
9,BOT,77


<div class="alert alert-block alert-info">
    <b>Ejercicio:</b> Obtén, para cada piloto, el número de Pit Stops realizados.
</div>

In [None]:
# Tu código aquí


<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001C455A12B00>

<div class="alert alert-block alert-info">
    <b>Ejercicio:</b> Crea una tabla donde se muestre en las filas los equipos y pilotos y en las columnas las 10 primeras vueltas. Se ha de mostrar el tiempo por vuelta de cada piloto en segundos (tiempo.dt.total_seconds()).
</div>

In [75]:
# Tu código aquí
data("Driver","Team")

TypeError: 'DataFrame' object is not callable

### **Limpieza final y almacenamiento del DataFrame**
Para poder utilizar este conjunto en futuras prácticas, vamos a eliminar ciertas filas y columnas que no van a aportar información relevante para los problemas que resolveremos.

<div class="alert alert-block alert-info">
    <b>Ejercicio:</b> Elimina todas aquellas filas que no tengan "TrackStatus" igual a 1 y aquellas que se correspondan con una parada en boxes. Estas últimas tendrán un valor en "PitOutTime" o en "PitInTime". Ordena de menor a mayor "Time" y haz un "reset_index(drop=true)" para que se vuelvan a crear los índices de las filas.
</div>

In [None]:
# Tu código aquí


<div class="alert alert-block alert-info">
    <b>Ejercicio:</b> Elimina finalmente las columnas "Deleted", "DeletedReason", "IsAccurate", "TrackStatus", "PitOutTime" y"PitInTime".
</div>

In [58]:
# Tu código aquí

Una vez realizada esta fase de análisis y limpieza del conjunto, almacenaremos el DataFrame de Pandas en un fichero de tipo `Pickle`.
Es posible almacenarlo como `CSV` o `XLSX`, pero estos formatos no guardan los tipos de las columnas. Esto provocaría que al cargarlo en el futuro tendríamos que volver a hacer un casting (`.astype()`) de cada columna. 

In [59]:
data.to_pickle("f1_23_monaco.pkl")