# Data transformation


## Dependencies


The dependencies used are as follows


In [1]:
from datetime import datetime
import numpy as np
import pandas as pd

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import warnings

warnings.simplefilter("ignore")

## Datatype conversion


Each data will be transformed to its corresponding data type. In addition, the times will be converted to milliseconds and two derived attributes will be created: the age of the current driver and when the race was run. Also, the string "\N" will be replaced by None, to determine the nulls in the object type data.


In [2]:
df = pd.read_csv("../assets/data/processed/data_selection.csv")

dates = ["driverBirth", "raceDate"]
for d in dates:
    df[d] = pd.to_datetime(df[d])

h_f, m_f = "%H:%M:%S", "%M:%S.%f"
times = [
    ("fastestLapTime", m_f),
    ("raceTime", h_f),
    ("q1", m_f),
    ("q2", m_f),
    ("q3", m_f),
]
for t, f in times:
    df[t] = pd.to_numeric(
        pd.to_timedelta(
            pd.to_datetime(df[t], format=f, errors="coerce").dt.time.to_numpy(dtype=str)
        ).total_seconds()
        * 1000
    )

numbers = [
    "carNumber",
    "positionGrid",
    "positionFinal",
    "pointsDriverEarned",
    "lapsCompleted",
    "timeTakenInMillisec",
    "fastestLap",
    "fastestLapRank",
    "maxSpeed",
    "driverNumber",
    "raceYear",
    "raceRound",
    "circuitLat",
    "circuitLng",
    "circuitAlt",
    "driverWins",
    "pointsConstructorEarned",
    "constructorPosition",
    "constructorWins",
]
for n in numbers:
    df[n] = pd.to_numeric(df[n], errors="coerce")

df["driverAgeToday"] = np.int64(datetime.today().year - df["driverBirth"].dt.year)
df["driverAgeAtRace"] = np.int64(df["raceDate"].dt.year - df["driverBirth"].dt.year)
df = df.drop(["driverBirth"], axis=1)
df = df.drop(["raceDate"], axis=1)

df = df.replace("\\N", None)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26080 entries, 0 to 26079
Data columns (total 35 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   carNumber                26074 non-null  float64
 1   positionGrid             26080 non-null  int64  
 2   positionFinal            26080 non-null  int64  
 3   pointsDriverEarned       26080 non-null  float64
 4   lapsCompleted            26080 non-null  int64  
 5   timeTakenInMillisec      7250 non-null   float64
 6   fastestLap               7615 non-null   float64
 7   fastestLapRank           7831 non-null   float64
 8   fastestLapTime           7615 non-null   float64
 9   maxSpeed                 7615 non-null   float64
 10  driverRef                26080 non-null  object 
 11  driverNumber             5823 non-null   float64
 12  driverNationality        26080 non-null  object 
 13  constructorRef           26080 non-null  object 
 14  constructorNationality

Finally, we write down the results for later sections


In [3]:
df.to_csv("../assets/data/processed/data_transformation.csv", index=False)