# Formula 1 Championship Drivers

I'm going to be looking at the gathered data from Formula 1 races from 1950–present. I got the data using the [Ergast Developer API](http://ergast.com/mrd/) and it has all the races from 1950 going forward. The data is stored across different relational tables, so I'll be doing some filtering and joining later.


Some things to note about the data: 

* I last downloaded this data on 19-04-30, and it includes Azerbaijan GP 2019. I don't think it's too relevant to keep the races from 2019 in the analysis because I'll be looking at the past championship winners.
* The tables are not chronologically organized in most cases. Races seem to start around Hamilton's debut with McLaren. Part of my initial cleaning and preparation is to get this re-organized. Drivers by date of birth, everything to races by the time they occured.
* In the standings table, a new entry is not added after a race if the driver does not change standings.
* When a cell is `\N` it means the cell is null, or the value wasn't recorded. This largely applies to things for which we had no data in the beginning.
* Looking at the original `raceId`, something happened because Azerbaijan GP 2019 was the 1001st race but their id for the same race was 1014. I assume the counting was pushed ahead somewhere in the middle, but things seem to line up. **I'll come back to look at this later**.

## Cleaning the data — part 1

The original data is saved (sans zip) in `data/f1db_raw`. I made a copy of the folder and renamed it to `f1db_excelPrep`.

The first cleaning thing I'm doing is going through every table in `data/f1db_excelPrep/` and:
* adding the headings for each table using the [f1db_schema.txt](../data/f1db_schema.txt) as a guideline
* removing the `\N` wherever it appears
* formatting dates to yyyy-mm-dd, and creating columns for month, date, and year separately.
* Sort drivers, and races chronologically and create new temporary driver and race Ids. 

I'll do this in excel and save the files in the folder: `data/f1db_excelPrep/`.

During this I noticed in constructor results that there is a `D` status that I don't know what it means. **I will look into this later.**

## Cleaning the data —part 2

In the first part of cleaning, I re-organized the `driverId`'s and `raceId`'s chronologically, so now I want to pass that info to the other tables so that everything is chronological.

What do I need to update for each table?

* circuits: *nothing*
* constructor_results: raceId
* constructor_standings: raceId
* constructors: *nothing*
* driver_standings: raceId, driverId
* driver: *nothing*
* lap_times: raceId, driverId
* pit_stops: raceId, driverId
* qualifying: raceId, driverId
* races: *nothing*
* results: raceId, driverId
* seasons: *nothing*
* status: *nothing*

The tables that don't need to be updated, I'll go ahead and make direct copies of and put them in the `data/f1db_working/` which will be the tables I use for analysis.


In [1]:
import numpy as np
import pandas as pd

In [48]:
drivers = pd.read_csv("../data/f1db_excelPrep/driver.csv")

In [49]:
drivers.head()

Unnamed: 0,driverId,driverRef,number,code,forename,surname,dob,nationality,url,driverId2,dob_year,dob_month,dob_day
0,741,etancelin,,,Philippe,Étancelin,1896-12-28,French,http://en.wikipedia.org/wiki/Philippe_%C3%89ta...,1,1896,12,28
1,703,legat,,,Arthur,Legat,1898-11-01,Belgian,http://en.wikipedia.org/wiki/Arthur_Legat,2,1898,11,1
2,786,fagioli,,,Luigi,Fagioli,1898-06-09,Italian,http://en.wikipedia.org/wiki/Luigi_Fagioli,3,1898,6,9
3,791,biondetti,,,Clemente,Biondetti,1898-08-18,Italian,http://en.wikipedia.org/wiki/Clemente_Biondetti,4,1898,8,18
4,589,chiron,,,Louis,Chiron,1899-08-03,Monegasque,http://en.wikipedia.org/wiki/Louis_Chiron,5,1899,8,3


In [50]:
drivers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 847 entries, 0 to 846
Data columns (total 13 columns):
driverId       847 non-null int64
driverRef      847 non-null object
number         44 non-null float64
code           90 non-null object
forename       847 non-null object
surname        847 non-null object
dob            847 non-null object
nationality    847 non-null object
url            846 non-null object
driverId2      847 non-null int64
dob_year       847 non-null int64
dob_month      847 non-null int64
dob_day        847 non-null int64
dtypes: float64(1), int64(5), object(7)
memory usage: 86.1+ KB


In [51]:
newDriverId = drivers[["driverId","driverId2"]]

In [52]:
newDriverId.head()

Unnamed: 0,driverId,driverId2
0,741,1
1,703,2
2,786,3
3,791,4
4,589,5


In [53]:
newDriverId.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 847 entries, 0 to 846
Data columns (total 2 columns):
driverId     847 non-null int64
driverId2    847 non-null int64
dtypes: int64(2)
memory usage: 13.3 KB


In [54]:
races = pd.read_csv("../data/f1db_excelPrep/races.csv")

In [55]:
races.head()

Unnamed: 0,raceId,year,round,circuitId,name,date,time,url,raceId2,race_year,race_month,race_day
0,833,1950,1,9,British Grand Prix,5/13/50,,http://en.wikipedia.org/wiki/1950_British_Gran...,1,1950,5,13
1,834,1950,2,6,Monaco Grand Prix,5/21/50,,http://en.wikipedia.org/wiki/1950_Monaco_Grand...,2,1950,5,21
2,835,1950,3,19,Indianapolis 500,5/30/50,,http://en.wikipedia.org/wiki/1950_Indianapolis...,3,1950,5,30
3,836,1950,4,66,Swiss Grand Prix,6/4/50,,http://en.wikipedia.org/wiki/1950_Swiss_Grand_...,4,1950,6,4
4,837,1950,5,13,Belgian Grand Prix,6/18/50,,http://en.wikipedia.org/wiki/1950_Belgian_Gran...,5,1950,6,18


In [56]:
races.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1018 entries, 0 to 1017
Data columns (total 12 columns):
raceId        1018 non-null int64
year          1018 non-null int64
round         1018 non-null int64
circuitId     1018 non-null int64
name          1018 non-null object
date          1018 non-null object
time          287 non-null object
url           1018 non-null object
raceId2       1018 non-null int64
race_year     1018 non-null int64
race_month    1018 non-null int64
race_day      1018 non-null int64
dtypes: int64(8), object(4)
memory usage: 95.5+ KB


In [57]:
newRaceId = races[["raceId","raceId2"]]

In [58]:
newRaceId.head()

Unnamed: 0,raceId,raceId2
0,833,1
1,834,2
2,835,3
3,836,4
4,837,5


In [59]:
newRaceId.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1018 entries, 0 to 1017
Data columns (total 2 columns):
raceId     1018 non-null int64
raceId2    1018 non-null int64
dtypes: int64(2)
memory usage: 16.0 KB


Now that we have `newRaceId` and `newDriverId` dataframes we can do inner joins with the other tables on the `raceId` and `driverId` columns, respectively, and add that info in.

In [60]:
constructorResults = pd.read_csv("../data/f1db_excelPrep/constructor_results.csv")

In [61]:
constructorResults.head()

Unnamed: 0,constructorResultsId,raceId,constructorId,points,status
0,1,18,1,14.0,
1,2,18,2,8.0,
2,3,18,3,9.0,
3,4,18,4,5.0,
4,5,18,5,2.0,


In [62]:
constructorResults.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11390 entries, 0 to 11389
Data columns (total 5 columns):
constructorResultsId    11390 non-null int64
raceId                  11390 non-null int64
constructorId           11390 non-null int64
points                  11390 non-null float64
status                  17 non-null object
dtypes: float64(1), int64(3), object(1)
memory usage: 445.0+ KB


In [63]:
test = pd.merge(constructorResults, newRaceId, on="raceId")

In [64]:
test.head()

Unnamed: 0,constructorResultsId,raceId,constructorId,points,status,raceId2
0,1,18,1,14.0,,786
1,2,18,2,8.0,,786
2,3,18,3,9.0,,786
3,4,18,4,5.0,,786
4,5,18,5,2.0,,786


In [65]:
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11390 entries, 0 to 11389
Data columns (total 6 columns):
constructorResultsId    11390 non-null int64
raceId                  11390 non-null int64
constructorId           11390 non-null int64
points                  11390 non-null float64
status                  17 non-null object
raceId2                 11390 non-null int64
dtypes: float64(1), int64(4), object(1)
memory usage: 622.9+ KB


Things worked well. We can save this dataframe to a csv since we don't need to add anything else to the constructor results.

In [66]:
test.to_csv("../data/f1db_working/constructor_results.csv")

We can repeat the same process with the constructor_standings table.

In [67]:
constructorStandings = pd.read_csv("../data/f1db_excelPrep/constructor_standings.csv")

In [68]:
constructorStandings.head()

Unnamed: 0,constructorStandingsId,raceId,constructorId,points,position,positionText,wins
0,1,18,1,14.0,1,1,1
1,2,18,2,8.0,3,3,0
2,3,18,3,9.0,2,2,0
3,4,18,4,5.0,4,4,0
4,5,18,5,2.0,5,5,0


In [69]:
constructorStandings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12146 entries, 0 to 12145
Data columns (total 7 columns):
constructorStandingsId    12146 non-null int64
raceId                    12146 non-null int64
constructorId             12146 non-null int64
points                    12146 non-null float64
position                  12146 non-null int64
positionText              12146 non-null object
wins                      12146 non-null int64
dtypes: float64(1), int64(5), object(1)
memory usage: 664.3+ KB


In [70]:
constructorStandings2 = constructorStandings.merge(newRaceId, on = "raceId")

In [71]:
constructorStandings2.head()

Unnamed: 0,constructorStandingsId,raceId,constructorId,points,position,positionText,wins,raceId2
0,1,18,1,14.0,1,1,1,786
1,2,18,2,8.0,3,3,0,786
2,3,18,3,9.0,2,2,0,786
3,4,18,4,5.0,4,4,0,786
4,5,18,5,2.0,5,5,0,786


In [72]:
constructorStandings2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12146 entries, 0 to 12145
Data columns (total 8 columns):
constructorStandingsId    12146 non-null int64
raceId                    12146 non-null int64
constructorId             12146 non-null int64
points                    12146 non-null float64
position                  12146 non-null int64
positionText              12146 non-null object
wins                      12146 non-null int64
raceId2                   12146 non-null int64
dtypes: float64(1), int64(6), object(1)
memory usage: 854.0+ KB


In [73]:
constructorStandings2.to_csv("../data/f1db_working/constructor_standings.csv")

Let's move on to the rest. We have to merge both the newRaceId and the newDriverId. We have to do it for: 

* driver_standings: raceId, driverId
* lap_times: raceId, driverId
* pit_stops: raceId, driverId
* qualifying: raceId, driverId
* results: raceId, driverId

Can I do it in one line of work?

In [74]:
driverStandings = pd.read_csv("../data/f1db_excelPrep/driver_standings.csv")

In [75]:
driverStandings.head()

Unnamed: 0,driverStandingsId,raceId,driverId,points,position,positionText,wins
0,1,18,1,10.0,1,1,1
1,2,18,2,8.0,2,2,0
2,3,18,3,6.0,3,3,0
3,4,18,4,5.0,4,4,0
4,5,18,5,4.0,5,5,0


In [76]:
driverStandings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32226 entries, 0 to 32225
Data columns (total 7 columns):
driverStandingsId    32226 non-null int64
raceId               32226 non-null int64
driverId             32226 non-null int64
points               32226 non-null float64
position             32226 non-null int64
positionText         32226 non-null object
wins                 32226 non-null int64
dtypes: float64(1), int64(5), object(1)
memory usage: 1.7+ MB


In [77]:
driverStandings2 = driverStandings.merge(newRaceId, on = "raceId").merge(newDriverId, on = "driverId")

In [78]:
driverStandings2.head()

Unnamed: 0,driverStandingsId,raceId,driverId,points,position,positionText,wins,raceId2,driverId2
0,1,18,1,10.0,1,1,1,786,803
1,9,19,1,14.0,1,1,1,787,803
2,27,20,1,14.0,3,3,1,788,803
3,48,21,1,20.0,2,2,1,789,803
4,69,22,1,28.0,3,3,1,790,803


Honest to god let's check to make sure that this worked. Lewis Hamilton has `driverId = 1`, so let's see if that matches up if we look with the newDriverId

In [80]:
drivers[drivers.driverId2 == 803]

Unnamed: 0,driverId,driverRef,number,code,forename,surname,dob,nationality,url,driverId2,dob_year,dob_month,dob_day
802,1,hamilton,44.0,HAM,Lewis,Hamilton,1/7/85,British,http://en.wikipedia.org/wiki/Lewis_Hamilton,803,1985,1,7


In [81]:
driverStandings2.head()

Unnamed: 0,driverStandingsId,raceId,driverId,points,position,positionText,wins,raceId2,driverId2
0,1,18,1,10.0,1,1,1,786,803
1,9,19,1,14.0,1,1,1,787,803
2,27,20,1,14.0,3,3,1,788,803
3,48,21,1,20.0,2,2,1,789,803
4,69,22,1,28.0,3,3,1,790,803


In [82]:
driverStandings2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32226 entries, 0 to 32225
Data columns (total 9 columns):
driverStandingsId    32226 non-null int64
raceId               32226 non-null int64
driverId             32226 non-null int64
points               32226 non-null float64
position             32226 non-null int64
positionText         32226 non-null object
wins                 32226 non-null int64
raceId2              32226 non-null int64
driverId2            32226 non-null int64
dtypes: float64(1), int64(7), object(1)
memory usage: 2.5+ MB


This back-to-back (chained??) merge worked! Let's save this file first

In [83]:
driverStandings2.to_csv("../data/f1db_working/driver_standings.csv")

Now let's try and do the other tables in the same way.

In [85]:
grab_url = "../data/f1db_excelPrep/"
save_url = "../data/f1db_working/"
tables = ["lap_times","pit_stops", "qualifying","results"]

for table in tables:
    url = grab_url + table + ".csv"
    df1 = pd.read_csv(url)
    df2 = df1.merge(newRaceId, on = "raceId").merge(newDriverId, on = "driverId")
    df2.to_csv(save_url + table + ".csv")
    print("###",table,"###")
    df1.info(verbose=False)
    print("---")
    df2.info(verbose=False)
    print("###############")
    print()

### lap_times ###
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 452991 entries, 0 to 452990
Columns: 6 entries, raceId to milliseconds
dtypes: int64(5), object(1)
memory usage: 20.7+ MB
---
<class 'pandas.core.frame.DataFrame'>
Int64Index: 452991 entries, 0 to 452990
Columns: 8 entries, raceId to driverId2
dtypes: int64(7), object(1)
memory usage: 31.1+ MB
###############

### pit_stops ###
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6912 entries, 0 to 6911
Columns: 7 entries, raceId to milliseconds
dtypes: int64(5), object(2)
memory usage: 378.1+ KB
---
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6912 entries, 0 to 6911
Columns: 9 entries, raceId to driverId2
dtypes: int64(7), object(2)
memory usage: 540.0+ KB
###############

### qualifying ###
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8014 entries, 0 to 8013
Columns: 9 entries, qualifyId to q3
dtypes: int64(6), object(3)
memory usage: 563.6+ KB
---
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8014 entrie

Each of the tables is now prepped. For my analysis I'll be working primarily with the the results table. So I'll try to get all the information I need together and start looking at the drivers