# Formula 1 Championship Drivers

I'm going to be looking at the gathered data from Formula 1 races from 1950–present. I got the data using the [Ergast Developer API](http://ergast.com/mrd/) and it has all the races from 1950 going forward. The data is stored across different relational tables, so I'll be doing some filtering and joining later.


Some things to note about the data: 

* I last downloaded this data on 19-04-30, and it includes Azerbaijan GP 2019. I don't think it's too relevant to keep the races from 2019 in the analysis because I'll be looking at the past championship winners.
* The tables are not chronologically organized in most cases. Races seem to start around Hamilton's debut with McLaren. Part of my initial cleaning and preparation is to get this re-organized. Drivers by date of birth, everything to races by the time they occured.
* In the standings table, a new entry is not added after a race if the driver does not change standings.
* When a cell is `\N` it means the cell is null, or the value wasn't recorded. This largely applies to things for which we had no data in the beginning.
* Looking at the original `raceId`, something happened because Azerbaijan GP 2019 was the 1001st race but their id for the same race was 1014. I assume the counting was pushed ahead somewhere in the middle, but things seem to line up. **I'll come back to look at this later**.

## Cleaning the data — part 1

The original data is saved (sans zip) in `data/f1db_raw`. I made a copy of the folder and renamed it to `f1db_excelPrep`.

The first cleaning thing I'm doing is going through every table in `data/f1db_excelPrep/` and:
* adding the headings for each table using the [f1db_schema.txt](../data/f1db_schema.txt) as a guideline
* removing the `\N` wherever it appears
* formatting dates to yyyy-mm-dd, and creating columns for month, date, and year separately.
* Sort drivers, and races chronologically and create new temporary driver and race Ids. 

I'll do this in excel and save the files in the folder: `data/f1db_excelPrep/`.

During this I noticed in constructor results that there is a `D` status that I don't know what it means. **I will look into this later.**

## Cleaning the data —part 2

In the first part of cleaning, I re-organized the `driverId`'s and `raceId`'s chronologically, so now I want to pass that info to the other tables so that everything is chronological.

What do I need to update for each table?

* circuits: *nothing*
* constructor_results: raceId
* constructor_standings: raceId
* constructors: *nothing*
* driver_standings: raceId, driverId
* driver: *nothing*
* lap_times: raceId, driverId
* pit_stops: raceId, driverId
* qualifying: raceId, driverId
* races: *nothing*
* results: raceId, driverId
* seasons: *nothing*
* status: *nothing*

The tables that don't need to be updated, I'll go ahead and make direct copies of and put them in the `data/f1db_working/` which will be the tables I use for analysis.


In [1]:
import numpy as np
import pandas as pd

In [2]:
drivers = pd.read_csv("../data/f1db_excelPrep/driver.csv")

In [3]:
drivers.head()

Unnamed: 0,driverId,driverRef,number,code,forename,surname,dob,nationality,url,driverId2,dob_year,dob_month,dob_day
0,741,etancelin,,,Philippe,Étancelin,1896-12-28,French,http://en.wikipedia.org/wiki/Philippe_%C3%89ta...,1,1896,12,28
1,703,legat,,,Arthur,Legat,1898-11-01,Belgian,http://en.wikipedia.org/wiki/Arthur_Legat,2,1898,11,1
2,786,fagioli,,,Luigi,Fagioli,1898-06-09,Italian,http://en.wikipedia.org/wiki/Luigi_Fagioli,3,1898,6,9
3,791,biondetti,,,Clemente,Biondetti,1898-08-18,Italian,http://en.wikipedia.org/wiki/Clemente_Biondetti,4,1898,8,18
4,589,chiron,,,Louis,Chiron,1899-08-03,Monegasque,http://en.wikipedia.org/wiki/Louis_Chiron,5,1899,8,3


In [4]:
drivers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 847 entries, 0 to 846
Data columns (total 13 columns):
driverId       847 non-null int64
driverRef      847 non-null object
number         44 non-null float64
code           90 non-null object
forename       847 non-null object
surname        847 non-null object
dob            847 non-null object
nationality    847 non-null object
url            846 non-null object
driverId2      847 non-null int64
dob_year       847 non-null int64
dob_month      847 non-null int64
dob_day        847 non-null int64
dtypes: float64(1), int64(5), object(7)
memory usage: 86.1+ KB


In [5]:
newDriverId = drivers[["driverId","driverId2"]]

In [6]:
newDriverId.head()

Unnamed: 0,driverId,driverId2
0,741,1
1,703,2
2,786,3
3,791,4
4,589,5


In [7]:
newDriverId.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 847 entries, 0 to 846
Data columns (total 2 columns):
driverId     847 non-null int64
driverId2    847 non-null int64
dtypes: int64(2)
memory usage: 13.3 KB


In [8]:
races = pd.read_csv("../data/f1db_excelPrep/races.csv")

In [9]:
races.head()

Unnamed: 0,raceId,year,round,circuitId,name,date,time,url,raceId2,race_year,race_month,race_day
0,833,1950,1,9,British Grand Prix,5/13/50,,http://en.wikipedia.org/wiki/1950_British_Gran...,1,1950,5,13
1,834,1950,2,6,Monaco Grand Prix,5/21/50,,http://en.wikipedia.org/wiki/1950_Monaco_Grand...,2,1950,5,21
2,835,1950,3,19,Indianapolis 500,5/30/50,,http://en.wikipedia.org/wiki/1950_Indianapolis...,3,1950,5,30
3,836,1950,4,66,Swiss Grand Prix,6/4/50,,http://en.wikipedia.org/wiki/1950_Swiss_Grand_...,4,1950,6,4
4,837,1950,5,13,Belgian Grand Prix,6/18/50,,http://en.wikipedia.org/wiki/1950_Belgian_Gran...,5,1950,6,18


In [10]:
races.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1018 entries, 0 to 1017
Data columns (total 12 columns):
raceId        1018 non-null int64
year          1018 non-null int64
round         1018 non-null int64
circuitId     1018 non-null int64
name          1018 non-null object
date          1018 non-null object
time          287 non-null object
url           1018 non-null object
raceId2       1018 non-null int64
race_year     1018 non-null int64
race_month    1018 non-null int64
race_day      1018 non-null int64
dtypes: int64(8), object(4)
memory usage: 95.5+ KB


In [11]:
newRaceId = races[["raceId","raceId2"]]

In [12]:
newRaceId.head()

Unnamed: 0,raceId,raceId2
0,833,1
1,834,2
2,835,3
3,836,4
4,837,5


In [13]:
newRaceId.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1018 entries, 0 to 1017
Data columns (total 2 columns):
raceId     1018 non-null int64
raceId2    1018 non-null int64
dtypes: int64(2)
memory usage: 16.0 KB


Now that we have `newRaceId` and `newDriverId` dataframes we can do inner joins with the other tables on the `raceId` and `driverId` columns, respectively, and add that info in.

In [14]:
constructorResults = pd.read_csv("../data/f1db_excelPrep/constructor_results.csv")

In [15]:
constructorResults.head()

Unnamed: 0,constructorResultsId,raceId,constructorId,points,status
0,1,18,1,14.0,
1,2,18,2,8.0,
2,3,18,3,9.0,
3,4,18,4,5.0,
4,5,18,5,2.0,


In [16]:
constructorResults.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11390 entries, 0 to 11389
Data columns (total 5 columns):
constructorResultsId    11390 non-null int64
raceId                  11390 non-null int64
constructorId           11390 non-null int64
points                  11390 non-null float64
status                  17 non-null object
dtypes: float64(1), int64(3), object(1)
memory usage: 445.0+ KB


In [17]:
constructorResultsNew = pd.merge(constructorResults, newRaceId, on="raceId")

In [18]:
constructorResultsNew.head()

Unnamed: 0,constructorResultsId,raceId,constructorId,points,status,raceId2
0,1,18,1,14.0,,786
1,2,18,2,8.0,,786
2,3,18,3,9.0,,786
3,4,18,4,5.0,,786
4,5,18,5,2.0,,786


In [19]:
constructorResultsNew.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11390 entries, 0 to 11389
Data columns (total 6 columns):
constructorResultsId    11390 non-null int64
raceId                  11390 non-null int64
constructorId           11390 non-null int64
points                  11390 non-null float64
status                  17 non-null object
raceId2                 11390 non-null int64
dtypes: float64(1), int64(4), object(1)
memory usage: 622.9+ KB


Things worked well. We can save this dataframe to a csv since we don't need to add anything else to the constructor results.

In [20]:
constructorResultsNew.to_csv("../data/f1db_working/constructor_results.csv", index=False)

We can repeat the same process with the constructor_standings table.

In [21]:
constructorStandings = pd.read_csv("../data/f1db_excelPrep/constructor_standings.csv")

In [22]:
constructorStandings.head()

Unnamed: 0,constructorStandingsId,raceId,constructorId,points,position,positionText,wins
0,1,18,1,14.0,1,1,1
1,2,18,2,8.0,3,3,0
2,3,18,3,9.0,2,2,0
3,4,18,4,5.0,4,4,0
4,5,18,5,2.0,5,5,0


In [23]:
constructorStandings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12146 entries, 0 to 12145
Data columns (total 7 columns):
constructorStandingsId    12146 non-null int64
raceId                    12146 non-null int64
constructorId             12146 non-null int64
points                    12146 non-null float64
position                  12146 non-null int64
positionText              12146 non-null object
wins                      12146 non-null int64
dtypes: float64(1), int64(5), object(1)
memory usage: 664.3+ KB


In [24]:
constructorStandings2 = constructorStandings.merge(newRaceId, on = "raceId")

In [25]:
constructorStandings2.head()

Unnamed: 0,constructorStandingsId,raceId,constructorId,points,position,positionText,wins,raceId2
0,1,18,1,14.0,1,1,1,786
1,2,18,2,8.0,3,3,0,786
2,3,18,3,9.0,2,2,0,786
3,4,18,4,5.0,4,4,0,786
4,5,18,5,2.0,5,5,0,786


In [26]:
constructorStandings2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12146 entries, 0 to 12145
Data columns (total 8 columns):
constructorStandingsId    12146 non-null int64
raceId                    12146 non-null int64
constructorId             12146 non-null int64
points                    12146 non-null float64
position                  12146 non-null int64
positionText              12146 non-null object
wins                      12146 non-null int64
raceId2                   12146 non-null int64
dtypes: float64(1), int64(6), object(1)
memory usage: 854.0+ KB


In [27]:
constructorStandings2.to_csv("../data/f1db_working/constructor_standings.csv", index=False)

Let's move on to the rest. We have to merge both the newRaceId and the newDriverId. We have to do it for: 

* driver_standings: raceId, driverId
* lap_times: raceId, driverId
* pit_stops: raceId, driverId
* qualifying: raceId, driverId
* results: raceId, driverId

Can I do it in one line of work?

In [28]:
driverStandings = pd.read_csv("../data/f1db_excelPrep/driver_standings.csv")

In [29]:
driverStandings.head()

Unnamed: 0,driverStandingsId,raceId,driverId,points,position,positionText,wins
0,1,18,1,10.0,1,1,1
1,2,18,2,8.0,2,2,0
2,3,18,3,6.0,3,3,0
3,4,18,4,5.0,4,4,0
4,5,18,5,4.0,5,5,0


In [30]:
driverStandings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32226 entries, 0 to 32225
Data columns (total 7 columns):
driverStandingsId    32226 non-null int64
raceId               32226 non-null int64
driverId             32226 non-null int64
points               32226 non-null float64
position             32226 non-null int64
positionText         32226 non-null object
wins                 32226 non-null int64
dtypes: float64(1), int64(5), object(1)
memory usage: 1.7+ MB


In [31]:
driverStandings2 = driverStandings.merge(newRaceId, on = "raceId").merge(newDriverId, on = "driverId")

In [32]:
driverStandings2.head()

Unnamed: 0,driverStandingsId,raceId,driverId,points,position,positionText,wins,raceId2,driverId2
0,1,18,1,10.0,1,1,1,786,803
1,9,19,1,14.0,1,1,1,787,803
2,27,20,1,14.0,3,3,1,788,803
3,48,21,1,20.0,2,2,1,789,803
4,69,22,1,28.0,3,3,1,790,803


Honest to god let's check to make sure that this worked. Lewis Hamilton has `driverId = 1`, so let's see if that matches up if we look with the newDriverId

In [33]:
drivers[drivers.driverId2 == 803]

Unnamed: 0,driverId,driverRef,number,code,forename,surname,dob,nationality,url,driverId2,dob_year,dob_month,dob_day
802,1,hamilton,44.0,HAM,Lewis,Hamilton,1/7/85,British,http://en.wikipedia.org/wiki/Lewis_Hamilton,803,1985,1,7


In [34]:
driverStandings2.head()

Unnamed: 0,driverStandingsId,raceId,driverId,points,position,positionText,wins,raceId2,driverId2
0,1,18,1,10.0,1,1,1,786,803
1,9,19,1,14.0,1,1,1,787,803
2,27,20,1,14.0,3,3,1,788,803
3,48,21,1,20.0,2,2,1,789,803
4,69,22,1,28.0,3,3,1,790,803


In [35]:
driverStandings2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32226 entries, 0 to 32225
Data columns (total 9 columns):
driverStandingsId    32226 non-null int64
raceId               32226 non-null int64
driverId             32226 non-null int64
points               32226 non-null float64
position             32226 non-null int64
positionText         32226 non-null object
wins                 32226 non-null int64
raceId2              32226 non-null int64
driverId2            32226 non-null int64
dtypes: float64(1), int64(7), object(1)
memory usage: 2.5+ MB


This back-to-back (chained??) merge worked! Let's save this file first

In [36]:
driverStandings2.to_csv("../data/f1db_working/driver_standings.csv", index=False)

Now let's try and do the other tables in the same way.

In [37]:
grab_url = "../data/f1db_excelPrep/"
save_url = "../data/f1db_working/"
tables = ["lap_times","pit_stops", "qualifying","results"]

for table in tables:
    url = grab_url + table + ".csv"
    df1 = pd.read_csv(url)
    df2 = df1.merge(newRaceId, on = "raceId").merge(newDriverId, on = "driverId")
    df2.to_csv(save_url + table + ".csv", index=False)
    # uncomment the following to check things
#     print("###",table,"###")
#     df1.info(verbose=False)
#     print("---")
#     df2.info(verbose=False)
#     print("###############")
#     print()

Each of the tables is now prepped. For my analysis I'll be working primarily with the the results table. So I'll try to get all the information I need together.

## Wrangling the results table.

Let's create one master results table:

### Getting the results columns

In [38]:
results = pd.read_csv("../data/f1db_working/results.csv")

In [39]:
results.head(10)

Unnamed: 0,resultsId,raceId,driverId,constructorId,number,grid,position,positionText,positionOrder,points,laps,time,milliseconds,fastestLap,rank,fastestLapTime,fastestLapSpeed,statusId,raceId2,driverId2
0,1,18,1,1,22.0,1,1.0,1,1,10.0,58,34:50.6,5690616.0,39.0,2.0,01:27.5,218.3,1,786,803
1,27,19,1,1,22.0,9,5.0,5,5,4.0,56,46.548,5525103.0,53.0,3.0,01:35.5,209.033,1,787,803
2,57,20,1,1,22.0,3,13.0,13,13,0.0,56,,,25.0,19.0,01:35.5,203.969,11,788,803
3,69,21,1,1,22.0,5,3.0,3,3,6.0,66,4.187,5903238.0,20.0,3.0,01:22.0,204.323,1,789,803
4,90,22,1,1,22.0,3,2.0,2,2,8.0,58,3.779,5213230.0,31.0,2.0,01:26.5,222.085,1,790,803
5,109,23,1,1,22.0,3,1.0,1,1,10.0,76,00:42.7,7242742.0,71.0,6.0,01:18.5,153.152,1,791,803
6,147,24,1,1,22.0,1,,R,19,0.0,19,,,4.0,3.0,01:17.5,202.559,4,792,803
7,158,25,1,1,22.0,13,10.0,10,10,0.0,70,54.538,5564783.0,40.0,5.0,01:17.5,205.022,1,793,803
8,169,26,1,1,22.0,4,1.0,1,1,10.0,60,39:09.4,5949440.0,16.0,3.0,01:32.8,199.398,1,794,803
9,189,27,1,1,22.0,1,1.0,1,1,10.0,67,31:20.9,5480874.0,17.0,2.0,01:16.0,216.552,1,795,803


There is this unnamed column in the front. Let's get rid of it.

In [40]:
results.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24277 entries, 0 to 24276
Data columns (total 20 columns):
resultsId          24277 non-null int64
raceId             24277 non-null int64
driverId           24277 non-null int64
constructorId      24277 non-null int64
number             24271 non-null float64
grid               24277 non-null int64
position           13634 non-null float64
positionText       24277 non-null object
positionOrder      24277 non-null int64
points             24277 non-null float64
laps               24277 non-null int64
time               6237 non-null object
milliseconds       6236 non-null float64
fastestLap         5862 non-null float64
rank               6031 non-null float64
fastestLapTime     5862 non-null object
fastestLapSpeed    5862 non-null float64
statusId           24277 non-null int64
raceId2            24277 non-null int64
driverId2          24277 non-null int64
dtypes: float64(7), int64(10), object(3)
memory usage: 3.7+ MB


In [41]:
results = results.sort_values("resultsId")

That seems to be gone now

In [42]:
results.head(10)

Unnamed: 0,resultsId,raceId,driverId,constructorId,number,grid,position,positionText,positionOrder,points,laps,time,milliseconds,fastestLap,rank,fastestLapTime,fastestLapSpeed,statusId,raceId2,driverId2
0,1,18,1,1,22.0,1,1.0,1,1,10.0,58,34:50.6,5690616.0,39.0,2.0,01:27.5,218.3,1,786,803
233,2,18,2,2,3.0,5,2.0,2,2,8.0,58,5.478,5696094.0,41.0,3.0,01:27.7,217.586,1,786,768
417,3,18,3,3,7.0,7,3.0,3,3,6.0,58,8.163,5698779.0,41.0,5.0,01:28.1,216.719,1,786,804
623,4,18,4,4,5.0,11,4.0,4,4,5.0,58,17.181,5707797.0,58.0,7.0,01:28.6,215.464,1,786,786
937,5,18,5,1,23.0,3,5.0,5,5,4.0,58,18.014,5708630.0,43.0,1.0,01:27.4,218.385,1,786,787
1049,6,18,6,3,8.0,13,6.0,6,6,3.0,57,,,50.0,14.0,01:29.6,212.974,11,786,805
1085,7,18,7,5,14.0,17,7.0,7,7,2.0,55,,,22.0,12.0,01:29.5,213.224,5,786,776
1112,8,18,8,6,1.0,15,8.0,8,8,1.0,53,,,20.0,4.0,01:27.9,217.18,5,786,777
1410,9,18,9,2,4.0,2,,R,9,0.0,47,,,15.0,9.0,01:28.8,215.1,4,786,799
1490,10,18,10,7,12.0,18,,R,10,0.0,43,,,23.0,13.0,01:29.6,213.166,3,786,793


In [43]:
results.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24277 entries, 0 to 23056
Data columns (total 20 columns):
resultsId          24277 non-null int64
raceId             24277 non-null int64
driverId           24277 non-null int64
constructorId      24277 non-null int64
number             24271 non-null float64
grid               24277 non-null int64
position           13634 non-null float64
positionText       24277 non-null object
positionOrder      24277 non-null int64
points             24277 non-null float64
laps               24277 non-null int64
time               6237 non-null object
milliseconds       6236 non-null float64
fastestLap         5862 non-null float64
rank               6031 non-null float64
fastestLapTime     5862 non-null object
fastestLapSpeed    5862 non-null float64
statusId           24277 non-null int64
raceId2            24277 non-null int64
driverId2          24277 non-null int64
dtypes: float64(7), int64(10), object(3)
memory usage: 3.9+ MB


In [44]:
list(results.columns)

['resultsId',
 'raceId',
 'driverId',
 'constructorId',
 'number',
 'grid',
 'position',
 'positionText',
 'positionOrder',
 'points',
 'laps',
 'time',
 'milliseconds',
 'fastestLap',
 'rank',
 'fastestLapTime',
 'fastestLapSpeed',
 'statusId',
 'raceId2',
 'driverId2']

Looking at the columns, I'm interested in:

* raceId2
* driverId2
* constructorID
* grid
* position
* positionText
* positionOrder
* points
* statusId

I can come back and add more columns, but let's create a new results table that has just these.

In [45]:
results[["raceId2","driverId2","constructorId","grid","positionText","positionOrder","points","statusId"]].head(20)

Unnamed: 0,raceId2,driverId2,constructorId,grid,positionText,positionOrder,points,statusId
0,786,803,1,1,1,1,10.0,1
233,786,768,2,5,2,2,8.0,1
417,786,804,3,7,3,3,6.0,1
623,786,786,4,11,4,4,5.0,1
937,786,787,1,3,5,5,4.0,1
1049,786,805,3,13,6,6,3.0,11
1085,786,776,5,17,7,7,2.0,5
1112,786,777,6,15,8,8,1.0,5
1410,786,799,2,2,R,9,0.0,4
1490,786,793,7,18,R,10,0.0,3


In [46]:
results2 = results[["raceId2","driverId2","constructorId","grid",
                    "positionText","positionOrder","points","statusId"]]

In [47]:
results2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24277 entries, 0 to 23056
Data columns (total 8 columns):
raceId2          24277 non-null int64
driverId2        24277 non-null int64
constructorId    24277 non-null int64
grid             24277 non-null int64
positionText     24277 non-null object
positionOrder    24277 non-null int64
points           24277 non-null float64
statusId         24277 non-null int64
dtypes: float64(1), int64(6), object(1)
memory usage: 1.7+ MB


Same number of lines in all of them. Now, let's sort the lines by raceId.

In [48]:
results2 = results2.sort_values("raceId2")

In [49]:
results2.head(8)

Unnamed: 0,raceId2,driverId2,constructorId,grid,positionText,positionOrder,points,statusId
21977,1,58,51,4,3,3,4.0,1
21694,1,20,51,1,1,1,9.0,1
22223,1,3,51,2,2,2,6.0,1
21892,1,142,151,10,R,20,0.0,6
22255,1,105,105,20,10,10,0.0,16
22057,1,15,154,6,4,4,3.0,12
22254,1,68,151,12,R,21,0.0,126
21865,1,65,151,10,R,20,0.0,6


### Getting driver Information
This is good progress. Now I want to connect the driver, constructor, and race data that are associated with these Ids. Let's start with the drivers

In [50]:
drivers = pd.read_csv("../data/f1db_working/driver.csv")

In [51]:
drivers.head()

Unnamed: 0,driverId,driverRef,number,code,forename,surname,dob,nationality,url,driverId2,dob_year,dob_month,dob_day
0,741,etancelin,,,Philippe,Étancelin,1896-12-28,French,http://en.wikipedia.org/wiki/Philippe_%C3%89ta...,1,1896,12,28
1,703,legat,,,Arthur,Legat,1898-11-01,Belgian,http://en.wikipedia.org/wiki/Arthur_Legat,2,1898,11,1
2,786,fagioli,,,Luigi,Fagioli,1898-06-09,Italian,http://en.wikipedia.org/wiki/Luigi_Fagioli,3,1898,6,9
3,791,biondetti,,,Clemente,Biondetti,1898-08-18,Italian,http://en.wikipedia.org/wiki/Clemente_Biondetti,4,1898,8,18
4,589,chiron,,,Louis,Chiron,1899-08-03,Monegasque,http://en.wikipedia.org/wiki/Louis_Chiron,5,1899,8,3


Before I get the columns I want, I want to add a column that has the full driver name.

In [52]:
def combine_name(row):
    name = row.forename + " " + row.surname

In [53]:
drivers["driverName"] = drivers[["forename","surname"]].apply(lambda x: ' '.join(x), axis=1)

In [54]:
drivers.head()

Unnamed: 0,driverId,driverRef,number,code,forename,surname,dob,nationality,url,driverId2,dob_year,dob_month,dob_day,driverName
0,741,etancelin,,,Philippe,Étancelin,1896-12-28,French,http://en.wikipedia.org/wiki/Philippe_%C3%89ta...,1,1896,12,28,Philippe Étancelin
1,703,legat,,,Arthur,Legat,1898-11-01,Belgian,http://en.wikipedia.org/wiki/Arthur_Legat,2,1898,11,1,Arthur Legat
2,786,fagioli,,,Luigi,Fagioli,1898-06-09,Italian,http://en.wikipedia.org/wiki/Luigi_Fagioli,3,1898,6,9,Luigi Fagioli
3,791,biondetti,,,Clemente,Biondetti,1898-08-18,Italian,http://en.wikipedia.org/wiki/Clemente_Biondetti,4,1898,8,18,Clemente Biondetti
4,589,chiron,,,Louis,Chiron,1899-08-03,Monegasque,http://en.wikipedia.org/wiki/Louis_Chiron,5,1899,8,3,Louis Chiron


Good! Now I want to grab the driverId2, driverName, nationality, and dob for now:

In [55]:
drivers2 = drivers[["driverId2", "driverRef", "driverName", "nationality","dob"]]

In [56]:
drivers2.head()

Unnamed: 0,driverId2,driverRef,driverName,nationality,dob
0,1,etancelin,Philippe Étancelin,French,1896-12-28
1,2,legat,Arthur Legat,Belgian,1898-11-01
2,3,fagioli,Luigi Fagioli,Italian,1898-06-09
3,4,biondetti,Clemente Biondetti,Italian,1898-08-18
4,5,chiron,Louis Chiron,Monegasque,1899-08-03


Let's try merging drivers2 with results2

In [57]:
i = results2.merge(drivers2, on="driverId2")

In [58]:
i.sort_values("raceId2").head(25)

Unnamed: 0,raceId2,driverId2,constructorId,grid,positionText,positionOrder,points,statusId,driverRef,driverName,nationality,dob
0,1,58,51,4,3,3,4.0,1,reg_parnell,Reg Parnell,British,7/2/11
287,1,83,126,19,N,13,0.0,62,kelly,Joe Kelly,Irish,3/13/13
52,1,142,151,10,R,20,0.0,6,rolt,Tony Rolt,British,10/16/18
229,1,56,51,3,R,12,0.0,44,fangio,Juan Fangio,Argentine,6/24/11
55,1,105,105,20,10,10,0.0,16,fry,Joe Fry,British,10/26/15
56,1,15,154,6,4,4,3.0,12,cabantous,Yves Cabantous,French,10/8/04
204,1,112,154,21,11,11,0.0,16,claes,Johnny Claes,Belgian,8/11/16
197,1,88,151,13,6,6,0.0,13,gerard,Bob Gerard,British,1/19/14
195,1,130,105,16,9,9,0.0,16,hampshire,David Hampshire,British,12/29/17
289,1,104,105,20,10,10,0.0,16,shawe_taylor,Brian Shawe Taylor,British,1/28/15


Why is dob showing up so weirdly?

In [59]:
drivers2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 847 entries, 0 to 846
Data columns (total 5 columns):
driverId2      847 non-null int64
driverRef      847 non-null object
driverName     847 non-null object
nationality    847 non-null object
dob            847 non-null object
dtypes: int64(1), object(4)
memory usage: 33.2+ KB


In [60]:
dobs = drivers2["dob"]

In [61]:
dobs.head(20)

0     1896-12-28
1     1898-11-01
2     1898-06-09
3     1898-08-18
4     1899-08-03
5     1899-10-15
6     1899-10-13
7       12/27/00
8        3/29/00
9        7/19/02
10       4/27/02
11        6/9/03
12       5/23/03
13      11/23/03
14       10/8/04
15       11/5/05
16      12/22/05
17       7/25/05
18      10/12/06
19      10/30/06
Name: dob, dtype: object

In [62]:
dobs2 = pd.to_datetime(dobs)

In [63]:
dobs2.head(20)

0    1896-12-28
1    1898-11-01
2    1898-06-09
3    1898-08-18
4    1899-08-03
5    1899-10-15
6    1899-10-13
7    2000-12-27
8    2000-03-29
9    2002-07-19
10   2002-04-27
11   2003-06-09
12   2003-05-23
13   2003-11-23
14   2004-10-08
15   2005-11-05
16   2005-12-22
17   2005-07-25
18   2006-10-12
19   2006-10-30
Name: dob, dtype: datetime64[ns]

Well this is fucked up. The dates are interepreted all funny. Good thing I went and made columns. Let's go back to drivers and try to fix this there

In [64]:
drivers.head(20)

Unnamed: 0,driverId,driverRef,number,code,forename,surname,dob,nationality,url,driverId2,dob_year,dob_month,dob_day,driverName
0,741,etancelin,,,Philippe,Étancelin,1896-12-28,French,http://en.wikipedia.org/wiki/Philippe_%C3%89ta...,1,1896,12,28,Philippe Étancelin
1,703,legat,,,Arthur,Legat,1898-11-01,Belgian,http://en.wikipedia.org/wiki/Arthur_Legat,2,1898,11,1,Arthur Legat
2,786,fagioli,,,Luigi,Fagioli,1898-06-09,Italian,http://en.wikipedia.org/wiki/Luigi_Fagioli,3,1898,6,9,Luigi Fagioli
3,791,biondetti,,,Clemente,Biondetti,1898-08-18,Italian,http://en.wikipedia.org/wiki/Clemente_Biondetti,4,1898,8,18,Clemente Biondetti
4,589,chiron,,,Louis,Chiron,1899-08-03,Monegasque,http://en.wikipedia.org/wiki/Louis_Chiron,5,1899,8,3,Louis Chiron
5,750,brudes,,,Adolf,Brudes,1899-10-15,German,http://en.wikipedia.org/wiki/Adolf_Brudes,6,1899,10,15,Adolf Brudes
6,760,dusio,,,Piero,Dusio,1899-10-13,Italian,http://en.wikipedia.org/wiki/Piero_Dusio,7,1899,10,13,Piero Dusio
7,717,hans_stuck,,,Hans,von Stuck,12/27/00,German,http://en.wikipedia.org/wiki/Hans_Von_Stuck,8,1900,12,27,Hans von Stuck
8,749,aston,,,Bill,Aston,3/29/00,British,http://en.wikipedia.org/wiki/Bill_Aston,9,1900,3,29,Bill Aston
9,733,miller,,,Chet,Miller,7/19/02,American,http://en.wikipedia.org/wiki/Chet_Miller,10,1902,7,19,Chet Miller


In [65]:
def combine_dob(row):
    year = str(row.dob_year)
    month = str(row.dob_month)
    day = str(row.dob_day)
    return year+"-"+month+"-"+day

In [66]:
dob2 = drivers[["dob_year", "dob_month", "dob_day"]]

In [67]:
dob2.head()

Unnamed: 0,dob_year,dob_month,dob_day
0,1896,12,28
1,1898,11,1
2,1898,6,9
3,1898,8,18
4,1899,8,3


In [68]:
dob2.apply(combine_dob, axis=1).head(20)

0     1896-12-28
1      1898-11-1
2       1898-6-9
3      1898-8-18
4       1899-8-3
5     1899-10-15
6     1899-10-13
7     1900-12-27
8      1900-3-29
9      1902-7-19
10     1902-4-27
11      1903-6-9
12     1903-5-23
13    1903-11-23
14     1904-10-8
15     1905-11-5
16    1905-12-22
17     1905-7-25
18    1906-10-12
19    1906-10-30
dtype: object

So now let's add that dob2 column again

In [69]:
drivers["dob2"] = drivers[["dob_year", "dob_month","dob_day"]].apply(combine_dob, axis=1)

In [70]:
drivers.head(10)

Unnamed: 0,driverId,driverRef,number,code,forename,surname,dob,nationality,url,driverId2,dob_year,dob_month,dob_day,driverName,dob2
0,741,etancelin,,,Philippe,Étancelin,1896-12-28,French,http://en.wikipedia.org/wiki/Philippe_%C3%89ta...,1,1896,12,28,Philippe Étancelin,1896-12-28
1,703,legat,,,Arthur,Legat,1898-11-01,Belgian,http://en.wikipedia.org/wiki/Arthur_Legat,2,1898,11,1,Arthur Legat,1898-11-1
2,786,fagioli,,,Luigi,Fagioli,1898-06-09,Italian,http://en.wikipedia.org/wiki/Luigi_Fagioli,3,1898,6,9,Luigi Fagioli,1898-6-9
3,791,biondetti,,,Clemente,Biondetti,1898-08-18,Italian,http://en.wikipedia.org/wiki/Clemente_Biondetti,4,1898,8,18,Clemente Biondetti,1898-8-18
4,589,chiron,,,Louis,Chiron,1899-08-03,Monegasque,http://en.wikipedia.org/wiki/Louis_Chiron,5,1899,8,3,Louis Chiron,1899-8-3
5,750,brudes,,,Adolf,Brudes,1899-10-15,German,http://en.wikipedia.org/wiki/Adolf_Brudes,6,1899,10,15,Adolf Brudes,1899-10-15
6,760,dusio,,,Piero,Dusio,1899-10-13,Italian,http://en.wikipedia.org/wiki/Piero_Dusio,7,1899,10,13,Piero Dusio,1899-10-13
7,717,hans_stuck,,,Hans,von Stuck,12/27/00,German,http://en.wikipedia.org/wiki/Hans_Von_Stuck,8,1900,12,27,Hans von Stuck,1900-12-27
8,749,aston,,,Bill,Aston,3/29/00,British,http://en.wikipedia.org/wiki/Bill_Aston,9,1900,3,29,Bill Aston,1900-3-29
9,733,miller,,,Chet,Miller,7/19/02,American,http://en.wikipedia.org/wiki/Chet_Miller,10,1902,7,19,Chet Miller,1902-7-19


In [71]:
drivers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 847 entries, 0 to 846
Data columns (total 15 columns):
driverId       847 non-null int64
driverRef      847 non-null object
number         44 non-null float64
code           90 non-null object
forename       847 non-null object
surname        847 non-null object
dob            847 non-null object
nationality    847 non-null object
url            846 non-null object
driverId2      847 non-null int64
dob_year       847 non-null int64
dob_month      847 non-null int64
dob_day        847 non-null int64
driverName     847 non-null object
dob2           847 non-null object
dtypes: float64(1), int64(5), object(9)
memory usage: 99.3+ KB


In [72]:
drivers["dob2"] = pd.to_datetime(drivers["dob2"])
drivers["driverNationality"] = drivers["nationality"]

In [73]:
drivers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 847 entries, 0 to 846
Data columns (total 16 columns):
driverId             847 non-null int64
driverRef            847 non-null object
number               44 non-null float64
code                 90 non-null object
forename             847 non-null object
surname              847 non-null object
dob                  847 non-null object
nationality          847 non-null object
url                  846 non-null object
driverId2            847 non-null int64
dob_year             847 non-null int64
dob_month            847 non-null int64
dob_day              847 non-null int64
driverName           847 non-null object
dob2                 847 non-null datetime64[ns]
driverNationality    847 non-null object
dtypes: datetime64[ns](1), float64(1), int64(5), object(9)
memory usage: 106.0+ KB


dob2 seems to be in working order, so let's get that back to the drivers2 and merging attempt:

In [74]:
drivers2 = drivers[["driverId2", "driverRef", "driverName"]]

In [75]:
drivers2.head(20)

Unnamed: 0,driverId2,driverRef,driverName
0,1,etancelin,Philippe Étancelin
1,2,legat,Arthur Legat
2,3,fagioli,Luigi Fagioli
3,4,biondetti,Clemente Biondetti
4,5,chiron,Louis Chiron
5,6,brudes,Adolf Brudes
6,7,dusio,Piero Dusio
7,8,hans_stuck,Hans von Stuck
8,9,aston,Bill Aston
9,10,miller,Chet Miller


In [76]:
drivers2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 847 entries, 0 to 846
Data columns (total 3 columns):
driverId2     847 non-null int64
driverRef     847 non-null object
driverName    847 non-null object
dtypes: int64(1), object(2)
memory usage: 19.9+ KB


In [77]:
results_pre1 = results2.merge(drivers2, on="driverId2")

In [78]:
results_pre1[results_pre1.raceId2 == 1].sort_values("positionOrder")

Unnamed: 0,raceId2,driverId2,constructorId,grid,positionText,positionOrder,points,statusId,driverRef,driverName
7,1,20,51,1,1,1,9.0,1,farina,Nino Farina
44,1,3,51,2,2,2,6.0,1,fagioli,Luigi Fagioli
0,1,58,51,4,3,3,4.0,1,reg_parnell,Reg Parnell
56,1,15,154,6,4,4,3.0,12,cabantous,Yves Cabantous
137,1,16,154,9,5,5,2.0,12,rosier,Louis Rosier
197,1,88,151,13,6,6,0.0,13,gerard,Bob Gerard
180,1,26,151,15,7,7,0.0,13,harrison,Cuth Harrison
183,1,1,154,14,8,8,0.0,15,etancelin,Philippe Étancelin
195,1,130,105,16,9,9,0.0,16,hampshire,David Hampshire
289,1,104,105,20,10,10,0.0,16,shawe_taylor,Brian Shawe Taylor


### Getting Race Information

This seems like we're moving along, so let's add race information next:

In [79]:
races = pd.read_csv("../data/f1db_working/races.csv")

In [80]:
races.head()

Unnamed: 0,raceId,year,round,circuitId,name,date,time,url,raceId2,race_year,race_month,race_day
0,833,1950,1,9,British Grand Prix,5/13/50,,http://en.wikipedia.org/wiki/1950_British_Gran...,1,1950,5,13
1,834,1950,2,6,Monaco Grand Prix,5/21/50,,http://en.wikipedia.org/wiki/1950_Monaco_Grand...,2,1950,5,21
2,835,1950,3,19,Indianapolis 500,5/30/50,,http://en.wikipedia.org/wiki/1950_Indianapolis...,3,1950,5,30
3,836,1950,4,66,Swiss Grand Prix,6/4/50,,http://en.wikipedia.org/wiki/1950_Swiss_Grand_...,4,1950,6,4
4,837,1950,5,13,Belgian Grand Prix,6/18/50,,http://en.wikipedia.org/wiki/1950_Belgian_Gran...,5,1950,6,18


In [81]:
races.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1018 entries, 0 to 1017
Data columns (total 12 columns):
raceId        1018 non-null int64
year          1018 non-null int64
round         1018 non-null int64
circuitId     1018 non-null int64
name          1018 non-null object
date          1018 non-null object
time          287 non-null object
url           1018 non-null object
raceId2       1018 non-null int64
race_year     1018 non-null int64
race_month    1018 non-null int64
race_day      1018 non-null int64
dtypes: int64(8), object(4)
memory usage: 95.5+ KB


In [82]:
def combine_date(row):
    year = str(row.race_year)
    month = str (row.race_month)
    day = str(row.race_day)
    return year+"-"+month+"-"+day

In [83]:
races["prixDate"] = races.apply(combine_date, axis=1)

In [84]:
races.head()

Unnamed: 0,raceId,year,round,circuitId,name,date,time,url,raceId2,race_year,race_month,race_day,prixDate
0,833,1950,1,9,British Grand Prix,5/13/50,,http://en.wikipedia.org/wiki/1950_British_Gran...,1,1950,5,13,1950-5-13
1,834,1950,2,6,Monaco Grand Prix,5/21/50,,http://en.wikipedia.org/wiki/1950_Monaco_Grand...,2,1950,5,21,1950-5-21
2,835,1950,3,19,Indianapolis 500,5/30/50,,http://en.wikipedia.org/wiki/1950_Indianapolis...,3,1950,5,30,1950-5-30
3,836,1950,4,66,Swiss Grand Prix,6/4/50,,http://en.wikipedia.org/wiki/1950_Swiss_Grand_...,4,1950,6,4,1950-6-4
4,837,1950,5,13,Belgian Grand Prix,6/18/50,,http://en.wikipedia.org/wiki/1950_Belgian_Gran...,5,1950,6,18,1950-6-18


In [85]:
races["prixDate"] = pd.to_datetime(races["prixDate"])
races["prixName"] = races["name"]

In [86]:
races.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1018 entries, 0 to 1017
Data columns (total 14 columns):
raceId        1018 non-null int64
year          1018 non-null int64
round         1018 non-null int64
circuitId     1018 non-null int64
name          1018 non-null object
date          1018 non-null object
time          287 non-null object
url           1018 non-null object
raceId2       1018 non-null int64
race_year     1018 non-null int64
race_month    1018 non-null int64
race_day      1018 non-null int64
prixDate      1018 non-null datetime64[ns]
prixName      1018 non-null object
dtypes: datetime64[ns](1), int64(8), object(5)
memory usage: 111.4+ KB


In [87]:
races2 = races[["raceId2", "year", "round", "circuitId", "prixName", "prixDate"]]

In [88]:
races2.head(20)

Unnamed: 0,raceId2,year,round,circuitId,prixName,prixDate
0,1,1950,1,9,British Grand Prix,1950-05-13
1,2,1950,2,6,Monaco Grand Prix,1950-05-21
2,3,1950,3,19,Indianapolis 500,1950-05-30
3,4,1950,4,66,Swiss Grand Prix,1950-06-04
4,5,1950,5,13,Belgian Grand Prix,1950-06-18
5,6,1950,6,55,French Grand Prix,1950-07-02
6,7,1950,7,14,Italian Grand Prix,1950-09-03
7,8,1951,1,66,Swiss Grand Prix,1951-05-27
8,9,1951,2,19,Indianapolis 500,1951-05-30
9,10,1951,3,13,Belgian Grand Prix,1951-06-17


In [89]:
races2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1018 entries, 0 to 1017
Data columns (total 6 columns):
raceId2      1018 non-null int64
year         1018 non-null int64
round        1018 non-null int64
circuitId    1018 non-null int64
prixName     1018 non-null object
prixDate     1018 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(4), object(1)
memory usage: 47.8+ KB


In [90]:
results_pre2 = results2.merge(drivers2, on="driverId2").merge(races2, on="raceId2")

In [91]:
results_pre2.head()

Unnamed: 0,raceId2,driverId2,constructorId,grid,positionText,positionOrder,points,statusId,driverRef,driverName,year,round,circuitId,prixName,prixDate
0,1,58,51,4,3,3,4.0,1,reg_parnell,Reg Parnell,1950,1,9,British Grand Prix,1950-05-13
1,1,20,51,1,1,1,9.0,1,farina,Nino Farina,1950,1,9,British Grand Prix,1950-05-13
2,1,3,51,2,2,2,6.0,1,fagioli,Luigi Fagioli,1950,1,9,British Grand Prix,1950-05-13
3,1,142,151,10,R,20,0.0,6,rolt,Tony Rolt,1950,1,9,British Grand Prix,1950-05-13
4,1,105,105,20,10,10,0.0,16,fry,Joe Fry,1950,1,9,British Grand Prix,1950-05-13


In [92]:
results_pre2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24277 entries, 0 to 24276
Data columns (total 15 columns):
raceId2          24277 non-null int64
driverId2        24277 non-null int64
constructorId    24277 non-null int64
grid             24277 non-null int64
positionText     24277 non-null object
positionOrder    24277 non-null int64
points           24277 non-null float64
statusId         24277 non-null int64
driverRef        24277 non-null object
driverName       24277 non-null object
year             24277 non-null int64
round            24277 non-null int64
circuitId        24277 non-null int64
prixName         24277 non-null object
prixDate         24277 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(9), object(4)
memory usage: 3.0+ MB


In [93]:
results_pre1.head()

Unnamed: 0,raceId2,driverId2,constructorId,grid,positionText,positionOrder,points,statusId,driverRef,driverName
0,1,58,51,4,3,3,4.0,1,reg_parnell,Reg Parnell
1,6,58,105,11,R,13,0.0,5,reg_parnell,Reg Parnell
2,11,58,6,9,4,4,3.0,14,reg_parnell,Reg Parnell
3,12,58,66,20,5,5,2.0,15,reg_parnell,Reg Parnell
4,14,58,66,8,W,21,0.0,54,reg_parnell,Reg Parnell


### Getting Constructor Information


In [94]:
constructors = pd.read_csv("../data/f1db_working/constructors.csv")

In [95]:
constructors.head()

Unnamed: 0,constructorId,constructorRef,name,nationality,url
0,1,mclaren,McLaren,British,http://en.wikipedia.org/wiki/McLaren
1,2,bmw_sauber,BMW Sauber,German,http://en.wikipedia.org/wiki/BMW_Sauber
2,3,williams,Williams,British,http://en.wikipedia.org/wiki/Williams_Grand_Pr...
3,4,renault,Renault,French,http://en.wikipedia.org/wiki/Renault_in_Formul...
4,5,toro_rosso,Toro Rosso,Italian,http://en.wikipedia.org/wiki/Scuderia_Toro_Rosso


In [96]:
constructors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209 entries, 0 to 208
Data columns (total 5 columns):
constructorId     209 non-null int64
constructorRef    209 non-null object
name              209 non-null object
nationality       209 non-null object
url               209 non-null object
dtypes: int64(1), object(4)
memory usage: 8.2+ KB


In [97]:
constructors["constructorNationality"] = constructors["nationality"]
constructors["constructorName"] = constructors["name"]

In [98]:
constructors.head()

Unnamed: 0,constructorId,constructorRef,name,nationality,url,constructorNationality,constructorName
0,1,mclaren,McLaren,British,http://en.wikipedia.org/wiki/McLaren,British,McLaren
1,2,bmw_sauber,BMW Sauber,German,http://en.wikipedia.org/wiki/BMW_Sauber,German,BMW Sauber
2,3,williams,Williams,British,http://en.wikipedia.org/wiki/Williams_Grand_Pr...,British,Williams
3,4,renault,Renault,French,http://en.wikipedia.org/wiki/Renault_in_Formul...,French,Renault
4,5,toro_rosso,Toro Rosso,Italian,http://en.wikipedia.org/wiki/Scuderia_Toro_Rosso,Italian,Toro Rosso


In [99]:
constructors2 = constructors[["constructorId","constructorRef", "constructorName"]]

In [100]:
constructors2.head()

Unnamed: 0,constructorId,constructorRef,constructorName
0,1,mclaren,McLaren
1,2,bmw_sauber,BMW Sauber
2,3,williams,Williams
3,4,renault,Renault
4,5,toro_rosso,Toro Rosso


In [101]:
results_pre3 = results2.merge(drivers2, on="driverId2").merge(races2, on="raceId2").merge(constructors2, on="constructorId")

In [102]:
results_pre3.head()

Unnamed: 0,raceId2,driverId2,constructorId,grid,positionText,positionOrder,points,statusId,driverRef,driverName,year,round,circuitId,prixName,prixDate,constructorRef,constructorName
0,1,58,51,4,3,3,4.0,1,reg_parnell,Reg Parnell,1950,1,9,British Grand Prix,1950-05-13,alfa,Alfa Romeo
1,1,20,51,1,1,1,9.0,1,farina,Nino Farina,1950,1,9,British Grand Prix,1950-05-13,alfa,Alfa Romeo
2,1,3,51,2,2,2,6.0,1,fagioli,Luigi Fagioli,1950,1,9,British Grand Prix,1950-05-13,alfa,Alfa Romeo
3,1,56,51,3,R,12,0.0,44,fangio,Juan Fangio,1950,1,9,British Grand Prix,1950-05-13,alfa,Alfa Romeo
4,6,20,51,2,7,7,0.0,48,farina,Nino Farina,1950,6,55,French Grand Prix,1950-07-02,alfa,Alfa Romeo


### Getting Circuit information 

And we're just missing the circuit information.

In [103]:
circuits = pd.read_csv("../data/f1db_working/circuits.csv")

In [104]:
circuits.head()

Unnamed: 0,circuitId,circuitRef,name,location,country,lat,lng,alt,url
0,1,albert_park,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,http://en.wikipedia.org/wiki/Melbourne_Grand_P...
1,2,sepang,Sepang International Circuit,Kuala Lumpur,Malaysia,2.76083,101.738,,http://en.wikipedia.org/wiki/Sepang_Internatio...
2,3,bahrain,Bahrain International Circuit,Sakhir,Bahrain,26.0325,50.5106,,http://en.wikipedia.org/wiki/Bahrain_Internati...
3,4,catalunya,Circuit de Barcelona-Catalunya,Montmeló,Spain,41.57,2.26111,,http://en.wikipedia.org/wiki/Circuit_de_Barcel...
4,5,istanbul,Istanbul Park,Istanbul,Turkey,40.9517,29.405,,http://en.wikipedia.org/wiki/Istanbul_Park


In [105]:
circuits["circuitName"] = circuits.name

In [106]:
circuits.head()

Unnamed: 0,circuitId,circuitRef,name,location,country,lat,lng,alt,url,circuitName
0,1,albert_park,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,http://en.wikipedia.org/wiki/Melbourne_Grand_P...,Albert Park Grand Prix Circuit
1,2,sepang,Sepang International Circuit,Kuala Lumpur,Malaysia,2.76083,101.738,,http://en.wikipedia.org/wiki/Sepang_Internatio...,Sepang International Circuit
2,3,bahrain,Bahrain International Circuit,Sakhir,Bahrain,26.0325,50.5106,,http://en.wikipedia.org/wiki/Bahrain_Internati...,Bahrain International Circuit
3,4,catalunya,Circuit de Barcelona-Catalunya,Montmeló,Spain,41.57,2.26111,,http://en.wikipedia.org/wiki/Circuit_de_Barcel...,Circuit de Barcelona-Catalunya
4,5,istanbul,Istanbul Park,Istanbul,Turkey,40.9517,29.405,,http://en.wikipedia.org/wiki/Istanbul_Park,Istanbul Park


In [107]:
circuits2 = circuits[["circuitId", "circuitRef", "circuitName",]]

In [108]:
circuits2.head()

Unnamed: 0,circuitId,circuitRef,circuitName
0,1,albert_park,Albert Park Grand Prix Circuit
1,2,sepang,Sepang International Circuit
2,3,bahrain,Bahrain International Circuit
3,4,catalunya,Circuit de Barcelona-Catalunya
4,5,istanbul,Istanbul Park


In [109]:
results_pre4 = results2.merge(drivers2, on="driverId2").merge(races2, on="raceId2").merge(constructors2, on="constructorId").merge(circuits2, on="circuitId")

In [110]:
results_pre4.head()

Unnamed: 0,raceId2,driverId2,constructorId,grid,positionText,positionOrder,points,statusId,driverRef,driverName,year,round,circuitId,prixName,prixDate,constructorRef,constructorName,circuitRef,circuitName
0,1,58,51,4,3,3,4.0,1,reg_parnell,Reg Parnell,1950,1,9,British Grand Prix,1950-05-13,alfa,Alfa Romeo,silverstone,Silverstone Circuit
1,1,20,51,1,1,1,9.0,1,farina,Nino Farina,1950,1,9,British Grand Prix,1950-05-13,alfa,Alfa Romeo,silverstone,Silverstone Circuit
2,1,3,51,2,2,2,6.0,1,fagioli,Luigi Fagioli,1950,1,9,British Grand Prix,1950-05-13,alfa,Alfa Romeo,silverstone,Silverstone Circuit
3,1,56,51,3,R,12,0.0,44,fangio,Juan Fangio,1950,1,9,British Grand Prix,1950-05-13,alfa,Alfa Romeo,silverstone,Silverstone Circuit
4,12,20,51,3,R,14,1.0,8,farina,Nino Farina,1951,5,9,British Grand Prix,1951-07-14,alfa,Alfa Romeo,silverstone,Silverstone Circuit


### Getting Status IDs

I'll add status info to the races

In [111]:
status = pd.read_csv("../data/f1db_working/status.csv")

In [112]:
status.head()

Unnamed: 0,statusId,status
0,1,Finished
1,2,Disqualified
2,3,Accident
3,4,Collision
4,5,Engine


In [113]:
status.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 135 entries, 0 to 134
Data columns (total 2 columns):
statusId    135 non-null int64
status      135 non-null object
dtypes: int64(1), object(1)
memory usage: 2.2+ KB


In [114]:
results_pre4.merge(status, on="statusId")

Unnamed: 0,raceId2,driverId2,constructorId,grid,positionText,positionOrder,points,statusId,driverRef,driverName,year,round,circuitId,prixName,prixDate,constructorRef,constructorName,circuitRef,circuitName,status
0,1,58,51,4,3,3,4.00,1,reg_parnell,Reg Parnell,1950,1,9,British Grand Prix,1950-05-13,alfa,Alfa Romeo,silverstone,Silverstone Circuit,Finished
1,1,20,51,1,1,1,9.00,1,farina,Nino Farina,1950,1,9,British Grand Prix,1950-05-13,alfa,Alfa Romeo,silverstone,Silverstone Circuit,Finished
2,1,3,51,2,2,2,6.00,1,fagioli,Luigi Fagioli,1950,1,9,British Grand Prix,1950-05-13,alfa,Alfa Romeo,silverstone,Silverstone Circuit,Finished
3,12,56,51,2,2,2,6.00,1,fangio,Juan Fangio,1951,5,9,British Grand Prix,1951-07-14,alfa,Alfa Romeo,silverstone,Silverstone Circuit,Finished
4,29,56,105,4,2,2,6.00,1,fangio,Juan Fangio,1953,6,9,British Grand Prix,1953-07-18,maserati,Maserati,silverstone,Silverstone Circuit,Finished
5,12,204,6,1,1,1,8.00,1,gonzalez,José Froilán González,1951,5,9,British Grand Prix,1951-07-14,ferrari,Ferrari,silverstone,Silverstone Circuit,Finished
6,20,139,6,2,1,1,9.00,1,ascari,Alberto Ascari,1952,5,9,British Grand Prix,1952-07-19,ferrari,Ferrari,silverstone,Silverstone Circuit,Finished
7,37,204,6,2,1,1,8.14,1,gonzalez,José Froilán González,1954,5,9,British Grand Prix,1954-07-17,ferrari,Ferrari,silverstone,Silverstone Circuit,Finished
8,37,316,6,3,2,2,6.14,1,hawthorn,Mike Hawthorn,1954,5,9,British Grand Prix,1954-07-17,ferrari,Ferrari,silverstone,Silverstone Circuit,Finished
9,29,139,6,1,1,1,8.50,1,ascari,Alberto Ascari,1953,6,9,British Grand Prix,1953-07-18,ferrari,Ferrari,silverstone,Silverstone Circuit,Finished


In [115]:
results_pre4 = results_pre4.merge(status, on="statusId")

In [116]:
results_pre4.head(10)

Unnamed: 0,raceId2,driverId2,constructorId,grid,positionText,positionOrder,points,statusId,driverRef,driverName,year,round,circuitId,prixName,prixDate,constructorRef,constructorName,circuitRef,circuitName,status
0,1,58,51,4,3,3,4.0,1,reg_parnell,Reg Parnell,1950,1,9,British Grand Prix,1950-05-13,alfa,Alfa Romeo,silverstone,Silverstone Circuit,Finished
1,1,20,51,1,1,1,9.0,1,farina,Nino Farina,1950,1,9,British Grand Prix,1950-05-13,alfa,Alfa Romeo,silverstone,Silverstone Circuit,Finished
2,1,3,51,2,2,2,6.0,1,fagioli,Luigi Fagioli,1950,1,9,British Grand Prix,1950-05-13,alfa,Alfa Romeo,silverstone,Silverstone Circuit,Finished
3,12,56,51,2,2,2,6.0,1,fangio,Juan Fangio,1951,5,9,British Grand Prix,1951-07-14,alfa,Alfa Romeo,silverstone,Silverstone Circuit,Finished
4,29,56,105,4,2,2,6.0,1,fangio,Juan Fangio,1953,6,9,British Grand Prix,1953-07-18,maserati,Maserati,silverstone,Silverstone Circuit,Finished
5,12,204,6,1,1,1,8.0,1,gonzalez,José Froilán González,1951,5,9,British Grand Prix,1951-07-14,ferrari,Ferrari,silverstone,Silverstone Circuit,Finished
6,20,139,6,2,1,1,9.0,1,ascari,Alberto Ascari,1952,5,9,British Grand Prix,1952-07-19,ferrari,Ferrari,silverstone,Silverstone Circuit,Finished
7,37,204,6,2,1,1,8.14,1,gonzalez,José Froilán González,1954,5,9,British Grand Prix,1954-07-17,ferrari,Ferrari,silverstone,Silverstone Circuit,Finished
8,37,316,6,3,2,2,6.14,1,hawthorn,Mike Hawthorn,1954,5,9,British Grand Prix,1954-07-17,ferrari,Ferrari,silverstone,Silverstone Circuit,Finished
9,29,139,6,1,1,1,8.5,1,ascari,Alberto Ascari,1953,6,9,British Grand Prix,1953-07-18,ferrari,Ferrari,silverstone,Silverstone Circuit,Finished


In [117]:
results_pre4.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24277 entries, 0 to 24276
Data columns (total 20 columns):
raceId2            24277 non-null int64
driverId2          24277 non-null int64
constructorId      24277 non-null int64
grid               24277 non-null int64
positionText       24277 non-null object
positionOrder      24277 non-null int64
points             24277 non-null float64
statusId           24277 non-null int64
driverRef          24277 non-null object
driverName         24277 non-null object
year               24277 non-null int64
round              24277 non-null int64
circuitId          24277 non-null int64
prixName           24277 non-null object
prixDate           24277 non-null datetime64[ns]
constructorRef     24277 non-null object
constructorName    24277 non-null object
circuitRef         24277 non-null object
circuitName        24277 non-null object
status             24277 non-null object
dtypes: datetime64[ns](1), float64(1), int64(9), object(9)
memory usage

In [118]:
results_final = results_pre4[["raceId2", "prixName", "year", "round", "prixDate", "constructorName", "driverName",
                              "grid","positionText","positionOrder","points","status"]].sort_values(["raceId2","positionOrder"])

In [119]:
results_final.head(10)

Unnamed: 0,raceId2,prixName,year,round,prixDate,constructorName,driverName,grid,positionText,positionOrder,points,status
1,1,British Grand Prix,1950,1,1950-05-13,Alfa Romeo,Nino Farina,1,1,1,9.0,Finished
2,1,British Grand Prix,1950,1,1950-05-13,Alfa Romeo,Luigi Fagioli,2,2,2,6.0,Finished
0,1,British Grand Prix,1950,1,1950-05-13,Alfa Romeo,Reg Parnell,4,3,3,4.0,Finished
16272,1,British Grand Prix,1950,1,1950-05-13,Talbot-Lago,Yves Cabantous,6,4,4,3.0,+2 Laps
16273,1,British Grand Prix,1950,1,1950-05-13,Talbot-Lago,Louis Rosier,9,5,5,2.0,+2 Laps
6569,1,British Grand Prix,1950,1,1950-05-13,ERA,Bob Gerard,13,6,6,0.0,+3 Laps
6568,1,British Grand Prix,1950,1,1950-05-13,ERA,Cuth Harrison,15,7,7,0.0,+3 Laps
18116,1,British Grand Prix,1950,1,1950-05-13,Talbot-Lago,Philippe Étancelin,14,8,8,0.0,+5 Laps
7291,1,British Grand Prix,1950,1,1950-05-13,Maserati,David Hampshire,16,9,9,0.0,+6 Laps
7290,1,British Grand Prix,1950,1,1950-05-13,Maserati,Joe Fry,20,10,10,0.0,+6 Laps


In [120]:
results_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24277 entries, 1 to 20564
Data columns (total 12 columns):
raceId2            24277 non-null int64
prixName           24277 non-null object
year               24277 non-null int64
round              24277 non-null int64
prixDate           24277 non-null datetime64[ns]
constructorName    24277 non-null object
driverName         24277 non-null object
grid               24277 non-null int64
positionText       24277 non-null object
positionOrder      24277 non-null int64
points             24277 non-null float64
status             24277 non-null object
dtypes: datetime64[ns](1), float64(1), int64(5), object(5)
memory usage: 2.4+ MB


In [121]:
results_final.to_csv("../data/f1db_results.csv", index=False)

Now let's just look at one season... from this

In [122]:
test1 = results_final[results_final.year == 1991]

In [123]:
test1.head()

Unnamed: 0,raceId2,prixName,year,round,prixDate,constructorName,driverName,grid,positionText,positionOrder,points,status
5149,501,United States Grand Prix,1991,1,1991-03-10,McLaren,Ayrton Senna,1,1,1,10.0,Finished
5145,501,United States Grand Prix,1991,1,1991-03-10,Ferrari,Alain Prost,2,2,2,6.0,Finished
5156,501,United States Grand Prix,1991,1,1991-03-10,Benetton,Nelson Piquet,5,3,3,4.0,Finished
5151,501,United States Grand Prix,1991,1,1991-03-10,Tyrrell,Stefano Modena,11,4,4,3.0,Finished
10710,501,United States Grand Prix,1991,1,1991-03-10,Tyrrell,Satoru Nakajima,16,5,5,2.0,+1 Lap


In [124]:
test1.to_csv("../data/f1db_results1991.csv", index=False)

Now that I have the 1991 season separate, I'll work with it to figure out the timeline

In [125]:
test2 = results_final[results_final.driverName == "Ayrton Senna"]

In [126]:
test2.head()

Unnamed: 0,raceId2,prixName,year,round,prixDate,constructorName,driverName,grid,positionText,positionOrder,points,status
11436,389,Brazilian Grand Prix,1984,1,1984-03-25,Toleman,Ayrton Senna,16,R,26,0.0,Turbo
6862,390,South African Grand Prix,1984,2,1984-04-07,Toleman,Ayrton Senna,13,6,6,1.0,+3 Laps
16813,391,Belgian Grand Prix,1984,3,1984-04-29,Toleman,Ayrton Senna,19,6,6,1.0,+2 Laps
23103,392,San Marino Grand Prix,1984,4,1984-05-06,Toleman,Ayrton Senna,0,F,28,0.0,Did not qualify
11352,393,French Grand Prix,1984,5,1984-05-20,Toleman,Ayrton Senna,13,R,19,0.0,Turbo


In [127]:
test2.to_csv("../data/f1db_senna.csv", index=False)