# Looking for the best Formula 1 season

For my master's project, I'm making a piece about answering the question: **What championship winning team had the best Formula 1 season?**

To do this I was working with data provided by the [Ergast Developer API](https://ergast.com/mrd/). I noticed an error in the driver-constructor pairing for the 1950 season and wanted to verify things without moving forward. I was originally going to create a table of the driver-constructor pairs for each race, and then compare it with the data I had.

Instead I've chosen to go straight to the source for F1 information ([formula1.com](https://formula1.com)) and scrape the race results for each race. I did this scraping on 2019-06-21 and 2019-06-22 and I'll be now working with that data to do my analysis.

Because it is data from a primary-source, I have some more confidence in it.

In [1]:
import pandas as pd
import numpy as np

The first thing to do is to import the data

In [2]:
race_results = pd.read_csv("../formula1-data/results_all.csv")

In [3]:
race_results.head()

Unnamed: 0,raceId,year,raceRound,date,prix,driverFirstName,driverLastName,driverCode,constructor,finishingPosition,positionOrder,laps,time,points
0,1,1950,1,13 May 1950,Great Britain,Nino,Farina,FAR,Alfa Romeo,1,1,70.0,2:13:23.600,9.0
1,1,1950,1,13 May 1950,Great Britain,Luigi,Fagioli,FAG,Alfa Romeo,2,2,70.0,+2.600s,6.0
2,1,1950,1,13 May 1950,Great Britain,Reg,Parnell,PAR,Alfa Romeo,3,3,70.0,+52.000s,4.0
3,1,1950,1,13 May 1950,Great Britain,Yves Giraud,Cabantous,CAB,Talbot-Lago,4,4,68.0,+2 laps,3.0
4,1,1950,1,13 May 1950,Great Britain,Louis,Rosier,ROS,Talbot-Lago,5,5,68.0,+2 laps,2.0


In [4]:
race_results.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22598 entries, 0 to 22597
Data columns (total 14 columns):
raceId               22598 non-null int64
year                 22598 non-null int64
raceRound            22598 non-null int64
date                 22598 non-null object
prix                 22598 non-null object
driverFirstName      22598 non-null object
driverLastName       22598 non-null object
driverCode           22598 non-null object
constructor          22572 non-null object
finishingPosition    22598 non-null object
positionOrder        22598 non-null int64
laps                 22365 non-null float64
time                 22590 non-null object
points               22598 non-null float64
dtypes: float64(2), int64(4), object(8)
memory usage: 2.4+ MB


Let's also check how many races we have:

In [5]:
race_results.raceId.max()

1007

Things seem to be in good order. For most of the work, I won't particularly care for a few of the columns, namely:

* laps
* points
* driverCode
* date

So, let's drop those:

In [6]:
results = race_results.copy().drop(columns=["date", "laps", "points", "driverCode"])
results = results[results.year < 2019]

In [7]:
results.head()

Unnamed: 0,raceId,year,raceRound,prix,driverFirstName,driverLastName,constructor,finishingPosition,positionOrder,time
0,1,1950,1,Great Britain,Nino,Farina,Alfa Romeo,1,1,2:13:23.600
1,1,1950,1,Great Britain,Luigi,Fagioli,Alfa Romeo,2,2,+2.600s
2,1,1950,1,Great Britain,Reg,Parnell,Alfa Romeo,3,3,+52.000s
3,1,1950,1,Great Britain,Yves Giraud,Cabantous,Talbot-Lago,4,4,+2 laps
4,1,1950,1,Great Britain,Louis,Rosier,Talbot-Lago,5,5,+2 laps


For most of my analysis, I'm looking only at the teams that won championships, so let's slice the results table and keep only the different championship runs. I scraped more of the F1 site to find the teams that the winning drivers were a part of each season and am working with that.

In [9]:
teams = pd.read_csv("../formula1-data/winners-clean.csv")

In [11]:
teams

Unnamed: 0,year,constructor,constructor_clean
0,1950,Alfa Romeo,Alfa Romeo
1,1951,Alfa Romeo,Alfa Romeo
2,1952,Ferrari,Ferrari
3,1953,Ferrari,Ferrari
4,1954,Mercedes-Benz,Mercedes
5,1955,Mercedes-Benz,Mercedes
6,1956,Ferrari,Ferrari
7,1957,Maserati,Maserati
8,1958,Ferrari,Ferrari
9,1959,Cooper Climax,Cooper


One thing with the teams is that they change names of the year (look at McLaren for instance) and often their name is the constructor followed by the engine they're using during a particular year.

To account for this, I created a column of clean names that keep the constructor consistent across the years. To slice my results data, I will merge this table to the results table and only keep the lines that match both `year` and `constructor`.

In [12]:
comparison = pd.merge(results, teams, how="left", on=["year", "constructor"], indicator=True)

In [13]:
comparison.head()

Unnamed: 0,raceId,year,raceRound,prix,driverFirstName,driverLastName,constructor,finishingPosition,positionOrder,time,constructor_clean,_merge
0,1,1950,1,Great Britain,Nino,Farina,Alfa Romeo,1,1,2:13:23.600,Alfa Romeo,both
1,1,1950,1,Great Britain,Luigi,Fagioli,Alfa Romeo,2,2,+2.600s,Alfa Romeo,both
2,1,1950,1,Great Britain,Reg,Parnell,Alfa Romeo,3,3,+52.000s,Alfa Romeo,both
3,1,1950,1,Great Britain,Yves Giraud,Cabantous,Talbot-Lago,4,4,+2 laps,,left_only
4,1,1950,1,Great Britain,Louis,Rosier,Talbot-Lago,5,5,+2 laps,,left_only


In [21]:
results_slice = comparison[comparison._merge == "both"]
results_slice.head(22)

Unnamed: 0,raceId,year,raceRound,prix,driverFirstName,driverLastName,constructor,finishingPosition,positionOrder,time,constructor_clean,_merge
0,1,1950,1,Great Britain,Nino,Farina,Alfa Romeo,1,1,2:13:23.600,Alfa Romeo,both
1,1,1950,1,Great Britain,Luigi,Fagioli,Alfa Romeo,2,2,+2.600s,Alfa Romeo,both
2,1,1950,1,Great Britain,Reg,Parnell,Alfa Romeo,3,3,+52.000s,Alfa Romeo,both
12,1,1950,1,Great Britain,Juan Manuel,Fangio,Alfa Romeo,NC,13,DNF,Alfa Romeo,both
23,2,1950,2,Monaco,Juan Manuel,Fangio,Alfa Romeo,1,1,3:13:18.700,Alfa Romeo,both
33,2,1950,2,Monaco,Luigi,Fagioli,Alfa Romeo,NC,11,DNF,Alfa Romeo,both
34,2,1950,2,Monaco,Nino,Farina,Alfa Romeo,NC,12,DNF,Alfa Romeo,both
78,4,1950,4,Switzerland,Nino,Farina,Alfa Romeo,1,1,2:02:53.700,Alfa Romeo,both
79,4,1950,4,Switzerland,Luigi,Fagioli,Alfa Romeo,2,2,+0.400s,Alfa Romeo,both
89,4,1950,4,Switzerland,Juan Manuel,Fangio,Alfa Romeo,NC,12,DNF,Alfa Romeo,both


Now that I've done this, I can go in and drop the old constructor column and the newly added `_merge` column

In [22]:
results2 = results_slice.drop(columns=["constructor", "_merge"]) 

In [23]:
results2.head()

Unnamed: 0,raceId,year,raceRound,prix,driverFirstName,driverLastName,finishingPosition,positionOrder,time,constructor_clean
0,1,1950,1,Great Britain,Nino,Farina,1,1,2:13:23.600,Alfa Romeo
1,1,1950,1,Great Britain,Luigi,Fagioli,2,2,+2.600s,Alfa Romeo
2,1,1950,1,Great Britain,Reg,Parnell,3,3,+52.000s,Alfa Romeo
12,1,1950,1,Great Britain,Juan Manuel,Fangio,NC,13,DNF,Alfa Romeo
23,2,1950,2,Monaco,Juan Manuel,Fangio,1,1,3:13:18.700,Alfa Romeo


In [24]:
results2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2311 entries, 0 to 22442
Data columns (total 10 columns):
raceId               2311 non-null int64
year                 2311 non-null int64
raceRound            2311 non-null int64
prix                 2311 non-null object
driverFirstName      2311 non-null object
driverLastName       2311 non-null object
finishingPosition    2311 non-null object
positionOrder        2311 non-null int64
time                 2311 non-null object
constructor_clean    2311 non-null object
dtypes: int64(4), object(6)
memory usage: 198.6+ KB


Now we can head into the analysis.

---

## Idea 01: Wins

The first way we can look to find an answer to the question is to look at the wins each team had in their run.

In [25]:
wins = results2[results2.finishingPosition == "1"]

In [26]:
wins.head()

Unnamed: 0,raceId,year,raceRound,prix,driverFirstName,driverLastName,finishingPosition,positionOrder,time,constructor_clean
0,1,1950,1,Great Britain,Nino,Farina,1,1,2:13:23.600,Alfa Romeo
23,2,1950,2,Monaco,Juan Manuel,Fangio,1,1,3:13:18.700,Alfa Romeo
78,4,1950,4,Switzerland,Nino,Farina,1,1,2:02:53.700,Alfa Romeo
96,5,1950,5,Belgium,Juan Manuel,Fangio,1,1,2:47:26.000,Alfa Romeo
110,6,1950,6,France,Juan Manuel,Fangio,1,1,2:57:52.800,Alfa Romeo


In [27]:
wins.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 532 entries, 0 to 22438
Data columns (total 10 columns):
raceId               532 non-null int64
year                 532 non-null int64
raceRound            532 non-null int64
prix                 532 non-null object
driverFirstName      532 non-null object
driverLastName       532 non-null object
finishingPosition    532 non-null object
positionOrder        532 non-null int64
time                 532 non-null object
constructor_clean    532 non-null object
dtypes: int64(4), object(6)
memory usage: 45.7+ KB


In [28]:
wins.head(22)

Unnamed: 0,raceId,year,raceRound,prix,driverFirstName,driverLastName,finishingPosition,positionOrder,time,constructor_clean
0,1,1950,1,Great Britain,Nino,Farina,1,1,2:13:23.600,Alfa Romeo
23,2,1950,2,Monaco,Juan Manuel,Fangio,1,1,3:13:18.700,Alfa Romeo
78,4,1950,4,Switzerland,Nino,Farina,1,1,2:02:53.700,Alfa Romeo
96,5,1950,5,Belgium,Juan Manuel,Fangio,1,1,2:47:26.000,Alfa Romeo
110,6,1950,6,France,Juan Manuel,Fangio,1,1,2:57:52.800,Alfa Romeo
130,7,1950,7,Italy,Nino,Farina,1,1,2:51:17.400,Alfa Romeo
158,8,1951,1,Switzerland,Juan Manuel,Fangio,1,1,2:07:53.640,Alfa Romeo
213,10,1951,3,Belgium,Nino,Farina,1,1,2:45:46.200,Alfa Romeo
226,11,1951,4,France,Juan Manuel,Fangio,1,1,3:22:11.000,Alfa Romeo
227,11,1951,4,France,Luigi,Fagioli,1,2,SHC,Alfa Romeo


One thing with the earlier F1 races that I've noted in my previous analysis attempts is that the there were a lot of shared drives. In this dataset, those rows are easier to find. In the column for time, they have a value of `SHC`. In doing my calculations, I'll ignore these rows.