# Looking for the best Formula 1 season

For my master's project, I'm making a piece about answering the question: **What championship winning team had the best Formula 1 season?**

To answer this question, I'll be checking three definitions of best:

1. most wins in a season
1. most podiums in a season
1. how close was the performance to perfect

To do this I was working with data provided by the [Ergast Developer API](https://ergast.com/mrd/). I noticed an error in the driver-constructor pairing for the 1950 season and wanted to verify things without moving forward. I was originally going to create a table of the driver-constructor pairs for each race, and then compare it with the data I had.

Instead I went straight to the source for F1 information, [formula1.com](https://formula1.com), and scraped race information for each race from 1950 to 2018. There were some holes with how disqualifications and withdrawal were recorded (or not, in this case) as we went back in time to earlier seasons.

Now I've gone and gotten data from [statsf1.com](https://www.statsf1.com/) which is tabulated in an easy to understand manner and is more complete than the formula1.com data, and doesn't have the issues of the Ergast data.

In [1]:
import pandas as pd
import numpy as np

In [2]:
race_results = pd.read_csv("../data/other/race_results_v2.csv")

In [3]:
race_results.head()

Unnamed: 0,race_id,year,round,race_name,position,order,driver,constructor,team,extra
0,1,1950,1,Britain,1,1,Giuseppe FARINA,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 13m 23.6s ( 146.378 km/h )
1,1,1950,1,Britain,2,2,Luigi FAGIOLI,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 13m 26.2s ( +02.6s )
2,1,1950,1,Britain,3,3,Reg PARNELL,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 14m 15.6s ( +52.0s )
3,1,1950,1,Britain,4,4,Yves GIRAUD-CABANTOUS,Talbot Lago,Talbot Lago Talbot,
4,1,1950,1,Britain,5,5,Louis ROSIER,Talbot Lago,Talbot Lago Talbot,


Quickly check for the races in this period (should be 997)

In [4]:
race_results.race_id.max()

997

I will work with a slice of this `race_results` dataFrame that only includes the team in their championship winning season. Let's make that slice now:

In [5]:
winning_teams = pd.read_csv("../data/other/winning_teams_statsf1_v2.csv")

In [6]:
winning_teams.head()

Unnamed: 0,year,team,constructor
0,1950,Alfa Romeo Alfa Romeo,Alfa Romeo
1,1951,Alfa Romeo Alfa Romeo,Alfa Romeo
2,1994,Benetton Ford,Benetton
3,1995,Benetton Renault,Benetton
4,1983,Brabham BMW,Brabham


Now we combine this dataframe with the others:

In [7]:
combine = pd.merge(race_results, winning_teams, how="left", on=["year", "constructor"], indicator="keep")

In [8]:
combine[combine.race_id == 1]

Unnamed: 0,race_id,year,round,race_name,position,order,driver,constructor,team_x,extra,team_y,keep
0,1,1950,1,Britain,1,1,Giuseppe FARINA,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 13m 23.6s ( 146.378 km/h ),Alfa Romeo Alfa Romeo,both
1,1,1950,1,Britain,2,2,Luigi FAGIOLI,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 13m 26.2s ( +02.6s ),Alfa Romeo Alfa Romeo,both
2,1,1950,1,Britain,3,3,Reg PARNELL,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 14m 15.6s ( +52.0s ),Alfa Romeo Alfa Romeo,both
3,1,1950,1,Britain,4,4,Yves GIRAUD-CABANTOUS,Talbot Lago,Talbot Lago Talbot,,,left_only
4,1,1950,1,Britain,5,5,Louis ROSIER,Talbot Lago,Talbot Lago Talbot,,,left_only
5,1,1950,1,Britain,6,6,Bob GERARD,ERA,ERA ERA,,,left_only
6,1,1950,1,Britain,7,7,Cuth HARRISON,ERA,ERA ERA,,,left_only
7,1,1950,1,Britain,8,8,Philippe ETANCELIN,Talbot Lago,Talbot Lago Talbot,,,left_only
8,1,1950,1,Britain,9,9,David HAMPSHIRE,Maserati,Maserati Maserati,,,left_only
9,1,1950,1,Britain,10,10,Joe FRY,Maserati,Maserati Maserati,,,left_only


In [9]:
results = combine[combine.keep == "both"]

In [10]:
results[results.year == 1950]

Unnamed: 0,race_id,year,round,race_name,position,order,driver,constructor,team_x,extra,team_y,keep
0,1,1950,1,Britain,1,1,Giuseppe FARINA,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 13m 23.6s ( 146.378 km/h ),Alfa Romeo Alfa Romeo,both
1,1,1950,1,Britain,2,2,Luigi FAGIOLI,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 13m 26.2s ( +02.6s ),Alfa Romeo Alfa Romeo,both
2,1,1950,1,Britain,3,3,Reg PARNELL,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 14m 15.6s ( +52.0s ),Alfa Romeo Alfa Romeo,both
12,1,1950,1,Britain,ab,12,Juan Manuel FANGIO,Alfa Romeo,Alfa Romeo Alfa Romeo,Oil line,Alfa Romeo Alfa Romeo,both
25,2,1950,2,Monaco,1,1,Juan Manuel FANGIO,Alfa Romeo,Alfa Romeo Alfa Romeo,3h 13m 18.7s ( 98.701 km/h ),Alfa Romeo Alfa Romeo,both
35,2,1950,2,Monaco,ab,11,Luigi FAGIOLI,Alfa Romeo,Alfa Romeo Alfa Romeo,Pile-up,Alfa Romeo Alfa Romeo,both
36,2,1950,2,Monaco,ab,12,Giuseppe FARINA,Alfa Romeo,Alfa Romeo Alfa Romeo,Pile-up,Alfa Romeo Alfa Romeo,both
94,3,1950,3,Indianapolis,nq,41,Johnny MAURO,Alfa Romeo,Alfa Romeo Alfa Romeo,,Alfa Romeo Alfa Romeo,both
127,4,1950,4,Switzerland,1,1,Giuseppe FARINA,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 02m 53.7s ( 149.279 km/h ),Alfa Romeo Alfa Romeo,both
128,4,1950,4,Switzerland,2,2,Luigi FAGIOLI,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 02m 54.1s ( +00.4s ),Alfa Romeo Alfa Romeo,both


And those are all the result records for Alfa Romeo in 1950.

Now we can drop the `keep` column, save a copy of this data, and start doing the three analyses.

In [11]:
results = results.drop(columns=["keep", "team_y"]).rename(index=str, columns={"team_x":"team"})

In [12]:
results.to_csv("../data/output/race_results_champions.csv", index=False)

---

## Method 01: Wins

Let's compare championship seasons by how many wins each team got in their season.

We can look for wins by doing one of two things:

* pick all rows where `order == 1`
* pick all rows where `position == "1"`

In terms of wins, there were three races where two drivers shared first: 1951 French GP (Alfa Romeo), 1956 Argentine GP (Ferrari), and 1957 British GP (Vanwall).

For this analysis I care more that the constructor/team finished first than I do about it being a shared drive. By selecting rows using the position column, I also don't have to worry about shared drives.


In [13]:
wins = results[results.position == "1"]

In [14]:
wins.head(12)

Unnamed: 0,race_id,year,round,race_name,position,order,driver,constructor,team,extra
0,1,1950,1,Britain,1,1,Giuseppe FARINA,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 13m 23.6s ( 146.378 km/h )
25,2,1950,2,Monaco,1,1,Juan Manuel FANGIO,Alfa Romeo,Alfa Romeo Alfa Romeo,3h 13m 18.7s ( 98.701 km/h )
127,4,1950,4,Switzerland,1,1,Giuseppe FARINA,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 02m 53.7s ( 149.279 km/h )
150,5,1950,5,Belgium,1,1,Juan Manuel FANGIO,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 47m 26s ( 177.097 km/h )
164,6,1950,6,France,1,1,Juan Manuel FANGIO,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 57m 52.8s ( 168.729 km/h )
188,7,1950,7,Italy,1,1,Giuseppe FARINA,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 51m 17.4s ( 176.543 km/h )
222,8,1951,1,Switzerland,1,1,Juan Manuel FANGIO,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 07m 53.64s ( 143.444 km/h )
309,10,1951,3,Belgium,1,1,Giuseppe FARINA,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 45m 46.2s ( 183.985 km/h )
325,11,1951,4,France,1,1,Luigi FAGIOLI,Alfa Romeo,Alfa Romeo Alfa Romeo,
433,15,1951,8,Spain,1,1,Juan Manuel FANGIO,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 46m 54.10s ( 158.939 km/h )


In [15]:
wins.info()

<class 'pandas.core.frame.DataFrame'>
Index: 502 entries, 0 to 25185
Data columns (total 10 columns):
race_id        502 non-null int64
year           502 non-null int64
round          502 non-null int64
race_name      502 non-null object
position       502 non-null object
order          502 non-null int64
driver         502 non-null object
constructor    502 non-null object
team           502 non-null object
extra          502 non-null object
dtypes: int64(4), object(6)
memory usage: 43.1+ KB


In [16]:
wins[wins.year== 1951]

Unnamed: 0,race_id,year,round,race_name,position,order,driver,constructor,team,extra
222,8,1951,1,Switzerland,1,1,Juan Manuel FANGIO,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 07m 53.64s ( 143.444 km/h )
309,10,1951,3,Belgium,1,1,Giuseppe FARINA,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 45m 46.2s ( 183.985 km/h )
325,11,1951,4,France,1,1,Luigi FAGIOLI,Alfa Romeo,Alfa Romeo Alfa Romeo,
433,15,1951,8,Spain,1,1,Juan Manuel FANGIO,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 46m 54.10s ( 158.939 km/h )


Now that we have the wins, we can group and start counting

In [17]:
wins_grouped = wins.groupby(["year", "constructor"])

In [18]:
win_count = wins_grouped.order.count().rename("wins")

In [19]:
win_count.sort_values(ascending=False).head(10)

year  constructor
2016  Mercedes       19
2015  Mercedes       16
2014  Mercedes       16
2002  Ferrari        15
1988  McLaren        15
2004  Ferrari        15
2013  Red Bull       13
1996  Williams       12
2017  Mercedes       12
2011  Red Bull       12
Name: wins, dtype: int64

We can turn this series to a dataframe for what we'll be doing later with it.

In [20]:
win_count = win_count.to_frame().reset_index()

In [21]:
win_count.sort_values(by="wins",ascending=False).head(10)

Unnamed: 0,year,constructor,wins
63,2016,Mercedes,19
62,2015,Mercedes,16
61,2014,Mercedes,16
52,2002,Ferrari,15
38,1988,McLaren,15
54,2004,Ferrari,15
60,2013,Red Bull,13
58,2011,Red Bull,12
46,1996,Williams,12
34,1984,McLaren,12


Things match up.

To better compare things we should also normalize by the number of races in each season. We'll compute the percentage of races won in each season

In [22]:
def races_in_season(row):
    season = results[results.year == int(row.year)]
    return season["round"].max()

def get_win_percentage(row):
    w = float(row.wins)
    total = float(row.races)
    return (w/total)*100

In [23]:
win_analysis = win_count.copy()

In [24]:
win_analysis["races"] = win_count.apply(races_in_season, axis=1)

In [25]:
win_analysis.head(10)

Unnamed: 0,year,constructor,wins,races
0,1950,Alfa Romeo,6,7
1,1951,Alfa Romeo,4,8
2,1952,Ferrari,7,8
3,1953,Ferrari,7,9
4,1954,Mercedes,4,9
5,1955,Mercedes,5,7
6,1956,Ferrari,5,8
7,1957,Maserati,4,8
8,1958,Ferrari,2,11
9,1959,Cooper,5,9


In [26]:
win_analysis["win_percentage"] = win_analysis.apply(get_win_percentage, axis=1)

In [27]:
win_analysis.sort_values(by="wins", ascending=False).head(10)

Unnamed: 0,year,constructor,wins,races,win_percentage
63,2016,Mercedes,19,21,90.47619
62,2015,Mercedes,16,19,84.210526
61,2014,Mercedes,16,19,84.210526
52,2002,Ferrari,15,17,88.235294
38,1988,McLaren,15,16,93.75
54,2004,Ferrari,15,18,83.333333
60,2013,Red Bull,13,19,68.421053
58,2011,Red Bull,12,19,63.157895
46,1996,Williams,12,16,75.0
34,1984,McLaren,12,16,75.0


In [28]:
win_analysis.sort_values(by="win_percentage", ascending=False).head(10)

Unnamed: 0,year,constructor,wins,races,win_percentage
38,1988,McLaren,15,16,93.75
63,2016,Mercedes,19,21,90.47619
52,2002,Ferrari,15,17,88.235294
2,1952,Ferrari,7,8,87.5
0,1950,Alfa Romeo,6,7,85.714286
62,2015,Mercedes,16,19,84.210526
61,2014,Mercedes,16,19,84.210526
54,2004,Ferrari,15,18,83.333333
3,1953,Ferrari,7,9,77.777778
34,1984,McLaren,12,16,75.0


McLaren's 1988 run is ~4% better than Mercedes's 2016 run.

Let's save this analysis for plotting purposes in the piece.

In [29]:
win_analysis.to_csv("../data/output/win_analysis.csv", index=False)

---

## Method 02: Podiums

Looking at the wins is a good start, but there are a lot of factors about the team's performance over a season that it leaves out.

* It only shows a very narrow slice of the team's drivers's performance. If we only know that one of the drivers won, we have no idea how the other driver did.
* It offers a limited amount of comparison. Winning is a binary variable — you win or you don't. When looking at the history of the sport, things are greyer. For example, Keke Rosberg won the driver's cup in 1982, but he only had one victory that season. Looking only at the number of wins doesn't provide any context about how this happened.

We can dig a little deeper and look at podiums. The podium refers to the drivers who finished first, second, and third in any given race. A team that consistenly has both drivers on the podium over a season is doing amazing. (ex: Mercedes's dominance is better understood when you see Bottas and Hamilton on podium for almost every race of 2019 so far.)

In [30]:
podiums = results[results.position.isin(["1","2", "3", 1, 2, 3])]

In [31]:
podiums.head(23)

Unnamed: 0,race_id,year,round,race_name,position,order,driver,constructor,team,extra
0,1,1950,1,Britain,1,1,Giuseppe FARINA,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 13m 23.6s ( 146.378 km/h )
1,1,1950,1,Britain,2,2,Luigi FAGIOLI,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 13m 26.2s ( +02.6s )
2,1,1950,1,Britain,3,3,Reg PARNELL,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 14m 15.6s ( +52.0s )
25,2,1950,2,Monaco,1,1,Juan Manuel FANGIO,Alfa Romeo,Alfa Romeo Alfa Romeo,3h 13m 18.7s ( 98.701 km/h )
127,4,1950,4,Switzerland,1,1,Giuseppe FARINA,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 02m 53.7s ( 149.279 km/h )
128,4,1950,4,Switzerland,2,2,Luigi FAGIOLI,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 02m 54.1s ( +00.4s )
150,5,1950,5,Belgium,1,1,Juan Manuel FANGIO,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 47m 26s ( 177.097 km/h )
151,5,1950,5,Belgium,2,2,Luigi FAGIOLI,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 47m 40s ( +14.000s )
164,6,1950,6,France,1,1,Juan Manuel FANGIO,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 57m 52.8s ( 168.729 km/h )
165,6,1950,6,France,2,2,Luigi FAGIOLI,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 58m 18.5s ( +25.7s )


In [32]:
podium_count = podiums.groupby(["year", "constructor"]).order.count().rename("podiums")

In [33]:
podium_count.sort_values(ascending=False).head(10)

year  constructor
2016  Mercedes       33
2015  Mercedes       32
2014  Mercedes       31
2004  Ferrari        29
2011  Red Bull       27
2002  Ferrari        27
2017  Mercedes       26
2018  Mercedes       25
1988  McLaren        25
2001  Ferrari        24
Name: podiums, dtype: int64

In [34]:
podium_count = podium_count.to_frame().reset_index()

In [35]:
podium_count.sort_values(by="podiums", ascending=False).head(10)

Unnamed: 0,year,constructor,podiums
64,2016,Mercedes,33
63,2015,Mercedes,32
62,2014,Mercedes,31
54,2004,Ferrari,29
52,2002,Ferrari,27
59,2011,Red Bull,27
65,2017,Mercedes,26
38,1988,McLaren,25
66,2018,Mercedes,25
61,2013,Red Bull,24


We'll want to normalize this as well, but a little different. We want to take into account that there are different number of races and different number of drivers in each race (usually 2 drivers per team per race, but they sometimes have a third subbing in for one of the two, or in the earlier years, they had more than 2 drivers.)

To account for this, we'll look at each race and count the number of unique drivers who raced. We'll ignore the drivers whose `position` value is one of the following:

* nq: not qualified
* npq: not pre-qualified
* exc: excluded
* tf: parade lap
* f: withdrawal

We'll keep in the drivers whose `position` value was:

* a number spot
* ab: retired
* nc: not classified


For each race, we'll compute the minimum between 3 and the number of drivers for the team in that race. The reason for picking the minimum between 3 and number of drivers is that the max number of podium spots for any race is 3 and if a team only brought two drivers, their best they can do is get two podiums.

In [36]:
keep_out = ["nq", "npq", "exc", "tf","f"]
race_entries = results[~results.position.isin(keep_out)]

def podium_spots(row):
    season = race_entries[race_entries.year == int(row.year)]
    team = season[season.constructor == row.constructor]
    races = team["round"].unique()
    spots = 0
    
    for race in races:
        driver_entries = team[team["round"] == race].driver.nunique()
        spots += min(3, driver_entries)
    
    return spots

In [37]:
podium_analysis = podium_count.copy()

In [38]:
podium_analysis["podium_spots"] = podium_count.apply(podium_spots, axis=1)

In [39]:
podium_analysis.head()

Unnamed: 0,year,constructor,podiums,podium_spots
0,1950,Alfa Romeo,12,18
1,1951,Alfa Romeo,9,21
2,1952,Ferrari,17,22
3,1953,Ferrari,16,24
4,1954,Mercedes,7,17


In [40]:
def podium_percentage(row):
    p = float(row.podiums)
    total = float(row.podium_spots)
    return (p/total) * 100

In [41]:
podium_analysis["podium_percentage"] = podium_analysis.apply(podium_percentage, axis=1)

In [42]:
podium_analysis.sort_values(by="podium_percentage", ascending=False).head(10)

Unnamed: 0,year,constructor,podiums,podium_spots,podium_percentage
63,2015,Mercedes,32,38,84.210526
62,2014,Mercedes,31,38,81.578947
54,2004,Ferrari,29,36,80.555556
52,2002,Ferrari,27,34,79.411765
64,2016,Mercedes,33,42,78.571429
38,1988,McLaren,25,32,78.125
2,1952,Ferrari,17,22,77.272727
59,2011,Red Bull,27,38,71.052632
51,2001,Ferrari,24,34,70.588235
43,1993,Williams,22,32,68.75


While Mercedes's 2016 run has the most podiums (they also had most wins), their podium percentage is only the fifth highest. Of their 33 podium spots, 19 are first place finishes. From the other 14 podium spots we can see they didn't have both drivers on the podium for 7 of the season's 21 races.

Their 2015 and 2014 percentages were way better in terms of podiums.

Looking at McLaren's 1988 run, they also had a lower podium percentage. This could be related to their car performance or driver mistakes costing them podiums.

Third place ferrari is also down to fourth from third, but still higher than Mercedes2016 or McLaren 1988 -- that F2002 was really robust.

We can save this podium analysis now.

In [43]:
podium_analysis.to_csv("../data/output/podium_analysis.csv", index=False)

---

### Putting both podiums and wins together

We can combine the `win_analysis` and `podium_analysis` dataframes to make web loading slightly faster (1 request vs 2 requests, no duplicate columns requested).

In [44]:
analysis = pd.merge(win_analysis, podium_analysis, on=["year","constructor"])

In [45]:
analysis.head()

Unnamed: 0,year,constructor,wins,races,win_percentage,podiums,podium_spots,podium_percentage
0,1950,Alfa Romeo,6,7,85.714286,12,18,66.666667
1,1951,Alfa Romeo,4,8,50.0,9,21,42.857143
2,1952,Ferrari,7,8,87.5,17,22,77.272727
3,1953,Ferrari,7,9,77.777778,16,24,66.666667
4,1954,Mercedes,4,9,44.444444,7,17,41.176471


Let's also add a column to be a single label for each run:

In [46]:
def team_run(row):
    return " ".join([row.constructor,str(row.year)])

In [47]:
analysis["run_id"] = analysis.apply(team_run, axis=1)

In [48]:
analysis.head()

Unnamed: 0,year,constructor,wins,races,win_percentage,podiums,podium_spots,podium_percentage,run_id
0,1950,Alfa Romeo,6,7,85.714286,12,18,66.666667,Alfa Romeo 1950
1,1951,Alfa Romeo,4,8,50.0,9,21,42.857143,Alfa Romeo 1951
2,1952,Ferrari,7,8,87.5,17,22,77.272727,Ferrari 1952
3,1953,Ferrari,7,9,77.777778,16,24,66.666667,Ferrari 1953
4,1954,Mercedes,4,9,44.444444,7,17,41.176471,Mercedes 1954


In [49]:
analysis.to_csv("../data/output/win_and_podium_analysis.csv", index=False)