# Looking for the best Formula 1 season

For my master's project, I'm making a piece about answering the question: **What championship winning team had the best Formula 1 season?**

To answer this question, I'll be checking three definitions of best:

1. most wins in a season
1. most podiums in a season
1. how close was the performance to perfect

To do this I was working with data provided by the [Ergast Developer API](https://ergast.com/mrd/). I noticed an error in the driver-constructor pairing for the 1950 season and wanted to verify things without moving forward. I was originally going to create a table of the driver-constructor pairs for each race, and then compare it with the data I had.

Instead I went straight to the source for F1 information, [formula1.com](https://formula1.com), and scraped race information for each race from 1950 to 2018. There were some holes with how disqualifications and withdrawal were recorded (or not, in this case) as we went back in time to earlier seasons.

Now I've gone and gotten data from [statsf1.com](https://www.statsf1.com/) which is tabulated in an easy to understand manner and is more complete than the formula1.com data.

In [1]:
import pandas as pd
import numpy as np

In [2]:
race_results = pd.read_csv("../data/from_scripts/statsf1_race_results.csv")

In [3]:
race_results.head(30)

Unnamed: 0,race_id,year,round,race_name,position,p0,driver,team,constructor_long,extra
0,1,1950,1,Britain,1,1.0,Giuseppe FARINA,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 13m 23.6s ( 146.378 km/h )
1,1,1950,1,Britain,2,2.0,Luigi FAGIOLI,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 13m 26.2s ( +02.6s )
2,1,1950,1,Britain,3,3.0,Reg PARNELL,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 14m 15.6s ( +52.0s )
3,1,1950,1,Britain,4,4.0,Yves GIRAUD-CABANTOUS,Talbot Lago,Talbot Lago Talbot,
4,1,1950,1,Britain,5,5.0,Louis ROSIER,Talbot Lago,Talbot Lago Talbot,
5,1,1950,1,Britain,6,6.0,Bob GERARD,ERA,ERA ERA,
6,1,1950,1,Britain,7,7.0,Cuth HARRISON,ERA,ERA ERA,
7,1,1950,1,Britain,8,8.0,Philippe ETANCELIN,Talbot Lago,Talbot Lago Talbot,
8,1,1950,1,Britain,9,9.0,David HAMPSHIRE,Maserati,Maserati Maserati,
9,1,1950,1,Britain,10,10.0,Joe FRY,Maserati,Maserati Maserati,


Let's verify that we have the right number of races. Between 1950 and the end of the 2018 season there were 997 races.

In [4]:
race_results.race_id.max()

997

Before we get to analysis, there is some processing that needs to be done. First I want to fill in the teams.

In [5]:
def update_teams(row):
    prev = race_results.iloc[row.name -1]
    if row.position == "&":
        return prev.team
    else:
        return row.team
    
def update_constructor_long(row):
    prev = race_results.iloc[row.name -1]
    if row.position == "&":
        return prev.constructor_long
    else:
        return row.constructor_long

In [6]:
race_results["team"] = race_results.apply(update_teams, axis=1)
race_results["constructor_long"] = race_results.apply(update_teams, axis = 1)

In [7]:
race_results.head(30)

Unnamed: 0,race_id,year,round,race_name,position,p0,driver,team,constructor_long,extra
0,1,1950,1,Britain,1,1.0,Giuseppe FARINA,Alfa Romeo,Alfa Romeo,2h 13m 23.6s ( 146.378 km/h )
1,1,1950,1,Britain,2,2.0,Luigi FAGIOLI,Alfa Romeo,Alfa Romeo,2h 13m 26.2s ( +02.6s )
2,1,1950,1,Britain,3,3.0,Reg PARNELL,Alfa Romeo,Alfa Romeo,2h 14m 15.6s ( +52.0s )
3,1,1950,1,Britain,4,4.0,Yves GIRAUD-CABANTOUS,Talbot Lago,Talbot Lago,
4,1,1950,1,Britain,5,5.0,Louis ROSIER,Talbot Lago,Talbot Lago,
5,1,1950,1,Britain,6,6.0,Bob GERARD,ERA,ERA,
6,1,1950,1,Britain,7,7.0,Cuth HARRISON,ERA,ERA,
7,1,1950,1,Britain,8,8.0,Philippe ETANCELIN,Talbot Lago,Talbot Lago,
8,1,1950,1,Britain,9,9.0,David HAMPSHIRE,Maserati,Maserati,
9,1,1950,1,Britain,10,10.0,Joe FRY,Maserati,Maserati,


Now we can look at processing the finishing order.  In scraping I had created a rough version of the final order, but now I want to refine it more.

The position column gives us information about how the driver fared in the race. There are several options:

* If the position is a number (in string form or otherwise) then that is the finishing position of the driver.
* If the position is `&` then that driver record is for a shared drive and the finishing position of that driver is the same as the record directly above it.
* If the position is `ab` then the driver retired. We will try two different interpretations: *leave the order as is* and, *change all the retired orders to the retired order for the race.*
* IF the position is `nc` the driver did not classify for the final positions, so we can make that the average of the retired drivers as well.
* If the position is `f` then the driver withdrew from a race. They will ranked as the last possible spot.
* If the position is `np` then the driver did not star the race, but was on the grid. They will be ranked as the last possible spot.
* If the position is `dsq`, the driver was disqualified and their finishing position will be the the last possible spot.
* If the position is `npq`, `nq`, or `exc` the driver's order will be ignored. 
* If the position is `tf` do nothing.

We'll do it in two parts, first updating everything but the shared drives.

In [8]:
def p_final(row):
    race = race_results[race_results.race_id == row.race_id]
    last_place = race.p0.max()
    avg_retire = np.round(race[race.position.isin(["ab", "nc"])].p0.mean())
    
    if (row.position == "dsq") or (row.position == "f") or (row.position == "np"):
        return last_place
    else:
        return row.p0

And then updating the shared drives:

In [9]:
shared_drives = race_results.index[race_results.position == "&"].tolist()

def update_p_final(row):
    prev = race_results.iloc[row.name -1]
    if row.name in shared_drives:
        return prev.p_final
    else:
        return row.p_final

In [10]:
race_results["p_final"] = race_results.apply(p_final, axis =1)
race_results["p_final"] = race_results.apply(update_p_final, axis=1)

In [11]:
race_results[race_results.race_id == 273]

Unnamed: 0,race_id,year,round,race_name,position,p0,driver,team,constructor_long,extra,p_final
7174,273,1976,9,Britain,dsq,-1.0,James HUNT,McLaren,McLaren,Started unofficially 1h 43m 27.61s,28.0
7175,273,1976,9,Britain,1,1.0,Niki LAUDA,Ferrari,Ferrari,1h 44m 19.66s ( 183.881 km/h ),1.0
7176,273,1976,9,Britain,2,2.0,Jody SCHECKTER,Tyrrell,Tyrrell,1h 44m 35.84s ( +16.18s ),2.0
7177,273,1976,9,Britain,3,3.0,John WATSON,Penske,Penske,,3.0
7178,273,1976,9,Britain,4,4.0,Tom PRYCE,Shadow,Shadow,,4.0
7179,273,1976,9,Britain,5,5.0,Alan JONES,Surtees,Surtees,,5.0
7180,273,1976,9,Britain,6,6.0,Emerson FITTIPALDI,Copersucar,Copersucar,,6.0
7181,273,1976,9,Britain,7,7.0,Harald ERTL,Hesketh,Hesketh,,7.0
7182,273,1976,9,Britain,8,8.0,Carlos PACE,Brabham,Brabham,,8.0
7183,273,1976,9,Britain,9,9.0,Jean-Pierre JARIER,Shadow,Shadow,,9.0


I will work with a slice of this `race_results` dataFrame that only includes the team in their championship winning season. Let's make that slice now:

In [12]:
winning_teams = pd.read_csv("../data/other/winning_teams_statsf1_v2.csv")
winning_teams.head(15)

Unnamed: 0,year,team
0,1950,Alfa Romeo
1,1951,Alfa Romeo
2,1952,Ferrari
3,1953,Ferrari
4,1954,Mercedes
5,1955,Mercedes
6,1956,Ferrari
7,1957,Maserati
8,1958,Ferrari
9,1959,Cooper


Now we combine this dataframe with the `race_results` one:

In [13]:
combine = pd.merge(race_results, winning_teams, how="left", on=["year", "team"], indicator="keep")

In [14]:
combine[combine.race_id == 1]

Unnamed: 0,race_id,year,round,race_name,position,p0,driver,team,constructor_long,extra,p_final,keep
0,1,1950,1,Britain,1,1.0,Giuseppe FARINA,Alfa Romeo,Alfa Romeo,2h 13m 23.6s ( 146.378 km/h ),1.0,both
1,1,1950,1,Britain,2,2.0,Luigi FAGIOLI,Alfa Romeo,Alfa Romeo,2h 13m 26.2s ( +02.6s ),2.0,both
2,1,1950,1,Britain,3,3.0,Reg PARNELL,Alfa Romeo,Alfa Romeo,2h 14m 15.6s ( +52.0s ),3.0,both
3,1,1950,1,Britain,4,4.0,Yves GIRAUD-CABANTOUS,Talbot Lago,Talbot Lago,,4.0,left_only
4,1,1950,1,Britain,5,5.0,Louis ROSIER,Talbot Lago,Talbot Lago,,5.0,left_only
5,1,1950,1,Britain,6,6.0,Bob GERARD,ERA,ERA,,6.0,left_only
6,1,1950,1,Britain,7,7.0,Cuth HARRISON,ERA,ERA,,7.0,left_only
7,1,1950,1,Britain,8,8.0,Philippe ETANCELIN,Talbot Lago,Talbot Lago,,8.0,left_only
8,1,1950,1,Britain,9,9.0,David HAMPSHIRE,Maserati,Maserati,,9.0,left_only
9,1,1950,1,Britain,10,10.0,Joe FRY,Maserati,Maserati,,10.0,left_only


In [15]:
results = combine[combine.keep == "both"]

In [16]:
results[results.year == 1982]

Unnamed: 0,race_id,year,round,race_name,position,p0,driver,team,constructor_long,extra,p_final,keep
9649,358,1982,1,South Africa,2,2.0,Carlos REUTEMANN,Williams,Williams,1h 32m 23.347s ( +14.946s ),2.0,both
9652,358,1982,1,South Africa,5,5.0,Keke ROSBERG,Williams,Williams,1h 32m 54.540s ( +46.139s ),5.0,both
9680,359,1982,2,Brazil,dsq,-1.0,Keke ROSBERG,Williams,Williams,Weight infringement 1h 44m 05.737s,29.0,both
9697,359,1982,2,Brazil,ab,17.0,Carlos REUTEMANN,Williams,Williams,Collision,17.0,both
9711,360,1982,3,USA West,2,2.0,Keke ROSBERG,Williams,Williams,1h 58m 39.978s ( +14.660s ),2.0,both
9729,360,1982,3,USA West,ab,19.0,Mario ANDRETTI,Williams,Williams,Suspension,19.0,both
9756,362,1982,5,Belgium,2,2.0,Keke ROSBERG,Williams,Williams,1h 35m 49.263s ( +07.268s ),2.0,both
9765,362,1982,5,Belgium,ab,10.0,Derek DALY,Williams,Williams,Accident,10.0,both
9792,363,1982,6,Monaco,6,6.0,Derek DALY,Williams,Williams,Accident,6.0,both
9797,363,1982,6,Monaco,ab,11.0,Keke ROSBERG,Williams,Williams,Suspension,11.0,both


And those are all the result records for Williams in 1982.

We can drop the `keep` column, save a copy of this data, and start doing the three analyses.

In [17]:
results = results.drop(columns=["keep"])

In [18]:
results.to_csv("../data/other/race_results_champions.csv", index=False)

---

## Method 01: Wins

Let's compare championship seasons by how many wins each team got in their season.

We can look for wins by doing one of two things:

* pick all rows where `p_final == 1`
* pick all rows where `position == "1"`

In terms of wins, there were three races where two drivers shared first: 1951 French GP (Alfa Romeo), 1956 Argentine GP (Ferrari), and 1957 British GP (Vanwall).

For this analysis I care more that the constructor/team finished first than I do about it being a shared drive. By selecting rows using the position column, I also don't have to worry about shared drives.


In [19]:
wins = results[results.position == "1"]

In [22]:
wins.head(12)

Unnamed: 0,race_id,year,round,race_name,position,p0,driver,team,constructor_long,extra,p_final
0,1,1950,1,Britain,1,1.0,Giuseppe FARINA,Alfa Romeo,Alfa Romeo,2h 13m 23.6s ( 146.378 km/h ),1.0
25,2,1950,2,Monaco,1,1.0,Juan Manuel FANGIO,Alfa Romeo,Alfa Romeo,3h 13m 18.7s ( 98.701 km/h ),1.0
127,4,1950,4,Switzerland,1,1.0,Giuseppe FARINA,Alfa Romeo,Alfa Romeo,2h 02m 53.7s ( 149.279 km/h ),1.0
150,5,1950,5,Belgium,1,1.0,Juan Manuel FANGIO,Alfa Romeo,Alfa Romeo,2h 47m 26s ( 177.097 km/h ),1.0
164,6,1950,6,France,1,1.0,Juan Manuel FANGIO,Alfa Romeo,Alfa Romeo,2h 57m 52.8s ( 168.729 km/h ),1.0
188,7,1950,7,Italy,1,1.0,Giuseppe FARINA,Alfa Romeo,Alfa Romeo,2h 51m 17.4s ( 176.543 km/h ),1.0
222,8,1951,1,Switzerland,1,1.0,Juan Manuel FANGIO,Alfa Romeo,Alfa Romeo,2h 07m 53.64s ( 143.444 km/h ),1.0
309,10,1951,3,Belgium,1,1.0,Giuseppe FARINA,Alfa Romeo,Alfa Romeo,2h 45m 46.2s ( 183.985 km/h ),1.0
325,11,1951,4,France,1,1.0,Luigi FAGIOLI,Alfa Romeo,Alfa Romeo,,1.0
433,15,1951,8,Spain,1,1.0,Juan Manuel FANGIO,Alfa Romeo,Alfa Romeo,2h 46m 54.10s ( 158.939 km/h ),1.0


In [23]:
wins[wins.year== 1982]

Unnamed: 0,race_id,year,round,race_name,position,p0,driver,team,constructor_long,extra,p_final
10028,371,1982,14,Switzerland,1,1.0,Keke ROSBERG,Williams,Williams,1h 32m 41.087s ( 196.796 km/h ),1.0


Now that we've verified the wins are correct, let's do the counting:

In [24]:
win_count = wins.groupby(["year", "team"]).p_final.count().rename("wins")

In [25]:
win_count.sort_values(ascending=False).head(10)

year  team    
2016  Mercedes    19
2015  Mercedes    16
2014  Mercedes    16
2002  Ferrari     15
1988  McLaren     15
2004  Ferrari     15
2013  Red Bull    13
1996  Williams    12
2017  Mercedes    12
2011  Red Bull    12
Name: wins, dtype: int64

Let's take it from a series to a dataframe:

In [26]:
win_count = win_count.to_frame().reset_index()

In [27]:
win_count.sort_values(by="wins",ascending=False).head(10)

Unnamed: 0,year,team,wins
63,2016,Mercedes,19
62,2015,Mercedes,16
61,2014,Mercedes,16
52,2002,Ferrari,15
38,1988,McLaren,15
54,2004,Ferrari,15
60,2013,Red Bull,13
58,2011,Red Bull,12
46,1996,Williams,12
34,1984,McLaren,12
