# Looking for the best Formula 1 season

For my master's project, I'm making a piece about answering the question: **What championship winning team had the best Formula 1 season?**

To answer this question, I'll be checking three definitions of best:

1. most wins in a season
1. most podiums in a season
1. how close was the performance to perfect

To do this I was working with data provided by the [Ergast Developer API](https://ergast.com/mrd/). I noticed an error in the driver-constructor pairing for the 1950 season and wanted to verify things without moving forward. I was originally going to create a table of the driver-constructor pairs for each race, and then compare it with the data I had.

Instead I went straight to the source for F1 information, [formula1.com](https://formula1.com), and scraped race information for each race from 1950 to 2018. There were some holes with how disqualifications and withdrawal were recorded (or not, in this case) as we went back in time to earlier seasons.

Now I've gone and gotten data from [statsf1.com](https://www.statsf1.com/) which is tabulated in an easy to understand manner and is more complete than the formula1.com data.

In [1]:
import pandas as pd
import numpy as np

In [2]:
race_results = pd.read_csv("../data/from_scripts/statsf1_race_results.csv")

In [3]:
race_results.head(30)

Unnamed: 0,race_id,year,round,race_name,position,p_prelim,driver,team,constructor_long,extra
0,1,1950,1,Britain,1,1.0,Giuseppe FARINA,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 13m 23.6s ( 146.378 km/h )
1,1,1950,1,Britain,2,2.0,Luigi FAGIOLI,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 13m 26.2s ( +02.6s )
2,1,1950,1,Britain,3,3.0,Reg PARNELL,Alfa Romeo,Alfa Romeo Alfa Romeo,2h 14m 15.6s ( +52.0s )
3,1,1950,1,Britain,4,4.0,Yves GIRAUD-CABANTOUS,Talbot Lago,Talbot Lago Talbot,
4,1,1950,1,Britain,5,5.0,Louis ROSIER,Talbot Lago,Talbot Lago Talbot,
5,1,1950,1,Britain,6,6.0,Bob GERARD,ERA,ERA ERA,
6,1,1950,1,Britain,7,7.0,Cuth HARRISON,ERA,ERA ERA,
7,1,1950,1,Britain,8,8.0,Philippe ETANCELIN,Talbot Lago,Talbot Lago Talbot,
8,1,1950,1,Britain,9,9.0,David HAMPSHIRE,Maserati,Maserati Maserati,
9,1,1950,1,Britain,10,10.0,Joe FRY,Maserati,Maserati Maserati,


Let's verify that we have the right number of races. Between 1950 and the end of the 2018 season there were 997 races.

In [4]:
race_results.race_id.max()

997

Before we get to analysis, there is some processing that needs to be done. First I want to fill in the teams.

In [5]:
def update_teams(row):
    prev = race_results.iloc[row.name -1]
    if row.position == "&":
        return prev.team
    else:
        return row.team
    
def update_constructor_long(row):
    prev = race_results.iloc[row.name -1]
    if row.position == "&":
        return prev.constructor_long
    else:
        return row.constructor_long

In [6]:
race_results["team"] = race_results.apply(update_teams, axis=1)
race_results["constructor_long"] = race_results.apply(update_teams, axis = 1)

In [7]:
race_results.head(30)

Unnamed: 0,race_id,year,round,race_name,position,p_prelim,driver,team,constructor_long,extra
0,1,1950,1,Britain,1,1.0,Giuseppe FARINA,Alfa Romeo,Alfa Romeo,2h 13m 23.6s ( 146.378 km/h )
1,1,1950,1,Britain,2,2.0,Luigi FAGIOLI,Alfa Romeo,Alfa Romeo,2h 13m 26.2s ( +02.6s )
2,1,1950,1,Britain,3,3.0,Reg PARNELL,Alfa Romeo,Alfa Romeo,2h 14m 15.6s ( +52.0s )
3,1,1950,1,Britain,4,4.0,Yves GIRAUD-CABANTOUS,Talbot Lago,Talbot Lago,
4,1,1950,1,Britain,5,5.0,Louis ROSIER,Talbot Lago,Talbot Lago,
5,1,1950,1,Britain,6,6.0,Bob GERARD,ERA,ERA,
6,1,1950,1,Britain,7,7.0,Cuth HARRISON,ERA,ERA,
7,1,1950,1,Britain,8,8.0,Philippe ETANCELIN,Talbot Lago,Talbot Lago,
8,1,1950,1,Britain,9,9.0,David HAMPSHIRE,Maserati,Maserati,
9,1,1950,1,Britain,10,10.0,Joe FRY,Maserati,Maserati,


Now we can look at processing the finishing order.  In scraping I had created a rough version of the final order, but now I want to refine it more.

The position column gives us information about how the driver fared in the race. There are several options:

* If the position is a number (in string form or otherwise) then that is the finishing position of the driver.
* If the position is `&` then that driver record is for a shared drive and the finishing position of that driver is the same as the record directly above it.
* If the position is `ab` then the driver retired during the race. The later they retired, the higher they ranked.
* IF the position is `nc` the driver did not classify for the final positions, but did complete most of the race.
* If the position is `f` then the driver withdrew from a race. They will ranked as the last possible spot.
* If the position is `np` then the driver did not star the race, but was on the grid. They will be ranked as the last possible spot.
* If the position is `dsq`, the driver was disqualified and their finishing position will be the the last possible spot.
* If the position is `npq`, `nq`, `tf` or `exc` the driver's order will be ignored. 

We'll do it in two parts, first updating everything but the shared drives.

In [8]:
def p_final(row):
    race = race_results[race_results.race_id == row.race_id]
    last_place = race.p_prelim.max()
    avg_retire = np.round(race[race.position.isin(["ab", "nc"])].p_prelim.mean())
    
    if (row.position == "dsq") or (row.position == "f") or (row.position == "np"):
        return last_place
    else:
        return row.p_prelim

And then updating the shared drives:

In [9]:
shared_drives = race_results.index[race_results.position == "&"].tolist()

def update_p_final(row):
    prev = race_results.iloc[row.name -1]
    if row.name in shared_drives:
        return prev.p_final
    else:
        return row.p_final

In [10]:
race_results["p_final"] = race_results.apply(p_final, axis =1)
race_results["p_final"] = race_results.apply(update_p_final, axis=1)

In [11]:
race_results[race_results.race_id == 273]

Unnamed: 0,race_id,year,round,race_name,position,p_prelim,driver,team,constructor_long,extra,p_final
7174,273,1976,9,Britain,dsq,-1.0,James HUNT,McLaren,McLaren,Started unofficially 1h 43m 27.61s,28.0
7175,273,1976,9,Britain,1,1.0,Niki LAUDA,Ferrari,Ferrari,1h 44m 19.66s ( 183.881 km/h ),1.0
7176,273,1976,9,Britain,2,2.0,Jody SCHECKTER,Tyrrell,Tyrrell,1h 44m 35.84s ( +16.18s ),2.0
7177,273,1976,9,Britain,3,3.0,John WATSON,Penske,Penske,,3.0
7178,273,1976,9,Britain,4,4.0,Tom PRYCE,Shadow,Shadow,,4.0
7179,273,1976,9,Britain,5,5.0,Alan JONES,Surtees,Surtees,,5.0
7180,273,1976,9,Britain,6,6.0,Emerson FITTIPALDI,Copersucar,Copersucar,,6.0
7181,273,1976,9,Britain,7,7.0,Harald ERTL,Hesketh,Hesketh,,7.0
7182,273,1976,9,Britain,8,8.0,Carlos PACE,Brabham,Brabham,,8.0
7183,273,1976,9,Britain,9,9.0,Jean-Pierre JARIER,Shadow,Shadow,,9.0


In [12]:
race_results[race_results.race_id == 273]

Unnamed: 0,race_id,year,round,race_name,position,p_prelim,driver,team,constructor_long,extra,p_final
7174,273,1976,9,Britain,dsq,-1.0,James HUNT,McLaren,McLaren,Started unofficially 1h 43m 27.61s,28.0
7175,273,1976,9,Britain,1,1.0,Niki LAUDA,Ferrari,Ferrari,1h 44m 19.66s ( 183.881 km/h ),1.0
7176,273,1976,9,Britain,2,2.0,Jody SCHECKTER,Tyrrell,Tyrrell,1h 44m 35.84s ( +16.18s ),2.0
7177,273,1976,9,Britain,3,3.0,John WATSON,Penske,Penske,,3.0
7178,273,1976,9,Britain,4,4.0,Tom PRYCE,Shadow,Shadow,,4.0
7179,273,1976,9,Britain,5,5.0,Alan JONES,Surtees,Surtees,,5.0
7180,273,1976,9,Britain,6,6.0,Emerson FITTIPALDI,Copersucar,Copersucar,,6.0
7181,273,1976,9,Britain,7,7.0,Harald ERTL,Hesketh,Hesketh,,7.0
7182,273,1976,9,Britain,8,8.0,Carlos PACE,Brabham,Brabham,,8.0
7183,273,1976,9,Britain,9,9.0,Jean-Pierre JARIER,Shadow,Shadow,,9.0


I will work with a slice of this `race_results` dataFrame that only includes the team in their championship winning season. Let's make that slice now:

In [13]:
winning_teams = pd.read_csv("../data/other/winning_teams_statsf1_v2.csv")
winning_teams.head(15)

Unnamed: 0,year,team
0,1950,Alfa Romeo
1,1951,Alfa Romeo
2,1952,Ferrari
3,1953,Ferrari
4,1954,Mercedes
5,1955,Mercedes
6,1956,Ferrari
7,1957,Maserati
8,1958,Ferrari
9,1959,Cooper


Now we combine this dataframe with the `race_results` one:

In [14]:
combine = pd.merge(race_results, winning_teams, how="left", on=["year", "team"], indicator="keep")

In [15]:
combine[combine.race_id == 1]

Unnamed: 0,race_id,year,round,race_name,position,p_prelim,driver,team,constructor_long,extra,p_final,keep
0,1,1950,1,Britain,1,1.0,Giuseppe FARINA,Alfa Romeo,Alfa Romeo,2h 13m 23.6s ( 146.378 km/h ),1.0,both
1,1,1950,1,Britain,2,2.0,Luigi FAGIOLI,Alfa Romeo,Alfa Romeo,2h 13m 26.2s ( +02.6s ),2.0,both
2,1,1950,1,Britain,3,3.0,Reg PARNELL,Alfa Romeo,Alfa Romeo,2h 14m 15.6s ( +52.0s ),3.0,both
3,1,1950,1,Britain,4,4.0,Yves GIRAUD-CABANTOUS,Talbot Lago,Talbot Lago,,4.0,left_only
4,1,1950,1,Britain,5,5.0,Louis ROSIER,Talbot Lago,Talbot Lago,,5.0,left_only
5,1,1950,1,Britain,6,6.0,Bob GERARD,ERA,ERA,,6.0,left_only
6,1,1950,1,Britain,7,7.0,Cuth HARRISON,ERA,ERA,,7.0,left_only
7,1,1950,1,Britain,8,8.0,Philippe ETANCELIN,Talbot Lago,Talbot Lago,,8.0,left_only
8,1,1950,1,Britain,9,9.0,David HAMPSHIRE,Maserati,Maserati,,9.0,left_only
9,1,1950,1,Britain,10,10.0,Joe FRY,Maserati,Maserati,,10.0,left_only


In [16]:
results = combine[combine.keep == "both"]

In [17]:
results[results.year == 1982]

Unnamed: 0,race_id,year,round,race_name,position,p_prelim,driver,team,constructor_long,extra,p_final,keep
9649,358,1982,1,South Africa,2,2.0,Carlos REUTEMANN,Williams,Williams,1h 32m 23.347s ( +14.946s ),2.0,both
9652,358,1982,1,South Africa,5,5.0,Keke ROSBERG,Williams,Williams,1h 32m 54.540s ( +46.139s ),5.0,both
9680,359,1982,2,Brazil,dsq,-1.0,Keke ROSBERG,Williams,Williams,Weight infringement 1h 44m 05.737s,29.0,both
9697,359,1982,2,Brazil,ab,17.0,Carlos REUTEMANN,Williams,Williams,Collision,17.0,both
9711,360,1982,3,USA West,2,2.0,Keke ROSBERG,Williams,Williams,1h 58m 39.978s ( +14.660s ),2.0,both
9729,360,1982,3,USA West,ab,19.0,Mario ANDRETTI,Williams,Williams,Suspension,19.0,both
9756,362,1982,5,Belgium,2,2.0,Keke ROSBERG,Williams,Williams,1h 35m 49.263s ( +07.268s ),2.0,both
9765,362,1982,5,Belgium,ab,10.0,Derek DALY,Williams,Williams,Accident,10.0,both
9792,363,1982,6,Monaco,6,6.0,Derek DALY,Williams,Williams,Accident,6.0,both
9797,363,1982,6,Monaco,ab,11.0,Keke ROSBERG,Williams,Williams,Suspension,11.0,both


And those are all the result records for Williams in 1982.

We can drop the `keep` column, save a copy of this data, and start doing the three analyses.

In [18]:
results = results.drop(columns=["keep"])

In [19]:
results.to_csv("../data/other/race_results_champions.csv", index=False)

---

## Method 01: Wins

Let's compare championship seasons by how many wins each team got in their season.

We can look for wins by doing one of two things:

* pick all rows where `p_final == 1`
* pick all rows where `position == "1"`

In terms of wins, there were three races where two drivers shared first: 1951 French GP (Alfa Romeo), 1956 Argentine GP (Ferrari), and 1957 British GP (Vanwall).

For this analysis I care more that the constructor/team finished first than I do about it being a shared drive. By selecting rows using the position column, I also don't have to worry about shared drives.


In [20]:
wins = results[results.position == "1"]

In [21]:
wins.head(12)

Unnamed: 0,race_id,year,round,race_name,position,p_prelim,driver,team,constructor_long,extra,p_final
0,1,1950,1,Britain,1,1.0,Giuseppe FARINA,Alfa Romeo,Alfa Romeo,2h 13m 23.6s ( 146.378 km/h ),1.0
25,2,1950,2,Monaco,1,1.0,Juan Manuel FANGIO,Alfa Romeo,Alfa Romeo,3h 13m 18.7s ( 98.701 km/h ),1.0
127,4,1950,4,Switzerland,1,1.0,Giuseppe FARINA,Alfa Romeo,Alfa Romeo,2h 02m 53.7s ( 149.279 km/h ),1.0
150,5,1950,5,Belgium,1,1.0,Juan Manuel FANGIO,Alfa Romeo,Alfa Romeo,2h 47m 26s ( 177.097 km/h ),1.0
164,6,1950,6,France,1,1.0,Juan Manuel FANGIO,Alfa Romeo,Alfa Romeo,2h 57m 52.8s ( 168.729 km/h ),1.0
188,7,1950,7,Italy,1,1.0,Giuseppe FARINA,Alfa Romeo,Alfa Romeo,2h 51m 17.4s ( 176.543 km/h ),1.0
222,8,1951,1,Switzerland,1,1.0,Juan Manuel FANGIO,Alfa Romeo,Alfa Romeo,2h 07m 53.64s ( 143.444 km/h ),1.0
309,10,1951,3,Belgium,1,1.0,Giuseppe FARINA,Alfa Romeo,Alfa Romeo,2h 45m 46.2s ( 183.985 km/h ),1.0
325,11,1951,4,France,1,1.0,Luigi FAGIOLI,Alfa Romeo,Alfa Romeo,,1.0
433,15,1951,8,Spain,1,1.0,Juan Manuel FANGIO,Alfa Romeo,Alfa Romeo,2h 46m 54.10s ( 158.939 km/h ),1.0


In [22]:
wins[wins.year== 1982]

Unnamed: 0,race_id,year,round,race_name,position,p_prelim,driver,team,constructor_long,extra,p_final
10028,371,1982,14,Switzerland,1,1.0,Keke ROSBERG,Williams,Williams,1h 32m 41.087s ( 196.796 km/h ),1.0


Now that we've verified the wins are correct, let's do the counting:

In [23]:
win_count = wins.groupby(["year", "team"]).p_final.count().rename("wins")

In [24]:
win_count.sort_values(ascending=False).head(10)

year  team    
2016  Mercedes    19
2015  Mercedes    16
2014  Mercedes    16
2004  Ferrari     15
1988  McLaren     15
2002  Ferrari     15
2013  Red Bull    13
1984  McLaren     12
2017  Mercedes    12
2011  Red Bull    12
Name: wins, dtype: int64

Let's take it from a series to a dataframe:

In [25]:
win_count = win_count.to_frame().reset_index()

In [26]:
win_count.sort_values(by="wins",ascending=False).head(10)

Unnamed: 0,year,team,wins
66,2016,Mercedes,19
65,2015,Mercedes,16
64,2014,Mercedes,16
38,1988,McLaren,15
54,2004,Ferrari,15
52,2002,Ferrari,15
63,2013,Red Bull,13
34,1984,McLaren,12
67,2017,Mercedes,12
61,2011,Red Bull,12


To better compare things we should also normalize by the number of races in each season. We'll compute the percentage of races won in each season.

In [27]:
def find_num_races(row):
    season = results[results.year == int(row.year)]
    return season["round"].max()

def get_win_percentage(row):
    w = float(row.wins)
    total = float(row.races)
    return (w/total)*100

In [28]:
win_analysis = win_count.copy()

In [29]:
win_analysis["races"] = win_count.apply(find_num_races, axis=1)

In [30]:
win_analysis.sort_values(by="wins",ascending=False).head(10)

Unnamed: 0,year,team,wins,races
66,2016,Mercedes,19,21
65,2015,Mercedes,16,19
64,2014,Mercedes,16,19
38,1988,McLaren,15,16
54,2004,Ferrari,15,18
52,2002,Ferrari,15,17
63,2013,Red Bull,13,19
34,1984,McLaren,12,16
67,2017,Mercedes,12,20
61,2011,Red Bull,12,19


In [31]:
win_analysis["win_percentage"] = win_analysis.apply(get_win_percentage, axis=1)

In [32]:
win_analysis.sort_values(by="wins", ascending=False).head(10)

Unnamed: 0,year,team,wins,races,win_percentage
66,2016,Mercedes,19,21,90.47619
65,2015,Mercedes,16,19,84.210526
64,2014,Mercedes,16,19,84.210526
38,1988,McLaren,15,16,93.75
54,2004,Ferrari,15,18,83.333333
52,2002,Ferrari,15,17,88.235294
63,2013,Red Bull,13,19,68.421053
34,1984,McLaren,12,16,75.0
67,2017,Mercedes,12,20,60.0
61,2011,Red Bull,12,19,63.157895


And looking at the percentages:

In [33]:
win_analysis.sort_values(by="win_percentage", ascending=False).head(10)

Unnamed: 0,year,team,wins,races,win_percentage
38,1988,McLaren,15,16,93.75
66,2016,Mercedes,19,21,90.47619
52,2002,Ferrari,15,17,88.235294
2,1952,Ferrari,7,8,87.5
0,1950,Alfa Romeo,6,7,85.714286
65,2015,Mercedes,16,19,84.210526
64,2014,Mercedes,16,19,84.210526
54,2004,Ferrari,15,18,83.333333
3,1953,Ferrari,7,9,77.777778
46,1996,Williams,12,16,75.0


McLaren's 1988 run is ~4% better than Mercedes's 2016 run.

Let's save this analysis for plotting purposes in the piece.

In [34]:
win_analysis.to_csv("../data/output/win_analysis.csv", index=False)

---

## Method 02: Podiums

Looking at the wins is a good start, but there are a lot of factors about the team's performance over a season that it leaves out.

* It only shows a very narrow slice of the team's drivers's performance. If we only know that one of the drivers won, we have no idea how the other driver did.
* It offers a limited amount of comparison. Winning is a binary variable — you win or you don't. When looking at the history of the sport, things are greyer. For example, Keke Rosberg won the driver's cup in 1982, but he only had one victory that season. Looking only at the number of wins doesn't provide any context about how this happened.

We can dig a little deeper and look at podiums. The podium refers to the drivers who finished first, second, and third in any given race. A team that consistenly has both drivers on the podium over a season is doing amazing. (ex: Mercedes's dominance is better understood when you see Bottas and Hamilton on podium for almost every race of 2019 so far.)

In [35]:
podiums = results[results.position.isin(["1","2", "3", 1, 2, 3])]

In [36]:
podiums.head(21)

Unnamed: 0,race_id,year,round,race_name,position,p_prelim,driver,team,constructor_long,extra,p_final
0,1,1950,1,Britain,1,1.0,Giuseppe FARINA,Alfa Romeo,Alfa Romeo,2h 13m 23.6s ( 146.378 km/h ),1.0
1,1,1950,1,Britain,2,2.0,Luigi FAGIOLI,Alfa Romeo,Alfa Romeo,2h 13m 26.2s ( +02.6s ),2.0
2,1,1950,1,Britain,3,3.0,Reg PARNELL,Alfa Romeo,Alfa Romeo,2h 14m 15.6s ( +52.0s ),3.0
25,2,1950,2,Monaco,1,1.0,Juan Manuel FANGIO,Alfa Romeo,Alfa Romeo,3h 13m 18.7s ( 98.701 km/h ),1.0
127,4,1950,4,Switzerland,1,1.0,Giuseppe FARINA,Alfa Romeo,Alfa Romeo,2h 02m 53.7s ( 149.279 km/h ),1.0
128,4,1950,4,Switzerland,2,2.0,Luigi FAGIOLI,Alfa Romeo,Alfa Romeo,2h 02m 54.1s ( +00.4s ),2.0
150,5,1950,5,Belgium,1,1.0,Juan Manuel FANGIO,Alfa Romeo,Alfa Romeo,2h 47m 26s ( 177.097 km/h ),1.0
151,5,1950,5,Belgium,2,2.0,Luigi FAGIOLI,Alfa Romeo,Alfa Romeo,2h 47m 40s ( +14.000s ),2.0
164,6,1950,6,France,1,1.0,Juan Manuel FANGIO,Alfa Romeo,Alfa Romeo,2h 57m 52.8s ( 168.729 km/h ),1.0
165,6,1950,6,France,2,2.0,Luigi FAGIOLI,Alfa Romeo,Alfa Romeo,2h 58m 18.5s ( +25.7s ),2.0


By again using the `position` column we can deal with the shared drives.

In [37]:
podium_count = podiums.groupby(["year", "team"]).p_final.count().rename("podium_spots_claimed")

In [38]:
podium_count.sort_values(ascending=False).head(10)

year  team    
2016  Mercedes    33
2015  Mercedes    32
2014  Mercedes    31
2004  Ferrari     29
2002  Ferrari     27
2011  Red Bull    27
2017  Mercedes    26
2018  Mercedes    25
1988  McLaren     25
2013  Red Bull    24
Name: podium_spots_claimed, dtype: int64

In [39]:
podium_count = podium_count.to_frame().reset_index()

We'll want to normalize this as well, but a little different. We want to take into account that there are different number of races and different number of drivers in each race (usually 2 drivers per team per race, but they sometimes have a third subbing in for one of the two, or in the earlier years, they had more than 2 drivers.)

To account for this, we'll look at each race and count the number of unique drivers who raced. We'll ignore the drivers whose `position` value is one of the following:

* nq: not qualified
* npq: not pre-qualified
* exc: excluded
* tf: parade lap

We'll keep in the drivers whose `position` value was:

* a number spot
* ab: retired
* nc: not classified
* np: not started
* f: withdrawal


For each race, we'll compute the minimum between 3 and the number of drivers for the team in that race. The reason for picking the minimum between 3 and number of drivers is that the max number of podium spots for any race is 3 and if a team only brought two drivers, their best they can do is get two podiums.

In [40]:
keep_out = ["nq", "npq", "exc", "tf"]
race_entries = results[~results.position.isin(keep_out)]

def podium_spots(row):
    season = race_entries[race_entries.year == row.year]
    team = season[season.team == row.team]
    races = team["round"].unique()
    spots = 0
    
    for race in races:
        driver_entries = team[team["round"] == race].driver.nunique()
        spots += min(3, driver_entries)
    
    return spots

def podium_percentage(row):
    p = float(row.podium_spots_claimed)
    total = float(row.podium_spots_available)
    return (p/total) * 100

In [41]:
podium_analysis = podium_count.copy()

In [42]:
podium_analysis.head(10)

Unnamed: 0,year,team,podium_spots_claimed
0,1950,Alfa Romeo,12
1,1951,Alfa Romeo,9
2,1952,Ferrari,17
3,1953,Ferrari,16
4,1954,Mercedes,7
5,1955,Mercedes,10
6,1956,Ferrari,10
7,1957,Maserati,10
8,1958,Ferrari,14
9,1959,Cooper,13


In [43]:
podium_analysis["podium_spots_available"] = podium_count.apply(podium_spots, axis=1)

In [44]:
podium_analysis.sort_values(by="podium_spots_claimed",ascending=False).head(10)

Unnamed: 0,year,team,podium_spots_claimed,podium_spots_available
66,2016,Mercedes,33,42
65,2015,Mercedes,32,38
64,2014,Mercedes,31,38
54,2004,Ferrari,29,36
52,2002,Ferrari,27,34
61,2011,Red Bull,27,38
67,2017,Mercedes,26,40
38,1988,McLaren,25,32
68,2018,Mercedes,25,42
51,2001,Ferrari,24,34


In [45]:
podium_analysis["podium_percentage"] = podium_analysis.apply(podium_percentage, axis=1)

In [46]:
podium_analysis.sort_values(by="podium_percentage", ascending=False).head(10)

Unnamed: 0,year,team,podium_spots_claimed,podium_spots_available,podium_percentage
65,2015,Mercedes,32,38,84.210526
64,2014,Mercedes,31,38,81.578947
54,2004,Ferrari,29,36,80.555556
52,2002,Ferrari,27,34,79.411765
66,2016,Mercedes,33,42,78.571429
38,1988,McLaren,25,32,78.125
2,1952,Ferrari,17,22,77.272727
61,2011,Red Bull,27,38,71.052632
51,2001,Ferrari,24,34,70.588235
43,1993,Williams,22,32,68.75


While Mercedes's 2016 run has the most podiums (they also had most wins), their podium percentage is only the fifth highest. Of their 33 podium spots, 19 are first place finishes. From the other 14 podium spots we can see they didn't have both drivers on the podium for 7 of the season's 21 races.

Their 2015 and 2014 percentages were way better in terms of podiums.

Looking at McLaren's 1988 run, they also had a lower podium percentage. This could be related to their car performance or driver mistakes costing them podiums.

Third place Ferrari is also down to fourth from third, but still higher than Mercedes 2016 or McLaren 1988 -- that F2002 was really robust.

We can save this podium analysis now.

In [47]:
podium_analysis.to_csv("../data/output/podium_analysis.csv", index=False)

---
### Putting together podium and win analysis:

We can combine the `win_analysis` and `podium_analysis` dataframes to make web loading slightly faster (1 request vs 2 requests, no duplicate columns requested).

In [48]:
analysis = pd.merge(win_analysis, podium_analysis, on=["year","team"])

In [49]:
analysis.head()

Unnamed: 0,year,team,wins,races,win_percentage,podium_spots_claimed,podium_spots_available,podium_percentage
0,1950,Alfa Romeo,6,7,85.714286,12,18,66.666667
1,1951,Alfa Romeo,4,8,50.0,9,21,42.857143
2,1952,Ferrari,7,8,87.5,17,22,77.272727
3,1953,Ferrari,7,9,77.777778,16,24,66.666667
4,1954,Mercedes,4,9,44.444444,7,17,41.176471


Let's also add a column to be a single label for each run:

In [50]:
def team_run(row):
    return " ".join([row.team,str(row.year)])

In [51]:
analysis["run_id"] = analysis.apply(team_run, axis=1)
analysis.head()

Unnamed: 0,year,team,wins,races,win_percentage,podium_spots_claimed,podium_spots_available,podium_percentage,run_id
0,1950,Alfa Romeo,6,7,85.714286,12,18,66.666667,Alfa Romeo 1950
1,1951,Alfa Romeo,4,8,50.0,9,21,42.857143,Alfa Romeo 1951
2,1952,Ferrari,7,8,87.5,17,22,77.272727,Ferrari 1952
3,1953,Ferrari,7,9,77.777778,16,24,66.666667,Ferrari 1953
4,1954,Mercedes,4,9,44.444444,7,17,41.176471,Mercedes 1954


In [52]:
analysis.to_csv("../data/output/win_and_podium_analysis.csv", index=False)

---

## Method 3: Race Averages and Consistency

The best podium finish a team can have is to have one of their drivers on first, and the other on second — a one-two finish. We can see who had the highest number of one-two finishes each season, but I think it's more interesting to see who overall got the closest to having a perfect season.

To figure this out, I'll introduce the idea of a race average: for each race I'll average all of the team's finishing positions. the lower the average, the better the team performed in that race. If there are two drivers in a team, then the best average is a one-two finish which is a race average of 1.5.

First let's try to see how many drivers each team had for each race.

In [53]:
keep_out = ["nq", "npq", "exc", "tf"]
race_entries = results[~results.position.isin(keep_out)]

In [54]:
grouped= race_entries.groupby(["year", "team", "round"])

In [55]:
grouped.driver.nunique().head(13)

year  team        round
1950  Alfa Romeo  1        4
                  2        3
                  4        3
                  5        3
                  6        3
                  7        5
1951  Alfa Romeo  1        4
                  3        3
                  4        4
                  5        4
                  6        4
                  7        5
                  8        4
Name: driver, dtype: int64

This is a good starting point. I can go race by race and get the averages doing the same thing (with say `.p_final.mean()` instead of `driver.nunique()`.

But I also want to look at how shared drives are handled.

In [56]:
grouped2 = race_entries.groupby(["year", "team", "round", "driver"])

In [57]:
grouped2.p_final.count()

year  team        round  driver                 
1950  Alfa Romeo  1      Giuseppe FARINA            1
                         Juan Manuel FANGIO         1
                         Luigi FAGIOLI              1
                         Reg PARNELL                1
                  2      Giuseppe FARINA            1
                         Juan Manuel FANGIO         1
                         Luigi FAGIOLI              1
                  4      Giuseppe FARINA            1
                         Juan Manuel FANGIO         1
                         Luigi FAGIOLI              1
                  5      Giuseppe FARINA            1
                         Juan Manuel FANGIO         1
                         Luigi FAGIOLI              1
                  6      Giuseppe FARINA            1
                         Juan Manuel FANGIO         1
                         Luigi FAGIOLI              1
                  7      Consalvo SANESI            1
                         Giuseppe

In the 1950 Italian GP, Juan Manuel Fangio has two records because of shared driving. When calculating race averages, we'll have to process those.

Let's try to calculate the race average for Ferrari's 1952 run:

In [58]:
f1_1952 = race_entries[race_entries.year == 1952]
grouped = f1_1952.groupby(["year", "team", "round", "driver"])

In [59]:
grouped = grouped.p_final.mean().rename("p_average")
grouped

year  team     round  driver             
1952  Ferrari  1      André SIMON            10.0
                      Giuseppe FARINA        12.5
                      Louis ROSIER           20.0
                      Maurice TRINTIGNANT    25.0
                      Peter HIRT              7.0
                      Piero TARUFFI           1.0
                      Rudi FISCHER            2.0
               2      Alberto ASCARI         31.0
               3      Alberto ASCARI          1.0
                      Charles de TORNACO      7.0
                      Giuseppe FARINA         2.0
                      Louis ROSIER           20.0
                      Piero TARUFFI          18.0
               4      Alberto ASCARI          1.0
                      Franco COMOTTI         12.0
                      Giuseppe FARINA         2.0
                      Louis ROSIER           17.0
                      Luigi VILLORESI        24.0
                      Peter HIRT             11.0
        

In [60]:
g_drivers = grouped.to_frame().reset_index()

In [61]:
g_rounds = g_drivers.groupby(["year","team", "round"]).p_average.mean()
g_rounds

year  team     round
1952  Ferrari  1        11.071429
               2        31.000000
               3         9.600000
               4        11.944444
               5        13.000000
               6         8.571429
               7         8.200000
               8         7.714286
Name: p_average, dtype: float64

In [62]:
g_rounds = g_rounds.to_frame().reset_index()

In [63]:
g_team = g_rounds.groupby(["year","team"]).p_average.mean()
g_team

year  team   
1952  Ferrari    12.637698
Name: p_average, dtype: float64

What we just computed is the average finishing position for Ferrari in 1952. 

We can generalize this process for every team and every season:

1. group results by `["year", "team", "round", "driver"]`.  
    1. Find the average `p_final` for each driver of each team at each round of the season.
    1. Rename computed average to `p_average`
    1. turn the resulting series into a dataframe.
1. group dataframe from previous step by `["year", "team", "round"]`
    1. Find the average `p_average` for each team at each round of the season.
    1. turn the resulting series into a dataframe.
1. group dataframe from previous step by `["year", "team"]`
    1. Find the average `p_average` for each team of each year.
    1. turn the resulting series into a dataframe.
1. Optional: group dataframe from previous step by `["year"]`
    1. Find the average `p_average` for each year of F1.
    1. turn resulting series into a dataframe.

Let's apply this to the `results` dataframe so we can compare the championship runs:

In [64]:
keep_out = ["nq", "npq", "exc", "tf"]
race_entries = results[~results.position.isin(keep_out)]

In [65]:
g_drivers = race_entries.groupby(["year", "team", "round", "driver"]).p_final.mean().rename("p_average").to_frame().reset_index()

In [66]:
g_rounds = g_drivers.groupby(["year", "team", "round"]).p_average.mean().to_frame().reset_index()

In [67]:
g_rounds.head(20)

Unnamed: 0,year,team,round,p_average
0,1950,Alfa Romeo,1,4.5
1,1950,Alfa Romeo,2,8.0
2,1950,Alfa Romeo,4,5.0
3,1950,Alfa Romeo,5,2.333333
4,1950,Alfa Romeo,6,3.333333
5,1950,Alfa Romeo,7,10.8
6,1951,Alfa Romeo,1,3.25
7,1951,Alfa Romeo,3,7.0
8,1951,Alfa Romeo,4,6.75
9,1951,Alfa Romeo,5,6.5


In [68]:
g_team = g_rounds.groupby(["year", "team"]).p_average.mean().to_frame().reset_index()

Now we can sort `g_team` to see who had the best racing average:

In [69]:
g_team.sort_values(by="p_average", ascending=True).head(10)

Unnamed: 0,year,team,p_average
67,2017,Mercedes,3.175
65,2015,Mercedes,3.210526
54,2004,Ferrari,3.305556
61,2011,Red Bull,3.473684
66,2016,Mercedes,3.47619
68,2018,Mercedes,3.857143
38,1988,McLaren,4.125
64,2014,Mercedes,4.210526
52,2002,Ferrari,4.235294
46,1996,Williams,4.8125


I think this is good enough to compare, but I want to check what the perfect performance would have been for each season. We can do this similarly to how we counted available podium spots:

1. Get number of drivers in each race, say *n*.
1. Find the average of the first n spots. Append to a list
1. Return the average of that list of *n*'s.

In [70]:
race_average_analysis = g_team.copy()

In [71]:
def expected_perfect(row):
    season = race_entries[race_entries.year == row.year]
    team = season[season.team == row.team]
    races = team["round"].unique()
    num_races = team["round"].max()
    my_list = []
    
    for race in races:
        drivers = team[team["round"] == race].driver.nunique()
        race_finishes = []
        for i in range(1, drivers + 1):
            race_finishes.append(i)

        my_list.append(np.mean(race_finishes))
    
    best_finish = np.sum(my_list)
    
    perfect = float(best_finish)/float(num_races)
    
    return perfect

In [72]:
race_average_analysis["perfect"] = race_average_analysis.apply(expected_perfect, axis = 1)

In [73]:
race_average_analysis.head(10)

Unnamed: 0,year,team,p_average,perfect
0,1950,Alfa Romeo,5.661111,1.928571
1,1951,Alfa Romeo,7.642857,2.1875
2,1952,Ferrari,12.637698,3.5625
3,1953,Ferrari,7.745982,3.111111
4,1954,Mercedes,6.083333,1.333333
5,1955,Mercedes,7.284722,2.0
6,1956,Ferrari,8.739286,2.6875
7,1957,Maserati,10.17568,4.1875
8,1958,Ferrari,8.05,2.045455
9,1959,Cooper,9.801339,3.777778


In [74]:
race_average_analysis["delta"] = race_average_analysis["p_average"] - race_average_analysis["perfect"]

In [75]:
race_average_analysis.sort_values(by="delta").head(10)

Unnamed: 0,year,team,p_average,perfect,delta
67,2017,Mercedes,3.175,1.5,1.675
65,2015,Mercedes,3.210526,1.5,1.710526
54,2004,Ferrari,3.305556,1.5,1.805556
61,2011,Red Bull,3.473684,1.5,1.973684
66,2016,Mercedes,3.47619,1.5,1.97619
68,2018,Mercedes,3.857143,1.5,2.357143
38,1988,McLaren,4.125,1.5,2.625
64,2014,Mercedes,4.210526,1.5,2.710526
52,2002,Ferrari,4.235294,1.5,2.735294
46,1996,Williams,4.8125,1.5,3.3125


To this, let's add variance, and standard deviation for plotting error:

In [76]:
def compute_std(row):
    season = g_rounds[g_rounds.year == row.year]
    run = season[season.team == row.team]
    return run.p_average.std()

def compute_var(row):
    season = g_rounds[g_rounds.year == row.year]
    run = season[season.team == row.team]
    return run.p_average.var()

In [77]:
race_average_analysis["std"] = race_average_analysis.apply(compute_std, axis=1)
race_average_analysis["var"] = race_average_analysis.apply(compute_var, axis=1)

In [78]:
race_average_analysis.sort_values(by="p_average").head(10)

Unnamed: 0,year,team,p_average,perfect,delta,std,var
67,2017,Mercedes,3.175,1.5,1.675,1.779082,3.165132
65,2015,Mercedes,3.210526,1.5,1.710526,3.22023,10.369883
54,2004,Ferrari,3.305556,1.5,1.805556,2.573273,6.621732
61,2011,Red Bull,3.473684,1.5,1.973684,3.207949,10.290936
66,2016,Mercedes,3.47619,1.5,1.97619,4.578417,20.961905
68,2018,Mercedes,3.857143,1.5,2.357143,3.428348,11.753571
38,1988,McLaren,4.125,1.5,2.625,4.627814,21.416667
64,2014,Mercedes,4.210526,1.5,2.710526,3.888429,15.119883
52,2002,Ferrari,4.235294,1.5,2.735294,3.973025,15.784926
46,1996,Williams,4.8125,1.5,3.3125,3.628016,13.1625


We can now save this to a csv and move on to doing the same but for all the teams in every season.

In [79]:
race_average_analysis.to_csv("../data/output/race_average_analysis.csv", index=False)

Like we did earlier, we can add columns from this analysis into an overall analysis dataFrame:

In [80]:
overall_analysis = pd.merge(analysis, race_average_analysis, on=["year","team"])
overall_analysis.head()

Unnamed: 0,year,team,wins,races,win_percentage,podium_spots_claimed,podium_spots_available,podium_percentage,run_id,p_average,perfect,delta,std,var
0,1950,Alfa Romeo,6,7,85.714286,12,18,66.666667,Alfa Romeo 1950,5.661111,1.928571,3.73254,3.167222,10.031296
1,1951,Alfa Romeo,4,8,50.0,9,21,42.857143,Alfa Romeo 1951,7.642857,2.1875,5.455357,4.028175,16.22619
2,1952,Ferrari,7,8,87.5,17,22,77.272727,Ferrari 1952,12.637698,3.5625,9.075198,7.652586,58.562072
3,1953,Ferrari,7,9,77.777778,16,24,66.666667,Ferrari 1953,7.745982,3.111111,4.634871,2.705494,7.319699
4,1954,Mercedes,4,9,44.444444,7,17,41.176471,Mercedes 1954,6.083333,1.333333,4.75,1.237156,1.530556


In [81]:
overall_analysis.to_csv("../data/output/overall_analysis.csv", index=False)

---

### Do race average analysis for all races

In [82]:
keep_out = ["nq", "npq", "exc", "tf"]
entries = race_results[~race_results.position.isin(keep_out)]

In [83]:
g_d = entries.groupby(["year", "team", "round", "driver"]).p_final.mean().rename("p_average").to_frame().reset_index()

In [84]:
g_d.to_csv("../data/output/race_averages_drivers.csv",index=False)

In [85]:
g_r = g_d.groupby(["year", "team", "round"]).p_average.mean().to_frame().reset_index()

In [86]:
g_r.to_csv("../data/output/race_averages_rounds.csv",index=False)

In [87]:
g_t = g_r.groupby(["year", "team"]).p_average.mean().to_frame().reset_index()

In [88]:
g_t.to_csv("../data/output/race_averages_team.csv",index=False)

In [89]:
g_year = g_t.groupby(["year"]).p_average.mean().to_frame().reset_index()

In [90]:
g_year.to_csv("../data/output/race_averages_year.csv",index=False)