# Analyzing F1 results

The question I'm trying to answer with my visualization project is: **"Who had the best championship season?"** 
To narrow things, I'm looking at constructors in the year they won a championship. Once I get an idea of that, I'll look a number of years before and after they won the championship to gauge their championship performance.

To answer this, I'm looking at:

1. Wins in the season
1. Overall Podiums in the season
1. One-Two finishes

Of the three, the One-Two finishes give the best idea of performance because they take into account the performance of the cars, the drivers, and the team at a race.

---

In [1]:
import pandas as pd
import numpy as np

## IDEA 1: Wins in the season

I think this is the roughest way to look at the results. In this one I'm looking at the cases where `positionOrder == 1`.

In [2]:
results = pd.read_csv("../data/working/master_results.csv")

In [3]:
results.head()

Unnamed: 0,raceId,year,round,prixName,constructor,constructorRef,forename,surname,grid,position,positionText,positionOrder,points,status
0,833,1950,1,British Grand Prix,Alfa Romeo,alfa,Nino,Farina,1,1.0,1,1,9.0,Finished
1,833,1950,1,British Grand Prix,Alfa Romeo,alfa,Luigi,Fagioli,2,2.0,2,2,6.0,Finished
2,833,1950,1,British Grand Prix,Alfa Romeo,alfa,Reg,Parnell,4,3.0,3,3,4.0,Finished
3,833,1950,1,British Grand Prix,Talbot-Lago,lago,Yves,Cabantous,6,4.0,4,4,3.0,+2 Laps
4,833,1950,1,British Grand Prix,Talbot-Lago,lago,Louis,Rosier,9,5.0,5,5,2.0,+2 Laps


First let's check that we have the right number of races.

In [4]:
races1 = results.groupby(["year","round"])

In [5]:
len(races1)

1004

In [6]:
races2 = results.groupby("raceId")

In [7]:
len(races2)

1004

Counting them two different ways we do end up with the total number of races that have happened in F1.

In [8]:
wins = results[(results.positionOrder == 1)]

In [9]:
wins.head()

Unnamed: 0,raceId,year,round,prixName,constructor,constructorRef,forename,surname,grid,position,positionText,positionOrder,points,status
0,833,1950,1,British Grand Prix,Alfa Romeo,alfa,Nino,Farina,1,1.0,1,1,9.0,Finished
23,834,1950,2,Monaco Grand Prix,Alfa Romeo,alfa,Juan,Fangio,1,1.0,1,1,9.0,Finished
44,835,1950,3,Indianapolis 500,Kurtis Kraft,kurtis_kraft,Johnnie,Parsons,5,1.0,1,1,9.0,Finished
79,836,1950,4,Swiss Grand Prix,Alfa Romeo,alfa,Nino,Farina,2,1.0,1,1,9.0,Finished
97,837,1950,5,Belgian Grand Prix,Alfa Romeo,alfa,Juan,Fangio,2,1.0,1,1,8.0,Finished


In [10]:
wins.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1007 entries, 0 to 24320
Data columns (total 14 columns):
raceId            1007 non-null int64
year              1007 non-null int64
round             1007 non-null int64
prixName          1007 non-null object
constructor       1007 non-null object
constructorRef    1007 non-null object
forename          1007 non-null object
surname           1007 non-null object
grid              1007 non-null int64
position          1007 non-null float64
positionText      1007 non-null object
positionOrder     1007 non-null int64
points            1007 non-null float64
status            1007 non-null object
dtypes: float64(2), int64(5), object(7)
memory usage: 118.0+ KB


There is a discrepancy where we have 1007 winners and only 1004 races.

---

#### Finding: Some early races had two winners

Exploring the above discrepancy between number of winners and number of races. Let's get some "full prix names" by combining `year` and `prixName` columns and seeing which instances are listed twice.

In [11]:
fullPrix = wins["year"].map(str) + " " + wins["prixName"]

In [12]:
fullPrix.value_counts() > 1 

1951 French Grand Prix            True
1957 British Grand Prix           True
1956 Argentine Grand Prix         True
1996 Portuguese Grand Prix       False
2011 Australian Grand Prix       False
1953 British Grand Prix          False
1980 Dutch Grand Prix            False
2004 United States Grand Prix    False
2001 French Grand Prix           False
1959 British Grand Prix          False
2010 Korean Grand Prix           False
2000 Belgian Grand Prix          False
2002 Brazilian Grand Prix        False
1982 British Grand Prix          False
2006 British Grand Prix          False
1999 Spanish Grand Prix          False
2008 Belgian Grand Prix          False
1958 Indianapolis 500            False
2006 European Grand Prix         False
2001 Hungarian Grand Prix        False
1972 Spanish Grand Prix          False
1973 Spanish Grand Prix          False
2000 San Marino Grand Prix       False
1980 South African Grand Prix    False
2017 Hungarian Grand Prix        False
1971 Dutch Grand Prix    

Let's look more closely at these three races:

* 1951 French Grand Prix
* 1956 Argentine Grand Prix
* 1957 British Grand Prix

In [13]:
frenchGP51 = wins[(wins.year == 1951) & (wins.prixName == "French Grand Prix")]

In [14]:
frenchGP51.head()

Unnamed: 0,raceId,year,round,prixName,constructor,constructorRef,forename,surname,grid,position,positionText,positionOrder,points,status
228,828,1951,4,French Grand Prix,Alfa Romeo,alfa,Juan,Fangio,7,1.0,1,1,5.0,Finished
229,828,1951,4,French Grand Prix,Alfa Romeo,alfa,Luigi,Fagioli,7,1.0,1,1,4.0,Finished


Looking at the [wikipedia page for this race](https://en.wikipedia.org/wiki/1951_French_Grand_Prix), Luigi Fagioli finished the race (40 laps) in the car that Juan Fangio had started in.

In [15]:
argentineGP56 = wins[(wins.year == 1956) & (wins.prixName == "Argentine Grand Prix")]
britishGP57 = wins[(wins.year == 1957) & (wins.prixName == "British Grand Prix")]

In [16]:
argentineGP56.head()

Unnamed: 0,raceId,year,round,prixName,constructor,constructorRef,forename,surname,grid,position,positionText,positionOrder,points,status
1210,784,1956,1,Argentine Grand Prix,Ferrari,ferrari,Luigi,Musso,3,1.0,1,1,5.0,Finished
1211,784,1956,1,Argentine Grand Prix,Ferrari,ferrari,Juan,Fangio,3,1.0,1,1,5.0,Finished


In [17]:
britishGP57.head()

Unnamed: 0,raceId,year,round,prixName,constructor,constructorRef,forename,surname,grid,position,positionText,positionOrder,points,status
1489,780,1957,5,British Grand Prix,Vanwall,vanwall,Stirling,Moss,3,1.0,1,1,5.0,Finished
1490,780,1957,5,British Grand Prix,Vanwall,vanwall,Tony,Brooks,3,1.0,1,1,4.0,Finished


In Argentina, Musso and Fangio shared a car and shared the first place points. Moss and Brooks also shared a car in the ’57 British Grand Prix. In each othese cases, the drivers split the points for first place. These are the only three races where this happened.

---

To handle these wins, let's create a slice of the wins that doesn't include driver info.

In [18]:
constructorWins = wins[["year","round","prixName", "constructor", "position"]]

In [19]:
constructorWins.head()

Unnamed: 0,year,round,prixName,constructor,position
0,1950,1,British Grand Prix,Alfa Romeo,1.0
23,1950,2,Monaco Grand Prix,Alfa Romeo,1.0
44,1950,3,Indianapolis 500,Kurtis Kraft,1.0
79,1950,4,Swiss Grand Prix,Alfa Romeo,1.0
97,1950,5,Belgian Grand Prix,Alfa Romeo,1.0


Now we can drop the duplicate rows without worry.

In [20]:
constructorWins = constructorWins.drop_duplicates()

In [21]:
constructorWins.duplicated().value_counts()

False    1004
dtype: int64

This corresponds to the number of races we have at the beginning. Now we can start grouping and summing to see who had the most wins

In [22]:
groupedWins = constructorWins.groupby(["year","constructor"]).position.sum()

In [23]:
groupedWins = groupedWins.rename("wins").reset_index().sort_values("year")

In [24]:
groupedWins.head(10)

Unnamed: 0,year,constructor,wins
0,1950,Alfa Romeo,6.0
1,1950,Kurtis Kraft,1.0
2,1951,Alfa Romeo,4.0
3,1951,Ferrari,3.0
4,1951,Kurtis Kraft,1.0
5,1952,Ferrari,7.0
6,1952,Kuzma,1.0
7,1953,Ferrari,7.0
8,1953,Kurtis Kraft,1.0
9,1953,Maserati,1.0


This `groupedWins` dataFrame is the count of wins a constructor had in a given season if they had won at least one race. For my analysis, I need to now filter this to only include the teams whose drivers won championships. I previously compiled the `championship_teams.csv` and have put it in the `data/working` folder.

In [25]:
championTeams = pd.read_csv("../data/working/championship_teams.csv")

In [26]:
championTeams.head()

Unnamed: 0,year,constructor
0,1950,Alfa Romeo
1,1951,Alfa Romeo
2,1952,Ferrari
3,1953,Ferrari
4,1954,Mercedes


In [27]:
championTeams.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69 entries, 0 to 68
Data columns (total 2 columns):
year           69 non-null int64
constructor    69 non-null object
dtypes: int64(1), object(1)
memory usage: 1.2+ KB


Now let's check where the matches are:

In [28]:
comparison = pd.merge(groupedWins, championTeams, on=["year","constructor"], how="left", indicator="Winner")

In [29]:
comparison.head()

Unnamed: 0,year,constructor,wins,Winner
0,1950,Alfa Romeo,6.0,both
1,1950,Kurtis Kraft,1.0,left_only
2,1951,Alfa Romeo,4.0,both
3,1951,Ferrari,3.0,left_only
4,1951,Kurtis Kraft,1.0,left_only


In [30]:
championWins = comparison[comparison.Winner == "both"]

In [31]:
championWins.head()

Unnamed: 0,year,constructor,wins,Winner
0,1950,Alfa Romeo,6.0,both
2,1951,Alfa Romeo,4.0,both
5,1952,Ferrari,7.0,both
7,1953,Ferrari,7.0,both
11,1954,Mercedes,4.0,both


In [32]:
championWins.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 69 entries, 0 to 254
Data columns (total 4 columns):
year           69 non-null int64
constructor    69 non-null object
wins           69 non-null float64
Winner         69 non-null category
dtypes: category(1), float64(1), int64(1), object(1)
memory usage: 2.3+ KB


We have the same length in championTeams and championWins so things seem to be working out. I will save this csv for plotting and further comparison.

In [33]:
championWins.to_csv("../data/processed/season_wins.csv", index=False, mode="w+")

Because the number of races in each season change, we should normalize the number of wins by the number of races in each season. 

In terms of implementing this, a function should take a row from `championWins`, take the `year`, and then find the max of rounds from the `results` dataFrame.