# Analyzing F1 results

The question I'm trying to answer with my visualization project is: **"Who had the best championship season?"** In terms of answering this, I'm actually looking primarily at the constructors and not at the drivers. Although the drivers could be a more interesting angle to dive into.

To answer this, I'm looking at:

1. Wins in the season
1. Overall Podiums in the season
1. One-Two finishes

Of the three, the One-Two finishes give the best idea of performance because they take into account the performance of the cars, the drivers, and the team at a race.

---

In [1]:
import pandas as pd
import numpy as np

## IDEA 1: Wins in the season

I think this is the roughest way to look at the results. In this one I'm looking at the cases where `positionOrder == 1`.

In [2]:
results = pd.read_csv("../data/working/master_results.csv")

In [3]:
results.head()

Unnamed: 0,raceId,year,round,prixName,constructor,constructorRef,forename,surname,grid,position,positionText,positionOrder,points,status
0,833,1950,1,British Grand Prix,Alfa Romeo,alfa,Nino,Farina,1,1.0,1,1,9.0,Finished
1,833,1950,1,British Grand Prix,Alfa Romeo,alfa,Luigi,Fagioli,2,2.0,2,2,6.0,Finished
2,833,1950,1,British Grand Prix,Alfa Romeo,alfa,Reg,Parnell,4,3.0,3,3,4.0,Finished
3,833,1950,1,British Grand Prix,Talbot-Lago,lago,Yves,Cabantous,6,4.0,4,4,3.0,+2 Laps
4,833,1950,1,British Grand Prix,Talbot-Lago,lago,Louis,Rosier,9,5.0,5,5,2.0,+2 Laps


First let's check that we have the right number of races.

In [4]:
races1 = results.groupby(["year","round"])

In [5]:
len(races1)

1004

In [7]:
races2 = results.groupby("raceId")

In [9]:
len(races2)

1004

Counting them two different ways we do end up with the total number of races that have happened in F1.

In [10]:
wins = results[(results.positionOrder == 1)]

In [12]:
wins.head()

Unnamed: 0,raceId,year,round,prixName,constructor,constructorRef,forename,surname,grid,position,positionText,positionOrder,points,status
0,833,1950,1,British Grand Prix,Alfa Romeo,alfa,Nino,Farina,1,1.0,1,1,9.0,Finished
23,834,1950,2,Monaco Grand Prix,Alfa Romeo,alfa,Juan,Fangio,1,1.0,1,1,9.0,Finished
44,835,1950,3,Indianapolis 500,Kurtis Kraft,kurtis_kraft,Johnnie,Parsons,5,1.0,1,1,9.0,Finished
79,836,1950,4,Swiss Grand Prix,Alfa Romeo,alfa,Nino,Farina,2,1.0,1,1,9.0,Finished
97,837,1950,5,Belgian Grand Prix,Alfa Romeo,alfa,Juan,Fangio,2,1.0,1,1,8.0,Finished


In [16]:
wins.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1007 entries, 0 to 24320
Data columns (total 14 columns):
raceId            1007 non-null int64
year              1007 non-null int64
round             1007 non-null int64
prixName          1007 non-null object
constructor       1007 non-null object
constructorRef    1007 non-null object
forename          1007 non-null object
surname           1007 non-null object
grid              1007 non-null int64
position          1007 non-null float64
positionText      1007 non-null object
positionOrder     1007 non-null int64
points            1007 non-null float64
status            1007 non-null object
dtypes: float64(2), int64(5), object(7)
memory usage: 118.0+ KB


There is a discrepancy where we have 1007 winners and only 1004 races. I don't know what to make of this right now.

In [22]:
wins.duplicated().value_counts()

False    1007
dtype: int64

None of them seem to be duplicated, but I'll save them to a csv and see if Bjarni can weigh in on this.

In [None]:
wins.to_csv("../data/working/winners.csv")