# Formula 1 Championship Drivers

I'm going to be looking at the gathered data from Formula 1 races from 1950–present. I got the data using the [Ergast Developer API](http://ergast.com/mrd/) and it has all the races from 1950 going forward. The data is stored across different relational tables, so I'll be doing some filtering and joining later.


Some things to note about the data: 

* I last downloaded this data on 19-04-30, and it includes Azerbaijan GP 2019. I don't think it's too relevant to keep the races from 2019 in the analysis because I'll be looking at the past championship winners.
* The tables are not chronologically organized in most cases. Races seem to start around Hamilton's debut with McLaren. Part of my initial cleaning and preparation is to get this re-organized. Drivers by date of birth, everything to races by the time they occured.
* In the standings table, a new entry is not added after a race if the driver does not change standings.
* When a cell is `\N` it means the cell is null, or the value wasn't recorded. This largely applies to things for which we had no data in the beginning.
* Looking at the original `raceId`, something happened because Azerbaijan GP 2019 was the 1001st race but their id for the same race was 1014. I assume the counting was pushed ahead somewhere in the middle, but things seem to line up. **I'll come back to look at this later**.

## Cleaning the data — part 1

The original data is saved (sans zip) in `data/f1db_raw`. I made a copy of the folder and renamed it to `f1db_excelPrep`.

The first cleaning thing I'm doing is going through every table in `data/f1db_excelPrep/` and:
* adding the headings for each table using the [f1db_schema.txt](../data/f1db_schema.txt) as a guideline
* removing the `\N` wherever it appears
* formatting dates to yyyy-mm-dd, and creating columns for month, date, and year separately.
* Sort drivers, and races chronologically and create new temporary driver and race Ids. 

I'll do this in excel and save the files in the folder: `data/f1db_excelPrep/`.

During this I noticed in constructor results that there is a `D` status that I don't know what it means. **I will look into this later.**

## Cleaning the data —part 2

In the first part of cleaning, I re-organized the `driverId`'s and `raceId`'s chronologically, so now I want to pass that info to the other tables so that everything is chronological.

What do I need to update for each table?

* circuits: *nothing*
* constructor_results: raceId
* constructor_standings: raceId
* constructors: *nothing*
* driver_standings: raceId, driverId
* driver: *nothing*
* lap_times: raceId, driverId
* pit_stops: raceId, driverId
* qualifying: raceId, driverId
* races: *nothing*
* results: raceId, driverId
* seasons: *nothing*
* status: *nothing*

The tables that don't need to be updated, I'll go ahead and make direct copies of and put them in the `data/f1db_working/` which will be the tables I use for analysis.


In [1]:
import numpy as np
import pandas as pd