# Cleaning f1db data

This notebook is to clean the series of csv tables coming from the [Ergast Developer API](https://ergast.com/mrd/db/). I'm downloading the raw csv tables and processing them here.

Some things to note about the data:

* The tables are not chronologically organized in most cases. Races seem to start around Hamilton's debut with McLaren. Part of my initial cleaning and preparation is to get this re-organized to be chonrological.
* In the standings table, a new entry is not added after a race if the driver does not change standings.
* When a cell is `\N` it means the cell is null, or the value wasn't recorded. This largely applies to things for which we had no data in the beginning.
* Looking at the original `raceId`, something happened because Azerbaijan GP 2019 was the 1001st race but their id for the same race was 1013. I assume the counting was pushed ahead somewhere in the middle, but things are still okay.
* The `drivers` table is actually named `driver.csv`.


**I last downloaded this data on 19-06-10.**

In [1]:
import pandas as pd
import numpy as np

First thing we'll do is add headers to each table based on this [schema text file](http://ergast.com/schemas/f1db_schema.txt). I will also couple removing the `\N` characters in this step.

In [18]:
baseurl = "../data/f1db_csv/"
destination = "../data/working/"

#names correspond to the csv files and not the table names given at the top of the schema text file
tables = ["circuits", "constructor_results", "constructor_standings", "constructors", "driver_standings", 
          "driver", "lap_times", "pit_stops", "qualifying", "races", "results", "seasons", "status"]

# all the headers are grouped by table and put into one big list
headers = [["circuitId", "circuitRef", "name", "location", "country","lat", "lng", "alt", "url"], #circuits
          ["constructorResultsId", "raceId", "constructorId", "points", "status"], #constructor_results
          ["constructorStandingsId", "constructorId", "points", "position", "positionText", "wins"], #constructor_standings
          ["constructorId", "constructorRef", "name", "nationality", "url"], #constructors
          ["driverStandingsId", "raceId", "driverId", "points", "position", "positionText", "wins"], #driver_standings
          ["driverId", "driverRef", "number", "code", "forename", "surname", "dob", "nationality", "url" ], #drivers
          ["raceId", "driverId", "lap", "position", "time", "milliseconds"], #lap_times
          ["raceId", "driverId", "stop", "lap", "time", "duration", "milliseconds"], #pit_stops
          ["qualifyId", "raceId", "driverId", "constructorId", "number",
           "position", "q1", "q2","q3"], #qualifying
          ["raceId", "year", "round", "circuitId", "name", "date", "time", "url"], #races
          ["resultId", "raceId", "driverId", "constructorId", "number", "grid", "position",
           "positionText", "positionOrder", "points", "laps", "time", "milliseconds",
           "fastestLap", "rank", "fastestLapTime", "fastestLapSpeed", "statusId"], #results
          ["year","url"], #seasons
          ["statusId", "status"] #status
          ]

for i in range(len(tables)):
    #pick the right table name and the right header
    table = tables[i]
    header = headers[i]
    
    #import the csv file associated with the table and add in the header to it
    df = pd.read_csv(baseurl+table+".csv",names=header)
    #get rid of the null characters
    df = df.replace("\\N","")
    #export back to a csv in the working folder
    df.to_csv(destination+table+".csv", index=False, mode="w+")
    
    #status updates
    print(table,"done", sep=" --> ")

circuits --> done
constructor_results --> done
constructor_standings --> done
constructors --> done
driver_standings --> done
driver --> done
lap_times --> done
pit_stops --> done
qualifying --> done
races --> done
results --> done
seasons --> done
status --> done


I could try to figure out how to sort the drivers chronologically and the races as well as I had done before, but I don't think that's actually as important to my analysis.

---

If there is anything else I need to clean across the board, I will come back to this notebook.