# Data Handling

The goal of this notebook is to inspect the incoming HTML data files the application will ingest. 
The data will be restructured into a pandas dataframe where we can illustrate how various rules can be applied to racer's timing data to deduce their grid position in the finals.

The grids of a race are not solely determined by lap time data. In total there are 3 factors. 

1. Lap times; the shorter lap times relative to other drivers results in greater overall grid position.


2. Overseer's context; the race overseer applies these rules or corrections so that the race data beyond raw timing data. Drivers in breech of regulations will face a time penalty which results in a lower overall grid position.


3. Novice Drivers; very inexperienced drivers may be forced to use the novice grid which is behind the main grid for safety. After a few races they can join the main grid or optionally waive their position on the main grid to use the novice grid. 


## Contents
1. Import Dependencies
2. Accessing Sample Data
3. Preprocessing Sample data
4. Preparing Data

## 1. Import Dependencies

In [1]:
import sys
from os import getcwd
from os.path import dirname, join

import pandas as pd
from IPython.display import Code

#jupiters kernel trick to import local packages and modules.
project_root = dirname(getcwd())
sys.path.append(project_root)

from kartingData.sample_data_api import load_heat1, load_heat2
from kartingData.prepare_data import prepare_dataframe



## 2. Accessing Sample Data

Data is stored locally in project folder and not uploaded to git. A simple access module was created so that heat and qualification data could be loaded easily with some command, e.g. load_heat1

In [2]:
df=load_heat1()

Code("sample_data_api.py")

## 3. Preprocessing Data

You may be wondering about the weird names and some blank values. The original name field was overridden for fun to return a star wars nickname for the driver. But not before any relevent racing info was extracted.
e.g. OCALLAGHAN ANDREW (N) -> chewbacca

The original name field for this data contains the drivers first and last name but also other material infromation. In the preprocessing producing new columns and overriding the original name column

1. Novice : this is flagged using '(N)' on the drivers original display name.


2. Race Event : when 'No.' is nan, the 'Name' contains the name of a race event. e.g. 'Warmup Flag', 'Run Stopped' etc.


3. Name : Overriding the drivers name to a star wars nickname. e.g. e.g. OCALLAGHAN ANDREW (N) -> chewbacca.


In [3]:
Code("preprocessing.py")

In [4]:
def head_and_tail(df):
    #first 5 rows and last five rows
    return pd.concat([df.head(), df.tail()])

import numpy as np
head_and_tail(df)
df.assign(RaceEvent=np.where(df['No.'].notnull(), '', df['Name']))

Unnamed: 0,#,No.,Name,Laps,Lead,Lap Tm,Spd,Elapsed Tm,Passing Tm,Hits,Strength,Noise,Photocell Time,Transponder,Backup Tx,Backup Passing Tm,Class,Deleted,Novice,RaceEvent
0,1,,,,,,,,10:18:54.484,,,,,,,,,,False,
1,2,73.0,rune_haako,0.0,0.0,,0.0,,10:19:02.125,458.0,76.0,26.0,,33879.0,0.0,,Bam,No,False,
2,3,27.0,miraj_scintel,0.0,0.0,,0.0,,10:19:05.192,88.0,52.0,26.0,,2742019.0,0.0,,Bam,No,False,
3,4,12.0,muzzer,0.0,0.0,,0.0,,10:19:07.200,285.0,87.0,26.0,,6754955.0,0.0,,Bam,No,False,
4,5,,,,,,,0.000,10:19:16.949,,,,,,,,,,False,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
62,63,86.0,si_treemba,7.0,7.0,1:10.979,76.1,8:20.877,10:27:37.826,58.0,133.0,23.0,,3432104.0,0.0,,Bam,No,True,
63,64,44.0,owen_lars,6.0,7.0,1:37.072,55.6,8:27.103,10:27:44.052,126.0,136.0,23.0,,1685552.0,0.0,,Bam,No,True,
64,65,34.0,onimi,7.0,7.0,1:12.352,74.6,8:35.904,10:27:52.853,65.0,128.0,21.0,,91301.0,0.0,,Bam,No,False,
65,66,12.0,muzzer,7.0,7.0,1:16.471,70.6,8:47.762,10:28:04.711,68.0,131.0,21.0,,6754955.0,0.0,,Bam,No,False,


## 4. Preparing Data

Now the data can be assessed and reduced as necessary. 

1. Most columns should logically irrelivent to the grid algorithm and can be removed.

2. Most Rows should be valid and represent a lap for a driver. We'll keep those.

3. Other rows signify race events such as warmup, start, finish. They NaN values generated stop us from casting columns easily. They will be removed.

4. The race overseer has filled in a deleted column. Hopefully that captures most of the invalid data a driver produces.

Some commands will be used just to gain insight into the data.


In [5]:
df.info(memory_usage='Deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67 entries, 0 to 66
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   #                  67 non-null     int64  
 1   No.                63 non-null     float64
 2   Name               67 non-null     object 
 3   Laps               63 non-null     float64
 4   Lead               63 non-null     float64
 5   Lap Tm             47 non-null     object 
 6   Spd                63 non-null     float64
 7   Elapsed Tm         58 non-null     object 
 8   Passing Tm         67 non-null     object 
 9   Hits               63 non-null     float64
 10  Strength           63 non-null     float64
 11  Noise              63 non-null     float64
 12  Photocell Time     0 non-null      float64
 13  Transponder        63 non-null     float64
 14  Backup Tx          63 non-null     float64
 15  Backup Passing Tm  0 non-null      float64
 16  Class              63 non-nu

In [6]:
df.describe()

Unnamed: 0,#,No.,Laps,Lead,Spd,Hits,Strength,Noise,Photocell Time,Transponder,Backup Tx,Backup Passing Tm
count,67.0,63.0,63.0,63.0,63.0,63.0,63.0,63.0,0.0,63.0,63.0,0.0
mean,34.0,43.492063,3.52381,3.555556,57.136508,139.825397,136.349206,25.365079,,2783072.0,0.0,
std,19.485037,27.694109,2.17654,2.212704,33.939217,97.884649,21.515305,6.350772,,2234730.0,0.0,
min,1.0,6.0,0.0,0.0,0.0,45.0,52.0,16.0,,33879.0,0.0,
25%,17.5,19.5,1.5,1.5,27.8,78.0,131.5,23.0,,888426.5,0.0,
50%,34.0,34.0,3.0,3.0,75.0,107.0,135.0,23.0,,2742019.0,0.0,
75%,50.5,69.5,5.0,5.5,80.4,129.5,137.5,29.0,,4352596.0,0.0,
max,67.0,86.0,7.0,7.0,82.2,458.0,183.0,36.0,,6754955.0,0.0,


In [7]:
df.memory_usage(deep=True)

Index                 128
#                     536
No.                   536
Name                 4466
Laps                  536
Lead                  536
Lap Tm               3695
Spd                   536
Elapsed Tm           4055
Passing Tm           4623
Hits                  536
Strength              536
Noise                 536
Photocell Time        536
Transponder           536
Backup Tx             536
Backup Passing Tm     536
Class                3908
Deleted              3850
Novice                 67
RaceEvent            3988
dtype: int64

In [8]:
#Code is stored in a module for easy re-use in wider application.
#The processing methods are important to see here, so we display the module (read-only)
Code("prepare_data.py")

In [9]:
df = prepare_dataframe(df)
head_and_tail(df)

Unnamed: 0,No.,Name,Laps,Lap Tm,Novice
18,6,antinnis_tremayne,2,1:07.453,False
19,66,deliah_blue,2,1:08.257,False
20,73,rune_haako,2,1:08.255,False
21,27,miraj_scintel,2,1:09.138,False
22,86,si_treemba,2,1:10.629,True
61,27,miraj_scintel,7,1:06.626,False
62,86,si_treemba,7,1:10.979,True
63,44,owen_lars,6,1:37.072,True
64,34,onimi,7,1:12.352,False
65,12,muzzer,7,1:16.471,False


In [10]:
df.info(memory_usage='Deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47 entries, 18 to 65
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   No.     47 non-null     int16 
 1   Name    47 non-null     object
 2   Laps    47 non-null     int16 
 3   Lap Tm  47 non-null     object
 4   Novice  47 non-null     bool  
dtypes: bool(1), int16(2), object(2)
memory usage: 1.3+ KB


In [11]:
df.memory_usage(deep=True)

Index      376
No.         94
Name      3156
Laps        94
Lap Tm    3055
Novice      47
dtype: int64

In [12]:
#Now do the same with heat2 with one simple command.
df2 = load_heat2().pipe(prepare_dataframe)
head_and_tail(df2)

Unnamed: 0,No.,Name,Laps,Lap Tm,Novice
18,6,antinnis_tremayne,2,1:06.025,False
19,66,deliah_blue,2,1:06.613,False
20,73,rune_haako,2,1:06.459,False
21,27,miraj_scintel,2,1:06.209,False
22,86,si_treemba,2,1:11.078,True
52,73,rune_haako,6,1:06.539,False
53,66,deliah_blue,6,1:06.906,False
54,44,owen_lars,5,1:20.658,True
55,12,muzzer,6,1:12.137,False
56,34,onimi,6,1:12.242,False


### Now we've reduced the heat data to the bear essentials.



## Timing Data
The karts have transponders that detect when the driver completes a lap by passing a specified section on the race track. 
From this, the laptimes are easily extrapolated. 
The transponder therefore provides a one dimensional view of the racing data.
In isolation, the transponder timing data is fallable.
For instance, a driver may have artifically low lap times by taking illegal short cuts. 
Alternatively, the driver may drive in breech of regulations. For example, the combined weight of kart and driver is too low for the weight category. Additional context needs to be superimposed over the timing data to validate or correct it.

## Overseer's Context
The race overseer applies this additional context to the data. 
They currate it such that it tells a more complete story of the heats and races. 
The overseer recieves radio feedback from the marshals and flagman who may report various racing incidents.
The overseer can then respond to these racing incidents by applying time penalties, discounting laps or disqualifing drivers.
Furthermore, the overseer can communicate with marshels to use flags to communicate to drivers.
When karts return the post race/heat checks are performed.
The karts are weighed, wheels inspected, fuel is tested and kart nose cones are checked.
If there are any failed checks the driver is sanctioned and perhaps a penalty is applied to their time.

## Gridman's Manual Handling
The curated data is then published and the gridman is responsible to analysing it.
Often they determine each racers grid position by hand and record it with pen and paper.
This can be prone to human error and leave some drivers disgruntled where they feel they were not placed incorrectly.
Furthermore, while setting up the grid, some drivers will choose to waive their current position to start in the novice grid.

## Correcting for Novice Drivers
Some novice drivers who are transitioning to be experienced drivers can join the main grid.
They take their rightful position on the main grid, like experienced drivers.
Alternatively, they may opt to return to the novice grid and waive their position on the main grid.
In that case, they are still placed appropriately relative to other novice drivers. 
Or for very new drivers it is mandatory that they join the novice grid while aquireing sufficient experience.
For these reasons, it's useful to model the overall grid as being composed of two sub grids. 

1. The main grid; for experienced drivers


2. The novive grid; for less experienced drivers who need some more breathing space on track.
