# Assignment 7

## Submit as an HTML file

### Print your name below

In [None]:
print("Jason Zhang")

### Import the "pandas" "numpy" and "statsmodels.formula.api" libraries

In [1]:
# Write your answer here:

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf


#### In the code chunk below read the CSV file named `results.csv` in the `data` <br> folder and print the first 5 rows of the dataset. Browse the dataset.

In [None]:
df = pd.read_csv("data/results.csv")

print(df.head())


   resultId  raceId  driverId  constructorId number  grid position  \
0         1      18         1              1     22     1        1   
1         2      18         2              2      3     5        2   
2         3      18         3              3      7     7        3   
3         4      18         4              4      5    11        4   
4         5      18         5              1     23     3        5   

  positionText  positionOrder  points  laps         time milliseconds  \
0            1              1    10.0    58  1:34:50.616      5690616   
1            2              2     8.0    58       +5.478      5696094   
2            3              3     6.0    58       +8.163      5698779   
3            4              4     5.0    58      +17.181      5707797   
4            5              5     4.0    58      +18.014      5708630   

  fastestLap rank fastestLapTime fastestLapSpeed  statusId  
0         39    2       1:27.452         218.300         1  
1         41    3 

### (a)  Check Column Types and Data Cleaning

- Use the function .dtypes to get the column types
- Identify which columns have data types that might need conversion
- The 'milliseconds' column contains string values that should be numeric. Create a new column called 'race_time_ms' that:
    - Converts the column to a numeric data type
    - Replaces any non-numeric values with NaN

In [8]:
# Write your answer here
print(df.dtypes)

df['race_time_ms'] = pd.to_numeric(df['milliseconds'], errors='coerce')
print(df[['milliseconds', 'race_time_ms']].head(10))




resultId             int64
raceId               int64
driverId             int64
constructorId        int64
number              object
grid                 int64
position            object
positionText        object
positionOrder        int64
points             float64
laps                 int64
time                object
milliseconds       float64
fastestLap          object
rank                object
fastestLapTime      object
fastestLapSpeed     object
statusId             int64
race_time_ms       float64
dtype: object
   milliseconds  race_time_ms
0     5690616.0     5690616.0
1     5696094.0     5696094.0
2     5698779.0     5698779.0
3     5707797.0     5707797.0
4     5708630.0     5708630.0
5           NaN           NaN
6           NaN           NaN
7           NaN           NaN
8           NaN           NaN
9           NaN           NaN


### (b) Create Categorical Variables

- Create a new column called 'finish_category' that categorizes the race finish positions as follows:
    - Positions 1-3: 'Podium'
    - Positions 4-10: 'Points'
    - Positions 11-20: 'Midfield'
    - Positions >20: 'Backmarker'

Hint: Use the pd.cut() function

In [9]:
# Write your answer here

df['finish_category'] = pd.cut(df['positionOrder'], 
                               bins=[0, 3, 10, 20, float('inf')], 
                               labels=['Podium', 'Points', 'Midfield', 'Backmarker'])

print(df[['positionOrder', 'finish_category']].head(10))





   positionOrder finish_category
0              1          Podium
1              2          Podium
2              3          Podium
3              4          Points
4              5          Points
5              6          Points
6              7          Points
7              8          Points
8              9          Points
9             10          Points


### (c) Calculate Race Duration
- For rows where 'milliseconds' is available, create a new column <br>
'race_duration_minutes' that converts milliseconds to minutes by dividing <br>
by (1000*60).
- Display the average race duration by 'constructorId' for the top 5 <br>
constructors with the shortest average race times

In [None]:
# Write your answer here
df['race_duration_minutes'] = df['milliseconds'] / (1000 * 60)

avg_race_duration = df.groupby('constructorId')['race_duration_minutes'].mean().sort_values().head(5)

print(avg_race_duration)


constructorId
35    76.710777
29    77.604125
41    87.046767
16    89.428828
53    89.658852
Name: race_duration_minutes, dtype: float64


### (d) Driver Performance Analysis

- Calculate the following statistics for each driver, grouped by 'driverId':
    - Average finishing position
    - Total points
    - Number of races completed
    - Best finishing position

- Sort the results by total points in descending order
- Display the top 10 drivers based on total points

In [11]:
# Write your answer here
driver_performance = df.groupby('driverId').agg({
    'positionOrder': 'mean',     # Average finishing position
    'points': 'sum',             # Total points
    'raceId': 'count',           # Number of races completed
    'positionOrder': 'min'       # Best finishing position
}).rename(columns={'positionOrder': 'avg_position', 
                   'points': 'total_points', 
                   'raceId': 'races_completed', 
                   'positionOrder': 'best_position'})  

driver_performance_sorted = driver_performance.sort_values('total_points', ascending=False)

print(driver_performance_sorted.head(10))



          best_position  total_points  races_completed
driverId                                              
1                     1        4396.5              310
20                    1        3098.0              300
4                     1        2061.0              358
830                   1        1983.5              163
8                     1        1873.0              352
822                   1        1778.0              201
3                     1        1594.5              206
30                    1        1566.0              308
817                   1        1307.0              232
18                    1        1235.0              309


### (e) Linear Regression
Create a linear regression model that predicts 'points' based on 'grid' (starting position) and 'laps' completed <br>
Use the following steps:

- Clean the data to remove any non-numeric values and missing values
- Create the regression formula using smf.ols 
- Display the summary of the regression model using model.summary()

What is the predicted points for a driver starting in position 3 and completing 55 laps?

Hint: Use ```.dropna()''' to remove missing values from the points, grid, and laps <br>
variables.

In [12]:
# Write your answer here
import statsmodels.formula.api as smf

df_cleaned = df.dropna(subset=['points', 'grid', 'laps'])

model = smf.ols('points ~ grid + laps', data=df_cleaned).fit()

print(model.summary())

predicted_points = model.predict({'grid': 3, 'laps': 55})

print("\nPredicted Points for Grid 3 & 55 Laps:", predicted_points.iloc[0])



                            OLS Regression Results                            
Dep. Variable:                 points   R-squared:                       0.215
Model:                            OLS   Adj. R-squared:                  0.215
Method:                 Least Squares   F-statistic:                     3530.
Date:                Mon, 24 Mar 2025   Prob (F-statistic):               0.00
Time:                        21:42:11   Log-Likelihood:                -70440.
No. Observations:               25840   AIC:                         1.409e+05
Df Residuals:                   25837   BIC:                         1.409e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.5841      0.054     48.267      0.0