# Assignment 7

## Submit your .ipynb file to Gradescope by Thursday, October 30th **by 10pm**

##### Import the familiar libraries ``pandas``, ``numpy`` and ``matplotlib.pyplot``

##### In addition, we'll import a sublibrary from the ``statsmodels`` library. We'll use this in problems 4 and 5

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt

# for problems 4 and 5
import statsmodels.formula.api as smf

##### Run the code chunk below to read the CSV file named `results.csv` in the `data` folder and print the first 5 rows of the dataset (using a quick alternative to `.iloc[:5,:]`). Browse the dataset.

In [None]:
df_results = pd.read_csv("data/results.csv")
print(df_results.head())

### (1)  Check Column Types and Data Cleaning

- Use the attributes ``.dtypes`` to get the data types for each column of the DataFrame. Assign this to a variable named ``results_data_types``.
- The 'milliseconds' column contains string values that should be numeric. Do the following:
    - Replace all non-numeric values in the milliseconds column with ``np.nan``
    - Add a new column to ``df_results`` called 'race_time_ms' by converting the 'milliseconds' column to a numeric data type using ``.to_numeric``

In [None]:
# your code here


### (2) Recoding: Create categorical variables

- The DataFrame has a column called 'position' which records whether a driver finished 1st, 2nd, 3rd, etc.
- Create a new column called 'finish_category' that categorizes the race finish positions as follows:
    - Positions 1-3: 'Podium'
    - Positions 4-10: 'Points'
    - Positions 11-20: 'Midfield'
    - Positions >20: 'Backmarker'

**Hint**: The shortest way to do this is to clean the data in the column and then use the ``.cut`` function in the ``pandas`` library. But this is not the only way to do it.

In [None]:
# Write your answer here



### (3) Calculate Race Duration
- Create a new column called 'race_duration_minutes' where we convert the race time in milliseconds to minutes by dividing each millisecond value by 60,000 (which equals $6 \times 10^4$). Equivalently, you can multiply each millisecond value by $6 \times 10^{-4}$

- Each F1 car is associated with a "constructor", the entity which designs the chassis and engine of the car. Use ``.groupby`` to create a DataFrameGroupBy object, grouping by the "constructorID" column.

- Compute the average race duration in minutes for each constructor. Then print out the constructorId's with the 5 fastest average times. (Both the constructorId and the corresponding average time in minutes should be visible)

In [None]:
# Write your answer here



### (4) Linear Regression

We return to the dataset of car features, in the file `features.csv`. We might guess that there is a simple linear relationship between the weight of a car and its miles per gallon (the heavier the car, the worse mileage it gets).

In particular, we expect that these two variables are related by the approximate equality:

$$ m_i \approx a\cdot w_i + b$$

where $m_i$ is the mpg of car $i$, $w_i$ is the weight of car $i$, and $a$ (the slope) and $b$ (the y-intercept) are the coefficients of the linear model - which we need to determine. In this model, we say that "weight" is the **independent variable** and "mpg" is the **dependent variable**


- Read in the carfeatures dataset, and assign it to a DataFrame.

- At the top of the notebook, we imported the library ``statsmodels.formula.api`` with the nickname ``smf``. To construct the model, we will use the ``smf.ols`` (Ordinary Least Squares) function as follows:
```python
        model = smf.ols(formula = ... , data = ...)
```
- Replace the ellipses (...) in the arguments as follows:
    - For formula, you should put the **string** "dependent_variable ~ independent_variable", where you should substitute the appropriate DataFrame column names for the two variables. (But keep the tilde (~) there)
    - The data argument corresponds to the DataFrame you created when you read in the .csv file.

- Compute a Pandas ``Series`` containing the computed coefficients $a$ and $b$ from the linear model. You can do this as follows:
```python
        coeffs = model.fit().params
```

This problem is continued in Question (5)

In [None]:
# your answer here




### (5) Plotting Linear Regression Model

In Question (4), you created a Pandas ``Series`` containing the coefficients of the linear model

$$ m_i \approx a\cdot w_i + b$$

- The elements of ``coeffs`` can be accessed as ``coeffs["Intercept"]`` (the y-intercept) and ``coeffs["weight"]`` (the slope). Define two separate floating-point number variables ``a`` and ``b`` corresponding to the model above.

- Create a Pandas ``Series`` called ``predicted_mpg`` based on the formula:
$$\hat{m}_i = a\cdot w_i + b$$

- This means that each element of ``predicted_mpg`` is computed by multiplying ``a`` by the corresponding element in the "weight" column, and then adding ``b``. **Hint:** It's just like if you were working with Numpy arrays.

- Using ``plt.scatter`` plot weight (on the x-axis) against mpg (on the y-axis). 

- Then using ``plt.plot`` (**not a scatter plot!**) plot weight (on the x-axis) against ``predicted_mpg`` (on the y-axis). They should appear on the same figure as long as you don't type ``plt.show`` in between. 

- Change the color of either the scatter plot or regular plot, so that they are easily distinguished from each other.

- Label the axes, add a legend, and a title to your plot. For the legend, you might want to call the points in the scatter plot "data", and the linear fit line "best fit line".

- Make sure your plots are output to the screen before submitting

In [None]:
# your answer here

