# Part 4: Fuel Emissions Regression

In this problem you will use the `FuelConsumptionCo2.csv` file (from your Homework Module on Canvas) to build two candidate models to predict a vehicle's Carbon Dioxide Emissions (`CO2EMISSIONS`).

In [23]:
df_fuel = pd.read_csv('FuelConsumptionCo2.csv')
df_fuel.dropna(inplace=True)
df_fuel.head()

Unnamed: 0,MODELYEAR,MAKE,MODEL,VEHICLECLASS,ENGINESIZE,CYLINDERS,TRANSMISSION,FUELTYPE,FUELCONSUMPTION_CITY,FUELCONSUMPTION_HWY,FUELCONSUMPTION_COMB,FUELCONSUMPTION_COMB_MPG,CO2EMISSIONS
0,2014,ACURA,ILX,COMPACT,2.0,4,AS5,Z,9.9,6.7,8.5,33,196
1,2014,ACURA,ILX,COMPACT,2.4,4,M6,Z,11.2,7.7,9.6,29,221
2,2014,ACURA,ILX HYBRID,COMPACT,1.5,4,AV7,Z,6.0,5.8,5.9,48,136
3,2014,ACURA,MDX 4WD,SUV - SMALL,3.5,6,AS6,Z,12.7,9.1,11.1,25,255
4,2014,ACURA,RDX AWD,SUV - SMALL,3.5,6,AS6,Z,12.1,8.7,10.6,27,244


## Part 4.1: Multiple Regression (10 points)

Our first model will be a multiple regression model where we try to predict `CO2EMISSIONS` with `ENGINESIZE`, `CYLINDERS` and `FUELCONSUMPTION_COMB_MPG`. Be sure to complete all the following steps:

#### Part 4.1.1

Create your `X` and `y` arrays. Make sure that:

- You scale the $x$ features **using scale normalization**
- You do **not** include a bias column in `X`

Defining the feature list:

    x_feat_list = ['ENGINESIZE', 'CYLINDERS', 'FUELCONSUMPTION_COMB_MPG']

may help. Your `X` should pass the assert statement in the cell before Part 4.1.2. **Note** if you use a different type of normalization besides scale normalization (i.e. other than simply dividing all features by their corresponding standard deviations) the assert will not pass.

In [25]:
# check if 
assert np.isclose(X[0], np.array([1.41531251, 2.19568706, 4.77566865])).all()

#### Part 4.1.2

Using single-fold cross validation with a 70-30 split, create `Xtrain`, `Xtest`, `ytrain`, and `ytest`.

Fit the model using **your own** `line_of_best_fit` function to `Xtrain` and `ytrain`.

Then pass `Xtest`, `ytest`, and the output from the `line_of_best_fit` to your `linreg_predict` function, saving that as something. 

Print out the cross-validated $MSE$ and $R^2$ values. You do not have to comment on their values yet; but you will in Part 4.3 as part of comparing this model with the one from Part 4.2.

#### Part 4.1.3

Now fit the full model using your `line_of_best_fit` function, and generate the residuals using your `linreg_predict` function. Create 5 residual plots in order to check that the assumptions of independence, constant variance, and normality are met for the model you built:

- A plot of the index vs. the residuals
- A plot of `ENGINESIZE` vs. the residuals
- A plot of `CYLINDERS` vs. the residuals
- A plot of `FUELCONSUMPTION_COMB_MPG` vs. the residuals
- A normal probability quantile-quantile plot of the residuals

You do not have to comment on these plots yet; but you will in Part 4.3 as part of comparing this model with the one from Part 4.2.

## Part 4.2: Polynomial Regression (10 points)

Our second model will be a polynomial regression model where we try to predict `CO2EMISSIONS` with `FUELCONSUMPTION_COMB_MPG`. Be sure to complete all the following steps:

#### Part 4.2.1

Use the `PolynomialFeatures` and `.fit_transform` functions to convert the `FUELCONSUMPTION_COMB_MPG` ($x$) feature into an array (**CALL THIS `X_poly`**) that includes **four** columns corresponding to building a quartic model for `CO2EMISSIONS` ($y$) along the lines of: $y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 x^4$. I have started the process for you by defining the array containing only our target feature, `X_fuel`.

**Note** that the `.fit_transform` function will produce by default **five** columns, including the bias column. Your functions take arrays that do not have this, so you should remove it.

Your `X_poly` should pass the assert statement in the cell before Part 4.2.2. **Note**: Do *not* scale your features (it is unnecessary, since there is really only one, albeit raised to different powers, and will cause an assert error).

In [32]:
from sklearn.preprocessing import PolynomialFeatures

X_fuel = np.array(df_fuel['FUELCONSUMPTION_COMB_MPG']).reshape(-1,1)

In [34]:
assert np.isclose(X_poly[0], np.array([33, 1089, 35937, 1185921])).all()

#### Part 4.2.2

Using single-fold cross validation with a 70-30 split, create `Xtrain`, `Xtest`, `ytrain`, and `ytest` (from `X_poly` from Part 4.2.1 and `y` as defined before).

Fit the model using **your own** `line_of_best_fit` function to `Xtrain` and `ytrain`.

Then pass `Xtest`, `ytest`, and the output from the `line_of_best_fit` to your `linreg_predict` function, saving that as something. 

Print out the cross-validated $MSE$ and $R^2$ values. You do not have to comment on their values yet; but you will in Part 4.3 as part of comparing this model with the one from Part 4.1.

#### Part 4.2.3

Now fit the full model using your `line_of_best_fit` function, and generate the residuals using your `linreg_predict` function. Create 3 residual plots in order to check that the assumptions of independence, constant variance, and normality are met for the model you built:

- A plot of the index vs. the residuals
- A plot of `FUELCONSUMPTION_COMB_MPG` vs. the residuals
- A normal probability quantile-quantile plot of the residuals

You do not have to comment on these plots yet; but you will in Part 4.3 as part of comparing this model with the one from Part 4.1.

## Part 4.3: Conclusions (10 points)

**In a markdown cell**, give a *lengthy and **detailed*** discussion of the two candidate models. Discuss each of their strengths/weaknesses/benefits (i.e. which model had the better $R^2$? which had the better $MSE$? which assumptions were met for each model and which were not?). Then, **make a decision** about which model you would suggest (if you **had** to choose) is most appropriate to use for predicting a vehicle's Carbon Dioxide Emissions. Do you have any thoughts about improving either/both of these models? **Discuss this as well.**