## DAT303 - Spring 2024 - Module 5.2 Notebook
---
Name:    
Date:

In this notebook, you will fit a number of regression models and evaluate the results on the *nyc-sales-rolling.csv* dataset available on Canvas. There are three sub-sections to this assignment:

- Part I: Data Preparation  
- Part II: Fitting Regression Models
- Part III: Evaluation  

**BE SURE TO READ THE INSTRUCTIONS FOR ALL SECTIONS!!!**

<br>


## Part I: Data Preparation
---

You will first pre-process the dataset. You can use the example code provided [here](https://github.com/jtrive84/DMACC/blob/master/DAT303/Demos/preprocessing-pipeline-demo.ipynb) to give you an idea of how to handle imputation, scaling and one-hot encoding for categorical features. The objective is to implement a number of models to predict the log of SALE_PRICE using selected features. Some tasks you will need to handle include:

- Determining which features are categorical and which are continuous.
    - For example, `GROSS SQUARE FEET` is a continuous feature but in the dataset, since missing values are represented as `"-"`, it will be read into Pandas as a object (string) column. You may want to use the `na_values` argument from `pd.read_csv` to handle this at load time.

    - Handling sparsely represented values in categorical features (see Part I #8).

- Imputing missing values (remember that imputation is handled differently for continuous and categorical features).

- Scaling continuous features.

- One-hot encoding categorical features.

- Pick which columns to ignore (for example, the models we are building are not time dependent, so you should exclude SALE_DATE).

---

- 1.a Read *nyc-rolling-sales.csv* into a Pandas DataFrame. 
- 1.b Drop any records in which SALE PRICE is NA or 0. 
- 1.c Create a new column named log_SALE_PRICE which represents the natual log of the original SALE PRICE column. This will be our target going forward. 
- 1.d Drop SALE PRICE and SALE DATE.
- 1.e Display the first 5 rows of the DataFrame.

In [None]:

import numpy as np
import pandas as pd

##### YOUR CODE HERE #####


<br>

2. Create a bar plot of the median log_SALE_PRICE by BOROUGH. Ensure the values for BOROUGH on the x-axis are ordered correctly. 

In [None]:

##### YOUR CODE HERE #####


<br>

3. Which borough has the highest median log_SALE_PRICE? The lowest?


YOUR WRITTEN RESPONSE HERE


<br>

4. Plot a histogram of log_SALE_PRICE. Experiment with the number of bins.

In [None]:

##### YOUR CODE HERE #####


<br>

5. Describe the distribution of the target variable in words (symmetric, skewed, etc.). Are outliers apparent? What tests are available in scikit-learn to detect outliers? Name two.

YOUR WRITTEN RESPONSE HERE

<br>

6. Replace 0 values in YEAR BUILT with NA so they will be properly inputed. Investigate whether any other columns should receive similar treatment. 

In [None]:

##### YOUR CODE HERE #####


<br>

7. Inspect the columns of the DataFrame, and create categorical and continuous feature lists. Be sure to remove any redundant columns (perhaps TOTAL UNITS should be excluded, since it is a perfect linear combination of COMMERCIAL UNITS and RESIDENTIAL UNITS).

In [None]:

##### YOUR CODE HERE #####


<br>

8. For each feature identified as categorical, consolidate values into an "OTHER" group if they appear less than 100 times. For example, in BUILDING CLASS CATEGORY, "CONDO WAREHOUSES/FACTORY/INDUS" is present in the data only 10 times. This should be replaced with "OTHER". Replace all such values with "OTHER". 

In [None]:

##### YOUR CODE HERE #####


<br>

9. Implement the preprocessing pipeline. Be sure to create train, validation and test subsets. We will use train and validation sets for modeling, but the test set will not be used until Part III to compare all models on unseen data. Print the number of rows and columns in each split.

In [None]:

##### YOUR CODE HERE #####






print(f"dftrain.shape: {dftrain.shape}")
print(f"dfvalid.shape: {dfvalid.shape}")
print(f"dftest.shape : {dftest.shape}")



<br>

## Part II: Fitting Regression Models
---
In this section, you will fit 4 separate regression models, and answer any additional questions about the given model. In parituclar, you will fit the following models:

- [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
- [Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso)
- [DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn-tree-decisiontreeregressor)
- Regression model of your choice (list of models available [here](https://scikit-learn.org/stable/supervised_learning.html))

Follow the instructions that accompany each model. Remember that in this section, **We are only working with the training and validation sets, not the test set!**.

---


### i. LinearRegression

1. Fit a LinearRegression model. Report/print the following metrics to 5 decimal places:

- Train $R^2$, MSE and MAE.
- Validation $R^2$, MSE and MAE.

Name the LinearRegression model you fit `mdl1`.



In [None]:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


##### YOUR CODE HERE #####

# Fit model on training set. 


# Generate predictions on test set.


# Generate predictions on validation set.


# Compute training metrics.


# Compute validation metrics.


<br>

2. Did each metric increase or decrease from training to validation?

YOUR WRITTEN RESPONSE HERE



<br>

3. Create a DataFrame consisting of the LinearRegression coefficients along with the feature names, and sort them in decreasing absolute order. 

In [1]:

##### YOUR CODE HERE #####



<br>

4. Which three features have the highest absolute coefficient values?


YOUR WRITTEN RESPONSE HERE



<br>


### ii. Lasso

5. Recall that the Lasso model can perform feature selection by shrinking coefficients to 0. For this you will fit a number of models by varying the regularization parameter, and taking note of how many coefficients are set to 0 for each regularization value. Specifically, perform the following:

    1. For each alpha in `np.linspace(0, 1, 100)`, do:   
    
        - Fit a Lasso model with that particular alpha.
        - Compute the training and validation MAE. 
        - Count the number of non-zero coefficients estimated by the model. 

    2. Create a DataFrame of your results with columns alpha, nbr_non_zero_coeffs, train_mae and valid_mae. Display the first 50 rows.



In [4]:

from sklearn.linear_model import Lasso
from sklearn.metrics import mean_absolute_error


alphas = np.linspace(0, 1, 100)


##### YOUR CODE HERE #####



<br>

6. Create a line plot with alpha on the x-axis and nbr_non_zero_coeffs on the y-axis. 

In [None]:

##### YOUR CODE HERE #####


<br>

7. After which value of alpha do train_mse and valid_mse no longer change?


YOUR WRITTEN RESPONSE HERE


<br>

8. Based on your analysis of the 100 models fit in question 5, select an alpha that gives a good trade off between bias and variance, and refit this model on the training data for use in Part III. Name the model you create here `mdl2`.

In [2]:

##### YOUR CODE HERE ##### 


### iii. DecisionTreeRegressor



9. Recall that in Module 5 calibration curves were discussed. For this question, you will vary the DecisionTreeRegressor's max_depth hyperparameter, and monitor how train MSE and validation MSE vary for each value. 


    1. For each max_depth in `np.arange(1, 51)`, do:  
    
        - Fit a DecisionTreeRegressor with that particular max_depth.
        - Compute the training and validation MSE. 

    2. Create a DataFrame of your results with columns max_depth, train_mse, and valid_mse. Display all rows.



In [5]:

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

max_depths = np.arange(1, 51)

##### YOUR CODE HERE #####


<br>

10. Plot the validation curve comparing train_mse and valid_mse with max_depth on the x-axis. Draw a verical black line at the value of max_depth where overfitting is starting to occur. Be sure to label your axes.

In [None]:

import matplotlib.pyplot as plt

##### YOUR CODE HERE #####


<br>

11. Based on your analysis of the validation curve in question 10, select a max_depth that gives a good trade off between bias and variance, and refit this model on the training data for use in Part III. Name the model you create here `mdl3`.

In [6]:

##### YOUR CODE HERE #####


<br> 

### iv. Model of Your Choice



12. Select a scikit-learn regression model not already covered in this notebook. Select a hyperparameter for that model, and recreate the validation curve created in problem 10. 

In [None]:

##### YOUR CODE HERE #####


<br>

13. Based on your analysis of the validation curve in question 12, select a hyperparameter value that gives a good trade off between bias and variance, and refit this model on the training data for use in Part III. Name the model you create here `mdl4`.

In [7]:

##### YOUR CODE HERE #####


<br>

### Part III: Evaluation

If you followed the instructions in terms of naming your models, there should be no coding in this section. Execute the next cell, which computes MSE and MAE on the final test set for the four models created. Recall that:

- `mdl1` = Standard LinearRegression model
- `mdl2` = Selected Lasso model (after identifying preferred alpha)
- `mdl3` = Selected DecisionTreeRegressor (after identifying preferred max_depth)
- `mdl4` = Selected model of your choice (after identifying preferred hyperparameter)

In [None]:

# Run this cell as-is, no updates necessary.

ypred_mdl1 = mdl1.predict(dftest)
ypred_mdl2 = mdl2.predict(dftest)
ypred_mdl3 = mdl3.predict(dftest)
ypred_mdl4 = mdl4.predict(dftest)


metrics = [
    {
        "model": f"{repr(mdl1)}",
        "mse": mean_squared_error(ypred_mdl1, ytest),
        "mae": mean_absolute_error(ypred_mdl1, ytest)
    },
    {
        "model": f"{repr(mdl2)}",
        "mse": mean_squared_error(ypred_mdl2, ytest),
        "mae": mean_absolute_error(ypred_mdl2, ytest)
    },
    {
        "model": f"{repr(mdl3)}",
        "mse": mean_squared_error(ypred_mdl3, ytest),
        "mae": mean_absolute_error(ypred_mdl3, ytest)
    },
    {
        "model": f"{repr(mdl4)}",
        "mse": mean_squared_error(ypred_mdl4, ytest),
        "mae": mean_absolute_error(ypred_mdl4, ytest)
    },
]


pd.DataFrame().from_dict(metrics).head(5)


<br>

1. Which model exhibited the best performance in terms of MAE? Which model exhibited the worst performance in terms of MAE? Why do you think the best performing model out-performed the others?


YOUR WRITTEN RESPONSE HERE
