# Homework 7
<p style="text-align:right;color:red;font-weight:bold;font-size:16pt;padding-bottom:20px">Please, copy this notebook before editing!</p>

In this assignment you train multiple regression models on a set of data. You use the model implementation for the scikit-learn package:

1. Linear Regression [`sklearn.linear_model.LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
1. Decision Tree Regressor [`sklearn.tree.DecisionTreeRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)
2. Random Forrest Regressor [`sklearn.ensemble.RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)

You also use several tools for pre-processing, namely:
1. Encoders from sklearn.preprocessing import LabelEncoder, OneHotEncoder
2. Imputers sklearn.impute import SimpleImputer
3. Scalers sklearn.preprocessing import StandardScaler
4. Polynomial features [`sklearn.preprocessing.PolynomialFeatures`]()

And finally, you will use implmented methods for:
1. Splitting data sets (Train and Test) [`sklearn.model_selection.train_test_split`]()
2. Calculate evaluation metrics [`sklearn.metrics.accuracy_score`](), [`sklearn.metrics.mean_squared_error`](), and [`sklearn.metrics.r2_score`]()

## Data Preparation


Import packages

In [None]:
%load_ext autoreload
%autoreload 2
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,mean_squared_error, r2_score

import pandas as pd, numpy as np
import sys
import matplotlib.pyplot as plt

Load data set (drop the `ID` column right away...)

In [None]:
df_pd =  pd.read_csv("/data/IFI8410/sess09/predict_home_value.csv") \
    .drop('ID', axis=1) 
print(f"Number of records: {df_pd.shape[0]:,}")
display(df_pd.head(5).T)

The `SALEPRICE` is your target variable. 

In [None]:
plt.hist(df_pd.SALEPRICE, bins=20)
plt.suptitle('Distribution of Sales Price')
plt.xlabel('Sales Prices')
plt.ylabel('Count')
plt.show()

### Split into Train and Test Sets
Keep a portion of your data for testing. These are the **un-seen** records.


In [None]:
test_mask = np.random.random_sample(df_pd.shape[0]) >0.8
df_train = df_pd[~test_mask].copy()
print(f"Number of training records: {df_train.shape[0]:,}")
df_test = df_pd[test_mask].copy()
print(f"Number of test records: {df_test.shape[0]:,}")

### Categorical Features

In [None]:
# Defining the categorical columns 
categoricalColumns = df_train.select_dtypes(include=[object]).columns

print(f"There are {len(categoricalColumns)} categorical columns: " )
print(categoricalColumns)

impute_categorical = SimpleImputer(strategy="most_frequent")
onehot_categorical =  OneHotEncoder(handle_unknown='ignore')

categorical_transformer = Pipeline(steps=[('impute',impute_categorical),('onehot',onehot_categorical)])

### Numerical Features

In [None]:
# Defining the numerical columns 
numericalColumns = [col for col in df_train.select_dtypes(include=[float,int]).columns 
                    if col not in ['SALEPRICE']]
print(f"There are {len(numericalColumns)} numerical columns: " )
print(numericalColumns)

scaler_numerical = StandardScaler()

numerical_transformer = Pipeline(steps=[('scale',scaler_numerical)])


### Creating a Pre-processor
For convenience and integrity you create a pre-processor that performs all the required data transformations. The pre-processor is fitted to the training data. Then, you use the trained pre-processor on the test data as well.

In [None]:
preprocessorForCategoricalColumns = ColumnTransformer(
    transformers=[('cat', categorical_transformer, categoricalColumns)],
                                            remainder="passthrough")
preprocessorForAllColumns = ColumnTransformer(
    transformers=[('cat', categorical_transformer, categoricalColumns),
                  ('num',numerical_transformer,numericalColumns)],
                                            remainder="passthrough")


#. The transformation happens in the pipeline. Temporarily done here to show what intermediate value looks like

preprocessorForCategoricalColumns.fit(df_train)
df_train_temp = preprocessorForCategoricalColumns.transform(df_train)
df_test_temp = preprocessorForCategoricalColumns.transform(df_test)

preprocessorForAllColumns.fit(df_train)
df_train_temp_2 = preprocessorForAllColumns.transform(df_train)
df_test_temp_2 = preprocessorForAllColumns.transform(df_test)

### Evaluation
Use this evaluation function

In [None]:
def model_metrics(regressor,y_test,y_pred):
    mse = mean_squared_error(y_test,y_pred)
    print("Mean squared error: %.2f"
      % mse)
    r2 = r2_score(y_test, y_pred)
    print('R2 score: %.2f' % r2 )
    return (mse, r2)


## keep this list to collect results from the various experiments
results = []

### 7.1 Simple Linear Regression

There are 13 numerical dependent variables that you can try each of them to train a simple linear regression model. Compare the performance.

**Refresher:**

This is the most basic form of linear regression in which the variable to be predicted is dependent on only one other variable. This is calculated by using the formula that is generally used in calculating the slope of a line.

$y = w_0 + w_1 \times x_1$

In the above equation, $y$ refers to the target variable and $x_1$ refers to the independent variable. w1 refers to the coeeficient that expresses the relationship between $y$ and $x_1$ is it also know as the slope. $w_0$ is the constant cooefficient a.k.a the intercept. It refers to the constant offset that $y$ will always be with respect to the independent variables.


In [None]:
from sklearn.linear_model import LinearRegression

### Iterate over each numerical column
for ftr in numericalColumns:
    print(ftr)
    X_train = df_train[ftr].values.reshape(-1, 1)
    y_train = df_train['SALEPRICE']
    slRegressor = LinearRegression()
    
    # train model
    slRegressor.fit(X_train,y_train)

    # inference
    X_test = df_test[ftr].values.reshape(-1, 1)
    y_actual = df_test['SALEPRICE']
    y_pred = slRegressor.predict(X_test)

    # evaluate
    MSE, R2 = model_metrics(slRegressor,y_actual,y_pred)
    results.append({'Model': f'SLR-{ftr}', 'MSE': f"{MSE:.2f}", 'R2': f"{R2:.2f}"})

    

### 7.2 Multiple Linear Regression Model

Multiple linear regression is an extension to the simple linear regression. In this setup, the target value is dependant on more than one variable. The number of variables depends on the use case at hand. Usually a subject matter expert is involved in identifying the fields that will contribute towards better predicting the output feature.

$y = w_0 + w_1 \times x_1 + w_2 \times x_2 + .... + w_n \times x_n$


Since multiple linear regression assumes that output depends on more than one variable, we are assuming that it depends on all the 30 features. Data is split up into training and test sets. As an experiment, you can try to remove a few features and check if the model performs any better. 

In [None]:
features = list(numericalColumns) + list(categoricalColumns)

In [None]:
from sklearn.linear_model import LinearRegression

model_name = 'Multiple Linear Regression'

mlRegressor = LinearRegression()

mlr_model = Pipeline(steps=[('preprocessorAll',preprocessorForAllColumns),('regressor', mlRegressor)])

X_train = df_train[features]
y_train = df_train['SALEPRICE']

X_test = df_test[features]
y_actual = df_test['SALEPRICE']

X_train.shape, y_train.shape
mlr_model.fit(X_train,y_train)

y_pred = mlr_model.predict(X_test)

print(mlRegressor)

In [None]:
y_pred = mlr_model.predict(X_test)

MSE, R2 = model_metrics(mlRegressor, y_actual, y_pred)
results.append({'Model': f'MLR', 'MSE': f"{MSE:.2f}", 'R2': f"{R2:.2f}"})



### 7.3 Build Polynomial Linear regression model

The prediction line generated by simple/linear regression is usually a straight line. In cases when a simple or multiple linear regression does not fit the data point accurately, we use the polynomial linear regression. The following formula is used in the back-end to generate polynomial linear regression.

$y = w_0 + w_1 \times x_1 + w_2 \times x^2_1 + .... + w_n \times x^n_n$

We are assuming that output depends on the YEARBUILT and LOTATREA. Data is split up into training and test sets.

In [None]:
X_train = df_train[['YEARBUILT', 'LOTAREA']]
y_train = df_train['SALEPRICE']

X_test = df_test[['YEARBUILT', 'LOTAREA']]
y_actual = df_test['SALEPRICE']


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

model_name = 'Polynomial Linear Regression'

polynomial_features= PolynomialFeatures(degree=3)
plRegressor = LinearRegression()

plr_model = Pipeline(steps=[('polyFeature', polynomial_features ),('regressor', plRegressor)])

plr_model.fit(X_train,y_train)



print(plRegressor)

# evaluation
y_pred = plr_model.predict(X_test)
MSE, R2 = model_metrics(plRegressor, y_actual, y_pred)

results.append({'Model': f'PLR-YEARBUILT-LOTAREA', 'MSE': f"{MSE:.2f}", 'R2': f"{R2:.2f}"})



### 7.4 Decision Tree Regressor

The Decision Tree Regressor has multiple hyper-parameters. Experiment with different values


In [None]:
from sklearn.tree import DecisionTreeRegressor

model_name = "Decision Tree Regressor"


X_train = df_train
y_train = df_train['SALEPRICE']

X_test = df_test
y_actual = df_test['SALEPRICE']


for max_features in [10, 20, 30]:
    decisionTreeRegressor = DecisionTreeRegressor(random_state=0, max_features=max_features)
    dtr_model = Pipeline(steps=[('preprocessorAll',preprocessorForAllColumns),
                                ('regressor', decisionTreeRegressor)]) 

    dtr_model.fit(X_train,y_train)

    print(f"Max features = {max_features}")
    y_pred = dtr_model.predict(X_test)
    MSE, R2 = model_metrics(decisionTreeRegressor, y_actual, y_pred)

    results.append({'Model': f'DecisionTree-{max_features}', 'MSE': f"{MSE:.2f}", 'R2': f"{R2:.2f}"})



### 7.5 Random Forest Regression Model

Decision tree algorithms are efficient in eliminating columns that don't add value in predicting the output and in some cases, we are even able to see how a prediction was derived by backtracking the tree. However, this algorithm doesn't perform individually when the trees are huge and are hard to interpret. Such models are oftern referred to as weak models. The model performance is however improvised by taking an average of several such decision trees derived from the subsets of the training data. This approach is called the Random Forest Regression.

In [None]:
from sklearn.ensemble import RandomForestRegressor

model_name = "Random Forest Regressor"

X_train = df_train
y_train = df_train['SALEPRICE']

X_test = df_test
y_actual = df_test['SALEPRICE']


### Let's experiment with different hyper-parameters
for n_estimators in [50, 100]:
    for max_depth in [5, 15, 30]:
        randomForestRegressor = RandomForestRegressor(
            n_estimators=n_estimators, max_depth=max_depth, random_state=0)
    
        rfr_model = Pipeline(steps=[('preprocessorAll',preprocessorForAllColumns),('regressor', randomForestRegressor)]) 
    
        rfr_model.fit(X_train,y_train)
    
        y_pred = rfr_model.predict(X_test)
        MSE, R2 = model_metrics(randomForestRegressor, y_actual, y_pred)
    
        results.append({
            'Model': f'RF-{n_estimators}-{max_depth}',
            'MSE': f"{MSE:.2f}",
            'R2': f"{R2:.2f}"
        })





# Model Comparison

You may have created duplicate results. Once everything is working properly, restart the kernel and run everything above. Then run the cell below. You are encouraged to analyse and visualize the performance results.

In [None]:
result_df = pd.DataFrame.from_records(results) \
    .sort_values('R2', ascending=False)
display(result_df.head(30))


# Testing
There is no automated testing for this assignment

# Homework Submission
- This homework is due by 2023-04-03, 5:30PM (EDT)
- **For this assignment you need to submit your notebook**
- All file names on this system are case sensitive. Verify if you copy your work from a local computer to your home directory on ARC.

**Run the cell below to submit your work.**

- You may submit your work multipe times up to the deadline of the assignment.
- Please note that any files that you previously submitted will
be overwritten by your current files.

In [None]:
# Run this cell to submit this notebook
#
from submit import submit
submit()