# Week 5 - Regression Models
Since we are trying to predict total price as a continuous value, we will fit a regression model.  There are several different models.  
- **Multiple Linear Regression**
- **Ridge Regression Model** fits the model by keeping the weights as close to zero as possible.  We must scale all numeric features to prevent large values from dominating the model.
  1.  Ridge regression fixes overfitting.
  2.  Ridge Regression adds a penalty to prevent the model from relying too much on certain features and smoothes out the model.

- The **Lasso Regression Model** fits the model by causing some weights to be zero. We must scale the numeric features.
  1.  Lasso Removes Useless Features (Makes the model simpler).
  2.  Helps When There Are Too Many Variables.
  3.  It prevents overfitting like Ridge, but it also does feature selection!
 
- **Elastic Net Regression**
  1.  Combination of Ridge and Lasso Regression
  2.  Reduces Overfitting while removing useless features
  3.  Works best when many features are correlated

- **Random Forest Regressor**
  1.  When the model is not good, a random forest regressor can be used to improve it.
  
This week, you will create all the models above using the Boston Housing Data. We will also evaluate the models using cross-validation.

Cross-validation occurs when we divide the training data set into k different parts.  We then train the model using K-1 different parts of the training data and use the last part to test the data, which is a validation data set.  K-Fold Cross-validation method creates the model k times using (k-1)/k of the training data set.  The model generated uses one kth part of the training data set to determine whether the model generated fits unknown data.  The final step is to average the k metric to evaluate the model.

**How to decide the number of folds**

- **5 or 10 folds is a good default.**
- **Smaller datasets need larger K (e.g., K=10).**
- **Larger datasets can use smaller K (e.g., K=3-5).**
- **Always balance accuracy vs. training time!**

Run the code cell below.

In [1]:
# Load Libraries run this code as is
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import QuantileTransformer 


# Read Data set

# Question 1
1. Read in the training data set.
2. Drop the Unnamed: 0 column from the training data set.
3. Read in the test data set.
4. Drop the Unnamed: 0 column from the testing data set.

In [2]:
# Question 1  - 1)
airBnb = pd.read_csv("train_set.csv")

In [3]:
# Question 1  - 2)
airBnb.drop("Unnamed: 0", axis=1, inplace=True)

In [4]:
# Question 1  - 3)
test_set = pd.read_csv("test_set.csv")

In [5]:
# Question 1  - 4)
test_set.drop("Unnamed: 0", axis=1, inplace=True)

# Separate feature and target value for training and test data sets

# Question 2
1. Create the feature dataframe for training data set.
2. Create the target dataframe for training data set.
3. Create the feature dataframe for testing data set.
4. Create the target dataframe for the testing data set.

In [6]:
# Question 2  - 1)
X = airBnb.drop("total_price", axis=1)

In [7]:
# Question 2  - 2)
y = airBnb['total_price']

In [8]:
# Question 2  - 3)
X_test = test_set.drop("total_price", axis=1)

In [9]:
# Question 2  - 4)
y_test = test_set['total_price']

# Create a pipeline 
For numeric features the pipeline will deal with missing values and standardization.  For categorical (string) features the pipeline will deal with missing values and onehot encoder.

```
num_feature_cols = df.select_dtypes(include=[np.number]).columns
cat_feature_cols = df.select_dtypes(include=['object']).columns

num_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),  # Fill missing numeric values with median
    ('scaler', StandardScaler(),  # Scale the data values (x - mean)/standard deviation
    ('normal', QuantileTransformer(output_distribution='normal'))   # transform data to normal distribution                 
])

cat_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Fill missing categorical values
    ('encoder', OneHotEncoder(handle_unknown='ignore'))  # One-hot encode categorical data
])


```


# Question 3
1. Create a list of column names for numeric features and store it in the variable numeric_features.
2. Create a list of column names for categorical features and store it in the variable categorical_features.
3. Create a Pipeline for for numeric data replacing the missing data with the median, StandardScaler, and Transform data to normal distribution. Assign it to the variable numeric_transformer.
4. Create a Pipeline for the categorical data replacing the missing data with the most_frequent and onehot encoder.  Assign it to the variable categorical_transformer.
5. Create the preprocessor ColumnTransformer with the numeric_transformer, categorical_transformer.

In [10]:
# Question 3  - 1)
numeric_features = X.select_dtypes(include=[np.number]).columns

In [11]:
# Question 3  - 2)
categorical_features = X.select_dtypes(include=['object']).columns

In [12]:
# Question 3  - 3)
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),  # Fill missing values with mean
    ('scaler', StandardScaler()), # Scale numeric features
    ('normal', QuantileTransformer(output_distribution='normal')) 
])


In [13]:
# Question 3  - 4)
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Fill missing categorical values
    ('encoder', OneHotEncoder(handle_unknown='ignore'))  # One-hot encode categorical data
])

In [14]:
# Question 3  - 5)
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])



# Fitting a Multiple Linear Regression
Create the linear regression model pipeline using the code below.

```
# Create a pipeline for a linear regression model

ln_reg =make_pipeline(preprocessor, TransformedTargetRegressor(LinearRegression(), transformer=StandardScaler()))

# fit a linear regression model
ln_reg.fit(X,y)

# predict the linear regression model
y_pred = ln_reg.predict(X)

# Create the metric mean-squared error for linear regression
ln_reg_rmse_train = mean_squared_error(y, y_pred, squared=False)

# predict results using the X-test set
y_pred_test = ln_reg.predict(X_test)

# Create the metric mean-squared error for linear regression
ln_reg_rmse_test = mean_squared_error(y_test, y_pred_test, squared=False)

# print out the mulitple regression mse for the training and test data sets
print("Training ", ln_reg_rmse_train)
print("Testing  ", ln_reg_rmse_test)

# run cross-validation for training and test data sets
ln_rmses_cv = -cross_val_score(ln_reg, X, y, scoring="neg_root_mean_squared_error", cv=10)
ln_rmses_cv_test = -cross_val_score(ln_reg, X_test, y_test, scoring="neg_root_mean_squared_error", cv=10)

# print out cross validation results
print("train data set MSE")
print(pd.Series(ln_rmses_cv).describe())
print("\ntest data set MSE")
print(pd.Series(ln_rmses_cv_test).describe())

```

# Question 4
1. Create a pipeline for a linear regression model assign it to a variable name lin_reg.
2. Fit the lin_reg model with X and y.
3. Predict the lin_reg model using X assignment to the variable y_pred.
4. Create the metric mean_squared_error for the linear regression model and assign it to the variable lin_reg_rmse.
5. Predict the lin_reg model using X_test assignment to the variable y_pred_test.
6. Create the metric mean_squared_error for the linear regression model and assign it to the variable lin_reg_rmse_test.
7. Print out the Training lin_reg_rmse and Testing lin_reg_rmse_test in one code cell.
8. Run the cross-validation for the training and test data set.  Assign the results to lin_rmses_cv and lin_rmses_cv_test.  Besure to change ln_reg to lin_reg.
9. In one code cell print out the cross-valation results.

In [15]:
# Question 4  - 1)
lin_reg = make_pipeline(preprocessor, TransformedTargetRegressor(LinearRegression(), transformer=StandardScaler()))

In [16]:
# Question 4  - 2)
lin_reg.fit(X,y)

In [17]:
# Question 4  - 3)
y_pred = lin_reg.predict(X)

In [18]:
# Question 4  - 4)
lin_reg_rmse_train = mean_squared_error(y, y_pred, squared=False)

In [19]:
# Question 4  - 5)
y_pred_test = lin_reg.predict(X_test)

In [20]:
# Question 4  - 6)
lin_reg_rmse_test = mean_squared_error(y_test, y_pred_test, squared=False)

In [21]:
# Question 4  - 7)
print("Training ", lin_reg_rmse_train)
print("Testing  ", lin_reg_rmse_test)

Training  397.8028381275751
Testing   398.53447106323614


In [22]:
# Question 4  - 8)
lin_rmses_cv = -cross_val_score(lin_reg, X, y, scoring="neg_root_mean_squared_error", cv=3)
lin_rmses_cv_test = -cross_val_score(lin_reg, X_test, y_test, scoring="neg_root_mean_squared_error", cv=3)


In [23]:
# Question 4  - 9)
print("train data set MSE")
print(pd.Series(lin_rmses_cv).describe())
print("\ntest data set MSE")
print(pd.Series(lin_rmses_cv_test).describe())

train data set MSE
count      3.000000
mean     397.954708
std        0.520349
min      397.501275
25%      397.670643
50%      397.840012
75%      398.181424
max      398.522836
dtype: float64

test data set MSE
count      3.000000
mean     398.918778
std        1.976786
min      396.643869
25%      398.268998
50%      399.894126
75%      400.056232
max      400.218337
dtype: float64


The MSE of 397.9 is not very good for the total_price which ranges from 60 to 1440. As a result, the data used to predict the total_price does not provide enough information to make a prediction.  We will look at other regression models to see if there is any improvement.

# Ridge Regression
A regression model that uses weights close to zero to prevent overfitting. This model uses a parameter alpha which will be set to 0.1 for this example and matrix factorization technique by sag.

```
# create the pipeline for ridge regression
rid_reg = make_pipeline(preprocessor, TransformTargetRegressor(Ridge(alpha=0.1, solver="sag", 
                                       transformer=StandardScaler())))

# fit the ridge regression model
rid_reg.fit(X,y)

# predict the training set results
y_pred_rid_reg = rid_reg.predict(X)

# metrics from the ridge regression training set
rid_rmse_train = mean_squared_error(y, y_pred_rid_reg, squared=False)

# predict the test set results
rid_reg.predict(X_test)
y_pred_test_rid_test = rid_reg.predict(X_test)

# metrics from the ridge regression test set
rid_reg_rmse_test = mean_squared_error(y_test, y_pred_test_rid_test, squared=False)

# pring out the metrics for training and test set
print("Training ", rid_rmse_train)
print("Testing  ", rid_reg_rmse_test)

# Cross Validation K-Fold where K=10 fold
rid_rmses_cv = -cross_val_score(rid_reg, X, y, scoring="neg_root_mean_squared_error", cv=10)
rid_rmses_cv_test = -cross_val_score(rid_reg, X_test, y_test, scoring="neg_root_mean_squared_error", cv=10)

# print the cross validation results
print("Ridge Regression - train data set MSE")
print(pd.Series(rid_rmses_cv).describe())
print("\nRidge Regression - test data set MSE")
print(pd.Series(rid_rmses_cv_test).describe())

````

# Question 5
1. Create a pipeline for a ridge regression model assign it to a variable name ridge_reg.
2. Fit the ridge_reg model with X and y.
3. Predict the ridge_reg model using X assignment to the variable y_pred.
4. Create the metric mean_squared_error for the ridge regression model and assign it to the variable ridge_reg_rmse_train.
5. Predict the ridge_reg model using X_test assignment to the variable y_pred_test.
6. Create the metric mean_squared_error for the ridge regression model and assign it to the variable ridge_reg_rmse_test.
7. Print out the Training ridge_reg_rmse and Testing ridge_reg_rmse_test in one code cell.
8. Run the cross-validation for the training and test data set in one code cell. Besure to change rid_reg to ridge_reg.
9. In one code cell print out the cross-valation results.  Besure to change the rid_rmses_cv to ridge_rmses_cv and rid_rmses_cv_test to ridg_rmse_cv_test.

In [24]:
# Question 5 - 1)
ridge_reg = make_pipeline(preprocessor, TransformedTargetRegressor(Ridge(alpha=0.1, solver="sag"), 
                                                                   transformer=StandardScaler()))

In [25]:
# Question 5 - 2)
ridge_reg.fit(X,y)

In [26]:
# Question 5 - 3)
y_pred_ridge_reg = ridge_reg.predict(X)

In [27]:
# Question 5 - 4)
ridge_reg_rmse_train = mean_squared_error(y, y_pred_ridge_reg, squared=False)

In [28]:
# Question 5 - 5)
y_pred_test_ridge_test = ridge_reg.predict(X_test)

In [29]:
# Question 5 - 6)
ridge_reg_rmse_test = mean_squared_error(y_test, y_pred_test_ridge_test, squared=False)

In [30]:
# Question 5 - 7)
print("Training ", ridge_reg_rmse_train)
print("Testing  ", ridge_reg_rmse_test)    

Training  397.80286555503164
Testing   398.53362736877745


In [31]:
# Question 5 - 8)
ridge_rmses_cv = -cross_val_score(ridge_reg, X, y, scoring="neg_root_mean_squared_error", cv=3)
ridge_rmses_cv_test = -cross_val_score(ridge_reg, X_test, y_test, scoring="neg_root_mean_squared_error", cv=3)

In [32]:
# Question 5 - 9)
print("Ridge Regression - train data set MSE")
print(pd.Series(ridge_rmses_cv).describe())
print("\nRidge Regression - test data set MSE")
print(pd.Series(ridge_rmses_cv_test).describe())

Ridge Regression - train data set MSE
count      3.000000
mean     397.953817
std        0.520412
min      397.500380
25%      397.669706
50%      397.839032
75%      398.180535
max      398.522039
dtype: float64

Ridge Regression - test data set MSE
count      3.000000
mean     398.919106
std        1.975719
min      396.645461
25%      398.269502
50%      399.893542
75%      400.055929
max      400.218315
dtype: float64


In [33]:
ridge_reg = make_pipeline(preprocessor, TransformedTargetRegressor(Ridge(alpha=0.1, solver="sag"), 
                                                                   transformer=StandardScaler()))
ridge_reg.fit(X,y)

y_pred_ridge_reg = ridge_reg.predict(X)
ridge_rmse_train = mean_squared_error(y, y_pred_ridge_reg, squared=False)

ridge_reg.predict(X_test)
y_pred_test_ridge_test = ridge_reg.predict(X_test)
ridge_reg_rmse_test = mean_squared_error(y_test, y_pred_test_ridge_test, squared=False)

print("Training ", ridge_rmse_train)
print("Testing  ", ridge_reg_rmse_test)    

Training  397.80264069484
Testing   398.53420288998655


In [34]:
# Convolution K-Fold 10 fold
ridge_rmses_cv = -cross_val_score(ridge_reg, X, y, scoring="neg_root_mean_squared_error", cv=3)
ridge_rmses_cv_test = -cross_val_score(ridge_reg, X_test, y_test, scoring="neg_root_mean_squared_error", cv=3)

In [35]:
print("Ridge Regression - train data set MSE")
print(pd.Series(ridge_rmses_cv).describe())
print("\nRidge Regression - test data set MSE")
print(pd.Series(ridge_rmses_cv_test).describe())

Ridge Regression - train data set MSE
count      3.000000
mean     397.955322
std        0.520578
min      397.501387
25%      397.671208
50%      397.841030
75%      398.182290
max      398.523549
dtype: float64

Ridge Regression - test data set MSE
count      3.000000
mean     398.918744
std        1.976443
min      396.644186
25%      398.269302
50%      399.894417
75%      400.056023
max      400.217628
dtype: float64


The ridge regression model MSE is also 397.9 so it did not do any better in predicting the total_price.  The parameter for the Ridge regression model is alpha = 0.1.  Using this parameter, it did not predict Next, we will look at the Lasso Regression model.

# Lasso Regression
Another regression model is a lasso regression model.  This model also has a parameter alpha = 0.1. It removes some of the weights associated with the features used to predict the total_price target variable.

```
# Lasso Regression
las_reg = make_pipeline(preprocessor, TransformedTargetRegressor(Lasso(alpha=0.1),
                                       transformer=StandardScaler()))

# fit the Lasso Regression model
las_reg.fit(X,y)

# predict the total_price for the features in the training data set
y_pred_las = las_reg.predict(X)

# find the mse for the lasso regression model for the training data set
las_reg_rmse_train = mean_squared_error(y, y_pred_las, squared=False)

# predict the total_price for the featuresin the test data set
y_pred_test_las = las_reg.predict(X_test)

# find the mse for the lasso regression model for testing data set
las_reg_rmse_test = mean_squared_error(y_test, y_pred_test_las, squared=False)

# print out the mse for the training and test data sets
print("Training ", las_reg_rmse_train)
print("Testing  ", las_reg_rmse_test)

```
# Question 6
1. Create a pipeline for a lasso regression model assign it to a variable name lasso_reg.
2. Fit the lasso_reg model with X and y.
3. Predict the lasso_reg model using X assignment to the variable y_pred_lasso.
4. Create the metric mean_squared_error for the lasso regression model and assign it to the variable lasso_reg_rmse.
5. Predict the lasso_reg model using X_test assignment to the variable y_pred_test_lasso.
6. Create the metric mean_squared_error for the lasso regression model and assign it to the variable lasso_reg_rmse_test.
7. Print out the Training lasso_reg_rmse and Testing lasso_reg_rmse_test in one code cell with cv=10.
8. Run the cross-validation for the training and test data set in one code cell.
9. In one code cell print out the cross-valation results.

In [36]:
# Question 6 - 1)
lasso_reg = make_pipeline(preprocessor, TransformedTargetRegressor(Lasso(alpha=0.1),
                                       transformer=StandardScaler()))


In [37]:
# Question 6 - 2)
lasso_reg.fit(X,y)

In [38]:
# Question 6 - 3)
y_pred_lasso = lasso_reg.predict(X)

In [39]:
# Question 6 - 4)
lasso_reg_rmse_train = mean_squared_error(y, y_pred_lasso, squared=False)

In [40]:
# Question 6 - 5)
y_pred_test_lasso = lasso_reg.predict(X_test)


In [41]:
# Question 6 - 6)
lasso_reg_rmse_test = mean_squared_error(y_test, y_pred_test_lasso, squared=False)

In [42]:
# Question 6 - 7)
print("Training ", lasso_reg_rmse_train)
print("Testing  ", lasso_reg_rmse_test)

Training  397.84046581776283
Testing   398.5535082869724


In [43]:
# Question 6 - 8)
lasso_rmses_cv = -cross_val_score(lasso_reg, X, y, scoring="neg_root_mean_squared_error", cv=3)
lasso_rmses_cv_test = -cross_val_score(lasso_reg, X_test, y_test, scoring="neg_root_mean_squared_error", cv=3)

In [44]:
# Question 6 - 9)
print("Lasso Regression - train data set MSE")
print(pd.Series(lasso_rmses_cv).describe())
print("\nLasso Regression - test data set MSE")
print(pd.Series(lasso_rmses_cv_test).describe())

Lasso Regression - train data set MSE
count      3.000000
mean     397.850457
std        0.462001
min      397.459042
25%      397.595648
50%      397.732254
75%      398.046165
max      398.360077
dtype: float64

Lasso Regression - test data set MSE
count      3.000000
mean     398.540759
std        2.285295
min      395.925833
25%      397.733572
50%      399.541312
75%      399.848222
max      400.155132
dtype: float64


In [45]:
# Lasso Regression
lasso_reg = make_pipeline(preprocessor, TransformedTargetRegressor(Lasso(alpha=0.1),
                                       transformer=StandardScaler()))
lasso_reg.fit(X,y)
y_pred_lasso = lasso_reg.predict(X)

lasso_reg_rmse_train = mean_squared_error(y, y_pred_lasso, squared=False)

y_pred_test_lasso = lin_reg.predict(X_test)
lasso_reg_rmse_test = mean_squared_error(y_test, y_pred_test_lasso, squared=False)

print("Training ", lasso_reg_rmse_train)
print("Testing  ", lasso_reg_rmse_test)

Training  397.84046581776283
Testing   398.5337267606186


In [46]:
# Convolution K-Fold 10 fold
lasso_rmses_cv = -cross_val_score(lasso_reg, X, y, scoring="neg_root_mean_squared_error", cv=3)
lasso_rmses_cv_test = -cross_val_score(lasso_reg, X_test, y_test, scoring="neg_root_mean_squared_error", cv=3)

In [47]:
print("Lasso Regression - train data set MSE")
print(pd.Series(lasso_rmses_cv).describe())
print("\nLasso Regression - test data set MSE")
print(pd.Series(lasso_rmses_cv_test).describe())

Lasso Regression - train data set MSE
count      3.000000
mean     397.850457
std        0.462001
min      397.459042
25%      397.595648
50%      397.732254
75%      398.046165
max      398.360077
dtype: float64

Lasso Regression - test data set MSE
count      3.000000
mean     398.540759
std        2.285295
min      395.925833
25%      397.733572
50%      399.541312
75%      399.848222
max      400.155132
dtype: float64


Again the MSE is 398.9 which means the Lasso Model is not better at predicting the total_price. There is the Elastic Net Regression model which is a weighted average between the Ridge and Lasso Regression model.

# Elastic Net Regression
The Elastic Net Regression is a weighted average of the Ridge Regression and the Lasso Regression. There are two parameters, alpha and l1_ratio weighted average for the lasso and ridge regression model.

```
# Elastic Net Regression
en_reg = make_pipeline(preprocessor, TransformedTargetRegressor(ElasticNet(alpha=0.1, l1_ratio=0.5),
                                       transformer=StandardScaler()))

# Fit the Elastic Net Regression
en_reg.fit(X,y)

# Predict the target values for training data set
y_pred_en = en_reg.predict(X)

# Determine the MSE for the training data set
en_reg_rmse_train = mean_squared_error(y, y_pred_en, squared=False)

# Predict the test values for test data set
y_pred_test_en = en_reg.predict(X_test)

# Determine the MSE for the test data set
en_reg_rmse_test = mean_squared_error(y_test, y_pred_test_en, squared=False)

# Print out the MSE for the training and testing data set
print("Training ", en_reg_rmse_train)
print("Testing  ", en_reg_rmse_test)

# Convolution K-Fold 10 fold
en_rmses_cv = -cross_val_score(en_reg, X, y, scoring="neg_root_mean_squared_error", cv=10)
en_rmses_cv_test = -cross_val_score(en_reg, X_test, y_test, scoring="neg_root_mean_squared_error", cv=10)

# Print out the MSE using the cross validation k fold for the training and testing data set
print("Elastic Regression - train data set MSE")
print(pd.Series(en_rmses_cv).describe())
print("\nElastic Regression - test data set MSE")
print(pd.Series(en_rmses_cv_test).describe())
```

# Question 7
1. Create a pipeline for a Elastic Net regression model assign it to a variable name elastic_reg.
2. Fit the elastic_reg model with X and y.
3. Predict the elastic_reg model using X assignment to the variable y_pred_elastic.
4. Create the metric mean_squared_error for the Elastic Net regression model and assign it to the variable elastic_reg_rmse.
5. Predict the elastic_reg model using X_test assignment to the variable y_pred_test_elastic.
6. Create the metric mean_squared_error for the Elastic Net regression model and assign it to the variable elastic_reg_rmse_test.
7. Print out the Training elastic_reg_rmse and Testing elastic_reg_rmse_test in one code cell with cv=10.
8. Run the cross-validation for the training and test data set in one code cell.
9. In one code cell print out the cross-valation results.

In [48]:
# Question 7 - 1)
elastic_reg = make_pipeline(preprocessor, TransformedTargetRegressor(ElasticNet(alpha=0.1, l1_ratio=0.5),
                                       transformer=StandardScaler()))


In [49]:
# Question 7 - 2)
elastic_reg.fit(X,y)

In [50]:
# Question 7 - 3)
y_pred_elastic = elastic_reg.predict(X)

In [51]:
# Question 7 - 4)
elastic_reg_rmse_train = mean_squared_error(y, y_pred_elastic, squared=False)

In [52]:
# Question 7 - 5)
y_pred_test_elastic = elastic_reg.predict(X_test)

In [53]:
# Question 7 - 6)
elastic_reg_rmse_test = mean_squared_error(y_test, y_pred_test_elastic, squared=False)

In [54]:
# Question 7 - 7)
print("Training ", elastic_reg_rmse_train)
print("Testing  ", elastic_reg_rmse_test)

Training  397.84046581776283
Testing   398.5535082869724


In [55]:
# Question 7 - 8)
elastic_rmses_cv = -cross_val_score(elastic_reg, X, y, scoring="neg_root_mean_squared_error", cv=3)
elastic_rmses_cv_test = -cross_val_score(elastic_reg, X_test, y_test, scoring="neg_root_mean_squared_error", cv=3)

In [56]:
# Question 7 - 9)
print("Elastic Regression - train data set MSE")
print(pd.Series(elastic_rmses_cv).describe())
print("\nElastic Regression - test data set MSE")
print(pd.Series(elastic_rmses_cv_test).describe())

Elastic Regression - train data set MSE
count      3.000000
mean     397.850457
std        0.462001
min      397.459042
25%      397.595648
50%      397.732254
75%      398.046165
max      398.360077
dtype: float64

Elastic Regression - test data set MSE
count      3.000000
mean     398.536057
std        2.282220
min      395.925833
25%      397.726520
50%      399.527208
75%      399.841170
max      400.155132
dtype: float64


The MSE is 398.9 for the Elastic Net Regression is pretty much the same. When other models do not work, you can run the Random Forest Regressor model. 

# Random Forest
When other model fail to work, you can run the random forest model to see if it is any better at predicting the total_price.

```
# Random Forest Regression
fr_reg = make_pipeline(preprocessor, TransformedTargetRegressor(RandomForestRegressor(random_state=420),  
                                                                    transformer=StandardScaler()))

# Fit the Random Forest Regression Model
fr_reg.fit(X,y)

# Predict the total_price using the training data set
y_pred_fr = fr_reg.predict(X)

# MSE for Random Forest for the training data set
fr_reg_rmse_train = mean_squared_error(y, y_pred_fr, squared=False)

# Predict the Random Forest for test data set
y_pred_test_fr = fr_reg.predict(X_test)

# MSE for Random Forest for test data set
fr_reg_rmse_test = mean_squared_error(y_test, y_pred_test_fr, squared=False)

# print out MSE for Random Forest for training and test data set
print("Training ", fr_reg_rmse_train)
print("Testing  ", fr_reg_rmse_test)


# Convolution K-Fold 10 fold for cross-validation
fr_rmses_cv = -cross_val_score(fr_reg, X, y, scoring="neg_root_mean_squared_error", cv=10)
fr_rmses_cv_test = -cross_val_score(fr_reg, X_test, y_test, scoring="neg_root_mean_squared_error", cv=10)

# Print Random Forest cross validation MSE results for training and test data set
print("Random Forest Regression - train data set MSE")
print(pd.Series(fr_rmses_cv).describe())
print("\nRandom Forest Regression - test data set MSE")
print(pd.Series(fr_rmses_cv_test).describe())
```

# Question 8
1. Create a pipeline for a Random Forest regression model assign it to a variable name forest_reg.
2. Fit the forest_reg model with X and y.
3. Predict the forest_reg model using X assignment to the variable y_pred_forest.
4. Create the metric mean_squared_error for the Random Forest regression model and assign it to the variable forest_reg_rmse.
5. Predict the forest_reg model using X_test assignment to the variable y_pred_test_forest.
6. Create the metric mean_squared_error for the Random Forest regression model and assign it to the variable forest_reg_rmse_test.
7. Print out the Training forest_reg_rmse and Testing forest_reg_rmse_test in one code cell with cv=10.
8. Run the cross-validation for the training and test data set in one code cell.
9. In one code cell print out the cross-valation results.

In [57]:
# Question 8 - 1)
forest_reg = make_pipeline(preprocessor, TransformedTargetRegressor(RandomForestRegressor(random_state=420),  
                                                                    transformer=StandardScaler()))

In [None]:
# Question 8 - 2)
forest_reg.fit(X,y)

In [None]:
# Question 8 - 3)
y_pred_forest = forest_reg.predict(X)

In [None]:
# Question 8 - 4)
forest_reg_rmse_train = mean_squared_error(y, y_pred_forest, squared=False)


In [None]:
# Question 8 - 5)
y_pred_test_forest = forest_reg.predict(X_test)


In [None]:
# Question 8 - 6)
forest_reg_rmse_test = mean_squared_error(y_test, y_pred_test_forest, squared=False)


In [None]:
# Question 8 - 7)
print("Training ", forest_reg_rmse_train)
print("Testing  ", forest_reg_rmse_test)

In [None]:
# Question 8 - 8)
forest_rmses_cv = -cross_val_score(forest_reg, X, y, scoring="neg_root_mean_squared_error", cv=3)
forest_rmses_cv_test = -cross_val_score(forest_reg, X_test, y_test, scoring="neg_root_mean_squared_error", cv=3)

In [None]:
# Question 8 - 9)
print("Random Forest Regression - train data set MSE")
print(pd.Series(forest_rmses_cv).describe())
print("\nRandom Forest Regression - test data set MSE")
print(pd.Series(forest_rmses_cv_test).describe())