# <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:155%; text-align:center; border-radius:10px 10px;">Linear Regression with Grid Search Cross-Validation: Ames Housing Dataset</p>

<div class="alert alert-block alert-info alert">

# <span style=" color:#bf2e98">Table of Contents
    
#### Introduction

#### Explore the Data

#### Formatting Data

#### Ceate Model
    
#### Grid Search

#### Evaluate the model's performance
</div>

<div class="alert alert-warning alert-info">
    
## <span style=" color:#bf2e98"> Introduction

### Ames Housing Dataset

In its original form, the Ames Housing Dataset includes 81 columns and 2930 rows. The target variable in the dataset is the "SalePrice," representing the sale price of the houses. The independent variables describe various aspects of the residential properties.

In the previous Feature Engineering notebook, I dropped some columns I would not need in this machine-learning model. Also, I dropped some rows with missing values or filled them with relevant ones. Lastly, I converted some categorical variables into numeric ones using the one-hot-encoding method (dummy variables). 

See the Feature Engineering Notebook: https://github.com/msevim24/MachineLearning_DeepLearning_Projects/blob/master/Feature%20Engineering_Ames%20Housing%20Dataset_Udemy.ipynb

After all this preprocessing, the dataset has 2925 rows and 263 columns in its final version.

In this project, I will create **a Linear Regression Model with Elastic Net**, train it on the data with the optimal parameters **using a grid search**, and then evaluate the model's capabilities on a test set.</span>

In [1]:
# Read the text file that includes explanations about the data

with open('Ames_Housing_Feature_Description.txt','r') as f: 
    print(f.read())

# "r" means read

MSSubClass: Identifies the type of dwelling involved in the sale.	

        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES
        45	1-1/2 STORY - UNFINISHED ALL AGES
        50	1-1/2 STORY FINISHED ALL AGES
        60	2-STORY 1946 & NEWER
        70	2-STORY 1945 & OLDER
        75	2-1/2 STORY ALL AGES
        80	SPLIT OR MULTI-LEVEL
        85	SPLIT FOYER
        90	DUPLEX - ALL STYLES AND AGES
       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150	1-1/2 STORY PUD - ALL AGES
       160	2-STORY PUD - 1946 & NEWER
       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190	2 FAMILY CONVERSION - ALL STYLES AND AGES

MSZoning: Identifies the general zoning classification of the sale.
		
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM

### Import Libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Explore the Data

In [3]:
df = pd.read_csv("Ames_House_Final_DF.csv", index_col=0)

# to remove "unnamed" index column, use "index_col=0"

In [4]:
df.head()

Unnamed: 0,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,...,Sale Type_ConLw,Sale Type_New,Sale Type_Oth,Sale Type_VWD,Sale Type_WD,Sale Condition_AdjLand,Sale Condition_Alloca,Sale Condition_Family,Sale Condition_Normal,Sale Condition_Partial
0,141.0,31770,6,5,1960,1960,112.0,639.0,0.0,441.0,...,0,0,0,0,1,0,0,0,1,0
1,80.0,11622,5,6,1961,1961,0.0,468.0,144.0,270.0,...,0,0,0,0,1,0,0,0,1,0
2,81.0,14267,6,6,1958,1958,108.0,923.0,0.0,406.0,...,0,0,0,0,1,0,0,0,1,0
3,93.0,11160,7,5,1968,1968,0.0,1065.0,0.0,1045.0,...,0,0,0,0,1,0,0,0,1,0
4,74.0,13830,5,5,1997,1998,0.0,791.0,0.0,137.0,...,0,0,0,0,1,0,0,0,1,0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2925 entries, 0 to 2924
Columns: 263 entries, Lot Frontage to Sale Condition_Partial
dtypes: float64(11), int64(252)
memory usage: 5.9 MB


<div class="alert alert-block alert-success">
    
## <span style=" color:red">Train | Test Split Procedure 


1. Clean and adjust data as necessary for X and y
2. Split Data in Train/Test for both X and y
3. Fit/Train Scaler on Training X Data
4. Scale X Test Data
5. Create Model
6. Fit/Train Model on X Train Data
7. Evaluate Model on X Test Data (by creating predictions and comparing to y_test)
8. Adjust parameters as necessary and repeat steps 5 and 6
</span>

## Formatting Data
- Separate data as X (independent variables) and y (dependent/target variable)
- Train | Test Split
- Scaling Data 

In [6]:
## CREATE X and y
X = df.drop('SalePrice',axis=1)
y = df['SalePrice']

# TRAIN | TEST SPLIT
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=101)

# SCALE DATA (fit_transform X_train; transform X_test(but not fit))
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train) # we have now sclaed X_train
# or in one step: # X_train = scaler.fit_transform(X_train

X_test = scaler.transform(X_test) # we have now sclaed X_test

## Create Model

Since I will use GridSearchCV and receive best parameters, I choose ElasticNet as my model. Its "l1_ratio" parameter will show whether Lasso or Ridge would be better for this dataset.

In [7]:
from sklearn.linear_model import ElasticNet

In [8]:
# help(ElasticNet)

In [9]:
base_elastic_model = ElasticNet() # with default parameters

## Grid Search

A search consists of:

* an estimator (regressor or classifier such as sklearn.svm.SVC());
* a parameter space;
* a method for searching or sampling candidates;
* a cross-validation scheme 
* a score function.

### Create a dictionary to find the best parameters

In [10]:
param_grid = {'alpha':[0.1,1,5,10,50,100],
              'l1_ratio':[.1, .5, .7, .9, .95, .99, 1]}

### Create and fit the grid model

In [11]:
from sklearn.model_selection import GridSearchCV

In [12]:
# verbose number shws the explanation of the parameters

grid_model = GridSearchCV(estimator=base_elastic_model,
                          param_grid=param_grid,
                          scoring='neg_mean_squared_error',
                          cv=5,
                          verbose=1)

In [13]:
grid_model.fit(X_train,y_train)

# Keep in mind that only X data set is scaled. So X_train is scaled but y_train is not.

Fitting 5 folds for each of 42 candidates, totalling 210 fits


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


If we do not want to see this long warnings, we can write "max_iter"parameter with high value into the ElasticNet model. For example, "base_elastic_model = ElasticNet(max_iter=1000000)" But since I preferred the default values, I did not write any parameter.

### See the best parameters

In [14]:
grid_model.best_estimator_

In [15]:
grid_model.best_params_

{'alpha': 100, 'l1_ratio': 1}

The value of the "l1_ratio" parameter shows that it behaves like Lasso and makes some features zero. It makes sense because after one-hot-encoding, lots of features are added to the dataset. Some of these features have weak correlation with "SalePrice"; in other words, they are ineffective on the target variable.

In [16]:
pd.DataFrame(grid_model.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_alpha,param_l1_ratio,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.117682,0.022226,0.003125,0.00625,0.1,0.1,"{'alpha': 0.1, 'l1_ratio': 0.1}",-696095100.0,-630817700.0,-380756800.0,-412412500.0,-592530000.0,-542522400.0,124079100.0,15
1,0.102013,0.006496,0.0,0.0,0.1,0.5,"{'alpha': 0.1, 'l1_ratio': 0.5}",-683083400.0,-607370800.0,-379454500.0,-408460300.0,-587423900.0,-533158600.0,118409600.0,12
2,0.1011,0.015645,0.0,0.0,0.1,0.7,"{'alpha': 0.1, 'l1_ratio': 0.7}",-675397600.0,-595936500.0,-379463000.0,-407806200.0,-586371200.0,-528994900.0,115111600.0,7
3,0.206676,0.011536,0.0,0.0,0.1,0.9,"{'alpha': 0.1, 'l1_ratio': 0.9}",-667229500.0,-587874400.0,-380892200.0,-411303300.0,-588149900.0,-527089900.0,111213700.0,4
4,0.362481,0.015527,0.0,0.0,0.1,0.95,"{'alpha': 0.1, 'l1_ratio': 0.95}",-666246400.0,-588585900.0,-381502100.0,-414381700.0,-589825500.0,-528108300.0,110431700.0,5
5,0.376511,0.016814,0.0,0.0,0.1,0.99,"{'alpha': 0.1, 'l1_ratio': 0.99}",-669592900.0,-594792900.0,-380934700.0,-418896000.0,-593346100.0,-531512500.0,111579700.0,9
6,0.424513,0.047446,0.0,0.0,0.1,1.0,"{'alpha': 0.1, 'l1_ratio': 1}",-674035400.0,-601048600.0,-380056100.0,-420666900.0,-596335700.0,-534428500.0,113607500.0,14
7,0.049929,0.006131,0.0,0.0,1.0,0.1,"{'alpha': 1, 'l1_ratio': 0.1}",-887059900.0,-980141600.0,-466664500.0,-522936200.0,-767613600.0,-724883200.0,200371400.0,24
8,0.06835,0.006423,0.0,0.0,1.0,0.5,"{'alpha': 1, 'l1_ratio': 0.5}",-797705800.0,-828146200.0,-415987600.0,-467582400.0,-679136200.0,-637711700.0,168335900.0,20
9,0.10365,0.006405,0.0,0.0,1.0,0.7,"{'alpha': 1, 'l1_ratio': 0.7}",-750456600.0,-739724100.0,-395284500.0,-439857900.0,-634588700.0,-591982400.0,148720500.0,19


In [29]:
grid_model.best_index_ # the last index on the table

41

In [28]:
grid_model.best_score_

-515292124.1671179

## Evaluate the model's performance (on the unseen 10% scaled X test set)

In [17]:
y_pred = grid_model.predict(X_test) # scaled X_test

In [18]:
from sklearn.metrics import mean_absolute_error,mean_squared_error

In [19]:
mean_absolute_error(y_test,y_pred)

# Compare the actual y (test) values with predicted y values in terms f evaluation metrics

14218.352387671823

In [20]:
np.sqrt(mean_squared_error(y_test,y_pred))

20619.576870342884

In [21]:
# Compare the evaluation metrics with the mean of y (SalePrice)
np.mean(df['SalePrice'])

180815.53743589742

We compared the SalePrice mean with the evalution metrics to see how far our estimate might fall from the average sales value. It is about 10 percent.

#### Compare the Test and Train Set

To compare the test and the training set gives us an insight about overfitting and underfitting. Therefore, we can see how scores change in the test set after training.

In [32]:
# A function to compare y_train and y_test scores 

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

def train_val(y_train, y_train_pred, y_test, y_pred, i):
    
    scores = {i+"_train": {"R2" : r2_score(y_train, y_train_pred),
    "mae" : mean_absolute_error(y_train, y_train_pred),
    "mse" : mean_squared_error(y_train, y_train_pred),
    "rmse" : np.sqrt(mean_squared_error(y_train, y_train_pred))},
              
    i+"_test": {"R2" : r2_score(y_test, y_pred),
    "mae" : mean_absolute_error(y_test, y_pred),
    "mse" : mean_squared_error(y_test, y_pred),
    "rmse" : np.sqrt(mean_squared_error(y_test, y_pred))}}
    return pd.DataFrame(scores)

In [33]:
y_pred = grid_model.predict(X_test) # remember it is scaled X_test
y_train_pred = grid_model.predict(X_train)  # it is scaled X_train

In [34]:
train_val(y_train, y_train_pred, y_test, y_pred, "GridSearch")

Unnamed: 0,GridSearch_train,GridSearch_test
R2,0.9400822,0.9184894
mae,13439.4,14218.35
mse,390607800.0,425167000.0
rmse,19763.8,20619.58
