## Best Model : Polynomial Regression (Based on CV Score)
- CV Score:  0.9114985160662464
- Training Score: 0.9114985160662464
- Test Score: 0.8755530241459522 


## Linear Regression
- CV score:  0.9114426745730106
- Training score: 0.9102975951657948
- Test Score: 0.8746968378820517


## Polynomial Regression
- - Models used with hyperparameters : 'degree':(1,2)
- best params {'degree': 1}
- CV Score:  0.9114985160662464
- Training Score: 0.9114985160662464
- Test Score: 0.8755530241459522


## SGD Regressor
- Models used with hyperparameters : {learning_rate':['invscaling','adaptive'],'penalty':['l2', 'l1','elasticnet'],
             'alpha':[0.0001,0.0005,0.001],
             'l1_ratio':[0.15,0.20,0.40,0.50,0.60],
             'tol':[1e-4,1e-3]
- Best Model parameters: {'alpha': 0.0005, 'l1_ratio': 0.5, 'learning_rate': 'adaptive', 'penalty': 'elasticnet', 'tol': 0.001} 
- CV Score:  0.891594730863832
- Training Score: 0.891594730863832
- Test Score: 0.8308356404127091

## Ridge Regressor
 - Models used with hyperparameters :'alpha':[0.001, 0.01,0.05, 0.1, 0.25,0.5,1, 10,100,150]
 - Best Model parameters: 'alpha': 100
 - CV Score:  0.8841976405377137
 - Training Score: 0.9103295889443214
 - Test Score: 0.8731529205790174
 
 
## Lasso Regressor
 - Models used with hyperparameters :'alpha':[0.001, 0.01,0.05, 0.1, 0.25,0.5,1, 10,100,150]
 - Best Model parameters: 'alpha': 150
 - CV Score:  0.8856334588062339
 - Training Score: 0.9110302586543852
 - Test Score: 0.87806158664072
 
## KNN Regressor
 - Models used with hyperparameters: 'weights':['uniform','distance'],'leaf_size': [5,10,20,30],'n_neighbors': [5,10,15],'p':[1,2,3]}
 - Best Model parameters: 'leaf_size': 5, 'n_neighbors': 5, 'p': 1, 'weights': 'distance'
  - CV Score:  0.8354252209693938
 - Training Score: 0.9999955968722103
 - Test Score: 0.7898527883440282
 
## Decision Tree Regresser
 - Models used with hyperparameters:    'criterion':['mse','friedman_mse','mae'],'splitter':['random','best'],
    'max_depth':[20,30],'min_samples_split':[4,8,12,16],'min_samples_leaf' :[2,4,8],'max_features':['auto','sqrt']
 - Best Model parameters: {criterion': 'mse', 'max_depth': 30, 'max_features': 'auto', 'min_samples_leaf': 8, 'min_samples_split': 12, 'splitter': 'random'}
 - CV Score:  0.7980363024025918
 - Training Score: 0.8771368435976774
 - Test Score: 0.8033735005070397
 
## SVM Regressor
 - Models used with hyperparameters: kernel':['linear','poly','rbf','sigmoid'],'degree':[3,4,5],'gamma':['auto','scale'],'coef0':[2.0,4.0,10.0,15.0]
 - Best Model parameters: {'coef0': 10.0, 'degree': 5, 'gamma': 'auto', 'kernel': 'poly'}
 - CV Score:  0.9004499345276106
 - Training Score: 0.9534721185647854
 - Test Score: 0.8741830276841772
 
 
## Multiple Models with One GridSearch
 
 - Models used with hyperparameters:{ 'regressor' :[LinearRegression(), Ridge() , Lasso()]}
 - Best Model parameters: 'regressor': Lasso(alpha=200, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False), 'alpha': 200}
- CV Score:  0.83
- Train Score: 0.981937071382834
- Test Score: 0.8740776530267179   
 

## Data PreProcessing

In [20]:
from math import sqrt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

pd.pandas.set_option('display.max_columns', None)
%matplotlib inline

### Load Datasets

In [21]:
# load dataset
# your code here
data = pd.read_csv('/Users/tapas/Downloads/Oil/houseprice.csv')


### Types of variables



In [22]:
# we have an Id variable, that we should not use for predictions:

print('Number of House Id labels: ', len(data.Id.unique()))
print('Number of Houses in the Dataset: ', len(data))

Number of House Id labels:  1460
Number of Houses in the Dataset:  1460


#### Find categorical variables

In [23]:
# find categorical variables- hint data type = 'O'

categorical = [var for var in data.columns if data[var].dtype=='O']

print(f'There are {len(categorical)} categorical variables')

There are 43 categorical variables


#### Find temporal variables

In [24]:
# make a list of the numerical variables first= Hint data type != O
numerical = [var for var in data.columns if data[var].dtype!='O']

# list of variables that contain year information= Hint variable namme has Yr or 
year_vars = [var for var in numerical if 'Yr' in var or 'Year' in var]

year_vars

['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold']

#### Find discrete variables

To identify discrete variables- numerical variables with less than 20 unique values 

In [25]:
# let's visualise the values of the discrete variables
discrete = [var for var in numerical if len(data[var].unique()) < 20 and var not in year_vars]

print(f'There are {len(discrete)} discrete variables')

There are 14 discrete variables


#### Continuous variables

In [26]:
# find continuous variables- hint numerical variables not in discrete and  year_years 
# Also remove the Id variable and the target variable SalePrice
# which are both also numerical

continuous = [var for var in numerical if var not in discrete and var not in [
    'Id', 'SalePrice'] and var not in year_vars]

print('There are {} numerical and continuous variables'.format(len(numerical)))

There are 38 numerical and continuous variables


### Separate train and test set

In [27]:
# Let's separate into train and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data.drop(['Id', 'SalePrice'], axis=1),
                                                    data['SalePrice'],
                                                    test_size=0.1,
                                                    random_state=0)

X_train.shape, X_test.shape

((1314, 79), (146, 79))

**Now we will move on and engineer the features of this dataset. The most important part for this course.**

### Craete New Variables

Replace 'YearBuilt', 'YearRemodAdd', 'GarageYrBlt  with time elapsed since YrSold
So YearBuilt = YrSold-YearBuilt. 

Similarly transform 'YearRemodAdd', 'GarageYrBlt.
After making transformation drop YrSold

In [28]:
# function to calculate elapsed time

def elapsed_years(df, var):
    # capture difference between year variable and
    # year the house was sold
    
    df[var] = df['YrSold'] - df[var]
    return df

In [29]:
for var in ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']:
    X_train = elapsed_years(X_train, var)
    X_test = elapsed_years(X_test, var)

In [30]:
# drop YrSold
X_train.drop('YrSold', axis=1, inplace=True)
X_test.drop('YrSold', axis=1, inplace=True)

In [31]:
year_vars.remove('YrSold')

In [32]:
# capture the column names for use later in the notebook
final_columns = X_train.columns
final_columns

Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
       'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDrive', 'Wo

### Feature Engineering Pipeline

In [33]:
# I will treat discrete variables as if they were categorical
# to treat discrete as categorical using Feature-engine
# we need to re-cast them as object

X_train[discrete] = X_train[discrete].astype('O')
X_test[discrete] = X_test[discrete].astype('O')

In [34]:
# import relevant modules for feature engineering
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from feature_engine import missing_data_imputers as mdi
from feature_engine import categorical_encoders as ce
from feature_engine.variable_transformers import YeoJohnsonTransformer
from sklearn.preprocessing import StandardScaler
from feature_engine.discretisers import DecisionTreeDiscretiser

In [35]:
house_preprocess = Pipeline([
    
    # missing data imputation 
    ('missing_ind', mdi.AddNaNBinaryImputer(
        variables=['LotFrontage', 'MasVnrArea',  'GarageYrBlt'])),
    ('imputer_num', mdi.MeanMedianImputer(imputation_method='mean',
                                          variables=['LotFrontage', 'MasVnrArea',  'GarageYrBlt'])),
    ('imputer_cat', mdi.CategoricalVariableImputer(variables=categorical)),

    # categorical encoding 
     ('rare_label_enc', ce.RareLabelCategoricalEncoder(
         tol=0.01,n_categories=6, variables=categorical+discrete)),
    ('categorical_enc', ce.MeanCategoricalEncoder(variables = categorical + discrete)),
     
    # Transforming Numerical Variables
    ('yjt', YeoJohnsonTransformer(variables = ['LotFrontage','MasVnrArea', 'GarageYrBlt'])),

    
    # discretisation and encoding
    ('treeDisc',  DecisionTreeDiscretiser(cv=2, scoring='neg_mean_squared_error',
                                   regression=True,
                                   param_grid={'max_depth': [1,2,3,4,5,6]})),

    # feature Scaling
    ('scaler', StandardScaler()),
    
    

])

In [36]:
house_preprocess.fit(X_train,y_train)

Pipeline(memory=None,
         steps=[('missing_ind',
                 AddNaNBinaryImputer(variables=['LotFrontage', 'MasVnrArea',
                                                'GarageYrBlt'])),
                ('imputer_num',
                 MeanMedianImputer(imputation_method='mean',
                                   variables=['LotFrontage', 'MasVnrArea',
                                              'GarageYrBlt'])),
                ('imputer_cat',
                 CategoricalVariableImputer(variables=['MSZoning', 'Street',
                                                       'Alley', 'LotShape',
                                                       'LandContour',
                                                       'Utilities', '...
                                                    'Utilities', 'LotConfig',
                                                    'LandSlope', 'Neighborhood',
                                                    'Condition1', 'Condition2',
    

In [37]:
# Apply Transformations
X_train=house_preprocess.transform(X_train)
X_test=house_preprocess.transform(X_test)

## <span class="mark">DO NOT CHANGE STEPS BEFORE THIS POINT</span>

## Regression Models- Tune different models one by one

In [40]:
# Train a linear regression model, report the coefficients and model performance 

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error,r2_score

lr = LinearRegression().fit(X_train, y_train)
cv_scores = cross_val_score(lr, X_train, y_train)

# Mean Cross validation Score
print("CV score: {}".format((cv_scores)))
print("CV score: ",lr.score(X_train, y_train))

# Print Co-efficients
print("lr.coef_:", lr.coef_)
print("lr.intercept_:", lr.intercept_)

# Check test data set performance

X_train_preds = lr.predict(X_train)
X_test_preds = lr.predict(X_test)

print("Training score",r2_score(y_train,X_train_preds))
print("LR Performance Test: ", lr.score(X_train,y_train))

CV score: [ 8.68312903e-01 -6.26561684e+21  8.76203279e-01  8.98626806e-01
  8.92290590e-01]
CV score:  0.9114426745730106
lr.coef_: [ 8.85927525e+02  9.63821950e+02  1.55093914e+03  2.38123720e+03
  1.54867121e+03  3.94643841e+02  5.73339946e+02  1.32077662e+03
  1.55946597e+03  2.25919167e+03  1.27504991e+03  1.12929618e+04
  1.31777294e+03  2.04524613e+03  1.21008674e+03 -9.02846098e+02
  1.58511847e+04 -9.82124059e+01 -5.41455300e+03  3.94119726e+03
  3.73336495e+02 -1.17351708e+03  3.44591704e+03 -2.45906679e+03
 -8.92244713e+02  2.64169300e+02  2.55034152e+03  5.77675879e+02
  1.02547905e+01  2.43524833e+03  5.95239944e+02  3.56353271e+03
  1.33859388e+03  5.57728195e+03 -1.43383745e+03  2.11644961e+03
 -2.80153401e+02  7.89568089e+03  2.44320365e+02  1.63894561e+03
  1.03758613e+03 -7.92545735e+02  1.21578274e+04  1.11657512e+04
  4.20710717e+03  5.60685225e+03  2.72048329e+03 -1.21950180e+03
  4.38283844e+03  5.39795630e+03  5.97516292e+02  2.94689148e+03
  3.71875504e+03  3.03

#  Linear Regression with Polynomial Features

In [41]:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LinearRegression
from math import sqrt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import GridSearchCV


reg_poly_pipe = Pipeline([
    ('Poly',PolynomialFeatures()),
    ('lreg',LinearRegression())
])

reg_poly_params = {'Poly__degree':range(1,2)}
reg_poly_grid = GridSearchCV(reg_poly_pipe,reg_poly_params,cv=5, n_jobs=-1,return_train_score = True)

reg_poly_grid.fit(X_train,y_train)

X_train_preds = reg_poly_grid.predict(X_train)
X_test_preds = reg_poly_grid.predict(X_test)

print('Co-efficients',dir(reg_poly_pipe.named_steps['lreg']))
#print('Intercept ',reg_poly_pipe.named_steps['lreg'].intercept_)
print('best params',reg_poly_grid.best_params_)


print('CV Score: ',reg_poly_grid.best_score_)
#
print('CV Score: ',reg_poly_grid.score(X_train,y_train))
#Training score

print('train rmse',sqrt(mean_squared_error(y_train,X_train_preds)))
print("train r2: ", r2_score(y_train,X_train_preds))

print('test rmse',sqrt(mean_squared_error(y_test,X_test_preds)))
print("Test Score/r2: ", r2_score(y_test,X_test_preds))


Co-efficients ['__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_decision_function', '_estimator_type', '_get_param_names', '_get_tags', '_more_tags', '_preprocess_data', '_set_intercept', 'copy_X', 'fit', 'fit_intercept', 'get_params', 'n_jobs', 'normalize', 'predict', 'score', 'set_params']
best params {'Poly__degree': 1}
CV Score:  -2.168230286496948e+20
CV Score:  0.9114985160662464
train rmse 23507.198610424522
train r2:  0.9114985160662464
test rmse 29244.041532318097
Test Score/r2:  0.8755530241459522


## Why are both CV Scores different ^

# SGD Regressor

In [28]:
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error,r2_score
from math import sqrt
from sklearn.model_selection import GridSearchCV

reg_sgd_pipe = Pipeline([
    ('scaler',MinMaxScaler()),
    ('sgd_reg',SGDRegressor(max_iter=1000,verbose=11,early_stopping=True,validation_fraction=0.2))
])

param_sgd = {'sgd_reg__learning_rate':['invscaling','adaptive'],
            'sgd_reg__penalty':['l2', 'l1', 'elasticnet'],
             'sgd_reg__alpha':[0.0001,0.0005,0.001],
             'sgd_reg__l1_ratio':[0.15,0.20,0.40,0.50,0.60],
             'sgd_reg__tol':[1e-4,1e-3]
            }
grid_sgd = GridSearchCV(reg_sgd_pipe,param_sgd,cv=5,n_jobs=-1,return_train_score=True)

grid_sgd.fit(X_train,y_train)

X_train_preds = grid_sgd.predict(X_train)
X_test_preds = grid_sgd.predict(X_test)



scores = cross_val_score(grid_sgd, X_train, y_train)
print("Cross-validation scores: {}".format(scores))

print("CV Score",grid_sgd.score(X_train, y_train)) # Training score
print("Mean CV Score :",grid_sgd.best_score_) # CV Score


print('train rmse',sqrt(mean_squared_error(y_train,X_train_preds)))
print("train r2: ", r2_score(y_train,X_train_preds))

print('test rmse',sqrt(mean_squared_error(y_test,X_test_preds)))
print("test r2: ", r2_score(y_test,X_test_preds))

print('best partams', grid_sgd.best_params_)


'''
Cross-validation scores: [0.85629649 0.85130738 0.89215648 0.87668958 0.90063551]
CV Score 0.891594730863832
Mean CV Score : 0.880125237950295
train rmse 26016.615364442063
train r2:  0.891594730863832
test rmse 34095.703270003876
test r2:  0.8308356404127091
best partams {'sgd_reg__alpha': 0.0005, 'sgd_reg__l1_ratio': 0.5, 'sgd_reg__learning_rate': 'adaptive', 'sgd_reg__penalty': 'elasticnet', 'sgd_reg__tol': 0.001}
'''



-- Epoch 1
Norm: 83216.40, NNZs: 81, Bias: 182.261203, T: 1051, Avg. loss: 579703583.510410
Total training time: 0.00 seconds.
-- Epoch 2
Norm: 95552.30, NNZs: 81, Bias: -314.953744, T: 2102, Avg. loss: 363118011.566454
Total training time: 0.00 seconds.
-- Epoch 3
Norm: 106476.63, NNZs: 81, Bias: -1845.385859, T: 3153, Avg. loss: 325498549.654393
Total training time: 0.00 seconds.
-- Epoch 4
Norm: 110542.93, NNZs: 81, Bias: -2935.490991, T: 4204, Avg. loss: 314688786.263990
Total training time: 0.01 seconds.
-- Epoch 5
Norm: 115241.45, NNZs: 81, Bias: -2407.238879, T: 5255, Avg. loss: 301906263.815102
Total training time: 0.01 seconds.
-- Epoch 6
Norm: 122157.79, NNZs: 81, Bias: -3735.300981, T: 6306, Avg. loss: 297603804.171660
Total training time: 0.01 seconds.
-- Epoch 7
Norm: 125669.81, NNZs: 81, Bias: -4476.929021, T: 7357, Avg. loss: 289648614.502727
Total training time: 0.01 seconds.
-- Epoch 8
Norm: 128472.90, NNZs: 81, Bias: -4541.967265, T: 8408, Avg. loss: 288324972.510556


"\ntrain rmse 24427.684457242212\ntrain r2:  0.904431800897606\ntest rmse 30888.23776210912\ntest r2:  0.8611660031516002\nbest partams {'sgd_reg__alpha': 0.001, 'sgd_reg__l1_ratio': 0.15, 'sgd_reg__learning_rate': 'adaptive', 'sgd_reg__penalty': 'elasticnet', 'sgd_reg__tol': 1e-06}\n"

## Are CV scores different because they choose different train test sets ^

In [122]:
# Train a Ridge regression model, report the coefficients, the best parameters, and model performance 
'''
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
ridge = Ridge()

#define a list of parameters
param_ridge = {'alpha':[0.001, 0.01, 0.1, 1, 10, 100] }

grid_ridge = GridSearchCV(ridge, param_ridge, cv=5, return_train_score = True)
grid_ridge.fit(X_train, y_train)

# Mean Cross Validation Score
print("Best Mean Cross-validation score: {:.2f}".format(grid_ridge.best_score_))

print()

#find best parameters
print('Ridge parameters: ', grid_ridge.best_params_)

# print co-eff

print("Ridge.coef_:", grid_ridge.best_estimator_.coef_)
print("Ridge.intercept_:", grid_ridge.best_estimator_.intercept_)

# Check test data set performance

print("Ridge Test Performance: ", grid_ridge.score(X_test,y_test))
'''


Best Mean Cross-validation score: 0.88

Ridge parameters:  {'alpha': 100}
Ridge.coef_: [ 6.21659729e+02  1.17460178e+03  1.42941553e+03  2.64606835e+03
  1.51146267e+03  2.62791603e+02  6.67979067e+02  1.18774714e+03
  1.42260925e+03  2.24214684e+03  1.25368077e+03  9.92739531e+03
  1.23361742e+03  1.99010230e+03  1.20327177e+03 -6.63081364e+02
  1.30900812e+04 -4.66276510e+02 -2.83583594e+03  3.23156417e+03
  6.84148916e+02 -6.10003713e+02  1.78250368e+03 -1.24106895e+03
 -8.16647095e+02  8.17430994e+02  3.38262729e+03  6.89414122e+02
 -2.93079656e+02  2.71590898e+03  4.54540556e+02  3.45654634e+03
  1.37124639e+03  5.85727188e+03 -1.21460614e+03  1.83762570e+03
 -2.50335442e+02  7.78490680e+03  1.59374113e+02  1.42274787e+03
  1.23479897e+03 -5.43667577e+02  1.07696364e+04  1.01767254e+04
  3.73457884e+03  6.25609926e+03  2.54341163e+03 -1.10107201e+03
  3.81888633e+03  4.50746959e+03  8.23531957e+02  2.86618489e+03
  4.19642055e+03  3.72923474e+03  2.86150257e+03  1.16162981e+03
  1

# Ridge Code

In [36]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler

ridge_pipe = Pipeline([
        ('scaler',StandardScaler()),
        ('ridge_reg',Ridge())])

ridge_params = {'ridge_reg__alpha':[0.001, 0.01,0.05, 0.1, 0.25,0.5,1, 10,100,150]}

grid_ridge = GridSearchCV(ridge_pipe,ridge_params,cv=5,return_train_score = True)

grid_ridge.fit(X_train,y_train)

X_train_preds = grid_ridge.predict(X_train)
X_test_preds = grid_ridge.predict(X_test)


print('best params ',grid_ridge.best_params_)
scores = cross_val_score(grid_ridge, X_train, y_train)
print("Cross-validation scores: {}".format(scores))
print('Cv score ',grid_ridge.best_score_)

print('train rmse',sqrt(mean_squared_error(y_train,X_train_preds)))
print("train r2: ", r2_score(y_train,X_train_preds))

print('test rmse',sqrt(mean_squared_error(y_test,X_test_preds)))
print("test r2: ", r2_score(y_test,X_test_preds))
print("Ridge Test Performance: ", grid_ridge.score(X_test,y_test))


best params  {'ridge_reg__alpha': 100}
Cross-validation scores: [0.87180427 0.87337498 0.88443778 0.89665756 0.89454761]
Cv score  0.8841976405377137
train rmse 23661.930837005166
train r2:  0.9103295889443214
test rmse 29524.697349097773
test r2:  0.8731529205790174
Ridge Test Performance:  0.8731529205790174


In [None]:

'''
'__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', 
'__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__',
'__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__',
'__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_check_is_fitted', 
'_estimator_type', '_format_results', '_get_param_names', '_get_tags', '_more_tags', '_pairwise',
'_required_parameters', '_run_search', 'best_estimator_', 'best_index_', 'best_params_', 'best_score_', 
'classes_', 'cv', 'cv_results_', 'decision_function', 'error_score', 'estimator', 'fit', 'get_params', 'iid', 
'inverse_transform', 'multimetric_', 'n_jobs', 'n_splits_', 'param_grid', 'pre_dispatch', 'predict', 
'predict_log_proba', 'predict_proba', 'refit', 'refit_time_', 'return_train_score', 'score', 'scorer_', 
'scoring', 'set_params', 'transform', 'verbose
'''

In [42]:
# Train a Lasso regression model, report the coefficients, the best parameters, and model performance 

# YOUR CODE HERE
'''
from sklearn.linear_model import Lasso
lasso = Lasso(random_state=0)

#define a list of parameters
param_lasso = {'alpha':[0.001, 0.01, 0.1, 1, 10, 100] }

grid_lasso = GridSearchCV(lasso, param_lasso, cv=5, return_train_score = True)
grid_lasso.fit(X_train, y_train)

# Mean Cross Validation Score
print("Best Mean Cross-validation score: {:.2f}".format(grid_lasso.best_score_))
print()

#find best parameters
print('Lasso parameters: ', grid_lasso.best_params_)

# print co-eff

print("Lasso.coef_:", grid_lasso.best_estimator_.coef_)
print("Lasso.intercept_:", grid_lasso.best_estimator_.intercept_)

# Check test data set performance
print("Lasso Test Performance: ", grid_lasso.score(X_test,y_test))
'''




Best Mean Cross-validation score: 0.88

Lasso parameters:  {'alpha': 100}
Lasso.coef_: [ 5.98354347e+02  9.93527379e+02  1.31248909e+03  2.45908267e+03
  1.47831122e+03  2.01697431e+02  4.99595526e+02  1.10518248e+03
  1.37589148e+03  2.21586305e+03  1.19134068e+03  1.12945939e+04
  1.13951391e+03  2.01113271e+03  1.11030992e+03 -7.18291866e+02
  1.58637889e+04 -2.48057120e+02 -4.75773676e+03  3.55362072e+03
  3.04826646e+02 -7.97883492e+02  1.41798949e+03 -5.97053298e+02
 -5.00534949e+02  0.00000000e+00  2.49235770e+03  6.28690101e+02
 -0.00000000e+00  2.12280516e+03  3.56997127e+02  3.48335820e+03
  1.03573161e+03  5.80099431e+03 -1.22600059e+03  1.78568161e+03
 -1.93900166e+02  7.72000354e+03  1.02765199e+02  1.33679852e+03
  1.22458458e+03 -4.25128495e+02  1.23009952e+04  1.13683366e+04
  3.99723644e+03  5.84723766e+03  2.84083895e+03 -1.19002753e+03
  3.90739089e+03  5.07525348e+03  5.97604436e+02  2.93758134e+03
  3.76509853e+03  3.03646690e+03  2.87199457e+03  8.70912692e+02
  1

# Lasso Code

In [38]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

lasso = Lasso(selection='random',max_iter=10000)

param_lasso = {'alpha':[0.001, 0.01,0.05, 0.1, 0.25,0.5,1, 10,100,150]}

grid_lasso = GridSearchCV(lasso,param_lasso,cv=10, return_train_score = True)

grid_lasso.fit(X_train,y_train)

X_train_preds = grid_lasso.predict(X_train)
X_test_preds = grid_lasso.predict(X_test)

print('CV Score',grid_lasso.best_score_)
print('Train RMSE',sqrt(mean_squared_error(y_train,X_train_preds)))
print('Test RMSE',sqrt(mean_squared_error(y_test,X_test_preds)))
print('Test Score/R-Square',grid_lasso.score(X_test,y_test))

CV Score 0.88562191775202
Train RMSE 23569.530430171562
Test RMSE 28943.028567200254
Test Score/R-Square 0.8781017390279917


# KNN Regressor Code

In [50]:
from sklearn.neighbors import KNeighborsRegressor

knn_pipe = Pipeline([
    ('scaler',StandardScaler()),
    ('knn_reg',KNeighborsRegressor(algorithm='auto'))
])

knn_params = {'knn_reg__weights':['uniform','distance'],
             'knn_reg__leaf_size': [5,10,20,30],
             'knn_reg__n_neighbors': [5,10,15],
             'knn_reg__p':[1,2,3]}

knn_reg_grid = GridSearchCV(knn_pipe,knn_params,cv=5,return_train_score = True)

knn_reg_grid.fit(X_train,y_train)

X_train_preds = knn_reg_grid.predict(X_train)
X_test_preds = knn_reg_grid.predict(X_test)

print("CV score ",knn_reg_grid.best_score_)
print("Best Params ",knn_reg_grid.best_params_)

print("train RMSE",sqrt(mean_squared_error(y_train,X_train_preds)))
print("train R2/score",r2_score(y_train,X_train_preds))

print("test RMSE",sqrt(mean_squared_error(y_test,X_test_preds)))
print("test R2/score",r2_score(y_test,X_test_preds))


CV score  0.8354252209693938
Best Params  {'knn_reg__leaf_size': 5, 'knn_reg__n_neighbors': 5, 'knn_reg__p': 1, 'knn_reg__weights': 'distance'}
train RMSE 165.80829186118498
train R2/score 0.9999955968722103
test RMSE 38002.05080871088
test R2/score 0.7898527883440282


# DecisionTree Regressor code

In [55]:
from sklearn.tree import DecisionTreeRegressor

dreg_pipe = Pipeline([
    ('dtree_reg',DecisionTreeRegressor())
])

dreg_params = {
    'dtree_reg__criterion':['mse','friedman_mse','mae'],
    'dtree_reg__splitter':['random','best'],
    'dtree_reg__max_depth':[20,30],
    'dtree_reg__min_samples_split':[4,8,12,16],
    'dtree_reg__min_samples_leaf' :[2,4,8],
    'dtree_reg__max_features':['auto','sqrt']
}

dtree_reg_grid = GridSearchCV(dreg_pipe,dreg_params,cv=5,return_train_score = True)

dtree_reg_grid.fit(X_train,y_train)

X_train_preds = dtree_reg_grid.predict(X_train)
X_test_preds = dtree_reg_grid.predict(X_test)

print("CV score ",dtree_reg_grid.best_score_)
print("Best Params ",dtree_reg_grid.best_params_)

print("train RMSE",sqrt(mean_squared_error(y_train,X_train_preds)))
print("train R2/score",r2_score(y_train,X_train_preds))

print("test RMSE",sqrt(mean_squared_error(y_test,X_test_preds)))
print("test R2/score",r2_score(y_test,X_test_preds))

'''
CV score  0.7814859520165511
Best Params  {'dtree_reg__criterion': 'friedman_mse', 'dtree_reg__max_depth': 20, 'dtree_reg__max_features': 'auto', 'dtree_reg__min_samples_leaf': 2, 'dtree_reg__min_samples_split': 8, 'dtree_reg__splitter': 'random'}
train RMSE 16912.129029241747
train R2/score 0.9541915743480076
test RMSE 35541.86940384073
test R2/score 0.8161811279310636
'''

CV score  0.7980363024025918
Best Params  {'dtree_reg__criterion': 'mse', 'dtree_reg__max_depth': 30, 'dtree_reg__max_features': 'auto', 'dtree_reg__min_samples_leaf': 8, 'dtree_reg__min_samples_split': 12, 'dtree_reg__splitter': 'random'}
train RMSE 27697.236187058385
train R2/score 0.8771368435976774
test RMSE 36759.216163106896
test R2/score 0.8033735005070397


"\nCV score  0.7814859520165511\nBest Params  {'dtree_reg__criterion': 'friedman_mse', 'dtree_reg__max_depth': 20, 'dtree_reg__max_features': 'auto', 'dtree_reg__min_samples_leaf': 2, 'dtree_reg__min_samples_split': 8, 'dtree_reg__splitter': 'random'}\ntrain RMSE 16912.129029241747\ntrain R2/score 0.9541915743480076\ntest RMSE 35541.86940384073\ntest R2/score 0.8161811279310636\n"

# SVM Regressor

In [67]:
from sklearn.svm import SVR

svm_reg_pipe = Pipeline([
    ('svm_reg',SVR())
])

svm_reg_params = {
    'svm_reg__kernel':['linear','poly','rbf','sigmoid'],
    'svm_reg__degree':[3,4,5],
    'svm_reg__gamma': ['auto','scale'],
    'svm_reg__coef0':[2.0,4.0,10.0,15.0]
}

svm_reg_grid = GridSearchCV(svm_reg_pipe,svm_reg_params,cv=5,return_train_score = True)

svm_reg_grid.fit(X_train,y_train)

X_train_preds = svm_reg_grid.predict(X_train)
X_test_preds = svm_reg_grid.predict(X_test)

print("CV score ",svm_reg_grid.best_score_)
print("Best Params ",svm_reg_grid.best_params_)

print("train RMSE",sqrt(mean_squared_error(y_train,X_train_preds)))
print("train R2/score",r2_score(y_train,X_train_preds))

print("test RMSE",sqrt(mean_squared_error(y_test,X_test_preds)))
print("test R2/score",r2_score(y_test,X_test_preds))


CV score  0.9004499345276106
Best Params  {'svm_reg__coef0': 10.0, 'svm_reg__degree': 5, 'svm_reg__gamma': 'auto', 'svm_reg__kernel': 'poly'}
train RMSE 17044.420472834856
train R2/score 0.9534721185647854
test RMSE 29404.570032081163
test R2/score 0.8741830276841772


## Tune Multiple Models with one GridSearch

In [95]:
model_gs = Pipeline([("Poly",PolynomialFeatures()),("regressor", LinearRegression())])

In [132]:
model_parm_gd = [
    { 'regressor': [LinearRegression()]},
    
    { 'regressor': [Ridge()],
      'regressor__alpha':[0.001, 0.01, 0.1, 1, 10, 100,200],
    'regressor__solver':['auto']},
    
    { 'regressor': [Lasso()],
      'regressor__alpha':[0.001, 0.01, 0.1, 1, 10, 100,200]},
 
]

In [133]:
grid_search_house_pipe = GridSearchCV(model_gs, model_parm_gd,cv=5)

In [134]:
grid_search_house_pipe.fit(X_train,y_train)

  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('Poly',
                                        PolynomialFeatures(degree=2,
                                                           include_bias=True,
                                                           interaction_only=False,
                                                           order='C')),
                                       ('regressor',
                                        LinearRegression(copy_X=True,
                                                         fit_intercept=True,
                                                         n_jobs=None,
                                                         normalize=False))],
                                verbose=False),
             iid='deprecated', n_jobs=None,
             param_grid=[{'regressor': [LinearRegression(copy_X=Tru...
                          'regressor__solver': ['auto']},
 

In [128]:
#print(grid_search_house_pipe.best_params_)

{'regressor': Ridge(alpha=100, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001), 'regressor__alpha': 100}


In [135]:
print(grid_search_house_pipe.best_params_)

{'regressor': Lasso(alpha=200, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False), 'regressor__alpha': 200}


In [136]:
# let's get the predictions
X_train_preds = grid_search_house_pipe.predict(X_train)
X_test_preds = grid_search_house_pipe.predict(X_test)

In [130]:
#print("Best Mean Cross-validation score: {:.2f}".format(grid_search_house_pipe.best_score_))

Best Mean Cross-validation score: 0.88


In [137]:
print("Best Mean Cross-validation score: {:.2f}".format(grid_search_house_pipe.best_score_))

Best Mean Cross-validation score: 0.83


In [131]:
'''
# check model performance:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

print('train mse: {}'.format(mean_squared_error(y_train, X_train_preds)))
print('train rmse: {}'.format(sqrt(mean_squared_error(y_train, X_train_preds))))
print('train r2: {}'.format(r2_score(y_train, X_train_preds)))
print()
print('test mse: {}'.format(mean_squared_error(y_test, X_test_preds)))
print('test rmse: {}'.format(sqrt(mean_squared_error(y_test, X_test_preds))))
print('test r2: {}'.format(r2_score(y_test, X_test_preds)))

'''


train mse: 559886970.9352162
train rmse: 23661.93083700517
train r2: 0.9103295889443213

test mse: 871707753.5558217
test rmse: 29524.697349097783
test r2: 0.8731529205790173


In [138]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

print('train mse: {}'.format(mean_squared_error(y_train, X_train_preds)))
print('train rmse: {}'.format(sqrt(mean_squared_error(y_train, X_train_preds))))
print('train r2: {}'.format(r2_score(y_train, X_train_preds)))
print()
print('test mse: {}'.format(mean_squared_error(y_test, X_test_preds)))
print('test rmse: {}'.format(sqrt(mean_squared_error(y_test, X_test_preds))))
print('test r2: {}'.format(r2_score(y_test, X_test_preds)))

train mse: 112781889.48419684
train rmse: 10619.881801799718
train r2: 0.981937071382834

test mse: 865352885.5659168
test rmse: 29416.88096256836
test r2: 0.8740776530267179
