## Project: Regression Dataset

### The dataset is downloaded from Kaggle. It consists of various datapoints collected from Airbnb website which can be used to predict the price of the stay. We use the data from Boston city and drop rows of other cities

### __Goal:__ To predict the log price of the stay using various regression methods

In [1]:
import warnings
warnings.filterwarnings('ignore')

__Import required packages__

In [2]:
# Import required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.options.display.max_rows = 100
pd.options.display.max_columns = 150

In [42]:
import pickle

__Load the airbnb dataset__

In [3]:
raw_data = pd.read_csv("airbnb.csv")

__Subset the dataset to contain only the instances of Boston city__

In [4]:
# Subset the data to Boston city
boston_data = raw_data[raw_data['city'] == 'Boston']
boston_data.reset_index(inplace=True, drop=True)

## Data Pre-processing

__Inspect the dataset to understand the data and recognize any data inconsistencies__

In [5]:
boston_data.head()

Unnamed: 0,id,log_price,property_type,room_type,amenities,accommodates,bathrooms,bed_type,cancellation_policy,cleaning_fee,city,description,first_review,host_has_profile_pic,host_identity_verified,host_response_rate,host_since,instant_bookable,last_review,latitude,longitude,name,neighbourhood,number_of_reviews,review_scores_rating,thumbnail_url,zipcode,bedrooms,beds
0,14648556,4.59512,Townhouse,Private room,"{Internet,""Wireless Internet"",""Air conditionin...",2,2.0,Real Bed,strict,True,Boston,This is a nice duplex in a good location.Recen...,2016-07-16,t,t,100%,2014-07-27,f,2017-02-07,42.339194,-71.049672,"Comfy room (C) near T, convention center, down...",South Boston,12,88.0,https://a0.muscache.com/im/pictures/176088bb-3...,2127,1.0,1.0
1,4680055,4.682131,Condominium,Private room,"{TV,Internet,""Wireless Internet"",""Air conditio...",2,1.0,Real Bed,strict,True,Boston,Tourists/Conference-goers great choice! Privat...,2016-03-20,t,t,100%,2013-06-16,t,2017-09-17,42.330628,-71.053148,Private Bedroom Close To Downtown/Subway Red line,South Boston,40,96.0,https://a0.muscache.com/im/pictures/aad0eaa7-a...,2127,1.0,1.0
2,4274462,4.828314,Apartment,Entire home/apt,"{TV,""Wireless Internet"",""Air conditioning"",Kit...",6,1.0,Real Bed,strict,True,Boston,"An Entire 2 bedroom, 600sqft, apartment w/ 4 t...",2017-09-14,t,t,100%,2015-01-25,f,2017-10-02,42.336007,-71.052918,**NEW*Downtown/Convention/Subway/Beach C130,South Boston,5,100.0,https://a0.muscache.com/im/pictures/3d35ea0b-e...,2127,2.0,4.0
3,2278299,4.094345,House,Private room,"{TV,Internet,""Wireless Internet"",Kitchen,""Free...",2,1.0,Real Bed,flexible,False,Boston,"This is a beautiful space in a gorgeous, newly...",,t,f,,2013-12-13,f,,42.319265,-71.113246,"Beautiful Home in Jamaica Plain, MA",Jamaica Plain,0,,https://a0.muscache.com/im/pictures/86832250/f...,2130,1.0,1.0
4,16253186,4.962845,Apartment,Entire home/apt,"{TV,""Wireless Internet"",Kitchen,""Family/kid fr...",2,1.0,Real Bed,flexible,False,Boston,Nicely decorated comfortable 1 bedroom in very...,2017-05-21,t,f,,2017-05-05,t,2017-05-27,42.357198,-71.071588,Clean upscale apt and location,Beacon Hill,2,80.0,https://a0.muscache.com/im/pictures/9bf5ae4c-f...,2114,1.0,1.0


__After initial data inspection, removing the attributes which do not contribute towards the regression__

In [6]:
drop_cols = ['id','city', 'description', 'first_review', 'last_review', 'host_since', 'latitude', 'longitude', \
            'name', 'neighbourhood', 'thumbnail_url', 'zipcode']

boston_data.drop(columns=drop_cols, inplace=True)

__Check the datatype of the retained attributes__

In [7]:
boston_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3468 entries, 0 to 3467
Data columns (total 17 columns):
log_price                 3468 non-null float64
property_type             3468 non-null object
room_type                 3468 non-null object
amenities                 3468 non-null object
accommodates              3468 non-null int64
bathrooms                 3462 non-null float64
bed_type                  3468 non-null object
cancellation_policy       3468 non-null object
cleaning_fee              3468 non-null bool
host_has_profile_pic      3468 non-null object
host_identity_verified    3468 non-null object
host_response_rate        2887 non-null object
instant_bookable          3468 non-null object
number_of_reviews         3468 non-null int64
review_scores_rating      2820 non-null float64
bedrooms                  3465 non-null float64
beds                      3466 non-null float64
dtypes: bool(1), float64(5), int64(2), object(9)
memory usage: 437.0+ KB


__Check for attributes with missing values__

In [8]:
boston_data.isnull().sum()

log_price                   0
property_type               0
room_type                   0
amenities                   0
accommodates                0
bathrooms                   6
bed_type                    0
cancellation_policy         0
cleaning_fee                0
host_has_profile_pic        0
host_identity_verified      0
host_response_rate        581
instant_bookable            0
number_of_reviews           0
review_scores_rating      648
bedrooms                    3
beds                        2
dtype: int64

## Initial impressions on data and next steps

### The following columns have missing values:

> bathrooms

> host_response_rate

> review_scores_rating

> bedrooms

> beds

* Based on analysis, missing values will be imputed with appropriate values

* The column __amenities__ has values in the form of JSON. So we will replace it with the count of amenities provided by the host

* Categorical columns need processing to convert them into numerical form.

__host_has_profile_pic__ , __host_identity_verified__ and __instant_bookable__ columns have t for true and f for false. Replacing t with 1 and f with 0

In [9]:
boston_data.replace(to_replace = "t", value = 1,inplace=True) 
boston_data.replace(to_replace = "f", value = 0,inplace=True)

boston_data.replace(to_replace = True, value = 1,inplace=True) 
boston_data.replace(to_replace = False, value = 0,inplace=True)

__Inspect the distinct values of the column `property_type`__

In [10]:
boston_data['property_type'].value_counts()

Apartment             2383
House                  563
Condominium            339
Townhouse               54
Other                   44
Loft                    24
Bed & Breakfast         17
Boat                    10
Guest suite              8
Villa                    6
Hostel                   4
Guesthouse               4
In-law                   4
Dorm                     3
Timeshare                3
Serviced apartment       1
Boutique hotel           1
Name: property_type, dtype: int64

__Since the occurence of many property types is less than 50(or less than 10 in majority cases), we will replace the values of these property types as `Other` to avoid increased dimensionality__

In [11]:
p_types = ['Apartment','House','Condominium','Townhouse']
boston_data.loc[~boston_data.property_type.isin(p_types), 'property_type'] = 'Other'

__Inspect the distinct values of the column `room_type`__

In [12]:
boston_data['room_type'].unique()

array(['Private room', 'Entire home/apt', 'Shared room'], dtype=object)

__Inspect the distinct values of the column `bed_type`__

In [13]:
boston_data['bed_type'].unique()

array(['Real Bed', 'Futon', 'Airbed', 'Pull-out Sofa', 'Couch'],
      dtype=object)

__Inspect the distinct values of the column `cancellation_policy`__

In [14]:
boston_data['cancellation_policy'].unique()

array(['strict', 'flexible', 'super_strict_30', 'moderate',
       'super_strict_60'], dtype=object)

__Converting the categorical columns to numeric by using `OneHotEncoder`__

In [15]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
feature_df = pd.DataFrame(encoder.fit_transform(boston_data[['room_type','bed_type','cancellation_policy', 'property_type','cleaning_fee']]).toarray(), 
                        columns = encoder.get_feature_names(['room_type','bed_type','cancellation_policy', 'property_type','cleaning_fee']))
boston_data = pd.merge(boston_data, feature_df, how='left', left_index=True, right_index=True)
boston_data.drop(columns=['room_type','bed_type','cancellation_policy', 'property_type','cleaning_fee'], inplace=True)

__Since the column `amenities` has json values, we fetch the count of amenities provided by each property and replace the json values with the count of amenities.__

__We are not using `OneHotEncoder` for this column since the maximum number of amenities is 78 and it will add to the dimensionality of the dataset significantly.__

__In our opinion, the count of amenities is also a good indicator to predict the price rather than taking into account 78 distinct columns.__

In [16]:
amenities_count = []
for i in boston_data['amenities']:
    amenities_count.append(len(i.split(',')))
boston_data['amenities'] = amenities_count

__Inspecting the values of the column `host_response_rate`__

In [17]:
boston_data['host_response_rate'].unique()

array(['100%', nan, '88%', '96%', '92%', '94%', '99%', '93%', '54%',
       '33%', '80%', '70%', '67%', '81%', '90%', '98%', '25%', '86%',
       '97%', '0%', '50%', '75%', '87%', '60%', '77%', '46%', '55%',
       '59%', '83%', '79%', '89%', '64%', '10%', '73%', '68%', '95%',
       '20%', '56%', '78%'], dtype=object)

__Since the response rates are in string, we convert it to numeric values between 0 to 1, where range between 0 and 1 represent values between 0% to 100%__

In [18]:
boston_data['host_response_rate'] = boston_data['host_response_rate'].str.strip('%').astype(float)
boston_data['host_response_rate'] = boston_data['host_response_rate']/100

__Inspect the datatypes of all the columns after data pre-processing__

In [19]:
boston_data.dtypes

log_price                              float64
amenities                                int64
accommodates                             int64
bathrooms                              float64
host_has_profile_pic                     int64
host_identity_verified                   int64
host_response_rate                     float64
instant_bookable                         int64
number_of_reviews                        int64
review_scores_rating                   float64
bedrooms                               float64
beds                                   float64
room_type_Entire home/apt              float64
room_type_Private room                 float64
room_type_Shared room                  float64
bed_type_Airbed                        float64
bed_type_Couch                         float64
bed_type_Futon                         float64
bed_type_Pull-out Sofa                 float64
bed_type_Real Bed                      float64
cancellation_policy_flexible           float64
cancellation_

__After data pre-processing, we have all the columns in numeric form__

## Missing value treatment

In [20]:
boston_data.describe()

Unnamed: 0,log_price,amenities,accommodates,bathrooms,host_has_profile_pic,host_identity_verified,host_response_rate,instant_bookable,number_of_reviews,review_scores_rating,bedrooms,beds,room_type_Entire home/apt,room_type_Private room,room_type_Shared room,bed_type_Airbed,bed_type_Couch,bed_type_Futon,bed_type_Pull-out Sofa,bed_type_Real Bed,cancellation_policy_flexible,cancellation_policy_moderate,cancellation_policy_strict,cancellation_policy_super_strict_30,cancellation_policy_super_strict_60,property_type_Apartment,property_type_Condominium,property_type_House,property_type_Other,property_type_Townhouse,cleaning_fee_0.0,cleaning_fee_1.0
count,3468.0,3468.0,3468.0,3462.0,3468.0,3468.0,2887.0,3468.0,3468.0,2820.0,3465.0,3466.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0
mean,4.884035,20.046136,3.301615,1.236857,0.997116,0.608131,0.968916,0.33564,25.690311,93.597518,1.343723,1.761685,0.6188,0.367935,0.013264,0.008939,0.001442,0.008074,0.004902,0.976644,0.243368,0.235582,0.510957,0.009516,0.000577,0.68714,0.097751,0.162341,0.037197,0.015571,0.235294,0.764706
std,0.664692,7.910602,2.185942,0.509028,0.053629,0.488238,0.113022,0.472282,45.103616,8.059291,0.881492,1.300588,0.485751,0.482313,0.11442,0.094136,0.037949,0.089504,0.069852,0.151054,0.429177,0.424424,0.499952,0.097097,0.024011,0.463725,0.29702,0.368817,0.189272,0.123826,0.424244,0.424244
min,2.833213,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4.382027,15.0,2.0,1.0,1.0,0.0,1.0,0.0,1.0,91.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,4.912655,19.0,2.0,1.0,1.0,1.0,1.0,0.0,7.0,96.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,5.298317,24.0,4.0,1.0,1.0,1.0,1.0,1.0,29.0,99.0,2.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
max,7.244228,78.0,16.0,6.0,1.0,1.0,1.0,1.0,380.0,100.0,10.0,16.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [21]:
missing_columns = boston_data.isnull().sum()
print(missing_columns[missing_columns > 0])

bathrooms                 6
host_response_rate      581
review_scores_rating    648
bedrooms                  3
beds                      2
dtype: int64


__We will remove the rows with missing values since imputation does not makes sense for these features__

In [22]:
boston_data = boston_data.dropna()

In [23]:
boston_data.shape

(2469, 32)

## Data Split and Feature Scaling

In [24]:
x = boston_data.drop(columns='log_price', axis=1)
y = boston_data['log_price']

__Since we have many columns with values 1 and 0, using MinMaxScaler will ensure all the features are in the range of 0 to 1 bringing all the columns to the same scale__

In [25]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0, test_size=0.3)

scaler = MinMaxScaler()
x_train_mod = scaler.fit_transform(x_train)
x_test_mod = scaler.transform(x_test)

# Regression Methods

In [93]:
scoring = {'MSE': 'neg_mean_squared_error', 'R_squared': 'r2'}

## KNN Regressor

In [94]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score

knn = KNeighborsRegressor()
hyperparams = dict(n_neighbors=np.arange(1,50), p=[1,2])

knn_reg = GridSearchCV(knn, param_grid=hyperparams, scoring=scoring, refit='MSE', cv=10, return_train_score=True)

#Fit the model
grid_search = knn_reg.fit(x_train_mod,y_train)

In [119]:
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation MSE: {:.2f}".format(grid_search.best_score_))

Best parameters: {'n_neighbors': 16, 'p': 1}
Best cross-validation MSE: -0.18


In [120]:
knn_model = grid_search.best_estimator_

In [121]:
knn_pkl = 'knn.pickle'
pickle.dump(knn_model, open(knn_pkl, 'wb'))

## LinearRegression

In [122]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lin_reg = LinearRegression()
lin_reg.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [123]:
print(f'Train MSE: {mean_squared_error(y_train,lin_reg.predict(x_train_mod))}')
print(f'Test MSE: {mean_squared_error(y_test,lin_reg.predict(x_test_mod))}')

Train MSE: 0.9543791824124208
Test MSE: 0.941664709845328


In [124]:
lin_pkl = 'lin_reg.pickle'
pickle.dump(knn_model, open(lin_pkl, 'wb'))

## Ridge

In [126]:
from sklearn.linear_model import Ridge

ridge = Ridge()

ridge_params = []
r1 = np.arange(0,1,0.1)
r2 = np.arange(1,101,1)
ridge_params = np.concatenate((r1,r2))

params = dict(alpha=ridge_params)

ridgeCV = GridSearchCV(ridge, param_grid=params, scoring=scoring, refit='MSE', cv=10, return_train_score=True)

#Fit the model
grid_ridge = ridgeCV.fit(x_train_mod,y_train)



In [127]:
#Print The value of best Hyperparameters
print("Best parameters: {}".format(grid_ridge.best_params_))
print("Best cross-validation MSE: {:.2f}".format(grid_ridge.best_score_))

Best parameters: {'alpha': 1.0}
Best cross-validation MSE: -0.17


In [128]:
ridge_model = grid_ridge.best_estimator_

In [129]:
ridge_pkl = 'ridge.pickle'
pickle.dump(knn_model, open(ridge_pkl, 'wb'))

## Lasso

In [130]:
from sklearn.linear_model import Lasso

lasso = Lasso()

lasso_params = []
r1 = np.arange(0,1,0.1)
r2 = np.arange(1,101,1)
lasso_params = np.concatenate((r1,r2))

params = dict(alpha=lasso_params)

lassoCV = GridSearchCV(lasso, param_grid=params, scoring=scoring, refit='MSE', cv=10, return_train_score=True)

#Fit the model
grid_lasso = lassoCV.fit(x_train_mod,y_train)

In [131]:
#Print The value of best Hyperparameters
print("Best parameters: {}".format(grid_lasso.best_params_))
print("Best cross-validation MSE: {:.2f}".format(grid_lasso.best_score_))

Best parameters: {'alpha': 0.0}
Best cross-validation MSE: -0.17


In [132]:
lasso_model = grid_lasso.best_estimator_

In [133]:
lasso_pkl = 'lasso.pickle'
pickle.dump(lasso_model, open(lasso_pkl, 'wb'))

## Polynomial regression

In [134]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline


def PolynomialRegression(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree), LinearRegression(**kwargs))

param_grid = {'polynomialfeatures__degree': np.arange(5)}
grid = GridSearchCV(PolynomialRegression(), param_grid, scoring=scoring, refit='MSE', cv=7)
grid.fit(x_train_mod, y_train)



GridSearchCV(cv=7, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('polynomialfeatures',
                                        PolynomialFeatures(degree=2,
                                                           include_bias=True,
                                                           interaction_only=False,
                                                           order='C')),
                                       ('linearregression',
                                        LinearRegression(copy_X=True,
                                                         fit_intercept=True,
                                                         n_jobs=None,
                                                         normalize=False))],
                                verbose=False),
             iid='warn', n_jobs=None,
             param_grid={'polynomialfeatures__degree': array([0, 1, 2, 3, 4])},
             pre_di

In [135]:
model = grid.best_estimator_

In [136]:
poly_pkl = 'polynomial.pickle'
pickle.dump(model, open(poly_pkl, 'wb'))

In [137]:
print("Best parameters: {}".format(grid.best_params_))
print("Best cross-validation MSE: {:.2f}".format(grid.best_score_))

Best parameters: {'polynomialfeatures__degree': 1}
Best cross-validation MSE: -0.17


## SVM

### LinearSVC

In [139]:
from sklearn.svm import LinearSVR

In [160]:
svr_params = []
r1 = np.arange(0.1,1,0.1)
r2 = np.arange(1,101,1)
svr_params = np.concatenate((r1,r2))

param_grid = {'C':svr_params}

In [161]:
linsvr = GridSearchCV(LinearSVR(),param_grid, scoring=scoring, refit='MSE')
linsvr.fit(x_train_mod,y_train)

GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=LinearSVR(C=1.0, dual=True, epsilon=0.0,
                                 fit_intercept=True, intercept_scaling=1.0,
                                 loss='epsilon_insensitive', max_iter=1000,
                                 random_state=None, tol=0.0001, verbose=0),
             iid='warn', n_jobs=None,
             param_grid={'C': array([  0.1,   0.2,   0.3,   0.4,   0.5,   0.6,   0.7,   0.8,   0.9,
         1. ,   2. ,   3. ,   4. ,   5. ,   6. ,   7. ,   8. ,   9. ,
        10. ,  11. ,  12. ,  13. ,...
        46. ,  47. ,  48. ,  49. ,  50. ,  51. ,  52. ,  53. ,  54. ,
        55. ,  56. ,  57. ,  58. ,  59. ,  60. ,  61. ,  62. ,  63. ,
        64. ,  65. ,  66. ,  67. ,  68. ,  69. ,  70. ,  71. ,  72. ,
        73. ,  74. ,  75. ,  76. ,  77. ,  78. ,  79. ,  80. ,  81. ,
        82. ,  83. ,  84. ,  85. ,  86. ,  87. ,  88. ,  89. ,  90. ,
        91. ,  92. ,  93. ,  94. ,  95. ,  96. ,  97. ,

In [162]:
linsvr_model = linsvr.best_estimator_

In [163]:
print("Best parameters: {}".format(linsvr.best_params_))
print("Best cross-validation MSE: {:.2f}".format(linsvr.best_score_))

Best parameters: {'C': 0.6}
Best cross-validation MSE: -0.17


In [164]:
linsvr_pkl = 'linsvr_pkl.pickle'
pickle.dump(linsvr_model, open(linsvr_pkl, 'wb'))

### rbf

In [138]:
from sklearn.svm import SVR

In [192]:
param_grid = {'C': [0.1,1, 10, 100], 'gamma': [1,0.1,0.01,0.001],'kernel': ['rbf']}

svc_rbf = GridSearchCV(SVR(),param_grid,scoring=scoring, refit='MSE', cv=10)
svc_rbf.fit(x_train_mod,y_train)

GridSearchCV(cv=10, error_score='raise-deprecating',
             estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3,
                           epsilon=0.1, gamma='auto_deprecated', kernel='rbf',
                           max_iter=-1, shrinking=True, tol=0.001,
                           verbose=False),
             iid='warn', n_jobs=None,
             param_grid={'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001],
                         'kernel': ['rbf']},
             pre_dispatch='2*n_jobs', refit='MSE', return_train_score=False,
             scoring={'MSE': 'neg_mean_squared_error', 'R_squared': 'r2'},
             verbose=0)

In [193]:
rbf_model = svc_rbf.best_estimator_

In [194]:
print("Best parameters: {}".format(svc_rbf.best_params_))
print("Best cross-validation MSE: {:.2f}".format(svc_rbf.best_score_))

Best parameters: {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
Best cross-validation MSE: -0.16


In [195]:
rbf_pkl = 'svc_rbf.pickle'
pickle.dump(rbf_model, open(rbf_pkl, 'wb'))

### poly

In [196]:
param_grid = {'C': [0.1,1, 10, 100], 'gamma': [1,0.1,0.01,0.001],'kernel': ['poly']}

svc_poly = GridSearchCV(SVR(),param_grid,scoring=scoring, refit='MSE', cv=10)
svc_poly.fit(x_train_mod,y_train)



GridSearchCV(cv=10, error_score='raise-deprecating',
             estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3,
                           epsilon=0.1, gamma='auto_deprecated', kernel='rbf',
                           max_iter=-1, shrinking=True, tol=0.001,
                           verbose=False),
             iid='warn', n_jobs=None,
             param_grid={'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001],
                         'kernel': ['poly']},
             pre_dispatch='2*n_jobs', refit='MSE', return_train_score=False,
             scoring={'MSE': 'neg_mean_squared_error', 'R_squared': 'r2'},
             verbose=0)

In [197]:
svcpoly_model = svc_poly.best_estimator_

In [198]:
print("Best parameters: {}".format(svc_poly.best_params_))
print("Best cross-validation MSE: {:.2f}".format(svc_poly.best_score_))

Best parameters: {'C': 10, 'gamma': 0.1, 'kernel': 'poly'}
Best cross-validation score: -0.16


In [199]:
svc_poly = 'svc_poly.pickle'
pickle.dump(svcpoly_model, open(svc_poly, 'wb'))

### linear

In [73]:
param_grid = {'C': [0.1,1, 10, 100], 'gamma': [1,0.1,0.01,0.001],'kernel': ['linear']}

svc_lin = GridSearchCV(SVR(),param_grid,scoring=scoring, refit='MSE', cv=10)
svc_lin.fit(x_train_mod,y_train)

GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3,
                           epsilon=0.1, gamma='auto_deprecated', kernel='rbf',
                           max_iter=-1, shrinking=True, tol=0.001,
                           verbose=False),
             iid='warn', n_jobs=None,
             param_grid={'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001],
                         'kernel': ['linear']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [74]:
svclin_model = svc_lin.best_estimator_

In [75]:
print("Best parameters: {}".format(svc_lin.best_params_))
print("Best cross-validation MSE: {:.2f}".format(svc_lin.best_score_))

Best parameters: {'C': 1, 'gamma': 1, 'kernel': 'linear'}
Best cross-validation score: 0.58


In [76]:
svc_lin = 'svc_lin.pickle'
pickle.dump(svclin_model, open(svc_lin, 'wb'))

## DesicionTree Regressor

In [77]:
from sklearn.tree import DecisionTreeRegressor

dt = DecisionTreeRegressor()

In [79]:
tree_params = dict(max_depth=np.arange(5,50), max_leaf_nodes=np.arange(50,200))

tree_grid = GridSearchCV(dt, param_grid=tree_params, scoring=scoring, refit='MSE', cv=10)
tree_grid.fit(x_train_mod, y_train)

GridSearchCV(cv=10, error_score='raise-deprecating',
             estimator=DecisionTreeRegressor(criterion='mse', max_depth=None,
                                             max_features=None,
                                             max_leaf_nodes=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             presort=False, random_state=None,
                                             splitter='best'),
             iid='warn', n_jobs=None,
             param_gr...
       128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140,
       141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153,
       154, 155, 156, 157, 158, 159, 160, 

In [80]:
dtree_model = tree_grid.best_estimator_

In [81]:
print("Best parameters: {}".format(tree_grid.best_params_))
print("Best cross-validation MSE: {:.2f}".format(tree_grid.best_score_))

Best parameters: {'max_depth': 5, 'max_leaf_nodes': 80}
Best cross-validation score: 0.57


In [82]:
dtree_pkl = 'dtree_pkl.pickle'
pickle.dump(dtree_model, open(dtree_pkl, 'wb'))