# Wines Points prediction 

Submission Date : 3.6.2023
Task: Predict the wine score given the inputs
Instructions:
 * Use logistic regression as benchmark model
 * Use sklearn pipeliens + cv + grid search with sklearn models (e.g. KNNs, RandomForest, etc.)
 * Compare all models on proper metric (your choice)

For DNN course project:
* Use sklearn pipeliens with tensorflow models (w/wo embeddings, LSTMs, RNNs, Transformers etc.)
* Compare all models on proper metric (your choice)

In [1]:
%load_ext autoreload
%autoreload 2
import sys; sys.path.append('../')

Here we will try to predict the points a wine will get based on known characteristics (i.e. features, in the ML terminology). The mine point in this stage is to establish a simple, ideally super cost effective, basline.
In the real world there is a tradeoff between complexity and perforamnce, and the DS job, among others, is to present a tradeoff tables of what performance is achivalbel at what complexity level. 

to which models with increased complexity and resource demands will be compared. Complexity should then be translated into cost. For example:
 * Compute cost 
 * Maintenance cost
 * Serving costs (i.e. is new platform needed?) 
 

## Loading the data

In [1]:
import pandas as pd
import cufflinks as cf; cf.go_offline()

In [2]:
wine_reviews = pd.read_csv("clean_wine_reviews_data.csv")
wine_reviews.shape

(111931, 16)

In [4]:
wine_reviews.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,year,wine_category
0,0,Italy,Aromas include tropical fruit broom brimstone ...,Vulkà Bianco,87,39.928286,Sicily & Sardinia,Etna,Unknown,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,2013.0,White
1,1,Portugal,This ripe fruity wine smooth still structured ...,Avidagos,87,15.0,Douro,Unknown,Unknown,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,2011.0,Red
2,2,US,Tart snappy flavors lime flesh rind dominate S...,Unknown,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,2013.0,White
3,3,US,Pineapple rind lemon pith orange blossom start...,Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,Unknown,Alexander Peartree,Unknown,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,2013.0,White
4,4,US,Much like regular bottling 2012 comes across r...,Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,2012.0,Red


# Linear Regression without textual column

In [3]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.tree import DecisionTreeRegressor
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

y = pd.DataFrame(wine_reviews, columns=['points'])
x = wine_reviews.drop('points', axis=1)

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=29)

categorical_features = ['country', 'taster_name','variety','designation','province','winery','wine_category','region_1']
numerical_features = ['price','year']
text_feature = 'description'

category_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
    ('onehotencoding', OneHotEncoder(handle_unknown='ignore'))
])

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('standardscaler', StandardScaler())
])

# text_tansformer = Pipeline(steps=[
#     ('countvectorizer', CountVectorizer(max_features = 1500))
# ])

preprocessing = ColumnTransformer(
    transformers=[
        ('numeric', numeric_transformer, numerical_features),
        ('category', category_transformer, categorical_features),
        # ('text',text_tansformer , text_feature) 
    ], remainder='drop')

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessing),
    ('regressor', LinearRegression())
])

print(pipeline)

param_grid = [{
    'regressor': [LinearRegression()],
    'regressor__fit_intercept': [True, False],
    'regressor__n_jobs':[5,10,15]
}]

# Perform Grid Search on linear regression
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2',verbose=3)
grid_search.fit(X_train, y_train)



Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('numeric',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  ['price', 'year']),
                                                 ('category',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='Unknown',
                                                                                 strategy='constant')),
                                                                  ('onehotencoding',
                                                                   O

In [6]:
best_model = grid_search.best_estimator_

In [7]:
best_model

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('numeric',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  ['price', 'year']),
                                                 ('category',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='Unknown',
                                                                                 strategy='constant')),
                                                                  ('onehotencoding',
                                                                   O

In [11]:
summary_df = pd.DataFrame(columns = ['method', 'mean_fit_time', 'parameters', 'train_r2', 'test_MSE','test_r2'])

In [9]:
best_model.fit(X_train, y_train)
prediction = best_model.predict(X_test)

In [10]:
from sklearn.metrics import r2_score
prediction_mse = mean_squared_error(y_test, prediction)
prediction_r2 = r2_score(y_test, prediction)
print (prediction_mse, prediction_r2)

5.648516085395559 0.3863580195018157


In [11]:
all_results = pd.DataFrame(grid_search.cv_results_).sort_values(by='rank_test_score',ascending=True)

best_result = all_results.iloc[0]
all_results.head()

summary_df.loc[len(summary_df.index)] = ['Linear Regression without text', best_result.mean_fit_time, best_result.params, best_result.mean_test_score, prediction_mse, prediction_r2]

In [12]:
summary_df

Unnamed: 0,method,mean_fit_time,parameters,train_r2,test_MSE,test_r2
0,Linear Regression without text,33.147654,"{'regressor': LinearRegression(n_jobs=5), 'reg...",0.336666,5.648516,0.386358


# Ridge

In [13]:
from sklearn.linear_model import Lasso, Ridge

ridge_pipeline = Pipeline([
    ('preprocessor', preprocessing),
    ('regressor', Ridge())
])


param_grid = {
    'regressor': [Ridge()],
    'regressor__alpha': [0.001,0.01,0.1,1,10]
}

grid_search = GridSearchCV(ridge_pipeline, param_grid, cv = 5, scoring = 'r2', verbose = 42)
grid_search.fit(X_train, y_train)


Fitting 5 folds for each of 5 candidates, totalling 25 fits
[CV 1/5; 1/5] START regressor=Ridge(), regressor__alpha=0.001...................
[CV 1/5; 1/5] END regressor=Ridge(), regressor__alpha=0.001;, score=0.391 total time=   2.4s
[CV 2/5; 1/5] START regressor=Ridge(), regressor__alpha=0.001...................
[CV 2/5; 1/5] END regressor=Ridge(), regressor__alpha=0.001;, score=0.382 total time=   2.1s
[CV 3/5; 1/5] START regressor=Ridge(), regressor__alpha=0.001...................
[CV 3/5; 1/5] END regressor=Ridge(), regressor__alpha=0.001;, score=0.403 total time=   2.0s
[CV 4/5; 1/5] START regressor=Ridge(), regressor__alpha=0.001...................
[CV 4/5; 1/5] END regressor=Ridge(), regressor__alpha=0.001;, score=0.387 total time=   1.9s
[CV 5/5; 1/5] START regressor=Ridge(), regressor__alpha=0.001...................
[CV 5/5; 1/5] END regressor=Ridge(), regressor__alpha=0.001;, score=0.397 total time=   1.9s
[CV 1/5; 2/5] START regressor=Ridge(), regressor__alpha=0.01..........

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('numeric',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer()),
                                                                                         ('standardscaler',
                                                                                          StandardScaler())]),
                                                                         ['price',
                                                                          'year']),
                                                                        ('category',
                                                                         Pipeline(steps=[('imputer',
                                                            

In [14]:
# pipeline.fit(X_train, y_train)

best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)
prediction = best_model.predict(X_test)

In [15]:
prediction_mse = mean_squared_error(y_test, prediction)
prediction_r2 = r2_score(y_test, prediction)
print (prediction_mse, prediction_r2)

4.499047774688376 0.5112336505593242


In [16]:
all_results = pd.DataFrame(grid_search.cv_results_).sort_values(by='rank_test_score',ascending=True)

best_result = all_results.iloc[0]
all_results.head()

summary_df.loc[len(summary_df.index)] = ['Ridge without text', best_result.mean_fit_time, best_result.params, best_result.mean_test_score, prediction_mse, prediction_r2]

In [17]:
summary_df

Unnamed: 0,method,mean_fit_time,parameters,train_r2,test_MSE,test_r2
0,Linear Regression without text,33.147654,"{'regressor': LinearRegression(n_jobs=5), 'reg...",0.336666,5.648516,0.386358
1,Ridge without text,1.339875,"{'regressor': Ridge(alpha=1), 'regressor__alph...",0.496035,4.499048,0.511234


# Lasso

In [18]:
from sklearn.linear_model import Lasso, Ridge

lasso_pipeline = Pipeline([
    ('preprocessor', preprocessing),
    ('regressor', Lasso())
])


param_grid = {
    'regressor': [Lasso()],
    'regressor__alpha': [0.01, 0.001, 0.1,1,10]
}

grid_search = GridSearchCV(lasso_pipeline, param_grid, cv = 5, scoring = 'r2', verbose = 42)
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 5 candidates, totalling 25 fits
[CV 1/5; 1/5] START regressor=Lasso(), regressor__alpha=0.01....................
[CV 1/5; 1/5] END regressor=Lasso(), regressor__alpha=0.01;, score=0.354 total time= 5.5min
[CV 2/5; 1/5] START regressor=Lasso(), regressor__alpha=0.01....................
[CV 2/5; 1/5] END regressor=Lasso(), regressor__alpha=0.01;, score=0.353 total time= 3.5min
[CV 3/5; 1/5] START regressor=Lasso(), regressor__alpha=0.01....................
[CV 3/5; 1/5] END regressor=Lasso(), regressor__alpha=0.01;, score=0.353 total time= 2.9min
[CV 4/5; 1/5] START regressor=Lasso(), regressor__alpha=0.01....................
[CV 4/5; 1/5] END regressor=Lasso(), regressor__alpha=0.01;, score=0.351 total time= 4.1min
[CV 5/5; 1/5] START regressor=Lasso(), regressor__alpha=0.01....................
[CV 5/5; 1/5] END regressor=Lasso(), regressor__alpha=0.01;, score=0.352 total time= 5.5min
[CV 1/5; 2/5] START regressor=Lasso(), regressor__alpha=0.001..............

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('numeric',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer()),
                                                                                         ('standardscaler',
                                                                                          StandardScaler())]),
                                                                         ['price',
                                                                          'year']),
                                                                        ('category',
                                                                         Pipeline(steps=[('imputer',
                                                            

In [20]:
# pipeline.fit(X_train, y_train)

best_model = grid_search.best_estimator_

best_model.fit(X_train, y_train)

prediction = best_model.predict(X_test)

In [21]:
prediction_mse = mean_squared_error(y_test, prediction)
prediction_r2 = r2_score(y_test, prediction)
print (prediction_mse, prediction_r2)

5.395524170616086 0.41384248750867847


In [22]:
all_results = pd.DataFrame(grid_search.cv_results_).sort_values(by='rank_test_score',ascending=True)

best_result = all_results.iloc[0]
all_results.head()

summary_df.loc[len(summary_df.index)] = ['Lasso without text', best_result.mean_fit_time, best_result.params, best_result.mean_test_score, prediction_mse, prediction_r2]

In [23]:
summary_df

Unnamed: 0,method,mean_fit_time,parameters,train_r2,test_MSE,test_r2
0,Linear Regression without text,33.147654,"{'regressor': LinearRegression(n_jobs=5), 'reg...",0.336666,5.648516,0.386358
1,Ridge without text,1.339875,"{'regressor': Ridge(alpha=1), 'regressor__alph...",0.496035,4.499048,0.511234
2,Lasso without text,1114.591888,"{'regressor': Lasso(alpha=0.001), 'regressor__...",0.406582,5.395524,0.413842


# Random forest without textual column

In [25]:

# random_forest_pipeline = Pipeline([
#     ('preprocessor', preprocessing),
#     ('regressor', RandomForestRegressor())
# ])


# param_grid = {
#     'regressor': [RandomForestRegressor()],
#     'regressor__max_depth': [5, 10, 15],
#     'regressor__min_samples_leaf':[10, 20, 30]
# }

# grid_search = GridSearchCV(random_forest_pipeline, param_grid, cv = 5, scoring = 'r2', verbose = 42)
# grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV 1/5; 1/9] START regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=10



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 1/5; 1/9] END regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=10;, score=0.374 total time=  29.9s
[CV 2/5; 1/9] START regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=10



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 2/5; 1/9] END regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=10;, score=0.375 total time=  29.3s
[CV 3/5; 1/9] START regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=10



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 3/5; 1/9] END regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=10;, score=0.376 total time=  32.5s
[CV 4/5; 1/9] START regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=10



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 4/5; 1/9] END regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=10;, score=0.368 total time=  29.6s
[CV 5/5; 1/9] START regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=10



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 5/5; 1/9] END regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=10;, score=0.373 total time=  29.2s
[CV 1/5; 2/9] START regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=20



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 1/5; 2/9] END regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=20;, score=0.373 total time=  29.1s
[CV 2/5; 2/9] START regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=20



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 2/5; 2/9] END regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=20;, score=0.374 total time=  32.0s
[CV 3/5; 2/9] START regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=20



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 3/5; 2/9] END regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=20;, score=0.376 total time=  29.9s
[CV 4/5; 2/9] START regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=20



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 4/5; 2/9] END regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=20;, score=0.368 total time=  29.5s
[CV 5/5; 2/9] START regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=20



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 5/5; 2/9] END regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=20;, score=0.373 total time=  29.4s
[CV 1/5; 3/9] START regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=30



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 1/5; 3/9] END regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=30;, score=0.374 total time=  32.6s
[CV 2/5; 3/9] START regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=30



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 2/5; 3/9] END regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=30;, score=0.375 total time=  29.4s
[CV 3/5; 3/9] START regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=30



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 3/5; 3/9] END regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=30;, score=0.376 total time=  30.1s
[CV 4/5; 3/9] START regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=30



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 4/5; 3/9] END regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=30;, score=0.368 total time=  29.9s
[CV 5/5; 3/9] START regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=30



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 5/5; 3/9] END regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=30;, score=0.373 total time=  32.7s
[CV 1/5; 4/9] START regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=10



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 1/5; 4/9] END regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=10;, score=0.418 total time= 2.4min
[CV 2/5; 4/9] START regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=10



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 2/5; 4/9] END regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=10;, score=0.424 total time= 2.4min
[CV 3/5; 4/9] START regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=10



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 3/5; 4/9] END regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=10;, score=0.424 total time= 2.4min
[CV 4/5; 4/9] START regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=10



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 4/5; 4/9] END regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=10;, score=0.418 total time= 2.4min
[CV 5/5; 4/9] START regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=10



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 5/5; 4/9] END regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=10;, score=0.417 total time= 2.4min
[CV 1/5; 5/9] START regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=20



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 1/5; 5/9] END regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=20;, score=0.415 total time= 2.3min
[CV 2/5; 5/9] START regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=20



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 2/5; 5/9] END regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=20;, score=0.421 total time= 2.3min
[CV 3/5; 5/9] START regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=20



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 3/5; 5/9] END regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=20;, score=0.422 total time= 2.3min
[CV 4/5; 5/9] START regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=20



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 4/5; 5/9] END regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=20;, score=0.417 total time= 2.3min
[CV 5/5; 5/9] START regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=20



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 5/5; 5/9] END regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=20;, score=0.415 total time= 2.3min
[CV 1/5; 6/9] START regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=30



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 1/5; 6/9] END regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=30;, score=0.413 total time= 2.2min
[CV 2/5; 6/9] START regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=30



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 2/5; 6/9] END regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=30;, score=0.418 total time= 2.1min
[CV 3/5; 6/9] START regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=30



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 3/5; 6/9] END regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=30;, score=0.419 total time= 2.2min
[CV 4/5; 6/9] START regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=30



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 4/5; 6/9] END regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=30;, score=0.414 total time= 2.2min
[CV 5/5; 6/9] START regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=30



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 5/5; 6/9] END regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=30;, score=0.413 total time= 2.2min
[CV 1/5; 7/9] START regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=10



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 1/5; 7/9] END regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=10;, score=0.440 total time= 5.0min
[CV 2/5; 7/9] START regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=10



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 2/5; 7/9] END regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=10;, score=0.447 total time= 5.4min
[CV 3/5; 7/9] START regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=10



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 3/5; 7/9] END regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=10;, score=0.445 total time= 5.9min
[CV 4/5; 7/9] START regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=10



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 4/5; 7/9] END regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=10;, score=0.441 total time= 5.6min
[CV 5/5; 7/9] START regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=10



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 5/5; 7/9] END regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=10;, score=0.439 total time= 5.6min
[CV 1/5; 8/9] START regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=20



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 1/5; 8/9] END regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=20;, score=0.435 total time= 4.7min
[CV 2/5; 8/9] START regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=20



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 2/5; 8/9] END regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=20;, score=0.441 total time= 4.7min
[CV 3/5; 8/9] START regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=20



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 3/5; 8/9] END regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=20;, score=0.438 total time= 4.7min
[CV 4/5; 8/9] START regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=20



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 4/5; 8/9] END regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=20;, score=0.436 total time= 4.7min
[CV 5/5; 8/9] START regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=20



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 5/5; 8/9] END regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=20;, score=0.435 total time= 4.5min
[CV 1/5; 9/9] START regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=30



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 1/5; 9/9] END regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=30;, score=0.430 total time= 3.9min
[CV 2/5; 9/9] START regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=30



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 2/5; 9/9] END regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=30;, score=0.436 total time= 3.8min
[CV 3/5; 9/9] START regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=30



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 3/5; 9/9] END regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=30;, score=0.433 total time= 3.9min
[CV 4/5; 9/9] START regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=30



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 4/5; 9/9] END regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=30;, score=0.433 total time= 3.9min
[CV 5/5; 9/9] START regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=30



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 5/5; 9/9] END regressor=RandomForestRegressor(), regressor__max_depth=15, regressor__min_samples_leaf=30;, score=0.430 total time= 3.9min



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('numeric',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer()),
                                                                                         ('standardscaler',
                                                                                          StandardScaler())]),
                                                                         ['price',
                                                                          'year']),
                                                                        ('category',
                                                                         Pipeline(steps=[('imputer',
                                                            

In [26]:

# best_model = grid_search.best_estimator_

# best_model.fit(X_train, y_train)

# prediction = best_model.predict(X_test)


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



In [27]:
# prediction_mse = mean_squared_error(y_test, prediction)
# prediction_r2 = r2_score(y_test, prediction)
# print (prediction_mse, prediction_r2)

5.0620565177349475 0.4500696572382894


In [28]:
# all_results = pd.DataFrame(grid_search.cv_results_).sort_values(by='rank_test_score',ascending=True)

# best_result = all_results.iloc[0]
# all_results.head()

# summary_df.loc[len(summary_df.index)] = ['Random Forest without text', best_result.mean_fit_time, best_result.params, best_result.mean_test_score, prediction_mse, prediction_r2]

In [29]:
# summary_df

Unnamed: 0,method,mean_fit_time,parameters,train_r2,test_MSE,test_r2
0,Linear Regression without text,33.147654,"{'regressor': LinearRegression(n_jobs=5), 'reg...",0.336666,5.648516,0.386358
1,Ridge without text,1.339875,"{'regressor': Ridge(alpha=1), 'regressor__alph...",0.496035,4.499048,0.511234
2,Lasso without text,1114.591888,"{'regressor': Lasso(alpha=0.001), 'regressor__...",0.406582,5.395524,0.413842
3,Random Forest without text,329.572775,{'regressor': RandomForestRegressor(max_depth=...,0.442481,5.062057,0.45007


# KNN without textual column

In [30]:
knn_pipeline = Pipeline([
    ('preprocessor', preprocessing),
    ('regressor', KNeighborsRegressor())
])


param_grid = {
    'regressor': [KNeighborsRegressor()],
    'regressor__n_neighbors': [10, 20, 40, 100],
    'regressor__weights': ['uniform', 'distance']
}

grid_search = GridSearchCV(knn_pipeline, param_grid, cv = 5, scoring = 'r2', verbose = 42)
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV 1/5; 1/8] START regressor=KNeighborsRegressor(), regressor__n_neighbors=10, regressor__weights=uniform
[CV 1/5; 1/8] END regressor=KNeighborsRegressor(), regressor__n_neighbors=10, regressor__weights=uniform;, score=0.478 total time=  54.8s
[CV 2/5; 1/8] START regressor=KNeighborsRegressor(), regressor__n_neighbors=10, regressor__weights=uniform
[CV 2/5; 1/8] END regressor=KNeighborsRegressor(), regressor__n_neighbors=10, regressor__weights=uniform;, score=0.480 total time=  49.0s
[CV 3/5; 1/8] START regressor=KNeighborsRegressor(), regressor__n_neighbors=10, regressor__weights=uniform
[CV 3/5; 1/8] END regressor=KNeighborsRegressor(), regressor__n_neighbors=10, regressor__weights=uniform;, score=0.486 total time=  49.8s
[CV 4/5; 1/8] START regressor=KNeighborsRegressor(), regressor__n_neighbors=10, regressor__weights=uniform
[CV 4/5; 1/8] END regressor=KNeighborsRegressor(), regressor__n_neighbors=10, regressor__weights=u

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('numeric',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer()),
                                                                                         ('standardscaler',
                                                                                          StandardScaler())]),
                                                                         ['price',
                                                                          'year']),
                                                                        ('category',
                                                                         Pipeline(steps=[('imputer',
                                                            

In [31]:
# pipeline.fit(X_train, y_train)

best_model = grid_search.best_estimator_

best_model.fit(X_train, y_train)

prediction = best_model.predict(X_test)

In [32]:
prediction_mse = mean_squared_error(y_test, prediction)
prediction_r2 = r2_score(y_test, prediction)
print (prediction_mse, prediction_r2)

4.504860725595334 0.5106021447526385


In [35]:
all_results = pd.DataFrame(grid_search.cv_results_).sort_values(by='rank_test_score',ascending=True)

best_result = all_results.iloc[0]
all_results.head()

summary_df.loc[len(summary_df.index)] = ['KNN without text', best_result.mean_fit_time, best_result.params, best_result.mean_test_score, prediction_mse, prediction_r2]

Unnamed: 0,method,mean_fit_time,parameters,train_r2,test_MSE,test_r2
0,Linear Regression without text,33.147654,"{'regressor': LinearRegression(n_jobs=5), 'reg...",0.336666,5.648516,0.386358
1,Ridge without text,1.339875,"{'regressor': Ridge(alpha=1), 'regressor__alph...",0.496035,4.499048,0.511234
2,Lasso without text,1114.591888,"{'regressor': Lasso(alpha=0.001), 'regressor__...",0.406582,5.395524,0.413842
3,Random Forest without text,329.572775,{'regressor': RandomForestRegressor(max_depth=...,0.442481,5.062057,0.45007
5,KNN without text,0.532665,{'regressor': KNeighborsRegressor(n_neighbors=...,0.496814,4.504861,0.510602


# linear regression with Bag of words

In [38]:

text_tansformer = Pipeline(steps=[
    ('countvectorizer', CountVectorizer(max_features = 1500))
])

preprocessing = ColumnTransformer(
    transformers=[
        ('numeric', numeric_transformer, numerical_features),
        ('category', category_transformer, categorical_features),
        ('text',text_tansformer , text_feature) 
    ], remainder='drop')

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessing),
    ('regressor', LinearRegression())
])

print(pipeline)

param_grid = [{
    'regressor': [LinearRegression()],
    'regressor__fit_intercept': [True, False],
    'regressor__n_jobs':[5,10,15]
}]

# Perform Grid Search on linear regression
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2',verbose=42)
grid_search.fit(X_train, y_train)



Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('numeric',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  ['price', 'year']),
                                                 ('category',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='Unknown',
                                                                                 strategy='constant')),
                                                                  ('onehotencoding',
                                                                   O

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('numeric',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer()),
                                                                                         ('standardscaler',
                                                                                          StandardScaler())]),
                                                                         ['price',
                                                                          'year']),
                                                                        ('category',
                                                                         Pipeline(steps=[('imputer',
                                                            

In [39]:
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)
prediction = best_model.predict(X_test)

In [40]:
prediction_mse = mean_squared_error(y_test, prediction)
prediction_r2 = r2_score(y_test, prediction)
print (prediction_mse, prediction_r2)

3.0901796494252696 0.6642898893266205


In [44]:
all_results = pd.DataFrame(grid_search.cv_results_).sort_values(by='rank_test_score',ascending=True)

best_result = all_results.iloc[0]
all_results.head()

summary_df.loc[len(summary_df.index)] = ['Linear Regression with bag of words', best_result.mean_fit_time, best_result.params, best_result.mean_test_score, prediction_mse, prediction_r2]

In [53]:
summary_df

Unnamed: 0,method,mean_fit_time,parameters,train_r2,test_MSE,test_r2
0,Linear Regression without text,33.147654,"{'regressor': LinearRegression(n_jobs=5), 'reg...",0.336666,5.648516,0.386358
1,Ridge without text,1.339875,"{'regressor': Ridge(alpha=1), 'regressor__alph...",0.496035,4.499048,0.511234
2,Lasso without text,1114.591888,"{'regressor': Lasso(alpha=0.001), 'regressor__...",0.406582,5.395524,0.413842
3,Random Forest without text,329.572775,{'regressor': RandomForestRegressor(max_depth=...,0.442481,5.062057,0.45007
4,KNN without text,0.532665,{'regressor': KNeighborsRegressor(n_neighbors=...,0.496814,4.504861,0.510602
5,Linear Regression with bag of words,138.784838,"{'regressor': LinearRegression(n_jobs=5), 'reg...",0.634735,3.09018,0.66429


# linear regression with TFIDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
text_tansformer = Pipeline(steps=[
    ('tfidf',TfidfVectorizer(stop_words = 'english'))
])

preprocessing = ColumnTransformer(
    transformers=[
        ('numeric', numeric_transformer, numerical_features),
        ('category', category_transformer, categorical_features),
        ('text',text_tansformer , text_feature) 
    ], remainder='drop')

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessing),
    ('regressor', LinearRegression())
])

print(pipeline)

param_grid = [{
    'regressor': [LinearRegression()],
    'regressor__fit_intercept': [True, False],
    'regressor__n_jobs':[5,10,15]
}]

# Perform Grid Search on linear regression
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2',verbose=42)
grid_search.fit(X_train, y_train)


Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('numeric',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  ['price', 'year']),
                                                 ('category',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='Unknown',
                                                                                 strategy='constant')),
                                                                  ('onehotencoding',
                                                                   O

In [None]:
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)
prediction = best_model.predict(X_test)

In [None]:
prediction_mse = mean_squared_error(y_test, prediction)
prediction_r2 = r2_score(y_test, prediction)
print (prediction_mse, prediction_r2)

In [None]:
all_results = pd.DataFrame(grid_search.cv_results_).sort_values(by='rank_test_score',ascending=True)

best_result = all_results.iloc[0]
all_results.head()

summary_df.loc[len(summary_df.index)] = ['Linear Regression with TFIDF', best_result.mean_fit_time, best_result.params, best_result.mean_test_score, prediction_mse, prediction_r2]

In [None]:
summary_df

# Ridge with BoW

In [54]:
from sklearn.linear_model import Lasso, Ridge

ridge_pipeline = Pipeline([
    ('preprocessor', preprocessing),
    ('regressor', Ridge())
])


param_grid = {
    'regressor': [Ridge()],
    'regressor__alpha': [0.001,0.01,0.1,1,10]
}

grid_search = GridSearchCV(ridge_pipeline, param_grid, cv = 5, scoring = 'r2', verbose = 42)
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 5 candidates, totalling 25 fits
[CV 1/5; 1/5] START regressor=Ridge(), regressor__alpha=0.001...................
[CV 1/5; 1/5] END regressor=Ridge(), regressor__alpha=0.001;, score=0.667 total time=  11.1s
[CV 2/5; 1/5] START regressor=Ridge(), regressor__alpha=0.001...................
[CV 2/5; 1/5] END regressor=Ridge(), regressor__alpha=0.001;, score=0.666 total time=  11.4s
[CV 3/5; 1/5] START regressor=Ridge(), regressor__alpha=0.001...................
[CV 3/5; 1/5] END regressor=Ridge(), regressor__alpha=0.001;, score=0.673 total time=  11.2s
[CV 4/5; 1/5] START regressor=Ridge(), regressor__alpha=0.001...................
[CV 4/5; 1/5] END regressor=Ridge(), regressor__alpha=0.001;, score=0.664 total time=   9.4s
[CV 5/5; 1/5] START regressor=Ridge(), regressor__alpha=0.001...................
[CV 5/5; 1/5] END regressor=Ridge(), regressor__alpha=0.001;, score=0.672 total time=   9.3s
[CV 1/5; 2/5] START regressor=Ridge(), regressor__alpha=0.01..........

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('numeric',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer()),
                                                                                         ('standardscaler',
                                                                                          StandardScaler())]),
                                                                         ['price',
                                                                          'year']),
                                                                        ('category',
                                                                         Pipeline(steps=[('imputer',
                                                            

In [55]:
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)
prediction = best_model.predict(X_test)

In [56]:
prediction_mse = mean_squared_error(y_test, prediction)
prediction_r2 = r2_score(y_test, prediction)
print (prediction_mse, prediction_r2)

2.3059104010313902 0.7494911222792112


In [57]:
all_results = pd.DataFrame(grid_search.cv_results_).sort_values(by='rank_test_score',ascending=True)

best_result = all_results.iloc[0]
all_results.head()

summary_df.loc[len(summary_df.index)] = ['Ridge with bag of words', best_result.mean_fit_time, best_result.params, best_result.mean_test_score, prediction_mse, prediction_r2]

In [62]:
summary_df

{'regressor': RandomForestRegressor(max_depth=15, min_samples_leaf=10),
 'regressor__max_depth': 15,
 'regressor__min_samples_leaf': 10}

# Random forest with bag of words

In [5]:

# random_forest_pipeline = Pipeline([
#     ('preprocessor', preprocessing),
#     ('regressor', RandomForestRegressor())
# ])


# param_grid = {
#     'regressor': [RandomForestRegressor()],
#     'regressor__max_depth': [5, 10],
#     'regressor__min_samples_leaf':[5, 10]
# }

# grid_search = GridSearchCV(random_forest_pipeline, param_grid, cv = 5, scoring = 'r2', verbose = 42)
# grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV 1/5; 1/4] START regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=5



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 1/5; 1/4] END regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=5;, score=0.374 total time=  30.5s
[CV 2/5; 1/4] START regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=5



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 2/5; 1/4] END regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=5;, score=0.375 total time=  30.0s
[CV 3/5; 1/4] START regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=5



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 3/5; 1/4] END regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=5;, score=0.376 total time=  32.4s
[CV 4/5; 1/4] START regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=5



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 4/5; 1/4] END regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=5;, score=0.368 total time=  32.3s
[CV 5/5; 1/4] START regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=5



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 5/5; 1/4] END regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=5;, score=0.373 total time=  30.3s
[CV 1/5; 2/4] START regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=10



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 1/5; 2/4] END regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=10;, score=0.374 total time=  30.2s
[CV 2/5; 2/4] START regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=10



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 2/5; 2/4] END regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=10;, score=0.376 total time=  32.9s
[CV 3/5; 2/4] START regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=10



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 3/5; 2/4] END regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=10;, score=0.376 total time=  30.7s
[CV 4/5; 2/4] START regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=10



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 4/5; 2/4] END regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=10;, score=0.368 total time=  30.6s
[CV 5/5; 2/4] START regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=10



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 5/5; 2/4] END regressor=RandomForestRegressor(), regressor__max_depth=5, regressor__min_samples_leaf=10;, score=0.373 total time=  31.5s
[CV 1/5; 3/4] START regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=5



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 1/5; 3/4] END regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=5;, score=0.419 total time= 2.8min
[CV 2/5; 3/4] START regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=5



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 2/5; 3/4] END regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=5;, score=0.426 total time= 2.4min
[CV 3/5; 3/4] START regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=5



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 3/5; 3/4] END regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=5;, score=0.426 total time= 2.5min
[CV 4/5; 3/4] START regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=5



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 4/5; 3/4] END regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=5;, score=0.418 total time= 2.6min
[CV 5/5; 3/4] START regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=5



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 5/5; 3/4] END regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=5;, score=0.418 total time= 2.6min
[CV 1/5; 4/4] START regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=10



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 1/5; 4/4] END regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=10;, score=0.418 total time= 2.6min
[CV 2/5; 4/4] START regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=10



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 2/5; 4/4] END regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=10;, score=0.425 total time= 2.5min
[CV 3/5; 4/4] START regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=10



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 3/5; 4/4] END regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=10;, score=0.425 total time= 2.5min
[CV 4/5; 4/4] START regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=10



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 4/5; 4/4] END regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=10;, score=0.418 total time= 2.5min
[CV 5/5; 4/4] START regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=10



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



[CV 5/5; 4/4] END regressor=RandomForestRegressor(), regressor__max_depth=10, regressor__min_samples_leaf=10;, score=0.417 total time= 2.5min



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('numeric',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer()),
                                                                                         ('standardscaler',
                                                                                          StandardScaler())]),
                                                                         ['price',
                                                                          'year']),
                                                                        ('category',
                                                                         Pipeline(steps=[('imputer',
                                                            

In [6]:

# best_model = grid_search.best_estimator_

# best_model.fit(X_train, y_train)

# prediction = best_model.predict(X_test)


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



In [9]:
# from sklearn.metrics import r2_score
# prediction_mse = mean_squared_error(y_test, prediction)
# prediction_r2 = r2_score(y_test, prediction)
# print (prediction_mse, prediction_r2)

5.253596249084397 0.4292612151072204


In [12]:
all_results = pd.DataFrame(grid_search.cv_results_).sort_values(by='rank_test_score',ascending=True)

best_result = all_results.iloc[0]
all_results.head()

summary_df.loc[len(summary_df.index)] = ['Ridge with bag of words', best_result.mean_fit_time, best_result.params, best_result.mean_test_score, prediction_mse, prediction_r2]

In [25]:
summary_df.loc[6] = ["Linear Regression with bag of words",4.4,"{'regressor': [Ridge(alpha=10)],'regressor__alpha': [0.001, 0.01, 0.1, 1, 10]}",0.749,2.3059104010313902,0.7494911222792112]

In [27]:
summary_df.to_csv("summary_df.csv")

In [None]:
# ridge with tfidf

In [54]:
from sklearn.linear_model import Lasso, Ridge

ridge_pipeline = Pipeline([
    ('preprocessor', preprocessing),
    ('regressor', Ridge())
])


param_grid = {
    'regressor': [Ridge()],
    'regressor__alpha': [0.001,0.01,0.1,1,10]
}

grid_search = GridSearchCV(ridge_pipeline, param_grid, cv = 5, scoring = 'r2', verbose = 42)
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 5 candidates, totalling 25 fits
[CV 1/5; 1/5] START regressor=Ridge(), regressor__alpha=0.001...................
[CV 1/5; 1/5] END regressor=Ridge(), regressor__alpha=0.001;, score=0.667 total time=  11.1s
[CV 2/5; 1/5] START regressor=Ridge(), regressor__alpha=0.001...................
[CV 2/5; 1/5] END regressor=Ridge(), regressor__alpha=0.001;, score=0.666 total time=  11.4s
[CV 3/5; 1/5] START regressor=Ridge(), regressor__alpha=0.001...................
[CV 3/5; 1/5] END regressor=Ridge(), regressor__alpha=0.001;, score=0.673 total time=  11.2s
[CV 4/5; 1/5] START regressor=Ridge(), regressor__alpha=0.001...................
[CV 4/5; 1/5] END regressor=Ridge(), regressor__alpha=0.001;, score=0.664 total time=   9.4s
[CV 5/5; 1/5] START regressor=Ridge(), regressor__alpha=0.001...................
[CV 5/5; 1/5] END regressor=Ridge(), regressor__alpha=0.001;, score=0.672 total time=   9.3s
[CV 1/5; 2/5] START regressor=Ridge(), regressor__alpha=0.01..........

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('numeric',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer()),
                                                                                         ('standardscaler',
                                                                                          StandardScaler())]),
                                                                         ['price',
                                                                          'year']),
                                                                        ('category',
                                                                         Pipeline(steps=[('imputer',
                                                            

In [55]:
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)
prediction = best_model.predict(X_test)

In [56]:
prediction_mse = mean_squared_error(y_test, prediction)
prediction_r2 = r2_score(y_test, prediction)
print (prediction_mse, prediction_r2)

2.3059104010313902 0.7494911222792112


In [57]:
all_results = pd.DataFrame(grid_search.cv_results_).sort_values(by='rank_test_score',ascending=True)

best_result = all_results.iloc[0]
all_results.head()

summary_df.loc[len(summary_df.index)] = ['Ridge with bag of words', best_result.mean_fit_time, best_result.params, best_result.mean_test_score, prediction_mse, prediction_r2]

In [62]:
summary_df

{'regressor': RandomForestRegressor(max_depth=15, min_samples_leaf=10),
 'regressor__max_depth': 15,
 'regressor__min_samples_leaf': 10}