# Selected Topics in Statistics ICA 2 - Patrick Leask

## Task A1

*1) Can we use the sommelier/ wine data to create an AI with super-human performance in wine tasting?*


*2) Which components of wine make a wine a good wine?*
- There may be interactions between components of wine that make it impossible to establish how variations in a single component affect the score with knowledge of the dependencies between the components.

*3) Can the AI use the data to create the perfect wine, i.e. wine whose quality exceeds all that we have seen?*
- As in the second question, I expect there to be complex interactions between the components of wine that do not allow extrapolation to regions that the AI does not have data for.
- It is unlikely that the only factors in determining the quality of the wine are those in this data set. If we take water and chemicals to it until we matched the levels found in Chateau Lafite Rotschild, we will still not have created a wine. Even when starting with wine, rebalancing the qualities measured in the data will not necessarily create a better wine.
- The question asks whether, given the data, the AI can create the perfect wine. This is a poorly worded question, as an entirely random wine generating process *can* create the perfect wine. A more precise question is whether the AI would know the ranges of values that would result in a rating of 10. We cannot answer this question with the data provided, and even if we had infinite data we must still consider the rating that is given to be a random variable and as such cannot say with certainty that the wine would receive a higher rating (see the next question).

*4) Is human perception of wine entirely subjective? If so, what would it be that AIs could learn from humans?*
- Human tastes are highly subjective.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np

pd.set_option('precision', 3)

wine_types = ['red', 'white']
all_data = pd.concat([pd.read_csv("./winequality/winequality-{0}.csv".format(wine_type), sep=';').assign(colour=wine_type) for wine_type in wine_types])
all_data.head()

numeric_titles = list(all_data)
numeric_titles.remove('colour')
numeric_titles.remove('quality')

## Task A2

In [2]:
data_description = all_data.describe()

display(data_description)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0
mean,7.215,0.34,0.319,5.443,0.056,30.525,115.745,0.995,3.219,0.531,10.492,5.818
std,1.296,0.165,0.145,4.758,0.035,17.749,56.522,0.003,0.161,0.149,1.193,0.873
min,3.8,0.08,0.0,0.6,0.009,1.0,6.0,0.987,2.72,0.22,8.0,3.0
25%,6.4,0.23,0.25,1.8,0.038,17.0,77.0,0.992,3.11,0.43,9.5,5.0
50%,7.0,0.29,0.31,3.0,0.047,29.0,118.0,0.995,3.21,0.51,10.3,6.0
75%,7.7,0.4,0.39,8.1,0.065,41.0,156.0,0.997,3.32,0.6,11.3,6.0
max,15.9,1.58,1.66,65.8,0.611,289.0,440.0,1.039,4.01,2.0,14.9,9.0


In [3]:
import plotly
from plotly import tools
import plotly.graph_objs as go

plotly.offline.init_notebook_mode(connected=True)

def col_hist(col_name):
    """
    Plots a histogram for the column.
    """
    col_max = all_data[col_name].max()
    col_min = all_data[col_name].min()

    step = (col_max - col_min) / 15

    trace1 = go.Histogram(
        x = all_data[all_data['colour']=='red'][col_name], 
        name = 'red'.title(),
        opacity = 0.75,
        xbins={
            'start': col_min,
            'end': col_max,
            'size': step
        },
        histnorm='probability', 
        marker={
            'color':'#900020'
        }
    )

    trace2 = go.Histogram(
        x = all_data[all_data['colour']=='white'][col_name],
        name = 'white'.title(),
        opacity = 0.75,
        xbins={
            'start': col_min,
            'end': col_max,
            'size': step
        },
        histnorm='probability', 
        marker={
            'color':'#D1B78F'
        }
    )

    histogram_data = [trace1, trace2]
    layout = go.Layout(
        xaxis={
            'title': col_name.title()
        },
        yaxis={
            'title':'Proportion'
        },
        bargap=0.2,
        bargroupgap=0.1
    )
    this_fig = go.Figure(data=histogram_data, layout=layout)
    plotly.offline.iplot(this_fig)

#hist_plots = [col_hist(col_name) for col_name in list(data_description)]

Some of the histograms indicate it may be useful to perform transforms on the data. From the law of mass action, we should transform all chemical balance ratios with the logarithm. This should improve performance for additive models, where we consider absolute, not proportional, change. The exceptions are listed below, as they already appear to be normally distributed, or at least not exponentially skewed.

In [4]:
log_exceptions = [
    'volatile acidity',
    'total sulfur dioxide',
    'density', 
    'ph',
    'alcohol', 
    'citric acid'
]

remaining = [title for title in numeric_titles if title not in log_exceptions]

all_data[remaining] = all_data[remaining].apply(np.sqrt)

#hist_plots = [col_hist(col_name) for col_name in list(data_description)]

In [5]:
def col_scatter(col_name):
    """
    Plots a histogram for the column.
    """
    
    trace1 = go.Scattergl(
        x = all_data[all_data['colour']=='red'][col_name],
        y = all_data[all_data['colour']=='red']['quality'],
        name = 'red'.title(),
        mode = 'markers',
        marker={
            'size': 10,
            'color':'#900020',
            'opacity': 0.1
        }
    )

    trace2 = go.Scattergl(
        x = all_data[all_data['colour']=='white'][col_name],
        y = all_data[all_data['colour']=='white']['quality'],
        name = 'white'.title(),
        mode = 'markers',
        marker={
            'size': 10,
            'color':'#D1B78F', 
            'opacity': 0.1
        }
    )

    histogram_data = [trace1, trace2]
    layout = go.Layout(
        xaxis={
            'title': col_name.title()
        },
        yaxis={
            'title':'Quality'
        },
        bargap=0.2,
        bargroupgap=0.1
    )
    this_fig = go.Figure(data=histogram_data, layout=layout)
    plotly.offline.iplot(this_fig)
    
#scatter_plots = [col_scatter(col_name) for col_name in list(data_description)]

In [6]:
# import matplotlib.pyplot as plt

# axes = pd.plotting.scatter_matrix(all_data, alpha=0.2, figsize=(100, 150))
# plt.tight_layout()
# plt.savefig('scatter_matrix.png')

## Process for tuning and validation
- Split the data into a training and test set
- On the training set, perform cross validation for fitting the models
- Make predictions on the test set
- Evaluate those predictions

In [7]:
import math

def print_classification_results(clf, y_true, y_pred):
    display(pd.DataFrame(clf.cv_results_))
    
    from sklearn.metrics import classification_report, accuracy_score
    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test, clf.predict(X_test)
    print(classification_report(y_true, y_pred))
    print('Accuracy Score:', accuracy_score(y_true, y_pred))
    print()
    print()
    
def print_regression_results(reg, y_test, X_test):
    from sklearn.metrics import mean_squared_error
    print("Best parameters set found on development set:")
    print()
    print(reg.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    means = reg.cv_results_['mean_test_score']
    stds = reg.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, reg.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()   
    
    print('best cv', math.sqrt(-reg.best_score_))
    from sklearn.metrics import mean_squared_error
    print("best test", mean_squared_error(y_test, reg.predict(X_test)))
    
    
    display(pd.DataFrame({'true':y_test, 'pred': reg.predict(X_test)}))

In [28]:
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.svm import SVC
from sklearn.preprocessing import scale

all_data[numeric_titles] = scale(all_data[numeric_titles])

no_colour = all_data.drop(['colour'], axis=1)

def data_split(data):
    y = data['quality']
    X = data.drop("quality", axis=1)
    
    return train_test_split(
        X, y, test_size=0.4, random_state=0
    )

# testing it without colour

X_train, X_test, y_train, y_test = data_split(no_colour)

# testing with a subset 
X_train = X_train[:]
y_train = y_train[:]

# # testing just with svm
# svc_pars = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
#                      'C': [1, 10, 100, 1000]}]

# svc = GridSearchCV(SVC(), svc_pars, cv=3)
# svc.fit(X_train, y_train)

# from sklearn.ensemble import RandomForestClassifier

# rfc_pars = [{}]

# rfc = GridSearchCV(RandomForestClassifier(), rfc_pars, cv=3)
# rfc.fit(X_train, y_train)
    
# print_classification_results(clf, y_true, y_pred)
# print_classification_results(rfc, y_true, y_pred)

In [54]:
from sklearn.svm import SVR
from sklearn.dummy import DummyRegressor

def fitter(model, params, samples, X_train, y_train):
    model = RandomizedSearchCV(
        model(), params, cv=5,
        scoring='neg_mean_squared_error',
        n_iter=samples,
        n_jobs=8,
        pre_dispatch=8,
        random_state = 0,
        verbose=10,
        return_train_score=False
    )
    model.fit(X_train, y_train)
    return model

# svr_pars = {
#     'kernel': ['poly','sigmoid','rbf'], 
#     'gamma': [2**(x) for x in range(-15, 4)],
#     'C': [2**(x) for x in range(-4, 15)]
# }
# svr = fitter(SVR, svr_pars, 10, X_train, y_train)

# # dummy_pars = {}
# # dum = fitter(DummyRegressor, dummy_pars, 1, X_train, y_train)

# print_regression_results(svr, y_test, X_test)
# print_regression_results(dum, y_true, y_pred, X_test)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] kernel=poly, gamma=3.0517578125e-05, C=256 ......................
[CV] kernel=poly, gamma=3.0517578125e-05, C=256 ......................
[CV] kernel=poly, gamma=3.0517578125e-05, C=256 ......................
[CV] kernel=poly, gamma=3.0517578125e-05, C=256 ......................
[CV] kernel=poly, gamma=3.0517578125e-05, C=256 ......................
[CV] kernel=sigmoid, gamma=1, C=32 ...................................
[CV] kernel=sigmoid, gamma=1, C=32 ...................................
[CV] kernel=sigmoid, gamma=1, C=32 ...................................
[CV]  kernel=poly, gamma=3.0517578125e-05, C=256, score=-0.6800890606116397, total=   1.4s
[CV] kernel=sigmoid, gamma=1, C=32 ...................................
[CV]  kernel=poly, gamma=3.0517578125e-05, C=256, score=-0.7684604865082753, total=   2.1s
[CV]  kernel=poly, gamma=3.0517578125e-05, C=256, score=-0.7207684903992321, total=   2.1s
[CV] kernel=sigmoid, gamma=

[Parallel(n_jobs=8)]: Done   5 tasks      | elapsed:    2.4s


[CV]  kernel=sigmoid, gamma=1, C=32, score=-56304451.662957154, total=   5.5s
[CV] kernel=sigmoid, gamma=0.125, C=1024 .............................
[CV]  kernel=sigmoid, gamma=1, C=32, score=-52356521.87841981, total=   7.0s
[CV] kernel=sigmoid, gamma=0.125, C=1024 .............................
[CV]  kernel=sigmoid, gamma=1, C=32, score=-59872149.47617393, total=   7.2s
[CV] kernel=sigmoid, gamma=0.00390625, C=512 .........................
[CV]  kernel=sigmoid, gamma=0.125, C=1024, score=-2799851440.917983, total=   5.9s
[CV] kernel=sigmoid, gamma=0.00390625, C=512 .........................
[CV]  kernel=sigmoid, gamma=1, C=32, score=-54497504.056564055, total=   6.9s
[CV] kernel=sigmoid, gamma=0.00390625, C=512 .........................


[Parallel(n_jobs=8)]: Done  10 tasks      | elapsed:    8.6s


[CV]  kernel=sigmoid, gamma=0.125, C=1024, score=-2771244001.5424128, total=   6.4s
[CV] kernel=sigmoid, gamma=0.00390625, C=512 .........................
[CV]  kernel=sigmoid, gamma=0.125, C=1024, score=-2757858206.6244054, total=   6.7s
[CV] kernel=sigmoid, gamma=0.00390625, C=512 .........................
[CV]  kernel=sigmoid, gamma=1, C=32, score=-56382150.21440843, total=   7.3s
[CV] kernel=sigmoid, gamma=0.0001220703125, C=16384 ..................
[CV]  kernel=sigmoid, gamma=0.125, C=1024, score=-2610987383.3912554, total=   7.0s
[CV] kernel=sigmoid, gamma=0.0001220703125, C=16384 ..................
[CV]  kernel=sigmoid, gamma=0.125, C=1024, score=-2729612370.7931237, total=   6.0s
[CV] kernel=sigmoid, gamma=0.0001220703125, C=16384 ..................
[CV]  kernel=sigmoid, gamma=0.00390625, C=512, score=-145.49173722090822, total=  11.2s
[CV] kernel=sigmoid, gamma=0.0001220703125, C=16384 ..................
[CV]  kernel=sigmoid, gamma=0.00390625, C=512, score=-69.66578333479075, 

[Parallel(n_jobs=8)]: Done  17 tasks      | elapsed:   20.5s


[CV] kernel=sigmoid, gamma=2, C=1 ....................................
[CV]  kernel=sigmoid, gamma=0.00390625, C=512, score=-59.79835427220859, total=  12.1s
[CV] kernel=sigmoid, gamma=2, C=1 ....................................
[CV]  kernel=sigmoid, gamma=0.0001220703125, C=16384, score=-0.5582867353760751, total=  14.9s
[CV] kernel=sigmoid, gamma=2, C=1 ....................................
[CV]  kernel=sigmoid, gamma=2, C=1, score=-66124.30811776505, total=   5.7s
[CV] kernel=sigmoid, gamma=2, C=1 ....................................
[CV]  kernel=sigmoid, gamma=2, C=1, score=-61713.05911494825, total=   6.6s
[CV] kernel=rbf, gamma=0.015625, C=64 ................................
[CV]  kernel=sigmoid, gamma=2, C=1, score=-68292.97982765884, total=   7.1s
[CV] kernel=rbf, gamma=0.015625, C=64 ................................


[Parallel(n_jobs=8)]: Done  24 tasks      | elapsed:   28.0s


[CV]  kernel=sigmoid, gamma=0.0001220703125, C=16384, score=-0.5090954251455305, total=  15.9s
[CV] kernel=rbf, gamma=0.015625, C=64 ................................
[CV]  kernel=sigmoid, gamma=0.0001220703125, C=16384, score=-0.5218591685091672, total=  16.4s
[CV] kernel=rbf, gamma=0.015625, C=64 ................................
[CV]  kernel=sigmoid, gamma=2, C=1, score=-65685.28982616334, total=   7.9s
[CV] kernel=rbf, gamma=0.015625, C=64 ................................
[CV]  kernel=sigmoid, gamma=2, C=1, score=-61865.22559598792, total=   6.8s
[CV] kernel=poly, gamma=0.03125, C=64 ................................
[CV]  kernel=sigmoid, gamma=0.0001220703125, C=16384, score=-0.4629947473369776, total=  15.1s
[CV] kernel=poly, gamma=0.03125, C=64 ................................
[CV]  kernel=sigmoid, gamma=0.0001220703125, C=16384, score=-0.5322388500891814, total=  16.1s
[CV] kernel=poly, gamma=0.03125, C=64 ................................
[CV]  kernel=rbf, gamma=0.015625, C=64, sc

[Parallel(n_jobs=8)]: Done  33 tasks      | elapsed:   41.9s


[CV]  kernel=rbf, gamma=0.015625, C=64, score=-0.41647720007044947, total=  13.0s
[CV] kernel=rbf, gamma=0.015625, C=2 .................................
[CV]  kernel=rbf, gamma=0.015625, C=64, score=-0.4732003619922087, total=  13.7s
[CV] kernel=rbf, gamma=0.015625, C=2 .................................
[CV]  kernel=rbf, gamma=0.015625, C=2, score=-0.49201430571091476, total=   4.8s
[CV] kernel=rbf, gamma=0.015625, C=2 .................................
[CV]  kernel=rbf, gamma=0.015625, C=2, score=-0.4796977694398022, total=   5.4s
[CV] kernel=rbf, gamma=0.015625, C=2 .................................
[CV]  kernel=poly, gamma=0.03125, C=64, score=-0.5918390840970345, total=  14.9s
[CV] kernel=poly, gamma=0.00390625, C=256 ............................
[CV]  kernel=poly, gamma=0.00390625, C=256, score=-0.605006916622956, total=   2.5s
[CV] kernel=poly, gamma=0.00390625, C=256 ............................
[CV]  kernel=rbf, gamma=0.015625, C=2, score=-0.470518578545506, total=   4.9s
[CV] k

[Parallel(n_jobs=8)]: Done  42 tasks      | elapsed:   52.2s


[CV]  kernel=rbf, gamma=0.015625, C=2, score=-0.49174958843554367, total=   4.6s
[CV]  kernel=poly, gamma=0.00390625, C=256, score=-0.5905956628596396, total=   3.0s
[CV]  kernel=poly, gamma=0.03125, C=64, score=-0.5694817265571347, total=  18.7s
[CV]  kernel=poly, gamma=0.00390625, C=256, score=-0.5839246570380889, total=   3.3s
[CV]  kernel=poly, gamma=0.00390625, C=256, score=-0.5466767769151896, total=   2.9s
[CV]  kernel=poly, gamma=0.00390625, C=256, score=-0.6138470498545315, total=   3.3s
[CV]  kernel=poly, gamma=0.03125, C=64, score=-0.7960832425573161, total=  16.3s
[CV]  kernel=poly, gamma=0.03125, C=64, score=-0.5282235344173725, total=  19.5s


[Parallel(n_jobs=8)]: Done  50 out of  50 | elapsed:   58.5s finished


Best parameters set found on development set:

{'kernel': 'rbf', 'gamma': 0.015625, 'C': 64}

Grid scores on development set:

-0.746 (+/-0.080) for {'kernel': 'poly', 'gamma': 3.0517578125e-05, 'C': 256}
-55882782.614 (+/-4956506.859) for {'kernel': 'sigmoid', 'gamma': 1, 'C': 32}
-2733943318.316 (+/-130952354.238) for {'kernel': 'sigmoid', 'gamma': 0.125, 'C': 1024}
-69.312 (+/-80.677) for {'kernel': 'sigmoid', 'gamma': 0.00390625, 'C': 512}
-0.517 (+/-0.063) for {'kernel': 'sigmoid', 'gamma': 0.0001220703125, 'C': 16384}
-64736.666 (+/-5127.596) for {'kernel': 'sigmoid', 'gamma': 2, 'C': 1}
-0.459 (+/-0.045) for {'kernel': 'rbf', 'gamma': 0.015625, 'C': 64}
-0.605 (+/-0.196) for {'kernel': 'poly', 'gamma': 0.03125, 'C': 64}
-0.472 (+/-0.049) for {'kernel': 'rbf', 'gamma': 0.015625, 'C': 2}
-0.588 (+/-0.046) for {'kernel': 'poly', 'gamma': 0.00390625, 'C': 256}

best cv 0.6775295073864251
best test 0.519414908249


Unnamed: 0,pred,true
3717,5.508,6
3611,6.325,6
1919,5.779,6
23,4.886,5
844,6.377,8
1922,5.948,5
3612,6.187,6
3337,5.175,6
1161,5.734,6
4394,6.137,6


In [55]:
# from sklearn.ensemble import RandomForestRegressor

# rfr_params = {
#     'n_estimators':[x for x in range(10, 15)],
#     'min_samples_split':[x for x in range(2,5)]
# }

# rfr = fitter(RandomForestRegressor, rfr_params, 10, X_train, y_train)
# print_regression_results(rfr, y_test, X_test)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] n_estimators=11, min_samples_split=2 ............................
[CV] n_estimators=11, min_samples_split=2 ............................
[CV] n_estimators=11, min_samples_split=2 ............................
[CV] n_estimators=11, min_samples_split=2 ............................
[CV] n_estimators=11, min_samples_split=2 ............................
[CV] n_estimators=11, min_samples_split=3 ............................
[CV] n_estimators=11, min_samples_split=3 ............................
[CV] n_estimators=11, min_samples_split=3 ............................
[CV]  n_estimators=11, min_samples_split=3, score=-0.4242244354971628, total=   1.4s
[CV] n_estimators=11, min_samples_split=3 ............................
[CV]  n_estimators=11, min_samples_split=3, score=-0.4255054900006198, total=   1.5s
[CV] n_estimators=11, min_samples_split=3 ............................
[CV]  n_estimators=11, min_samples_split=2, score=-0.422801

[Parallel(n_jobs=8)]: Done   5 tasks      | elapsed:    1.7s


[CV]  n_estimators=13, min_samples_split=3, score=-0.384428845541959, total=   0.9s
[CV] n_estimators=14, min_samples_split=3 ............................
[CV]  n_estimators=13, min_samples_split=3, score=-0.4230616009135662, total=   1.1s
[CV] n_estimators=14, min_samples_split=3 ............................


[Parallel(n_jobs=8)]: Done  10 tasks      | elapsed:    2.8s


[CV]  n_estimators=11, min_samples_split=3, score=-0.47796092268230206, total=   1.7s
[CV] n_estimators=14, min_samples_split=3 ............................
[CV]  n_estimators=13, min_samples_split=3, score=-0.42816699556978505, total=   1.7s
[CV]  n_estimators=11, min_samples_split=3, score=-0.3951163726187066, total=   1.9s
[CV] n_estimators=14, min_samples_split=3 ............................
[CV] n_estimators=14, min_samples_split=4 ............................
[CV]  n_estimators=13, min_samples_split=3, score=-0.4238451234005968, total=   1.8s
[CV] n_estimators=14, min_samples_split=4 ............................
[CV]  n_estimators=14, min_samples_split=3, score=-0.4236610361112001, total=   1.9s
[CV] n_estimators=14, min_samples_split=4 ............................
[CV]  n_estimators=13, min_samples_split=3, score=-0.4673599168761852, total=   2.0s
[CV] n_estimators=14, min_samples_split=4 ............................
[CV]  n_estimators=14, min_samples_split=3, score=-0.420424594

[Parallel(n_jobs=8)]: Done  17 tasks      | elapsed:    4.6s


[CV]  n_estimators=14, min_samples_split=3, score=-0.4476288190714883, total=   1.4s
[CV] n_estimators=14, min_samples_split=2 ............................
[CV]  n_estimators=14, min_samples_split=3, score=-0.38017114701975735, total=   1.8s
[CV] n_estimators=14, min_samples_split=2 ............................
[CV]  n_estimators=14, min_samples_split=4, score=-0.41664495554686554, total=   1.6s
[CV] n_estimators=14, min_samples_split=2 ............................
[CV]  n_estimators=14, min_samples_split=4, score=-0.4200215602577979, total=   1.8s
[CV] n_estimators=14, min_samples_split=2 ............................
[CV]  n_estimators=14, min_samples_split=4, score=-0.3776958984629009, total=   1.8s
[CV] n_estimators=12, min_samples_split=2 ............................
[CV]  n_estimators=14, min_samples_split=4, score=-0.42277812760864425, total=   2.1s
[CV] n_estimators=12, min_samples_split=2 ............................


[Parallel(n_jobs=8)]: Done  24 tasks      | elapsed:    5.8s


[CV]  n_estimators=14, min_samples_split=4, score=-0.4521651993912582, total=   1.7s
[CV] n_estimators=12, min_samples_split=2 ............................
[CV]  n_estimators=14, min_samples_split=2, score=-0.425, total=   1.6s
[CV] n_estimators=12, min_samples_split=2 ............................
[CV]  n_estimators=14, min_samples_split=2, score=-0.43771585557299847, total=   1.6s
[CV] n_estimators=12, min_samples_split=2 ............................
[CV]  n_estimators=14, min_samples_split=2, score=-0.42802851909994766, total=   1.4s
[CV] n_estimators=13, min_samples_split=4 ............................
[CV]  n_estimators=12, min_samples_split=2, score=-0.4337072649572649, total=   1.2s
[CV] n_estimators=13, min_samples_split=4 ............................
[CV]  n_estimators=12, min_samples_split=2, score=-0.4425480769230769, total=   1.5s
[CV] n_estimators=13, min_samples_split=4 ............................
[CV]  n_estimators=14, min_samples_split=2, score=-0.39228733855544784, tot

[Parallel(n_jobs=8)]: Done  33 tasks      | elapsed:    8.0s


[CV]  n_estimators=13, min_samples_split=4, score=-0.41943744785988196, total=   1.6s
[CV] n_estimators=10, min_samples_split=4 ............................
[CV]  n_estimators=13, min_samples_split=4, score=-0.3819970330831619, total=   1.4s
[CV] n_estimators=12, min_samples_split=3 ............................
[CV]  n_estimators=10, min_samples_split=4, score=-0.39946151055629786, total=   1.0s
[CV] n_estimators=12, min_samples_split=3 ............................
[CV]  n_estimators=10, min_samples_split=4, score=-0.43920760358927524, total=   1.3s
[CV]  n_estimators=13, min_samples_split=4, score=-0.4479011286897501, total=   1.6s
[CV] n_estimators=12, min_samples_split=3 ............................
[CV]  n_estimators=13, min_samples_split=4, score=-0.44070600947430394, total=   2.0s
[CV] n_estimators=12, min_samples_split=3 ............................
[CV] n_estimators=12, min_samples_split=3 ............................


[Parallel(n_jobs=8)]: Done  42 tasks      | elapsed:    9.4s


[CV]  n_estimators=10, min_samples_split=4, score=-0.4679290395704709, total=   1.0s
[CV]  n_estimators=10, min_samples_split=4, score=-0.45188998368970096, total=   1.7s
[CV]  n_estimators=10, min_samples_split=4, score=-0.43345435474082794, total=   1.8s
[CV]  n_estimators=12, min_samples_split=3, score=-0.4402121864118392, total=   1.0s
[CV]  n_estimators=12, min_samples_split=3, score=-0.420339471549541, total=   1.0s
[CV]  n_estimators=12, min_samples_split=3, score=-0.44220333820662766, total=   1.0s
[CV]  n_estimators=12, min_samples_split=3, score=-0.3927938570919976, total=   1.1s
[CV]  n_estimators=12, min_samples_split=3, score=-0.42412268604319703, total=   1.4s


[Parallel(n_jobs=8)]: Done  50 out of  50 | elapsed:   10.9s finished


Best parameters set found on development set:

{'n_estimators': 14, 'min_samples_split': 4}

Grid scores on development set:

-0.426 (+/-0.046) for {'n_estimators': 11, 'min_samples_split': 2}
-0.432 (+/-0.054) for {'n_estimators': 11, 'min_samples_split': 3}
-0.425 (+/-0.053) for {'n_estimators': 13, 'min_samples_split': 3}
-0.420 (+/-0.044) for {'n_estimators': 14, 'min_samples_split': 3}
-0.418 (+/-0.047) for {'n_estimators': 14, 'min_samples_split': 4}
-0.427 (+/-0.039) for {'n_estimators': 14, 'min_samples_split': 2}
-0.428 (+/-0.057) for {'n_estimators': 12, 'min_samples_split': 2}
-0.423 (+/-0.046) for {'n_estimators': 13, 'min_samples_split': 4}
-0.438 (+/-0.046) for {'n_estimators': 10, 'min_samples_split': 4}
-0.424 (+/-0.036) for {'n_estimators': 12, 'min_samples_split': 3}

best cv 0.646422966714551
best test 0.445504528737


Unnamed: 0,pred,true
3717,5.595,6
3611,6.214,6
1919,5.714,6
23,4.857,5
844,7.417,8
1922,5.740,5
3612,6.089,6
3337,5.643,6
1161,5.565,6
4394,6.179,6


In [None]:
all_data