## Procedure

1. Clean and transform data
2. Exploratory Data Analysis (EDA)
3. **Modeling & evaluation**
4. Conclusion
5. Clean code with classes & functions

In [9]:
import pandas as pd
from datetime import datetime
from sklearn.externals import joblib
from sklearn.model_selection import train_test_split

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ShuffleSplit

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [2]:
features_transformed = joblib.load('../work/data/features_transformed')
target = joblib.load('../work/data/target')

## train-test-split

In [4]:
X_train, X_test, y_train, y_test = train_test_split(features_transformed, target, test_size=.3, random_state=88)

### KNN

In [None]:
start = datetime.now()

params = {
    'n_neighbors': list(range(2,30,2))
}

knn = GridSearchCV(KNeighborsClassifier(), param_grid=params, cv=ShuffleSplit(random_state=88))
knn.fit(X_train, y_train)
display(pd.DataFrame(gs_knn.cv_results_))
print(gs_knn.cv_results_['mean_test_score'])

print("Time: ", datetime.now() - start)

In [None]:
print('KNN train score: {}'.format(knn.score(X_train, y_train)))
print('KNN test score: {}'.format(knn.score(X_test, y_test)))

## Decision Tree

In [None]:
start = datetime.now()

params = {
    'max_depth':[1,2,4,8,16,17, None],
    'min_samples_leaf':[1,5,10,15,20],
}

dt = DecisionTreeRegressor()
dt.fit(X_train, y_train)

In [None]:
print('DT train score: {}'.format(dt.score(X_train, y_train)))
print('DT test score: {}'.format(dt.score(X_test, y_test)))

In [None]:
optimal_max_depth = dt.best_estimator_.max_depth
optimal_max_depth

In [None]:
cv_results = pd.DataFrame(dt.cv_results_)

plt.plot(cv_results_df['param_max_depth'], cv_results_df['mean_test_score'], c='g', label='test')
plt.plot(cv_results_df['param_max_depth'], cv_results_df['mean_train_score'], c='b', label='train')
plt.axvline(optimal_max_depth, c='r')
plt.legend()

`cv_results_df['mean_train_score']` are the means of what you get when applied to CROSS-VALIDATED (hold-out) test data. This is why it's different from the `.score` results above. The `.score` results are the means of what you actually get when applied to REAL TEST data.