# Model Training and Hyperparameter Tuning

In this notebook we wil use preprocessed data to train our models and try to find the best set of hyperparameters. Here, I want to investigate how our model performs for different types of machine learning models such as Logistic Regression, SVM, Random Forests, Gradient Boosting, etc.

In [4]:
import pandas as pd
import numpy as np
import warnings
import pickle


from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, train_test_split, StratifiedShuffleSplit
from sklearn.metrics import f1_score, r2_score
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.impute import SimpleImputer

warnings.filterwarnings("ignore")


%load_ext autoreload
%autoreload 2

%run ../src/utils.py

In [5]:
# splitting data into train/test sets

data_sequence = pd.read_hdf('../data/preprocessed/data_sequence_alldata.hdf', key='final_alldata', mode='r')
data_sequence = data_sequence.replace([np.inf, -np.inf], np.nan)

with open('../data/preprocessed/hashs_train.pkl', 'rb') as fp:
    hashs_train = pickle.load(fp)
    
with open('../data/preprocessed/hashs_test.pkl', 'rb') as fp:
    hashs_test = pickle.load(fp)

window_reference = 5

train = data_sequence[data_sequence.hash.isin(hashs_train)]
train = train[train['hour_exit_'+str(window_reference)]>=15]

test  = data_sequence[data_sequence.hash.isin(hashs_test)]

train_train, train_val = train_test_split(train, test_size=0.25, random_state=20)
train_train.shape, train_val.shape

((100547, 1431), (33516, 1431))

In [6]:
with open('../data/preprocessed/original_columns_alldata.pkl', 'rb') as fp: 
    original_cols = pickle.load(fp)
    
drop_cols = list(x for x in original_cols if 'exit' in x)# + grid_cols
drop_cols += ['lat_lon_entry', 'lat_lon_exit']
drop_cols += ['euclidean_distance', 'manhattan_distance', 'harvesine_distance',
              'center_permanency', 'crossed_city', 'velocity', 'leaving_city', 'entering_city']

drop_cols = [col+'_'+str(window_reference) for col in drop_cols]
#drop_cols += [col+'_'+str(i) for col in grid_cols for i in range(0, 5)]
drop_cols += ['hash', f'delta_last_center_permanency_{window_reference}', f'delta_origin_center_permanency_{window_reference}']

features = list(set(train.columns) - set(drop_cols))
target   = ['is_inside_city_exit_'+str(window_reference)]

## 1. Logistic Regression

The first model we will consider is the simple linear model Logistic Regression. I will use a GridSearch approach to cross-validate results using k-fold schema to find the best set of hyperparameters.

It is important to note that hyperparameter tuning is done using k-folds using only training data, that is, our test set will use to assess the parameters found and will not be used to choose the best ones.

In [7]:
# defining parameters search
parameters = {'model__C': [0.01, 0.1, 20],
             'model__penalty':['l1', 'l2'],
             'model__class_weight':[None, 'balanced'],
             'scaler':[StandardScaler(), MinMaxScaler()]}

model    = LogisticRegression(random_state=20)
imputer  = SimpleImputer(strategy='constant', fill_value=0)
pipeline = Pipeline(steps=[('imputer', imputer), ('scaler', MinMaxScaler()), ('model', model)])
splitter = StratifiedShuffleSplit(n_splits=2, random_state=2)

clf   = GridSearchCV(pipeline, param_grid=parameters, cv=splitter, verbose=20, n_jobs=2, scoring='f1', refit='f1')

clf.fit(train_train[features], train_train[target])

Fitting 2 folds for each of 24 candidates, totalling 48 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   1 tasks      | elapsed:   30.3s
[Parallel(n_jobs=2)]: Done   2 tasks      | elapsed:   31.1s
[Parallel(n_jobs=2)]: Done   3 tasks      | elapsed:   52.0s
[Parallel(n_jobs=2)]: Done   4 tasks      | elapsed:   52.6s
[Parallel(n_jobs=2)]: Done   5 tasks      | elapsed:  2.7min
[Parallel(n_jobs=2)]: Done   6 tasks      | elapsed:  2.7min
[Parallel(n_jobs=2)]: Done   7 tasks      | elapsed:  3.5min
[Parallel(n_jobs=2)]: Done   8 tasks      | elapsed:  3.5min
[Parallel(n_jobs=2)]: Done   9 tasks      | elapsed:  4.0min
[Parallel(n_jobs=2)]: Done  10 tasks      | elapsed:  4.0min
[Parallel(n_jobs=2)]: Done  11 tasks      | elapsed:  4.3min
[Parallel(n_jobs=2)]: Done  12 tasks      | elapsed:  4.4min
[Parallel(n_jobs=2)]: Done  13 tasks      | elapsed:  6.3min
[Parallel(n_jobs=2)]: Done  14 tasks      | elapsed:  6.3min
[Parallel(n_jobs=2)]: Done  15 tasks      | elapsed:  7.0min
[Parallel(

In [None]:
print("Best Parameters: {}".format(clf.best_params_))

In [None]:
print("Logistic Regression F1-Score on CV data: {}".format(clf.best_score))

In [None]:
print("Logistic Regression F1-Score on Holdout data: {}".format(clf.score(train_val[features], train_val[target])))

In [None]:
with open('../data/preprocessed/cv_results_logistic_regression.pkl', 'wb') as fp:
    pickle.dump(clf.cv_results_, fp)

## 2. KNN

In [None]:
# defining parameters search
parameters = {'model__n_neighbors': [3, 11, 31, 75],
             'model__weights': ['uniform', 'distance']}

model    = KNeighborsClassifier(n_jobs=-1)
imputer  = SimpleImputer(strategy='constant', fill_value=0)
pipeline = Pipeline(steps=[('imputer', imputer), ('scaler', MinMaxScaler()), ('model', model)])

splitter = StratifiedShuffleSplit(n_splits=2, random_state=2)

clf   = GridSearchCV(pipeline, param_grid=parameters, cv=splitter, verbose=20, n_jobs=-1, scoring='f1', refit='f1')

clf.fit(train_train[features], train_train[target])

In [None]:
print("Best Parameters: {}".format(clf.best_params_))

In [None]:
print("kNN F1-Score on CV data: {}".format(clf.best_score))

In [None]:
print("kNN F1-Score on Holdout data: {}".format(clf.score(train_val[features], train_val[target])))

In [None]:
with open('../data/preprocessed/cv_results_knn.pkl', 'wb') as fp:
    pickle.dump(clf.cv_results_, fp)

## 3 . LGBM

In [None]:
import lightgbm as lgb

# Create parameters to search
gridParams = {
    'num_leaves': [31, 42],
    'random_state' : [20], # Updated from 'seed'
    'colsample_bytree' : [0.8, 1.0],
    'subsample' : [0.5, 0.75, 1.0],
    'max_depth': [7, 15, 25, -1]
    }

# Create classifier to use. Note that parameters have to be input manually
# not as a dict!
clf = lgb.LGBMClassifier()
splitter = StratifiedShuffleSplit(n_splits=2, random_state=2)
# Create the grid
grid = GridSearchCV(clf, gridParams, verbose=20, cv=splitter, n_jobs=-1, scoring='f1', refit='f1')
# Run the grid
grid.fit(train_train[features], train_train[target])

# Print the best parameters found
print(grid.best_params_)
print(grid.best_score_)

In [None]:
print("Best Parameters: {}".format(grid.best_params_))

In [None]:
print("LightGBM F1-Score on CV data: {}".format(clf.best_score))

In [None]:
print("LightGBM F1-Score on Holdout data: {}".format(clf.score(train_val[features], train_val[target])))

In [17]:
import lightgbm as lgb

best_params_lgb = {'boosting_type': 'gbdt', 
                 'colsample_bytree': 1.0, 
                 'is_unbalance': False, 
                 'max_depth': 7, 
                 'n_estimators': 150, 
                 'num_leaves': 31, 
                 'objective': 'binary', 
                 'random_state': 20, 
                 'reg_alpha': 1, 
                 'subsample': 0.7}

clf = lgb.LGBMClassifier(**best_params_lgb)

clf.fit(train_train[features], train_train[target])
clf.score(train_val[features], train_val[target])

0.9411027568922306

In [12]:
data_test = data_sequence[data_sequence.hash.isin(hashs_test)]
ids = pd.read_csv('../data/raw/data_test.zip', index_col='Unnamed: 0', low_memory=True)
ids = ids[ids.x_exit.isnull()]
data_test = data_test.merge(ids[['hash', 'trajectory_id']], on='hash')

clf = lgb.LGBMClassifier(**best_params_lgb)

clf.fit(train[features], train[target])
yhat = clf.predict(data_test[features])

pd.Series(yhat).value_counts()

0.0    25012
1.0     8503
dtype: int64

In [13]:
submission = pd.DataFrame(list(zip(data_test['trajectory_id'], yhat)), columns=['id', 'target'])
submission.to_csv('../data/submission92_victor.csv', index=False)

In [None]:
# defining parameters search
parameters = {'model__C': [0.01, 0.1, 10, 20],
             'model__penalty':['l1', 'l2'],
             'model__fit_intercept': [True, False],
             'model__class_weight':[None, 'balanced'],
              'model__n_jobs':[8],
             'scaler':[StandardScaler(), MinMaxScaler()]}

model    = LogisticRegression(random_state=20, n_jobs=-1)
imputer  = SimpleImputer(strategy='constant', fill_value=0)
pipeline = Pipeline(steps=[('imputer', imputer), ('scaler', MinMaxScaler()), ('model', model)])
splitter = StratifiedShuffleSplit(n_splits=2, random_state=2)

clf   = GridSearchCV(pipeline, param_grid=parameters, cv=splitter, verbose=20, n_jobs=-1)

clf.fit(train_train[features], train_train[target])

Fitting 3 folds for each of 64 candidates, totalling 192 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   1 tasks      | elapsed:   13.3s
[Parallel(n_jobs=4)]: Done   2 tasks      | elapsed:   15.8s
[Parallel(n_jobs=4)]: Done   3 tasks      | elapsed:   16.3s
[Parallel(n_jobs=4)]: Done   4 tasks      | elapsed:   16.6s
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:   25.6s
[Parallel(n_jobs=4)]: Done   6 tasks      | elapsed:   29.1s
[Parallel(n_jobs=4)]: Done   7 tasks      | elapsed:   49.6s
[Parallel(n_jobs=4)]: Done   8 tasks      | elapsed:  1.2min
[Parallel(n_jobs=4)]: Done   9 tasks      | elapsed:  1.2min
[Parallel(n_jobs=4)]: Done  10 tasks      | elapsed:  1.2min
[Parallel(n_jobs=4)]: Done  11 tasks      | elapsed:  1.3min
[Parallel(n_jobs=4)]: Done  12 tasks      | elapsed:  1.4min
[Parallel(n_jobs=4)]: Done  13 tasks      | elapsed:  1.4min
[Parallel(n_jobs=4)]: Done  14 tasks      | elapsed:  1.5min
[Parallel(n_jobs=4)]: Done  15 tasks      | elapsed:  1.5min
[Parallel(

## 4. Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create parameters to search
gridParams = {
    'model__n_estimators': [200],
    'model__max_features' : [0.5, 'sqrt'],
    'model__max_depth': [7, 15, None]
    }

# Create classifier to use. Note that parameters have to be input manually
# not as a dict!
model    = RandomForestClassifier(random_state=20)
imputer  = SimpleImputer(strategy='constant', fill_value=0)
pipeline = Pipeline(steps=[('imputer', imputer), ('model', model)])

splitter = StratifiedShuffleSplit(n_splits=2, random_state=2)
# Create the grid
grid = GridSearchCV(pipeline, gridParams, verbose=20, cv=splitter, n_jobs=4, scoring='f1', refit='f1')
# Run the grid
grid.fit(train_train[features], train_train[target])

# Print the best parameters found
print(grid.best_params_)
print(grid.best_score_)

Fitting 2 folds for each of 6 candidates, totalling 12 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   1 tasks      | elapsed:  2.2min
[Parallel(n_jobs=4)]: Done   2 tasks      | elapsed:  2.2min
[Parallel(n_jobs=4)]: Done   3 tasks      | elapsed: 40.1min
[Parallel(n_jobs=4)]: Done   4 tasks      | elapsed: 40.1min
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed: 43.8min
[Parallel(n_jobs=4)]: Done   6 out of  12 | elapsed: 43.8min remaining: 43.8min
[Parallel(n_jobs=4)]: Done   7 out of  12 | elapsed: 75.9min remaining: 54.2min
[Parallel(n_jobs=4)]: Done   8 out of  12 | elapsed: 76.8min remaining: 38.4min
[Parallel(n_jobs=4)]: Done   9 out of  12 | elapsed: 81.3min remaining: 27.1min
[Parallel(n_jobs=4)]: Done  10 out of  12 | elapsed: 82.3min remaining: 16.5min
[Parallel(n_jobs=4)]: Done  12 out of  12 | elapsed: 189.6min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  12 out of  12 | elapsed: 189.6min finished


## 5. CatBoost

In [None]:
cbc = CatBoostClassifier(verbose=0)
cbc.fit(train_train[features], train_train[target])

In [None]:
cbc.score(train_val[features])