**Instructors**: Prof. Keith Chugg (chugg@usc.edu) & Prof. B. Keith Jenkins (jenkins@sipi.usc.edu)

**Notebook**: Written by TA Thanos

Sklearn info on CV:
https://scikit-learn.org/stable/modules/cross_validation.html#repeated-k-fold

![Alt text](grid_search_cross_validation.png)

With grid search we set beforehand the parameters to search and their values. 

In [32]:
from scipy.stats import loguniform
from pandas import read_csv
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import RepeatedStratifiedKFold, StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
# load dataset
dataframe = read_csv('sonar.csv', header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# define model
# model = LogisticRegression()
model = SVC()
# define evaluation
# cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
cv = StratifiedKFold(n_splits=10)

# define search space
space = dict()
space['C'] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]
space['kernel'] = ['linear','rbf','poly']
space['degree'] = [1,3,6]
space['gamma'] = [0.01,0.1,1,4,10]
# space['solver'] = ['newton-cg', 'liblinear']
# space['penalty'] = ['none', 'l1', 'l2', 'elasticnet']
# space['C'] = loguniform(1e-5, 100)
# define search
search = GridSearchCV(model, space, scoring='accuracy', n_jobs=-1, cv=cv)

# execute search
result = search.fit(X, y)
# summarize result
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)

Best Score: 0.6976190476190476
Best Hyperparameters: {'C': 10, 'degree': 1, 'gamma': 0.1, 'kernel': 'rbf'}


RandomizedSearchCV samples a number of parameters for each model and uses CV to evaluate. We do not provide an explicit set of possible values for each hyperparameter; rather, we provide a statistical distribution for each hyperparameter from which values are sampled.

In [33]:
# random search logistic regression model on the sonar dataset
from scipy.stats import loguniform
from pandas import read_csv
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import RepeatedStratifiedKFold, StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV
# load dataset
dataframe = read_csv('sonar.csv', header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# define model
# model = LogisticRegression()
model = SVC()
# define evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# cv = StratifiedKFold(n_splits=10)
# define search space
space = dict()
space['C'] = loguniform(1e-3, 100)
# The distribution density is proportional to 
# the reciprocal of the variable value within a and b, hence,
#  the pdf value decreases as the value of the variable increases,
#  effectively sampling smaller C
space['kernel'] = ['linear','rbf','poly']
space['degree'] = [1,3,6]
space['gamma'] = loguniform(1e-4, 10)
# space['solver'] = ['newton-cg', 'liblinear']
# space['penalty'] = ['none', 'l1', 'l2', 'elasticnet']
# space['C'] = loguniform(1e-5, 100)
# define search
search = RandomizedSearchCV(model, space, n_iter=500, scoring='accuracy', n_jobs=-1, cv=cv, random_state=1)
# execute search
result = search.fit(X, y)
# summarize result
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)

Best Score: 0.8893650793650795
Best Hyperparameters: {'C': 14.616410081499913, 'degree': 6, 'gamma': 0.49696110967288326, 'kernel': 'rbf'}


Even if a model produces an error the search continues (the error is ignored). The RandomizedSearchCV outputs the best model performance. 

Sequential search: search for optimal hyperparameter individually. May not yield optimal results but is faster than grid search. The order of which we are searching parameters also matters.

In [24]:
from sklearn.model_selection import cross_val_score
cv = StratifiedKFold(n_splits=10)
space = dict()
space['C'] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]
space['kernel'] = ['linear','rbf','poly']
space['degree'] = [1,3,6]
space['gamma'] = [0.01,0.1,1,4,10]
for c_par in space['C']:
    clf = SVC(kernel='linear', C=c_par)
    scores = cross_val_score(clf, X, y, cv=10)
    print("Model",clf,":%0.2f accuracy with std of %0.2f" % (scores.mean(), scores.std()))

Model SVC(C=1e-05, kernel='linear') :0.53 accuracy with std of 0.02
Model SVC(C=0.0001, kernel='linear') :0.53 accuracy with std of 0.02
Model SVC(C=0.001, kernel='linear') :0.53 accuracy with std of 0.02
Model SVC(C=0.01, kernel='linear') :0.53 accuracy with std of 0.02
Model SVC(C=0.1, kernel='linear') :0.60 accuracy with std of 0.16
Model SVC(C=1, kernel='linear') :0.64 accuracy with std of 0.19
Model SVC(C=10, kernel='linear') :0.60 accuracy with std of 0.15
Model SVC(C=100, kernel='linear') :0.61 accuracy with std of 0.15


Best C = 1 thus we use C = 1 for the next parameters

In [25]:
for kernel_par in space['kernel']:
    clf = SVC(kernel=kernel_par, C=1)
    scores = cross_val_score(clf, X, y, cv=10)
    print("Model",clf,":%0.2f accuracy with std of %0.2f" % (scores.mean(), scores.std()))

Model SVC(C=1, kernel='linear') :0.64 accuracy with std of 0.19
Model SVC(C=1) :0.64 accuracy with std of 0.14
Model SVC(C=1, kernel='poly') :0.68 accuracy with std of 0.15


Best kernel is polynomial 

In [26]:
for degree_par in space['degree']:
    clf = SVC(kernel='poly', C=1,degree=degree_par)
    scores = cross_val_score(clf, X, y, cv=10)
    print("Model",clf,":%0.2f accuracy with std of %0.2f" % (scores.mean(), scores.std()))

Model SVC(C=1, degree=1, kernel='poly') :0.60 accuracy with std of 0.13
Model SVC(C=1, kernel='poly') :0.68 accuracy with std of 0.15
Model SVC(C=1, degree=6, kernel='poly') :0.66 accuracy with std of 0.11


Best degree is 3 

In [27]:
for gamma_par in space['gamma']:
    clf = SVC(kernel='poly', C=1,degree=3,gamma=gamma_par)
    scores = cross_val_score(clf, X, y, cv=10)
    print("Model",clf,":%0.2f accuracy with std of %0.2f" % (scores.mean(), scores.std()))

Model SVC(C=1, gamma=0.01, kernel='poly') :0.53 accuracy with std of 0.02
Model SVC(C=1, gamma=0.1, kernel='poly') :0.59 accuracy with std of 0.16
Model SVC(C=1, gamma=1, kernel='poly') :0.68 accuracy with std of 0.09
Model SVC(C=1, gamma=4, kernel='poly') :0.68 accuracy with std of 0.09
Model SVC(C=1, gamma=10, kernel='poly') :0.68 accuracy with std of 0.09


Sometimes, when we have date or timestamp in our data it would not be possible to use the sklearn commands, because we may need to preserve the time order of our data. For example if we have stock market data and we want to predict a stock price, it would not yield realistic results if we evaluate our model with a validation set that pre-dates the training set (our model will have been trained with future data). Following is an example that illustrates this using weather data to predict fires in Algeria. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.cm as cm
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score


import datetime

# from plotDecBoundaries import plotDecBoundaries

In [2]:
# alg_fire_df = pd.read_csv('algerian_fires2.csv',header = "infer")  

alg_fire_df_train = pd.read_csv('Fire_train_set.csv',header = "infer")  
alg_fire_df_test = pd.read_csv('Fire_test_set.csv',header = "infer")  
alg_fire_df_train.head()
 

Unnamed: 0,Date,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,...,ws_max,rain_avg,rain_min,rain_max,rh_avg,rh_min,rh_max,temp_avg,temp_min,temp_max
0,2012-06-01,29.0,57.0,18.0,0.0,65.7,3.4,7.6,1.3,3.4,...,18.0,0.35,0.0,0.7,64.0,57.0,71.0,30.5,29.0,32.0
1,2012-06-01,32.0,71.0,12.0,0.7,57.1,2.5,8.2,0.6,2.8,...,18.0,0.35,0.0,0.7,64.0,57.0,71.0,30.5,29.0,32.0
2,2012-06-02,29.0,61.0,13.0,1.3,64.4,4.1,7.6,1.0,3.9,...,18.0,1.5,0.0,4.0,65.5,57.0,73.0,30.0,29.0,32.0
3,2012-06-02,30.0,73.0,13.0,4.0,55.7,2.7,7.8,0.6,2.9,...,18.0,1.5,0.0,4.0,65.5,57.0,73.0,30.0,29.0,32.0
4,2012-06-03,26.0,82.0,22.0,13.1,47.1,2.5,7.1,0.3,2.7,...,22.0,3.516667,0.0,13.1,70.666667,57.0,82.0,29.166667,26.0,32.0


In [3]:
print('Train Start date:',alg_fire_df_train['Date'].min(),'nd date:',alg_fire_df_train['Date'].max())
print('Test Start date:',alg_fire_df_test['Date'].min(),'nd date:',alg_fire_df_test['Date'].max())

Train Start date: 2012-06-01 nd date: 2012-08-31
Test Start date: 2012-09-06 nd date: 2012-09-30


The test set data points are dated after the train set. Train set contains data from June 1st  - August 31st and the test set data from September 6th to September 30th. The goal is to train a model with data points of earlier dates and predict fire in "future" dates. 

In [4]:
X_date_train = alg_fire_df_train.drop(columns=['Classes'])
y_date_train = alg_fire_df_train['Classes']
X_date_train.head()

Unnamed: 0,Date,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,...,ws_max,rain_avg,rain_min,rain_max,rh_avg,rh_min,rh_max,temp_avg,temp_min,temp_max
0,2012-06-01,29.0,57.0,18.0,0.0,65.7,3.4,7.6,1.3,3.4,...,18.0,0.35,0.0,0.7,64.0,57.0,71.0,30.5,29.0,32.0
1,2012-06-01,32.0,71.0,12.0,0.7,57.1,2.5,8.2,0.6,2.8,...,18.0,0.35,0.0,0.7,64.0,57.0,71.0,30.5,29.0,32.0
2,2012-06-02,29.0,61.0,13.0,1.3,64.4,4.1,7.6,1.0,3.9,...,18.0,1.5,0.0,4.0,65.5,57.0,73.0,30.0,29.0,32.0
3,2012-06-02,30.0,73.0,13.0,4.0,55.7,2.7,7.8,0.6,2.9,...,18.0,1.5,0.0,4.0,65.5,57.0,73.0,30.0,29.0,32.0
4,2012-06-03,26.0,82.0,22.0,13.1,47.1,2.5,7.1,0.3,2.7,...,22.0,3.516667,0.0,13.1,70.666667,57.0,82.0,29.166667,26.0,32.0


In [5]:
X_date_test = alg_fire_df_test.drop(columns=['Classes'])
y_date_test = alg_fire_df_test['Classes']

In [6]:
X_date_train = X_date_train.drop(columns=['Date'])
X_date_test = X_date_test.drop(columns=['Date'])

In [7]:
C = 0.1
models = (
    svm.SVC(kernel="linear", C=C),
    svm.LinearSVC(C=C, max_iter=10000),
    svm.SVC(kernel="rbf", gamma=0.9, C=C),
    svm.SVC(kernel="poly", degree=1, gamma="auto", C=C),
    svm.SVC(kernel="poly", degree=3, gamma="auto", C=C),
    svm.SVC(kernel="poly", degree=5, gamma="auto", C=C),
    svm.SVC(kernel="poly", degree=8, gamma="auto", C=C),
#     svm.SVC(kernel="sigmoid", gamma=0.7, C=C),

)

Validation set is done by randomly sampling the data (thus using in test set data points from past dates from some training samples) (wrong way of doing it)

In [8]:
for clf in models:
    acc_train_ar = []
    acc_val_ar = []
    # f1_ar = []
    # auc_ar = []
    # prec_ar = []
    # rc_ar = []
    for i_CV in range(5):
        X_train, X_val, y_train, y_val = train_test_split(X_date_train, y_date_train, test_size=0.25, shuffle=True)
        clf.fit(X_train, y_train)
        # clf.fit(X_date_train, y_date_train)
        # y_pred = clf.predict(X_date_test)
        y_pred_train = clf.predict(X_train)
        acc_train_ar.append(accuracy_score(y_pred_train, y_train))
        
        # acc_ar.append(accuracy_score(y_pred, y_date_test))
        # f1_ar.append(f1_score(y_pred, y_date_test))
        # prec_ar.append(precision_score(y_pred, y_date_test))
        # rc_ar.append(recall_score(y_pred, y_date_test))
#         auc_ar.append(roc_auc_score(y_pred, y_date_test))
        val_ac = accuracy_score(clf.predict(X_val), y_val)
        acc_val_ar.append(val_ac)
        # report performance
    print('Model',clf,'Train Accuracy: %.3f, std: (%.3f), Validation: Accuracy: %.3f, std: (%.3f)'\
           % (np.mean(acc_train_ar), np.std(acc_train_ar),np.mean(acc_val_ar), np.std(acc_val_ar)))

    # print('f1: %.3f, std: (%.3f)' % (np.mean(f1_ar), np.std(f1_ar)))
#     print('Auc: %.3f, std: (%.3f)' % (np.mean(auc_ar), np.std(auc_ar)))
    # print('Prec: %.3f, std: (%.3f)' % (np.mean(prec_ar), np.std(prec_ar)))
    # print('rc: %.3f, std: (%.3f)' % (np.mean(rc_ar), np.std(rc_ar)))
#     plot_confusion_matrix(clf, X_date_test, y_date_test) 

Model SVC(C=0.1, kernel='linear') Train Accuracy: 0.994, std: (0.005), Validation: Accuracy: 0.935, std: (0.027)




Model LinearSVC(C=0.1, max_iter=10000) Train Accuracy: 0.996, std: (0.004), Validation: Accuracy: 0.948, std: (0.026)
Model SVC(C=0.1, gamma=0.9) Train Accuracy: 0.636, std: (0.012), Validation: Accuracy: 0.591, std: (0.035)
Model SVC(C=0.1, degree=1, gamma='auto', kernel='poly') Train Accuracy: 0.974, std: (0.010), Validation: Accuracy: 0.939, std: (0.025)
Model SVC(C=0.1, gamma='auto', kernel='poly') Train Accuracy: 1.000, std: (0.000), Validation: Accuracy: 0.939, std: (0.016)
Model SVC(C=0.1, degree=5, gamma='auto', kernel='poly') Train Accuracy: 1.000, std: (0.000), Validation: Accuracy: 0.926, std: (0.017)
Model SVC(C=0.1, degree=8, gamma='auto', kernel='poly') Train Accuracy: 1.000, std: (0.000), Validation: Accuracy: 0.935, std: (0.027)


Create validation set using dates from August only. (right way of doing it)

In [9]:
final_train_date = '2012-07-31'
val_mask = (alg_fire_df_train['Date'] > final_train_date)
val_set = alg_fire_df_train.loc[val_mask]
val_set

Unnamed: 0,Date,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,...,ws_max,rain_avg,rain_min,rain_max,rh_avg,rh_min,rh_max,temp_avg,temp_min,temp_max
122,2012-08-01,36.0,45.0,14.0,0.0,78.8,4.8,10.2,2.0,4.7,...,18.0,0.000000,0.0,0.0,62.583333,45.0,87.0,34.166667,29.0,38.0
123,2012-08-01,38.0,52.0,14.0,0.0,78.3,4.4,10.5,2.0,4.4,...,18.0,0.000000,0.0,0.0,62.583333,45.0,87.0,34.166667,29.0,38.0
124,2012-08-02,40.0,34.0,14.0,0.0,93.3,10.8,21.4,13.8,10.6,...,17.0,0.033333,0.0,0.4,58.750000,34.0,79.0,35.000000,31.0,40.0
125,2012-08-02,35.0,55.0,12.0,0.4,78.0,5.8,10.0,1.7,5.5,...,17.0,0.033333,0.0,0.4,58.750000,34.0,79.0,35.000000,31.0,40.0
126,2012-08-03,35.0,63.0,14.0,0.3,76.6,5.7,10.0,1.7,5.5,...,17.0,0.058333,0.0,0.4,55.666667,33.0,79.0,35.666667,31.0,40.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
179,2012-08-29,35.0,48.0,18.0,0.0,90.1,54.2,220.4,12.5,67.4,...,21.0,0.075000,0.0,0.5,56.166667,37.0,82.0,34.166667,31.0,36.0
180,2012-08-30,34.0,49.0,15.0,0.0,89.2,24.8,159.1,8.1,35.7,...,21.0,0.141667,0.0,0.8,57.583333,37.0,82.0,34.166667,31.0,36.0
181,2012-08-30,35.0,70.0,17.0,0.8,72.7,25.2,180.4,1.7,37.4,...,21.0,0.141667,0.0,0.8,57.583333,37.0,82.0,34.166667,31.0,36.0
182,2012-08-31,28.0,80.0,21.0,16.8,52.5,8.7,8.7,0.6,8.3,...,21.0,1.541667,0.0,16.8,60.833333,37.0,82.0,33.250000,28.0,36.0


In [10]:
val_set = val_set.drop(columns=['Date'])

The validation set has 62 data points (2 for each day of August). Since we are doing 5-fold validation, we can use 62/5 = 12 data points or 6 days for each validation set. 

In [11]:
# random indices for train set template code
np.random.choice(10, replace = False, size = 5)

array([9, 1, 5, 4, 7])

In [13]:
train_mask = (alg_fire_df_train['Date'] <= final_train_date)
X_date_train = alg_fire_df_train.loc[train_mask]
X_date_train.head()


Unnamed: 0,Date,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,...,ws_max,rain_avg,rain_min,rain_max,rh_avg,rh_min,rh_max,temp_avg,temp_min,temp_max
0,2012-06-01,29.0,57.0,18.0,0.0,65.7,3.4,7.6,1.3,3.4,...,18.0,0.35,0.0,0.7,64.0,57.0,71.0,30.5,29.0,32.0
1,2012-06-01,32.0,71.0,12.0,0.7,57.1,2.5,8.2,0.6,2.8,...,18.0,0.35,0.0,0.7,64.0,57.0,71.0,30.5,29.0,32.0
2,2012-06-02,29.0,61.0,13.0,1.3,64.4,4.1,7.6,1.0,3.9,...,18.0,1.5,0.0,4.0,65.5,57.0,73.0,30.0,29.0,32.0
3,2012-06-02,30.0,73.0,13.0,4.0,55.7,2.7,7.8,0.6,2.9,...,18.0,1.5,0.0,4.0,65.5,57.0,73.0,30.0,29.0,32.0
4,2012-06-03,26.0,82.0,22.0,13.1,47.1,2.5,7.1,0.3,2.7,...,22.0,3.516667,0.0,13.1,70.666667,57.0,82.0,29.166667,26.0,32.0


In [15]:
X_date_train = X_date_train.drop(columns=['Date'])

In [16]:
for clf in models:
    acc_train_ar = []
    acc_val_ar = []
    # f1_ar = []
    # auc_ar = []
    # prec_ar = []
    # rc_ar = []
    for i_CV in range(5):
        # choose random data points for training set
        chosen_idx = np.random.choice(X_date_train.shape[0], replace = False, size = round(0.75*X_date_train.shape[0]))
        X_train = X_date_train.drop(columns=['Classes'])
        y_train = X_date_train['Classes']

        # choose every 6 days in August for validation set
        X_val = val_set.iloc[i_CV*12:(i_CV+1)*12]
        y_val = X_val['Classes']
        X_val = X_val.drop(columns=['Classes'])

        clf.fit(X_train, y_train)
        # clf.fit(X_date_train, y_date_train)
        # y_pred = clf.predict(X_date_test)
        y_pred_train = clf.predict(X_train)
        acc_train_ar.append(accuracy_score(y_pred_train, y_train))
        
        # acc_ar.append(accuracy_score(y_pred, y_date_test))
        # f1_ar.append(f1_score(y_pred, y_date_test))
        # prec_ar.append(precision_score(y_pred, y_date_test))
        # rc_ar.append(recall_score(y_pred, y_date_test))
        val_ac = accuracy_score(clf.predict(X_val), y_val)
        acc_val_ar.append(val_ac)
        # report performance
    print('Model',clf,'Train Accuracy: %.3f, std: (%.3f), Validation: Accuracy: %.3f, std: (%.3f)'\
           % (np.mean(acc_train_ar), np.std(acc_train_ar),np.mean(acc_val_ar), np.std(acc_val_ar)))


Model SVC(C=0.1, kernel='linear') Train Accuracy: 0.992, std: (0.000), Validation: Accuracy: 0.917, std: (0.075)




Model LinearSVC(C=0.1, max_iter=10000) Train Accuracy: 0.997, std: (0.004), Validation: Accuracy: 0.933, std: (0.062)
Model SVC(C=0.1, gamma=0.9) Train Accuracy: 0.525, std: (0.000), Validation: Accuracy: 0.833, std: (0.158)
Model SVC(C=0.1, degree=1, gamma='auto', kernel='poly') Train Accuracy: 0.975, std: (0.000), Validation: Accuracy: 0.950, std: (0.067)
Model SVC(C=0.1, gamma='auto', kernel='poly') Train Accuracy: 1.000, std: (0.000), Validation: Accuracy: 0.583, std: (0.350)
Model SVC(C=0.1, degree=5, gamma='auto', kernel='poly') Train Accuracy: 1.000, std: (0.000), Validation: Accuracy: 0.600, std: (0.339)
Model SVC(C=0.1, degree=8, gamma='auto', kernel='poly') Train Accuracy: 1.000, std: (0.000), Validation: Accuracy: 0.583, std: (0.373)
