<a href="https://colab.research.google.com/github/juanpaat/Machine-Learning-Project-Template/blob/main/Template_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cross-Validation

# Train-Test-Split  

Train-test-split is the simplest form of cross-validation. We simply randomly slice our dataset into a training set and testing set. Typically, the most important parameters are:

*  `X`: The feature set you're looking to split.

*  `y`: The target variable you're looking to split.

*  `test_size`: The size of your testing set. Typically, this is denoted as a fraction such as `0.33`.

*  `random_state`: This is the seed of the random shuffle. I recommend setting a seed so everytime you rerun your notebook, your results stay consistent.

*  `stratify`: This is an optional argument. But stratifying will reduce the variance in the random shuffle to ensure that your training and testing sets are more similar than not.




In [None]:
from sklearn.model_selection import train_test_split

features = [
    'amount',
    'oldbalanceOrg',
    'newbalanceOrig',
    'oldbalanceDest',
    'newbalanceDest'
]

X = df[features]
y = df['isFraud']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=42)

## K-Fold Cross Validation  

Import parameters we should keep in mind:

*  `n_splits`: This is the number of splits we want to make within our dataset.

*  `shuffle`: This tells us whether we should shuffle our data before splitting into folds.

*  `random_state`: This is the random seed we're setting, similar to train-test-split.

In [None]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=2, shuffle = True, random_state = 42)
kf.get_n_splits(X)

folds = {}

for train, test in kf.split(X):
    # Fold
    fold_number = 1
    # Store fold number
    folds[fold_number] = (df.iloc[train], df.iloc[test])
    print('train: %s, test: %s' % (df.iloc[train], df.iloc[test]))
    fold_number += 1

Typically, after completing K-Fold Cross-Validation we'll want to calculate a cross-validation score. Typically, we'll get the scores for each fold, then take an average

In [None]:
from sklearn.model_selection import cross_val_score

model = RandomForestClassifier()

scores = cross_val_score(model, X, y, scoring='accuracy', cv=kf, n_jobs=-1)

print(np.mean(scores))

## Leave-One-Out Cross Validation


In [None]:
from sklearn.model_selection import LeaveOneOut
from sklearn.metrics import accuracy_score

loo = LeaveOneOut()
loo.get_n_splits(X)


all_preds = []

for train_index, test_index in loo.split(X[:100]):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    model = RandomForestClassifier()

    model.fit(X_train, y_train)
    y_preds = model.predict(X_test)

    correct = y_preds[0] == y_test.values[0]

    all_preds.append(correct)

In [None]:
sum(all_preds)/len(all_preds)

## Train-Test-Split Date Split  

In many instances, you don't want to randomly slice your data into training and testing sets, but instead, you want to split it by time. In this case, you'll want to split by date:

In [None]:
DATE = '2021-12-31'

train_df = df[df['date'] < DATE].copy()
test_df = df[df['date'] >= DATE].copy()

X_train = train_df[features]
X_test = test_df[features]

y_train = train_df['isFraud']
y_test = test_df['isFraud']


model = RandomForestClassifier()

model.fit(X_train, y_train)
y_preds = model.predict(X_test)

print(average_precision_score(y_preds, y_test))

## Sliding Window/Time Series KFold

In [None]:
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit()

all_scores = []

for train_index, test_index in tscv.split(X):
#     print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    model = RandomForestClassifier()

    model.fit(X_train, y_train)
    y_preds = model.predict(X_test)

    pr_auc = average_precision_score(y_preds, y_test)

    all_scores.append(pr_auc)


print(all_scores)

## Expanding Window

In [None]:
class ExpandingWindowCV:
    def fit(self, date_col, date_range = None, custom_range = None):
        self.date_col = date_col
        self.date_range = date_range
        self.custom_range = custom_range

        if date_range is not None and custom_range is not None:
            raise ValueError("Date Range and Custom Range both cannot be None.")

    def split(self, df):
        if self.date_range is None:
            dates = list(set(df[self.date_col].astype(str).values))

        if self.date_range is not None:
            dates = pd.date_range(start=self.date_range[0], end=self.date_range[1])
            dates = [str(d.date()) for d in dates]

        if self.custom_range is not None:
            dates = self.custom_range

        for d in dates:
            df_train = df[df[self.date_col].astype(str) <= d].copy()
            df_test = df[df[self.date_col].astype(str) > d].copy()
            yield df_train, df_test

ew = ExpandingWindowCV()
ew.fit(date_col = 'date', date_range = ['2022-01-02','2022-01-08'])
ew.split(df)

In [None]:
all_scores = []

for train_df, test_df in ew.split(df):
    X_train = train_df[features]
    X_test = test_df[features]

    y_train = train_df['isFraud']
    y_test = test_df['isFraud']


    model = RandomForestClassifier()

    model.fit(X_train, y_train)
    y_preds = model.predict(X_test)

    pr_auc = average_precision_score(y_preds, y_test)

    all_scores.append(pr_auc)

all_scores

## Monte Carlo Cross Validation  

Monte Carlos Cross Validation is where we randomly select a sub-sample (with replacement) from our dataset for the training set, use the rest for the testing set. Repeat this (with replacement) N number of times, to create a distribution of evaluation scores.

In [None]:
from sklearn.model_selection import ShuffleSplit

rs = ShuffleSplit(n_splits=5, test_size=.25, random_state=0)
rs.get_n_splits(df)

all_scores = []
for train_index, test_index in rs.split(df):
#     print("TRAIN:", train_index, "TEST:", test_index)

    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    model = RandomForestClassifier()

    model.fit(X_train, y_train)
    y_preds = model.predict(X_test)

    pr_auc = average_precision_score(y_preds, y_test)

    all_scores.append(pr_auc)

In [None]:
all_scores

# Models Classification



https://www.datacamp.com/cheat-sheet/machine-learning-cheat-sheet

| Model                 | Regression | Classification |
| :--------------------: | :--------: | :------------: |
| Linear Regression     |      X     |                |
| Logistic Regression   |            |        X       |
| Ridge Regression      |      X     |                |
| Lasso Regression      |      X     |                |
| K-Nearest Neighbours  |      X     |        X       |
| Decision Trees        |      X     |        X       |
| Naïve Bayes           |            |        X       |
| SVM                   |      X     |        X       |
| Random Forest         |      X     |        X       |
| XGBoost               |      X     |        X       |

## Decision Tree

In [None]:
# Load libraries
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

In [None]:
#split dataset in features and target variable
feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.NameOfTheTarget # Target variable

In [None]:
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test

In [None]:
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

*  **criterion : optional (default=”gini”) or Choose attribute selection measure.**
This parameter allows us to use the different-different attribute selection measure. Supported criteria are “gini” for the Gini index and “entropy” for the information gain.

*  **splitter : string, optional (default=”best”) or Split Strategy.** This parameter allows us to choose the split strategy. Supported strategies are “best” to choose the best split and “random” to choose the best random split.

*  **max_depth : int or None, optional (default=None) or Maximum Depth of a Tree.** The maximum depth of the tree. If None, then nodes are expanded until all the leaves contain less than min_samples_split samples. The higher value of maximum depth causes overfitting, and a lower value causes underfitting (Source).

In [None]:
# Optimizing Decision Tree Performance

In [None]:
# Create Decision Tree classifer object
clf = DecisionTreeClassifier(criterion="entropy", max_depth=3)

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
#import cross validation score
from sklearn.model_selection import cross_val_score

## Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(random_state =32)

dt_accuracy = cross_val_score(dt,X_train,y_train.values.ravel(), cv=3, scoring ='accuracy')
dt_f1 = cross_val_score(dt,X_train,y_train.values.ravel(), cv=3, scoring ='f1')

print('dt_accuracy: ' +str(dt_accuracy))
print('dt F1_Macro Score: '+str(dt_f1))
print('dt_accuracy_avg: ' + str(dt_accuracy.mean()) +'  |  dt_f1_avg: '+str(dt_f1.mean())+'\n')

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
# Define a parameter grid with distributions of possible parameters to use
DT_param_grid = {
    'criterion' : ['gini', 'entropy'],
    'splitter' : ['best', 'random'],
    'max_depth' : [1, 2, 3, 4, 5, 6,7,8,9,10],
    'min_samples_split' : [2, 3, 4, 5, 6, 7, 8, 9, 10],
}

# Create the cross validation object
KFold_cv = KFold(n_splits=10, shuffle=True, random_state=2309805)

# Instantiate RandomizedSearchCV()
DT_model = RandomizedSearchCV(
    estimator = DecisionTreeClassifier(random_state = 2309805),
    n_iter = 300,
    param_distributions = DT_param_grid,
    cv = KFold_cv,
    verbose = 0,
    scoring = 'recall',
    n_jobs=-1,
    refit=True)

# Fit the object to our data
DT_model.fit(X_train, y_train)
DT_y_pred = DT_model.predict(X_test)

# Print the best parameters and highest accuracy
print("Best parameters found: ", DT_model.best_params_)
print("\nBest recall found: ", DT_model.best_score_)

## Random Forest

In [None]:
# Modelling
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, ConfusionMatrixDisplay
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from scipy.stats import randint

# Tree Visualisation
from sklearn.tree import export_graphviz
from IPython.display import Image
import graphviz

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

In [None]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

In [None]:
# Visualizing the Results

# Export the first three decision trees from the forest
for i in range(3):
    tree = rf.estimators_[i]
    dot_data = export_graphviz(tree,
                               feature_names=X_train.columns,
                               filled=True,
                               max_depth=2,
                               impurity=False,
                               proportion=True)
    graph = graphviz.Source(dot_data)
    display(graph)

In [None]:
# Create a series containing feature importances from the model and feature names from the training data
feature_importances = pd.Series(best_rf.feature_importances_, index=X_train.columns).sort_values(ascending=False)

# Plot a simple bar chart
feature_importances.plot.bar();

In [None]:
# Define a parameter grid with distributions of possible parameters to use
RF_param_grid = {'bootstrap': [True, False],
                 'max_depth': range(2,20,2),
                 'max_features': ['log2', 'sqrt'],
                 'min_samples_leaf': [1, 2, 4],
                 'min_samples_split': [2, 5, 10],
                 'criterion' : ['gini', 'entropy'],
                 'n_estimators': [50,100, 200, 300]}

KFold_cv = KFold(n_splits=10, shuffle=True, random_state=2309805)

# Instantiate GridSearchCV() with clf and the parameter grid
RF_model = RandomizedSearchCV(
    estimator = RandomForestClassifier(random_state = 2309805),
    n_iter = 250,
    param_distributions = RF_param_grid,
    random_state = 2309805,
    cv = KFold_cv,
    verbose = 0,
    scoring = 'recall',
    n_jobs=2,
    refit=True)

# Fit the object to our data
RF_model.fit(X_train, y_train)
RF_y_pred = RF_model.predict(X_test)

# Print the best parameters and highest accuracy
print("Best parameters found: ", RF_model.best_params_)
print("Best recall found: ", RF_model.best_score_)

## Logistic Regression

In [None]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=16)

In [None]:
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg = LogisticRegression(random_state=16)

# fit the model with data
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

In [None]:
# import the metrics class
from sklearn import metrics

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

In [None]:
#import cross validation score
from sklearn.model_selection import cross_val_score

## Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state=32, max_iter = 2000, class_weight = 'balanced')

lr_accuracy = cross_val_score(lr,X_train,y_train.values.ravel(), cv=3, scoring ='accuracy')
lr_f1 = cross_val_score(lr,X_train,y_train.values.ravel(), cv=3, scoring ='f1')

print('lr_accuracy: ' +str(lr_accuracy))
print('lr F1_Macro Score: '+str(lr_f1))
print('lr_accuracy_avg: ' + str(lr_accuracy.mean()) +'  |  lr_f1_avg: '+str(lr_f1.mean())+'\n')

In [None]:
# load libraries
from sklearn.linear_model import LogisticRegression

In [None]:
# Create the grid
LR_param_grid = {'penalty' : ['l1', 'l2'],
    'C' : np.logspace(-4, 4, 20),
    'solver' : ['liblinear']}

# Create the cross validation object
KFold_cv = KFold(n_splits=10, shuffle=True, random_state=2309805)

# Instantiate the grid search object
LR_model = GridSearchCV(
	estimator = LogisticRegression(random_state = 2309805),
	param_grid = LR_param_grid,
	scoring = 'recall',
	n_jobs=2,
	cv = KFold_cv,
	refit = True,
  verbose = 0,
	return_train_score = True)

#Fit the object to our data
LR_model.fit(X_train, y_train)
# Make predictions
LR_y_pred = LR_model.predict(X_test)

# Print the best parameters and highest accuracy
print("Best parameters found: ", LR_model.best_params_)
print("\nBest Recall found: ", LR_model.best_score_)

## KNN

In [None]:
## KNN
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

In [None]:
# Split the data into features (X) and target (y)
X = df.drop('fraud', axis=1)
y = df['fraud']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Scale the features using StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

In [None]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

In [None]:
# Using Cross Validation to Get the Best Value of k
k_values = [i for i in range (1,31)]
scores = []

scaler = StandardScaler()
X = scaler.fit_transform(X)

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    score = cross_val_score(knn, X, y, cv=5)
    scores.append(np.mean(score))


# plot
sns.lineplot(x = k_values, y = scores, marker = 'o')
plt.xlabel("K Values")
plt.ylabel("Accuracy Score")

In [None]:
#import cross validation score
from sklearn.model_selection import cross_val_score

## KNN
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline, Pipeline

knn = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=3))
knn_accuracy = cross_val_score(knn,X_train,y_train.values.ravel(), cv=3, scoring ='accuracy')
knn_f1 = cross_val_score(knn,X_train,y_train.values.ravel(), cv=3, scoring ='f1')

print('knn_accuracy: ' +str(knn_accuracy))
print('knn F1_Macro Score: '+str(knn_f1))
print('knn_accuracy_avg: ' + str(knn_accuracy.mean()) +'  |  knn_f1_avg: '+str(knn_f1.mean()))

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
# Define a parameter grid with distributions of possible parameters to use
KNN_param_grid = {
    "n_neighbors": np.linspace(1, 30, 30).astype(int),
    "algorithm" : ['auto', 'ball_tree', 'kd_tree', 'brute'],
    "leaf_size": np.linspace(1, 50, 6).astype(int),
    "p": [1,2]
}

# Create the cross validation object
KFold_cv = KFold(n_splits=10, shuffle=True, random_state=2309805)

# Instantiate GridSearch() with clf and the parameter grid
KNN_model = RandomizedSearchCV(
    estimator = KNeighborsClassifier(),
    n_iter = 200,
    param_distributions =  KNN_param_grid,
    random_state = 2309805,
    cv = KFold_cv,
    verbose = 0,
    scoring = 'recall',
    n_jobs=2,
    refit=True)

# Fit the object to our data
KNN_model.fit(X_train, y_train)
KNN_y_pred = KNN_model.predict(X_test)

# Print the best parameters and highest accuracy
print("Best parameters found: ", KNN_model.best_params_)
print("Best recall found: ", KNN_model.best_score_)

## Naïve Bayes

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=125
)

In [None]:
from sklearn.naive_bayes import GaussianNB

# Build a Gaussian Classifier
model = GaussianNB()

# Model training
model.fit(X_train, y_train)

# Predict Output
predicted = model.predict([X_test[6]])

print("Actual Value:", y_test[6])
print("Predicted Value:", predicted[0])

In [None]:
#import Naive Bayes Classifier
from sklearn.naive_bayes import GaussianNB

#create classifier object
nb = GaussianNB()

nb_accuracy = cross_val_score(nb,X_train,y_train.values.ravel(), cv=3, scoring ='accuracy')
nb_f1 = cross_val_score(nb,X_train,y_train.values.ravel(), cv=3, scoring ='f1')

print('nb_accuracy: ' +str(nb_accuracy))
print('nb F1_Macro Score: '+str(nb_f1))
print('nb_accuracy_avg: ' + str(nb_accuracy.mean()) +'  |  lr_f1_avg: '+str(nb_f1.mean()))

## SVM

In [None]:
# Import train_test_split function
from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=109) # 70% training and 30% test


In [None]:
#Import svm model
from sklearn import svm

In [None]:
#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel

#Train the model using the training sets
clf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)


In [None]:
# Define a parameter grid with distributions of possible parameters to use
SVM_param_grid = {
    "kernel": ["linear", "poly", "rbf", "sigmoid"],
    "C": [0.1, 1, 10],
    "gamma": [0.00001, 0.0001, 0.001, 0.01, 0.1],
}

# Create the cross validation object
KFold_cv = KFold(n_splits=10, shuffle=True, random_state=2309805)

# Instantiate RandomizedSearchCV() with clf and the parameter grid
SVM_model = RandomizedSearchCV(
    estimator = svm.SVC(random_state = 2309805),
    n_iter = 40,
    param_distributions = SVM_param_grid,
    cv = KFold_cv,
    verbose = 0,
    random_state = 2309805,
    scoring = 'recall',
    n_jobs=2,
    refit=True)

# Fit the object to our data
SVM_model.fit(X_train, y_train)
SVM_y_pred = SVM_model.predict(X_test)

# Print the best parameters and highest accuracy
print("Best parameters found: ", SVM_model.best_params_)
print("\nBest recall found: ", SVM_model.best_score_)

# Models Regression

## Linear Regression

In [None]:
# Load packages
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid", {"axes.facecolor": ".9"})

In [None]:
# Function to flatten 2D lists so it can be used by plotly
def flatten(l):
    return [item for sublist in l for item in sublist]

# Set up and fit the linear regressor
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

# Flatten the prediction and expected lists
predicted = flatten(lin_reg.predict(X_test))
expected = flatten(y_test.values)

In [None]:
%matplotlib inline
# Import plotting package
import plotly.express as px

# Put data to plot in dataframe
df_plot = pd.DataFrame({'expected':expected, 'predicted':predicted})

# Make scatter plot from data
fig = px.scatter(
    df_plot,
    x='expected',
    y='predicted',
    title='Predicted vs. Actual Values')

# Add straight line indicating perfect model
fig.add_shape(type="line",
    x0=0, y0=0, x1=50, y1=50,
    line=dict(
        color="Red",
        width=4,
        dash="dot",
    )
)

# Show figure
fig.show()

In [None]:
# Print the root mean square error (RMS)
error = np.sqrt(np.mean((np.array(predicted) - np.array(expected)) ** 2))
print(f"RMS: {error:.4f} ")

r2=r2_score(expected, predicted)
print(f"R2: {round(r2,4)}")

# Parameter Tuning

## Manual Parameter Tuning

In [None]:
#Knn Model Comparison

#here we will loop through and see which value of k performs the best.

for i in range(1,7):
    knn = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=i))
    knn_f1 = cross_val_score(knn,X_train,y_train.values.ravel(), cv=3, scoring ='f1')
    print('K ='+(str(i)) + (': ') + str(knn_f1.mean()))

## Randomized Parameter Tuning

In [None]:
from sklearn.model_selection import RandomizedSearchCV

dt = DecisionTreeClassifier(random_state = 42)

features = {'criterion': ['gini','entropy'],
            'splitter': ['best','random'],
           'max_depth': [2,5,10,20,40,None],
           'min_samples_split': [2,5,10,15],
           'max_features': ['auto','sqrt','log2',None]}

rs_dt = RandomizedSearchCV(estimator = dt,
                           param_distributions = features,
                           n_iter = 100,
                           cv = 3,
                           random_state = 42,
                           scoring = 'f1')

rs_dt.fit(X_train, y_train)

In [None]:
print('best stcore = ' + str(rs_dt.best_score_))
print('best params = ' + str(rs_dt.best_params_))

## GridsearchCV (Exhaustive Parameter Tuning)

In [None]:
from sklearn.model_selection import GridSearchCV


features_gs = {'criterion': ['entropy'],
            'splitter': ['random'],
           'max_depth': np.arange(30,50,1), #getting more precise within range
           'min_samples_split': [2,3,4,5,6,7,8,9],
           'max_features': [None]}

gs_dt = GridSearchCV(estimator = dt,
                     param_grid = features_gs,
                     cv = 3,
                     scoring ='f1') #we don't need random state because there isn't randomization like before

gs_dt.fit(X_train,y_train)

In [None]:
print('best stcore = ' + str(gs_dt.best_score_))
print('best params = ' + str(gs_dt.best_params_))

## Bayesian Optimization

This is an iterative process where our model improves its understandings of the feature inputs as it goes.

In [None]:
from skopt import BayesSearchCV
from sklearn.model_selection import StratifiedKFold

In [None]:
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer

# Choose cross validation method
cv = StratifiedKFold(n_splits = 3)


bs_lr = BayesSearchCV(
    dt,
    {'criterion': Categorical(['gini','entropy']),
            'splitter': Categorical(['best','random']),
           'max_depth': Integer(10,50),
           'min_samples_split': Integer(2,15),
           'max_features': Categorical(['auto','sqrt','log2',None])},
    random_state=42,
    n_iter= 100,
    cv= cv,
    scoring ='f1')

bs_lr.fit(X_train,y_train.values.ravel())

In [None]:
print('best stcore = ' + str(bs_lr.best_score_))
print('best params = ' + str(bs_lr.best_params_))

# Selecting a model  

We also want to use other considerations like training time, prediction time or interperetability to select selct the best model for our use case.


## Ensemble model  

Since we have one tuned model, lets see if we can improve it by combining it with a few of the other models we have used. This process is called ensembling. In the case of classification, we often use a popular vote metric to select the best model.

In [None]:
from sklearn.ensemble import VotingClassifier

dt_voting = DecisionTreeClassifier(**{'criterion': 'entropy',
                                      'max_depth': 44,
                                      'max_features': None,
                                      'min_samples_split': 2,
                                      'splitter': 'random'}) # ** allows you to pass in parameters as dict
knn_voting = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=1))
lr_voting = LogisticRegression(random_state=32, max_iter = 2000, class_weight = 'balanced')

ens = VotingClassifier(estimators = [('dt', dt_voting),('knn', knn_voting), ('lr',lr_voting)], voting = 'hard')

In [None]:
voting_accuracy = cross_val_score(ens,X_train,y_train.values.ravel(), cv=3, scoring ='accuracy')
voting_f1 = cross_val_score(ens,X_train,y_train.values.ravel(), cv=3, scoring ='f1')

print('voting_accuracy: ' +str(voting_accuracy))
print('voting F1_Macro Score: '+str(voting_f1))
print('voting_accuracy_avg: ' + str(voting_accuracy.mean()) +'  |  voting_f1_avg: '+str(voting_f1.mean()))

In [None]:
ens = VotingClassifier(estimators = [('dt', dt_voting), ('knn', knn_voting), ('lr',lr_voting)], voting = 'soft')
voting_accuracy = cross_val_score(ens,X_train,y_train.values.ravel(), cv=3, scoring ='accuracy')
voting_f1 = cross_val_score(ens,X_train,y_train.values.ravel(), cv=3, scoring ='f1')

print('voting_accuracy: ' +str(voting_accuracy))
print('voting F1_Macro Score: '+str(voting_f1))
print('voting_accuracy_avg: ' + str(voting_accuracy.mean()) +'  |  voting_f1_avg: '+str(voting_f1.mean()))