## Estimating the Cost with Cross-Validation

We mentioned that there are 3 ways of estimating the cost:

- Domain Expert provides the cost
- Balance Ratio (we did this in previous notebook)
- Cross-validation: find cost as hyper-parameter

In this notebook, we will find the cost with hyper parameter search and cross-validation.

In [7]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

In [8]:
# load data
# only a few observations to speed the computaton

# data = pd.read_csv('../kdd2004.csv').sample(10000)
data = pd.read_csv('../kdd2004.csv')

data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,65,66,67,68,69,70,71,72,73,target
0,52.0,32.69,0.3,2.5,20.0,1256.8,-0.89,0.33,11.0,-55.0,...,1595.1,-1.64,2.83,-2.0,-50.0,445.2,-0.35,0.26,0.76,-1
1,58.0,33.33,0.0,16.5,9.5,608.1,0.5,0.07,20.5,-52.5,...,762.9,0.29,0.82,-3.0,-35.0,140.3,1.16,0.39,0.73,-1
2,77.0,27.27,-0.91,6.0,58.5,1623.6,-1.4,0.02,-6.5,-48.0,...,1491.8,0.32,-1.29,0.0,-34.0,658.2,-0.76,0.26,0.24,-1
3,41.0,27.91,-0.35,3.0,46.0,1921.6,-1.36,-0.47,-32.0,-51.5,...,2047.7,-0.98,1.53,0.0,-49.0,554.2,-0.83,0.39,0.73,-1
4,50.0,28.0,-1.32,-9.0,12.0,464.8,0.88,0.19,8.0,-51.5,...,479.5,0.68,-0.59,2.0,-36.0,-6.9,2.02,0.14,-0.23,-1


In [9]:
# imbalanced target

data.target.value_counts() / len(data)

target
-1    0.991108
 1    0.008892
Name: count, dtype: float64

In [10]:
# separate dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),  # drop the target
    data['target'],  # just the target
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((102025, 74), (43726, 74))

In [11]:
# set up initial random forest

rf = RandomForestClassifier(n_estimators=50,
                            random_state=39,
                            max_depth=2,
                            n_jobs=4,
                            class_weight=None)

In [12]:
# set up parameter search grid
# including class weight

param_grid = {
  'n_estimators': [10, 50, 100],
  'max_depth': [None, 2, 3],
  'class_weight': [None, {-1:1, 1:10}, {-1:1, 1:100}],
}

In [13]:
search = GridSearchCV(estimator=rf,
                      scoring='roc_auc',
                      param_grid=param_grid,
                      cv=2,
                     ).fit(X_train, y_train)

In [14]:
search.best_score_

0.9815388355224426

In [15]:
search.best_params_

{'class_weight': {-1: 1, 1: 100}, 'max_depth': 3, 'n_estimators': 50}

In [16]:
search.best_estimator_

In [17]:
search.score(X_test, y_test)

0.9860577982166434

In [31]:
# omd 

from sklearn import metrics

rf = search.best_estimator_
rf.fit(X_train, y_train)

pred_prob = rf.predict_proba(X_test)[:,1]
preds = rf.predict(X_test)
print(f"Random Rorest ROC AUC: {metrics.roc_auc_score(y_test, pred_prob)}")
print(f"Random Forest Precision: {metrics.precision_score(y_test, preds)}")
print(f"Random Forest Recall: {metrics.recall_score(y_test, preds)}")
print(f"Random Forest F1: {metrics.f1_score(y_test, preds)}")

Random Rorest ROC AUC: 0.9860577982166434
Random Forest Precision: 0.4355909694555113
Random Forest Recall: 0.8098765432098766
Random Forest F1: 0.5664939550949915


**HOMEWORK**

Try other machine learning algorithms and other datasets available in imbalanced-learn