## Decision Trees for Class Imbalance with Scikit-Learn

Kaggle CreditCard Fraud Detection Data can be downloaded here:
https://github.com/nsethi31/Kaggle-Data-Credit-Card-Fraud-Detection/blob/master/creditcard.csv?raw=true

### How Decision Trees work:

Part 1: https://www.youtube.com/watch?v=O__7lAqni7A
    
Part 2: https://www.youtube.com/watch?v=WpA-Xbw4z_A

In [None]:
%%time
import pandas as pd
data = pd.read_csv('creditcard.csv')

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data['Class'].value_counts()

In [None]:
284315/492

In [None]:
%%time
# Split data into train and test splits

from sklearn.model_selection import train_test_split

# retrieve numpy array
data = data.values
# split into input and output elements
X, y = data[:, 1:-1], data[:, -1]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=4992,stratify=y)

In [None]:
# Count how many unique values of each class

import numpy as np
unique, counts = np.unique(y_train, return_counts=True)
print (np.asarray((unique, counts)).T)

unique, counts = np.unique(y_test, return_counts=True)
print (np.asarray((unique, counts)).T)

### The class weighing can be defined multiple ways; for example:

* Domain expertise, determined by talking to subject matter experts
* Tuning, determined by a hyperparameter search such as a grid search
* Heuristic, specified using a general best practice
* A best practice for using the class weighting is to use the inverse of the class distribution present in the training dataset

In [None]:
# calculate heuristic class weighting
from sklearn.utils.class_weight import compute_class_weight

# calculate class weighting according to training data
weighting = compute_class_weight('balanced', [0,1], y_train)
print(weighting)

### For a decision tree:
* *Small Weight:* Less importance, lower impact on node purity.
* *Large Weight:* More importance, higher impact on node purity.

In [None]:
%%time
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score

weights = {0:weighting[0], 1:weighting[1]} 
print(weights)
# define model
# try with class_weight=weights and class_weight=None
#model = DecisionTreeClassifier()
model = DecisionTreeClassifier(class_weight=weights)
# fit a model
model.fit(X_train, y_train)

# predict probabilities
probs = model.predict_proba(X_test)
# keep probabilities for the positive outcome only
pos_probs = probs[:, 1]

auc = roc_auc_score(y_test, pos_probs)

# summarize performance
print(' ROC AUC = %.3f' % auc)

In [None]:
%%time
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score

weights = [
        {0:weighting[0], 1:weighting[1]} ,
    {0:0.7, 1:300} ,
    {0:22, 1:105} ,
    {0:50, 1:150} 
          ]
for w in weights:
    # define model
    # try with class_weight=weights and class_weight=None
    #model = DecisionTreeClassifier()
    model = DecisionTreeClassifier(class_weight=w)
    # fit a model
    model.fit(X_train, y_train)

    # predict probabilities
    probs = model.predict_proba(X_test)
    # keep probabilities for the positive outcome only
    pos_probs = probs[:, 1]

    auc = roc_auc_score(y_test, pos_probs)

    # summarize performance
    print(' ROC AUC = %.3f' % auc)