# Dealing with imbalanced classes. 

* using a combination of **hyperopt** and the **XGBoost-Classifier**. To see my experiments with a Neural Network Autoencoder click here. 


* Using the Kaggle **Credit Card Fraud Detection** dataset. 


* Quoting from Kaggle: 

    - The datasets contains transactions made by credit cards in September 2013 by european cardholders.This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
    
    - Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.
    
    - Features V1-V28 are anonymized. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount.
    
    
* XGBoost 

    - is known for its excellent speed & performance (operations parallizable) on a variety on Machine Learning problems. 
    - Is an Ensemble algoritm. It uses CART decision trees as base learners. 
    
    - It is preferred for use when we have a large number of training samples. Number of training samples must be greater than the number of features. 
    
    - The complete parameter list can be found at https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst
    
    
* Hyperopt

    - Manual search, grid search and random search are common algorithms used for hyperparameter tuning. 
    - All these previous methods are un-informed: ie. they scan through the entire/randomly selected set of hyperparameters to select the best one. 
    - Hyperopt is a Bayasian optimization algorithm, it is classified as an 'informed search' algorithm as the score from the previous round of search, 'informs' the choice of a better set of hyperparameters. 
 
* Another extremely cool advantage of hyperopt is: 
    - There are a lot of machine learning algorithms. And many different configuration for preprocessing/ feature engineering. Hyperopt allows us to use an algorithm to search this large set of other algorithms for the best, without additional input on our part. 
    - The 'algo' parameter is customizable. I will only use the inbuilt tpe.suggest algo. It is known to work well for most use cases. 
    - My dataset does not need much pre-processing. 
    - So I will only try out hyperparameter tuning for XGBoost today. 

# Imports & read file

In [22]:
import pandas as pd
import numpy as np
import plotly_express as px
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report, precision_score, recall_score, f1_score, accuracy_score
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score, precision_recall_curve
from sklearn.model_selection import train_test_split
from hyperopt import hp, fmin, tpe, Trials, STATUS_OK
import pickle
import xgboost as xgb

%matplotlib inline
sns.set(style='ticks')
pd.options.display.max_columns = 500
pd.options.display.max_rows = 500

data = pd.read_csv("../input/creditcardfraud/creditcard.csv")

# Explore data

In [None]:
data.info()
data.tail()

In [None]:
data.describe()

* 30 feature columns in dataset. 
    - We don't know what V1-V26 are, but we know they have been scaled (boxplots show a similar range of values, to confirm).
    - Additionally,  'Time' in seconds, over a period of 2 days, is available. 
    - Majority of transactions are of a smaller 'Amount', mean is USD 88, range is ~ USD 0 - USD 25,691 range.
    
    
* No null values in dataset. 


* Imbalanced classes can be seen below. 

In [None]:
# Target contains imbalanced classes
data.Class.value_counts()

In [None]:
# Boxplot of variables V1 - V28
plt.figure(figsize=(20,5))
sns.boxplot(data=data.drop(['Time', 'Amount', 'Class'], axis=1))
plt.xticks(rotation=90)
plt.title('Boxplots of features V1 - V26')
plt.xlabel('Feature Name')
plt.ylabel('Value')
plt.show()

#datelist=pd.date_range("00:00:00", "23:59:59", freq="S")
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(20,5))
ax1.scatter(data.index, data['Time']/(60*60))
ax1.set_xlabel('Index')
ax1.set_ylabel('Time feature (in hours)')
ax1.set_title('Scatterplot of Time feature')
ax2.boxplot(data['Amount'])
ax2.set_xlabel('Amount')
ax2.set_title('Boxplot of Amount feature')

# Preprocessing

In [23]:
# Scaling 'Time' and 'Amount'
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()

data['scaled_amount'] = std_scaler.fit_transform(data['Amount'].values.reshape(-1,1))
data['scaled_time'] = std_scaler.fit_transform(data['Time'].values.reshape(-1,1))

data.drop(['Time','Amount'], axis=1, inplace=True)

In [24]:
X_train, X_rest, y_train, y_rest= train_test_split(data.drop('Class', axis=1), data.Class, 
                                                   test_size=0.5, random_state=123)
X_hold, X_test, y_hold, y_test= train_test_split(X_rest, y_rest, 
                                                   test_size=0.5, random_state=123)

# Models

## XGBoost

### Baseline model using XGBClassifier, using common test_train_split

In [25]:
clf = xgb.XGBClassifier(n_estimators=50, seed=123)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_hold)
accuracy = accuracy_score(y_hold, y_pred)
print("accuracy: %f" % (accuracy))
print('As 99.828% of the data is class 0, the 99.927% heavily biased accuracy does not tell us much about the quality of our model .')

f1 = f1_score(y_hold, y_pred)
print("f1 score is a better metric: %f" % (f1))

print(confusion_matrix(y_hold, y_pred))
print(classification_report(y_hold, y_pred))

accuracy: 0.999396
As 99.828% of the data is class 0, the 99.927% heavily biased accuracy does not tell us much about the quality of our model .
f1 score is a better metric: 0.833977
[[71051    19]
 [   24   108]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     71070
           1       0.85      0.82      0.83       132

    accuracy                           1.00     71202
   macro avg       0.93      0.91      0.92     71202
weighted avg       1.00      1.00      1.00     71202



### Sampling & Models : Decreases performance

* **OVERSAMPLING OR UNDERSAMPLING SHOULD ONLY BE APPLIED TO TRAIN.** 

* #### Undersampling : Precision very low, Recall is high. 

In [None]:
y_0=y_train[y_train==0]
y_1=y_train[y_train==1]
y_0_under=y_0.sample(n=len(y_1), random_state=123)
y_under=pd.concat([y_0_under,y_1]).sample(frac=1, random_state=123)
X_under=X_train.reindex(y_under.index)

In [None]:
# Target is no longer imbalanced. 
pd.DataFrame(y_under)['Class'].value_counts()

In [None]:
# Target is no longer imbalanced. 
pd.DataFrame(y_under)['Class'].value_counts()

clf = xgb.XGBClassifier(n_estimators=50, seed=123)
clf.fit(X_under, y_under)
y_pred = clf.predict(X_hold)
print(confusion_matrix(y_hold, y_pred))
print(classification_report(y_hold, y_pred))

#### Random Oversampling : Precision very low, Recall high

In [None]:
from imblearn.over_sampling import RandomOverSampler, SMOTE
method = RandomOverSampler()
X_over, y_over = method.fit_resample(X_train,y_train)
X_over = pd.DataFrame(X_over, columns=['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'scaled_amount',
       'scaled_time'])

In [None]:
# Target is no longer imbalanced. 
pd.DataFrame(y_over)['Class'].value_counts()

In [None]:
clf = xgb.XGBClassifier(n_estimators=50, seed=123)
clf.fit(X_over, y_over)
y_pred = clf.predict(X_hold)
print(confusion_matrix(y_hold, y_pred))
print(classification_report(y_hold, y_pred))

#### Synthetic Minority Oversampling: Precision very low, Recall high.

* SMOTE is an improved version of Random Oversampling. But this is useful only when all undersampled cases are similar.

In [None]:
method = SMOTE()
X_over, y_over = method.fit_resample(X_train,y_train)
X_over = pd.DataFrame(X_over, columns=['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'scaled_amount',
       'scaled_time'])

In [None]:
# Target is no longer imbalanced. 
pd.DataFrame(y_over)['Class'].value_counts()

In [None]:
clf = xgb.XGBClassifier(n_estimators=50, seed=123)
clf.fit(X_over, y_over)
y_pred = clf.predict(X_hold)
print(confusion_matrix(y_hold, y_pred))
print(classification_report(y_hold, y_pred))

### Vanilla XGBoostClassifier does significantly better on unaltered data. Under/oversampling techniques decrease performance. 

* This is an interesting result. 

* Random Forests and Logistic Regression methods are commonly used for imbalanced class data. These do well with varius sampling strategies. Autoencoder Neural Network work very well to model imbalanced classes. 

* XGBoostClassifier does better with the original data. It is faster than most ML algorithms AND needs less preprocessing! 

* Let's improve out XGBoostClassifier through hyperparameter tuning. 

### Using hyperopt for XGBoostClassifier hyperparameter tuning 

* Parameter spaces can be built from manually entering values or selecting from one of these distrbutions.
    - hp.choice(label, options) — Returns one of the options, which should be a list or tuple.
    - hp.randint(label, upper) — Returns a random integer between the range (0, upper).
    - hp.uniform(label, low, high) — Returns a value uniformly between low and high.
    - hp.quniform(label, low, high, q) — Returns a value round(uniform(low, high) / q) * q, i.e it rounds the decimal values and returns an integer
    - hp.normal(label, mean, std) — Returns a real value that’s normally-distributed with mean and standard deviation sigma.
    
    
* The function optimized by hyperopt always minimizes - so we have to return 1-f1_score, in order to maximize this metric of our choice. 


* An odd feature: parameter names have to be provided twice in the parameter selection space. 


* XGBoost uses a special matrix type called a DMatrix. xgb.XGBClassifier automatically groups the data in a DMatrix, but when using cross validation or more complicated code, we must explicitly convert data to a DMatrix first. 


* the Trials() function stores data as the hyperopt algorithm progresses. It allows us to learn a few details about the internal working of the hyperopt algorithm. Running the Trials() function is optional. 



In [None]:
# Below code has been commented out because it takes time a long time (~30 mins with. 8 cores, GPU acceleration) 
# to run on every commit. So it was run once and commented out.
# defining the space for hyperparameter tuning
space={'max_depth': hp.quniform("max_depth", 3, 18, 1),'gamma': hp.uniform ('gamma', 1,9),
       'reg_alpha' : hp.quniform('reg_alpha', 1,180,1),
       'reg_lambda' : hp.uniform('reg_lambda', 0,1),
       'colsample_bytree' : hp.uniform('colsample_bytree', 0.5,1),
       'min_child_weight' : hp.quniform('min_child_weight', 0, 10, 1),
       }

def hyperparameter_tuning(space):
    clf=xgb.XGBClassifier(max_depth = int(space['max_depth']), gamma = space['gamma'],
                         reg_alpha = int(space['reg_alpha']),min_child_weight=space['min_child_weight'],
                         colsample_bytree=space['colsample_bytree'],eta= 0.8, nthread=-1, 
                         scale_pos_weight = np.sqrt(sum(y_train==0)/sum(y_train==1)),
                          n_estimators=50, random_state=123)
    evaluation = [( X_train, y_train), ( X_hold, y_hold)]
    
    clf.fit(X_train, y_train,
            eval_set=evaluation, eval_metric="rmse",
            early_stopping_rounds=10,verbose=False)

    y_pred = clf.predict(X_hold)
    recall = recall_score(y_hold, y_pred)
    print(classification_report(y_hold, y_pred))
    # We want to ensure that every fraud case is reported. 
    # Even if we tradeoff a large number of false positives. 
    # So we will maximize recall (ie. minimize 1-recall)
    print ("Recall:", recall)
    return {'loss': 1-recall, 'status': STATUS_OK }


# run the hyper paramter tuning
trials = Trials()

best = fmin(fn=hyperparameter_tuning,
            space=space,
            algo=tpe.suggest,
            max_evals=100)

print (best)

              precision    recall  f1-score   support  

           0       1.00      1.00      1.00     71070
           1       0.70      0.86      0.77       132

    accuracy                           1.00     71202
   macro avg       0.85      0.93      0.89     71202
weighted avg       1.00      1.00      1.00     71202

Recall:                                                
0.8636363636363636                                     
  1%|          | 1/100 [00:47<1:18:27, 47.55s/trial, best loss: 0.13636363636363635]

In [None]:
# Best output from hyperparameter tuning.
"""
              precision    recall  f1-score   support                              

           0       1.00      1.00      1.00    142147
           1       0.90      0.80      0.85       257

    accuracy                           1.00    142404
   macro avg       0.95      0.90      0.92    142404
weighted avg       1.00      1.00      1.00    142404

Recall:                                                                            
0.8015564202334631                                                                 
"""

In [None]:
# Model with best parameters (80% recall, 85% f1 score)
clf = xgb.XGBClassifier(colsample_bytree= 0.7356225471858802,
 gamma = 5.0244632204209765,
 max_depth=15,
 min_child_weight=4,
 reg_alpha = 1,
 reg_lambda = 0.49906388627478493, eta= 0.8, nthread=-1, n_estimators=50, seed=123)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_hold)
print(confusion_matrix(y_hold, y_pred))
print(classification_report(y_hold, y_pred))

# Default parameters in Vanilla baseline model (80% similar recall, 82% lower f1 score)
"""
colsample_bytree= 1,
 gamma = 0,
 max_depth = 6,
 min_child_weight = 1,
 reg_alpha = 0,
 reg_lambda = 1
"""

* Even though the baseline XGBoostClassifier has a similar recall and few percent lower f1 score, the tuned model is much better because: 
    - it has alpha & lambda regularization as well as a higher gamma value (higher gamma = greater gain threshold for maintaining nodes, and thus, more pruning). 
    - 7.36/10th fraction of the features are subsampled in the tuned model. We thus make the model run faster (and introduce regularization) without losing out on performance. 
    - The tuned model builds deeper trees (max depth 6 vs 15). A greater number of interactions can be captured by the tuned model. 
    - A higher weight threshold is used to allow/deny subdivision of node, in the tuned model. 
    
    
* Further tuning with a lower learning rate (eta) can be done to narrow down on a better model. 

* Finally. the precision-recall threshold should be changed to ensure better recall, if possible. We see in the current precision - recall curve, there is a sharp drop in precision if we require >80% recall. But we also see that the mean amount of 492 fraud transactions is higher than the mean of 284315 non-fraud transactions. Considering the higher losses, the bank may want to increase recall to ~99% even if it means more calls to customers to check on legitimate transctions flagged as fraud.       

In [None]:
# Plot precision recall curve for current best model. 
precisions, recalls, thresholds = precision_recall_curve(y_hold, y_pred)
plt.plot(recalls, precisions, "b-", linewidth=2)
plt.xlabel("Recall", fontsize=16)
plt.ylabel("Precision", fontsize=16)
plt.axis([0, 1, 0, 1])
plt.grid(True)
plt.show()

In [None]:
# Plot mean Amounts of non-fraud (label 0) and fraud (label 1) transactions
sns.catplot(x='Class', y='Amount', data=data, kind='bar', ci=None)
plt.title('Mean Amounts for non-fraud (label 0) and fraud (label 1) transactions' )