# Dealing with imbalanced classes. 

* using a combination of **hyperopt** and the **XGBoost-Classifier**. To see my experiments with a Neural Network Autoencoder click here. 


* Using the Kaggle **Credit Card Fraud Detection** dataset. 


* Quoting from Kaggle: 

    - The datasets contains transactions made by credit cards in September 2013 by european cardholders.This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
    
    - Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.
    
    - Features V1-V28 are anonymized. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount.
    
    
* XGBoost 

    - is known for its excellent speed & performance (operations parallizable) on a variety on Machine Learning problems. 
    - Is an Ensemble algoritm. It uses CART decision trees as base learners. 
    
    - It is preferred for use when we have a large number of training samples. Number of training samples must be greater than the number of features. 
    
    - The complete parameter list can be found at https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst
    
    
* Hyperopt

    - Manual search, grid search and random search are common algorithms used for hyperparameter tuning. 
    - All these previous methods are un-informed: ie. they scan through the entire/randomly selected set of hyperparameters to select the best one. 
    - Hyperopt is a Bayasian optimization algorithm, it is classified as an 'informed search' algorithm as the score from the previous round of search, 'informs' the choice of a better set of hyperparameters. 
 
* Another extremely cool advantage of hyperopt is: 
    - There are a lot of machine learning algorithms. And many different configuration for preprocessing/ feature engineering. Hyperopt allows us to use an algorithm to search this large set of other algorithms for the best, without additional input on our part. 
    - The 'algo' parameter is customizable. I will only use the inbuilt tpe.suggest algo. It is known to work well for most use cases. 
    - My dataset does not need much pre-processing. 
    - So I will only try out hyperparameter tuning for XGBoost today. 

## Imports & read file

In [1]:
import pandas as pd
import numpy as np
import plotly_express as px
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report, precision_score, recall_score, f1_score, accuracy_score
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score, precision_recall_curve
from sklearn.model_selection import train_test_split
from hyperopt import hp, fmin, tpe, Trials, STATUS_OK
import pickle
import xgboost as xgb

%matplotlib inline
sns.set(style='ticks')
pd.options.display.max_columns = 500
pd.options.display.max_rows = 500

In [2]:
data = pd.read_csv('creditcard.csv')

## Explore data

In [None]:
data.info()
data.tail()

In [None]:
data.describe()

* 30 feature columns in dataset. 
    - We don't know what V1-V26 are, but we know they have been scaled (boxplots show a similar range of values, to confirm).
    - Additionally,  'Time' in seconds, over a period of 2 days, is available. 
    - Majority of transactions are of a smaller 'Amount', mean is USD 88, range is ~ USD 0 - USD 25,691 range.
    
    
* No null values in dataset. 


* Imbalanced classes can be seen below. 

In [None]:
# Target contains imbalanced classes
data.Class.value_counts()

In [None]:
# Boxplot of variables V1 - V28
plt.figure(figsize=(20,5))
sns.boxplot(data=data.drop(['Time', 'Amount', 'Class'], axis=1))
plt.xticks(rotation=90)
plt.title('Boxplots of features V1 - V26')
plt.xlabel('Feature Name')
plt.ylabel('Value')
plt.show()

In [None]:
#datelist=pd.date_range("00:00:00", "23:59:59", freq="S")
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(20,5))
ax1.scatter(data.index, data['Time']/(60*60))
ax1.set_xlabel('Index')
ax1.set_ylabel('Time feature (in hours)')
ax1.set_title('Scatterplot of 'Time feature')
ax2.boxplot(data['Amount'])
ax2.set_xlabel('Amount')
ax2.set_title('Boxplot of 'Amount' feature')

## Preprocessing

In [57]:
# Scaling 'Time' and 'Amount'
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()

data['scaled_amount'] = std_scaler.fit_transform(data['Amount'].values.reshape(-1,1))
data['scaled_time'] = std_scaler.fit_transform(data['Time'].values.reshape(-1,1))

data.drop(['Time','Amount'], axis=1, inplace=True)

## Models

### Baseline model using XGBClassifier, using common test_train_split

In [58]:
X_train, X_test, y_train, y_test= train_test_split(data.drop('Class', axis=1), data.Class, 
                                                   test_size=0.5, random_state=123)

clf = xgb.XGBClassifier(n_estimators=10, seed=123)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("accuracy: %f" % (accuracy))
print('As 99.828% of the data is class 0, the 99.945% heavily biased accuracy does not tell us much about the quality of our model .')
f1 = f1_score(y_test, y_pred)
print("f1 score is a better metric: %f" % (f1))

accuracy: 0.999445
As 99.828% of the data is class 0, the 99.945% heavily biased accuracy does not tell us much about the quality of our model .
f1 score is a better metric: 0.840404


### Sampling & Models : Decreases performance

* **OVERSAMPLING OR UNDERSAMPLING SHOULD ONLY BE APPLIED TO TRAIN.** 

#### Undersampling : Does extremely bad

In [60]:
X_train, X_test, y_train, y_test= train_test_split(data.drop('Class', axis=1), data.Class,
                                                   test_size=0.5, random_state=123)

y_0=y_train[y_train==0]
y_0_under=y_0.sample(n=len(y_1), random_state=123)
y_1=y_train[y_train==1]
y_under=pd.concat([y_0_under,y_1]).sample(frac=1, random_state=123)
X_under=X_train.reindex(y_under.index)

In [61]:
# Target is no longer imbalanced. 
pd.DataFrame(y_under)['Class'].value_counts()

1    235
0    235
Name: Class, dtype: int64

In [62]:
clf = xgb.XGBClassifier(n_estimators=10, seed=123)
clf.fit(X_under, y_under)
y_pred = clf.predict(X_test)
f1 = f1_score(y_test, y_pred)
print("f1 score: %f" % (f1))

f1 score: 0.059856


In [70]:
X_train.columns

Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'scaled_amount',
       'scaled_time'],
      dtype='object')

#### Random Oversampling : Reduces f1_score.

In [71]:
X_train, X_test, y_train, y_test= train_test_split(data.drop('Class', axis=1), data.Class,
                                                   test_size=0.5, random_state=123)

from imblearn.over_sampling import RandomOverSampler, SMOTE
method = RandomOverSampler()
X_over, y_over = method.fit_resample(X_train,y_train)
X_over = pd.DataFrame(X_over, columns=['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'scaled_amount',
       'scaled_time'])


Function safe_indexing is deprecated; safe_indexing is deprecated in version 0.22 and will be removed in version 0.24.



In [67]:
# Target is no longer imbalanced. 
pd.DataFrame(y_over)[0].value_counts()

1    142168
0    142168
Name: 0, dtype: int64

In [72]:
clf = xgb.XGBClassifier(n_estimators=10, seed=123)
clf.fit(X_over, y_over)
y_pred = clf.predict(X_test)
f1 = f1_score(y_test, y_pred)
print("f1 score: %f" % (f1))

f1 score: 0.585635


#### Synthetic Minority Oversampling: f1_score is even worse. 

* SMOTE is an improved version of Random Oversampling. But this is useful only when all undersampled cases are similar. 

In [23]:
X_train, X_test, y_train, y_test= train_test_split(data.drop('Class', axis=1), data.Class,
                                                   test_size=0.5, random_state=123)

In [73]:
method = SMOTE()
X_over, y_over = method.fit_resample(X_train,y_train)
X_over = pd.DataFrame(X_over, columns=['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'scaled_amount',
       'scaled_time'])


Function safe_indexing is deprecated; safe_indexing is deprecated in version 0.22 and will be removed in version 0.24.



In [74]:
# Target is no longer imbalanced. 
pd.DataFrame(y_over)[0].value_counts()

1    142168
0    142168
Name: 0, dtype: int64

In [75]:
clf = xgb.XGBClassifier(n_estimators=10, seed=123)
clf.fit(X_over, y_over)
y_pred = clf.predict(X_test)
f1 = f1_score(y_test, y_pred)
print("f1 score: %f" % (f1))

f1 score: 0.282193


### Vanilla XGBoostClassifier does significantly better on unaltered data. Under/oversampling techniques decrease performance. 

* This is an interesting result. 

* Random Forests and Bagging Ensemble methods are commonly used for imbalanced class data. Autoencoder Neural Network work very well to model imbalanced classes. 

* It will be interesting to see how an Autoencode compares to XGBoostClassifier. 

* But first, let's improve XGBoostClassifier through hyperparameter tuning. 

### Using hyperopt for XGBoostClassifier hyperparameter tuning 

* Parameter spaces can be built from manually entering values or selecting from one of these distrbutions.
    - hp.choice(label, options) — Returns one of the options, which should be a list or tuple.
    - hp.randint(label, upper) — Returns a random integer between the range (0, upper).
    - hp.uniform(label, low, high) — Returns a value uniformly between low and high.
    - hp.quniform(label, low, high, q) — Returns a value round(uniform(low, high) / q) * q, i.e it rounds the decimal values and returns an integer
    - hp.normal(label, mean, std) — Returns a real value that’s normally-distributed with mean and standard deviation sigma.
    
    
* The function optimized by hyperopt always minimizes - so we have to return 1-f1_score, in order to maximize this metric of our choice. 


* An odd feature: parameter names have to be provided twice in the parameter selection space. 


* XGBoost uses a special matrix type called a DMatrix. xgb.XGBClassifier automatically groups the data in a DMatrix, but when using cross validation or more complicated code, we must explicitly convert data to a DMatrix first. 


* the Trials() function stores data as the hyperopt algorithm progresses. It allows us to learn a few details about the internal working of the hyperopt algorithm. Running the Trials() function is optional. 

In [None]:
# defining the space for hyperparameter tuning
space={'max_depth': hp.quniform("max_depth", 3, 18, 1),'gamma': hp.uniform ('gamma', 1,9),
       'reg_alpha' : hp.quniform('reg_alpha', 40,180,1),
       'reg_lambda' : hp.uniform('reg_lambda', 0,1),
       'colsample_bytree' : hp.uniform('colsample_bytree', 0.5,1),
       'min_child_weight' : hp.quniform('min_child_weight', 0, 10, 1),
       'n_estimators': 180}

def hyperparameter_tuning(space):
    clf=xgb.XGBClassifier(n_estimators =space['n_estimators'], max_depth = int(space['max_depth']), gamma = space['gamma'],
                         reg_alpha = int(space['reg_alpha']),min_child_weight=space['min_child_weight'],
                         colsample_bytree=space['colsample_bytree'], nthread=-1)
    evaluation = [( X_train, y_train), ( X_test, y_test)]
    
    clf.fit(X_train, y_train,
            eval_set=evaluation, eval_metric="rmse",
            early_stopping_rounds=10,verbose=False)

    y_pred = clf.predict(X_test)
    f1 = f1_score(y_test, y_pred)
    print ("F1 SCORE:", f1)
    #change the metric if you like
    return {'loss': 1-f1, 'status': STATUS_OK }


# run the hyper paramter tuning
trials = Trials()

best = fmin(fn=hyperparameter_tuning,
            space=space,
            algo=tpe.suggest,
            max_evals=100)

print (best)

## Best model from hyperopt

In [None]:
params={'colsample_bytree': 0.5653326096522678,
 'gamma': 1.6283396367528893,
 'max_depth': 4,
 'min_child_weight': 0.0,
 'reg_alpha': 55,
 'reg_lambda': 0.26983029795530716}
clf=xgb.XGBClassifier(**params)
evaluation = [( X_train, y_train), ( X_test, y_test)]
clf.fit(X_train, y_train, eval_set=evaluation, eval_metric="rmse", early_stopping_rounds=10,verbose=False)
y_pred = clf.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
f1=f1_score(y_test,y_pred)
print(f1) #nthreads- for multicore