### Exercise 2 (6 Points): Imbalanced Data

##### As already discussed, our toy data set is quite imbalanced (much more non-returns than returns). In the lecture we discussed the concept of under-sampling the training data. Implement now your own code for under-sampling the training data to balance the classes (= do not use the sklearn for this) . 

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import sklearn.metrics as metrics

In [2]:
transactions = pd.read_csv('data/data.csv')
transactions.drop(['Unnamed: 0'], axis=1, inplace=True)
transactions = transactions[['totalAmount','c_0','c_1','c_2','c_3','c_4','c_5', 'returnLabel']]

In [3]:
X = transactions.drop('returnLabel',axis=1)
y = transactions.returnLabel
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

##### Prepare balanced training data

In [4]:
count_return = sum(y_train)
y_train_pd = y_train.to_frame()
non_return_indices = y_train_pd[y_train_pd.returnLabel == 0].index
random_indices = np.random.choice(non_return_indices,count_return, replace=False)
return_indices = y_train_pd[y_train_pd.returnLabel == 1].index
under_sample_indices = np.concatenate([return_indices,random_indices])
under_sample = transactions.loc[under_sample_indices]
print("number of rows:",under_sample.shape[0])
print("number of returns:",count_return)

number of rows: 1310
number of returns: 655


###### Learn a RandomForest model on the balanced training data and compare the performance in terms of accuracy on the test data to the performance of a model that was learned on the original data.
###### Remark: The test data must not be balanced (it should always reflect the real distribution of the data).

In [5]:
compare_model = RandomForestClassifier(n_estimators=100)
compare_model.fit(X_train, y_train)

pred_compare = compare_model.predict(X_test)

print("Trained with original data")
print(classification_report(y_test,pred_compare))
print("Accuracy with original data:",metrics.accuracy_score(y_test, pred_compare))

Trained with original data
              precision    recall  f1-score   support

           0       0.98      0.99      0.98      9042
           1       0.33      0.23      0.27       258

   micro avg       0.97      0.97      0.97      9300
   macro avg       0.65      0.61      0.63      9300
weighted avg       0.96      0.97      0.96      9300

Accuracy with original data: 0.9656989247311828


In [6]:
X_train_under = under_sample.drop('returnLabel',axis=1)
y_train_under = under_sample.returnLabel

under_model = RandomForestClassifier(n_estimators=100)
under_model.fit(X_train_under, y_train_under)

pred_full_test = under_model.predict(X_test)

print("Trained with balanced data")
print(classification_report(y_test,pred_full_test))
print("Accuracy with undersampled data:",metrics.accuracy_score(y_test, pred_full_test))

Trained with balanced data
              precision    recall  f1-score   support

           0       0.99      0.80      0.89      9042
           1       0.11      0.84      0.19       258

   micro avg       0.80      0.80      0.80      9300
   macro avg       0.55      0.82      0.54      9300
weighted avg       0.97      0.80      0.87      9300

Accuracy with undersampled data: 0.803763440860215
