# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


In [None]:
import pandas as pd
import numpy as np

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

### First, download the data from: https://www.kaggle.com/ntnu-testimon/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?

In [None]:
# Your code here
data = pd.read_csv("paysim.csv")

In [None]:
data.head()

In [None]:
data.dtypes

In [None]:
data.isnull().sum()

In [None]:
data.describe().T

In [None]:
data.describe(include='O').T

<font color='blue'>Calculate unique values
</font>

In [None]:
for column in data.columns:
    print(f'----- {column} -----')
    print(data[column].unique())

In [None]:
for column in data.columns:
    print(f'----- {column} -----')
    print(data[column].value_counts(ascending=False).head())

<font color='blue'>Histogram of the numeric
</font>

In [None]:
list_histo = ['step', 'amount','oldbalanceOrg','newbalanceOrig',
             'oldbalanceDest','newbalanceDest']
for column in list_histo:
    x = data[column]
    bins = 50
    n, bins, patches = plt.hist(x, bins, facecolor="darkblue", alpha=0.5)
    plt.xlabel(f'Description of {column}')
    plt.title(f'Histogram {column}')
    plt.show()

<font color='blue'>Bar chart
</font>

In [None]:
gr_type = data.groupby('type')['step'].count().reset_index()
gr_type.columns = ['type', 'num']
gr_type.head()

In [None]:
plt.figure()
plt.bar(gr_type.type, gr_type.num)
plt.title(f'Barchart Type')
plt.xlabel('Type')
plt.show()

<font color='blue'>Correlation
</font>

In [None]:
corr = data.corr()

In [None]:
plt.figure(figsize=(16,8))
sns.set(font_scale=0.8)
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    ax = sns.heatmap(corr, mask=mask, vmax=1, square=True, cmap="YlGnBu", linewidths=.5, annot=True)
    

<font color='blue'>**COMMENTS:**<br>
* The important features to the outcome I think it will be STEP, TŸPE AND AMOINT
</font>

### What is the distribution of the outcome? 

In [None]:
out = data.groupby('isFraud')['step'].count().reset_index()
out.columns = ['isFraud', 'num']
out.head()

In [None]:
plt.figure()
plt.bar(['0','1'], [6354407,8213])
plt.title(f'Barchart Is Fraud')
plt.xlabel('Is Fraud')
plt.show()

In [None]:
outf = data.groupby('isFlaggedFraud')['step'].count().reset_index()
outf.columns = ['isFlaggedFraud', 'num']
outf.head()

In [None]:
plt.figure()
plt.bar(['0','1'], [6362604,16])
plt.title(f'Barchart Is Flagged Fraud')
plt.xlabel('is Flagged Fraud')
plt.show()

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

<font color='blue'>A step is one hour, maybe we can change it to days
</font>

In [None]:
data['days'] = data['step'].apply(lambda x: x/24)

In [None]:
data.head()

In [None]:
plt.hist(data['days'], bins=30)
plt.show()

### Run a logisitc regression classifier and evaluate its accuracy.

In [None]:
one_hot_type = data[['type']].stack().str.get_dummies().sum(level=0).iloc[:,:-1].add_prefix('type_')


In [None]:
one_hot_type.head()

In [None]:
data_enc = one_hot_type.join(data)

In [None]:
data_enc.columns

<font color='blue'>Unbalanced Data Model
</font>

In [None]:
X = data_enc[['type_CASH_IN', 'type_CASH_OUT', 'type_DEBIT', 'type_PAYMENT', 'amount', 'days']]

In [None]:
y = data_enc['isFraud']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)


In [None]:
clf = LogisticRegression(random_state=10, solver='lbfgs')
clf.fit(X_train, y_train)

In [None]:
y_pred = clf.predict(X_test)

In [None]:
clf.score(X_train, y_train)

In [None]:
clf.score(X_test, y_test)

In [None]:
f1_score(y_test, y_pred)

In [None]:
print(f1_score)

In [None]:
confusion_matrix(y_test, y_pred)


In [None]:
roc_auc_score(y_test, y_pred)

In [None]:
logit_roc_auc = roc_auc_score(y_test, y_pred)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('LGC unbalanced')
plt.legend(loc="lower right")
plt.show()

<font color='blue'>The accuracy seams very good, but is not, as the model do not detect any Fraud at all
</font>

<font color='blue'>Balanced Data
</font>

In [None]:
data_enc_1 = data_enc.loc[data_enc['isFraud'] == 1]

In [None]:
data_enc_1.shape

In [None]:
data_enc_0 = data_enc.loc[data_enc['isFraud'] == 0].head(8213)

In [None]:
data_enc_0.shape

In [None]:
data_enc_train = pd.concat([data_enc_1.iloc[:7000,:],data_enc_0.iloc[:7000,:]], axis=0, join='outer', ignore_index=True)
data_enc_test = pd.concat([data_enc_1.iloc[7001:,:],data_enc_0.iloc[7001:,:]], axis=0, join='outer', ignore_index=True)
               
                         

In [None]:
data_enc_train.shape

In [None]:
data_enc_test.shape

In [None]:
X_train = data_enc_train[['type_CASH_IN', 'type_CASH_OUT', 'type_DEBIT', 'type_PAYMENT', 'amount', 'days']]
y_train = data_enc_train['isFraud']

In [None]:
X_test = data_enc_test[['type_CASH_IN', 'type_CASH_OUT', 'type_DEBIT', 'type_PAYMENT', 'amount', 'days']]
y_test = data_enc_test['isFraud']

In [None]:
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10, shuffle=True)


In [None]:
clf = LogisticRegression(random_state=10, solver='lbfgs',
                         multi_class='multinomial')
clf.fit(X_train, y_train)

In [None]:
y_pred = clf.predict(X_test)

In [None]:
clf.score(X_train, y_train)

In [None]:
clf.score(X_test, y_test)

In [None]:
f1_score(y_test, y_pred)

In [None]:
print(f1_score)

In [None]:
confusion_matrix(y_test, y_pred)


In [None]:
roc_auc_score(y_test, y_pred)

In [None]:
logit_roc_auc = roc_auc_score(y_test, y_pred)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('LGC balanced')
plt.legend(loc="lower right")
plt.show()

<font color='blue'>RESAMPLE
</font>

In [None]:
from sklearn.utils import resample

In [None]:
data_enc['isFraud'].value_counts()

In [None]:
# Separate majority and minority classes

df_majority = data_enc[data_enc.isFraud==0]
df_minority = data_enc[data_enc.isFraud==1]

In [None]:
# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=6354407,    # to match majority class
                                 random_state=123) # reproducible results

In [None]:
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])

In [None]:
# Display new class counts
df_upsampled.isFraud.value_counts()

In [None]:
X = df_upsampled[['type_CASH_IN', 'type_CASH_OUT', 'type_DEBIT', 'type_PAYMENT', 'amount', 'days']]

In [None]:
y = df_upsampled['isFraud']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)


In [None]:
clf = LogisticRegression(random_state=10, solver='lbfgs',
                         multi_class='multinomial')
clf.fit(X_train, y_train)

In [None]:
y_pred = clf.predict(X_test)

In [None]:
clf.score(X_train, y_train)

In [None]:
clf.score(X_test, y_test)

In [None]:
f1_score(y_test, y_pred)

In [None]:
print(f1_score)

In [None]:
confusion_matrix(y_test, y_pred)


In [None]:
roc_auc_score(y_test, y_pred)

In [None]:
logit_roc_auc = roc_auc_score(y_test, y_pred)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('LGC balanced')
plt.legend(loc="lower right")
plt.show()

### Now pick a model of your choice and evaluate its accuracy.

<font color='blue'>Random Forrest Classifier
</font>

In [None]:
from sklearn.ensemble import RandomForestClassifier


In [None]:
X = data_enc[['type_CASH_IN', 'type_CASH_OUT', 'type_DEBIT', 'type_PAYMENT', 'amount', 'days']]

In [None]:
y = data_enc['isFraud']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)


In [None]:
rfc = RandomForestClassifier()


In [None]:
rfc.fit(X_train, y_train)

In [None]:
y_pred = rfc.predict(X_test)

In [None]:
rfc.score(X_train, y_train)

In [None]:
rfc.score(X_test, y_test)

In [None]:
f1_score(y_test, y_pred)

In [None]:
print(f1_score)

In [None]:
confusion_matrix(y_test, y_pred)


In [None]:
roc_auc_score(y_test, y_pred)

In [None]:
logit_roc_auc = roc_auc_score(y_test, y_pred)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('RFC unbalanced')
plt.legend(loc="lower right")
plt.show()

<font color='blue'> Penalized SVM
</font>

In [None]:
from sklearn.svm import SVC

In [None]:
X = data_enc[['type_CASH_IN', 'type_CASH_OUT', 'type_DEBIT', 'type_PAYMENT', 'amount', 'days']]

In [None]:
y = data_enc['isFraud']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)


In [None]:
svc = SVC(gamma='auto', kernel='linear', 
            class_weight='balanced', # penalize
            probability=True)

In [None]:
svc.fit(X_train, y_train)

In [None]:
y_pred = svc.predict(X_test)

In [None]:
svc.score(X_train, y_train)

In [None]:
svc.score(X_test, y_test)

In [None]:
f1_score(y_test, y_pred)

In [None]:
print(f1_score)

In [None]:
confusion_matrix(y_test, y_pred)


In [None]:
roc_auc_score(y_test, y_pred)

In [None]:
logit_roc_auc = roc_auc_score(y_test, y_pred)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('SVM unbalanced')
plt.legend(loc="lower right")
plt.show()

### Which model worked better and how do you know?

<font color='blue'>**COMMENTS**:
* The Logistic Regression Classifier with the unbalanced data do not work at all, it overfits.
* The Logistic Regression Classifier with the balanced data has worked better, but still not predict more than random (50%)
* The Random Forrest Classifier with unbalanced data has worked better that Logistic regression
* Finally the penalized SVM I waasn't able to make it do the fit :(
</font>