# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### First, download the data from: https://www.kaggle.com/ntnu-testimon/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
import pandas as pd
df = pd.read_csv('PS_20174392719_1491204439457_log.csv')

In [None]:
headers = '''step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

amount - amount of the transaction in local currency.

nameOrig - customer who started the transaction

oldbalanceOrg - initial balance before the transaction

newbalanceOrig - new balance after the transaction

nameDest - customer who is the recipient of the transaction

oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).

newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).

isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.'''


headdict = {}
for h in headers.split('\n'):
    try:
        headdict.update({h.split(' - ')[0] : h.split(' - ')[1]})
    except:
        continue

In [None]:
df.describe()

In [None]:
df.head()

In [None]:
df.corr()

In [None]:
headdict['isFraud']

In [None]:
headdict['isFlaggedFraud']

In [None]:
df.corrwith(df['isFraud']).abs().sort_values()

In [None]:
# What do you think will be the important features in determining the outcome?

# Amount, isFlaggedFraud, step
# Also old and new balance is highly correlated, in fraudulent transactions the money is usually quickly taken out of the account

### What is the distribution of the outcome? 

In [None]:
import seaborn as sns
#sns.kdeplot(df['isFraud'])
# Binomial

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
df['nameOrig'].nunique()

In [None]:
df['nameDest'].nunique()

In [None]:
df.drop(['nameDest', 'nameOrig'], axis=1, inplace=True)

In [None]:
dumms = pd.get_dummies(df['type'], drop_first=True)

In [None]:
df = pd.concat([df,dumms], axis=1).drop('type', axis=1)

In [None]:
headdict['step']

# 

In [None]:
# Step column could be dropped too..
df.drop('step', axis=1, inplace=True)

In [None]:
X = df.drop('isFraud', axis=1)
y = df['isFraud']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=55)

### Run a logisitc regression classifier and evaluate its accuracy.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)
y_score = model.predict_proba(X_test)

In [None]:
from sklearn.metrics import r2_score, mean_squared_error, auc, confusion_matrix, classification_report,recall_score

In [None]:
r2_score(y_test, y_pred)

In [None]:
confusion_matrix(y_test, y_pred)

In [None]:
print(classification_report(y_test, y_pred))

### Now pick a model of your choice and evaluate its accuracy.

In [None]:
# Downsampling the majority and use the same model.

In [None]:
from sklearn.utils import resample
df_majority = df[df.isFraud==0]
df_minority = df[df.isFraud==1]

df_minority_upsampled = resample(df_minority, 
                                 replace=True,
                                 n_samples=len(df_minority)*10)

df_upsampled = pd.concat([df_majority, df_minority_upsampled])
df_upsampled = df_upsampled.sample(frac=1)

df_upsampled_X = df_upsampled.drop('isFraud', axis=1)
df_upsampled_y = df_upsampled['isFraud']

In [None]:
X_train_upsampled, X_test_upsampled, y_train_upsampled, y_test_upsampled = train_test_split(df_upsampled_X, df_upsampled_y, test_size=0.5, random_state=55)

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_us_scaled = scaler.fit_transform(X_train_upsampled)
X_test_us_scaled = scaler.fit_transform(X_test_upsampled)

In [None]:
model2 = LogisticRegression()

model2.fit(X_train_us_scaled, y_train_upsampled)


In [None]:
X_test_scaled = scaler.fit_transform(X_test)
y_pred_upsampled = model2.predict(X_test_scaled)
y_upsample_score = model.predict_proba(X_test_scaled)

In [None]:
confusion_matrix(y_test, y_pred_upsampled)

In [None]:
import numpy as np
def plot_cm(labels, predictions, p=0.5):
    cm = confusion_matrix(labels, predictions > p)
    plt.figure(figsize=(5,5))
    sns.heatmap(cm, annot=True, fmt="d")
    plt.title('Confusion matrix @{:.2f}'.format(p))
    plt.ylabel('Actual label')
    plt.xlabel('Predicted label')

    print('Legitimate Transactions Detected (True Negatives): ', cm[0][0])
    print('Legitimate Transactions Incorrectly Detected (False Positives): ', cm[0][1])
    print('Fraudulent Transactions Missed (False Negatives): ', cm[1][0])
    print('Fraudulent Transactions Detected (True Positives): ', cm[1][1])
    print('Total Fraudulent Transactions: ', np.sum(cm[1]))




In [None]:
df_majority_downsampled = resample(df_majority, 
                                 replace=True,
                                 n_samples=round(len(df_majority)/10))

df_downsampled = pd.concat([df_minority, df_majority_downsampled])
df_downsampled = df_downsampled.sample(frac=1)

df_downsampled_X = df_downsampled.drop('isFraud', axis=1)
df_downsampled_y = df_downsampled['isFraud']
X_train_downsampled, X_test_downsampled, y_train_downsampled, y_test_downsampled = train_test_split(df_downsampled_X, df_downsampled_y, test_size=0.5, random_state=55)

In [None]:
X_train_ds_scaled = scaler.fit_transform(X_train_downsampled)
X_test_ds_scaled = scaler.fit_transform(X_test_downsampled)
model3 = LogisticRegression()
model3.fit(X_train_ds_scaled, y_train_downsampled)

In [None]:
y_pred_downsampled = model.predict(X_test_scaled)
y_downsample_score = model.predict_proba(X_test_scaled)

In [None]:
plot_cm(y_test, y_pred_upsampled)

In [None]:
plot_cm(y_test, y_pred)

In [None]:
plot_cm(y_test, y_pred_downsampled)

In [None]:
# Test size affects the outcome a lot.

In [None]:
from sklearn.metrics import roc_auc_score

In [None]:
from sklearn.svm import SVC

In [None]:
svc_mod = SVC(kernel='linear', 
            class_weight='balanced', # penalize
            probability=True)
svc_mod.fit(X_train, y_train)
svc_pred = svc_mod.predict(X_test)
roc_auc_score(y_test, svc_pred)
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

In [None]:
# SVC takes WAY To long time