# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.

### First, download the data from: https://www.kaggle.com/ntnu-testimon/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?
### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze.

In [59]:
import itertools
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
import sklearn

In [60]:
%%time
fraud = pd.read_csv("../data.csv").sample(100000)

2
Wall time: 17.7 s


In [61]:
# checking size of df
fraud.shape

(100000, 11)

In [62]:
# checking head and tail
fraud.head()
fraud.tail()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
2278414,187,CASH_OUT,279018.38,C1326264252,0.0,0.0,C733478086,1114638.4,1393656.78,0,0
3039937,234,CASH_OUT,195584.0,C1248084880,20887.0,0.0,C637098894,0.0,195584.0,0,0
5581975,393,CASH_OUT,221736.41,C1518002018,0.0,0.0,C1283550676,1556931.92,1778668.33,0,0
3793490,281,CASH_OUT,220357.98,C1343031900,11418.0,0.0,C917715031,1075013.51,1295371.49,0,0
1378427,138,TRANSFER,398892.44,C4753512,51006.0,0.0,C1897193207,2692547.74,3091440.18,0,0


In [63]:
# checking types
fraud.dtypes

step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

In [64]:
# checking the unique values in the columns w/dtype object
print('type:' , len(fraud["type"].unique()))
print('nameOrig:' , len(fraud["nameOrig"].unique()))
print('nameDest:' , len(fraud["nameDest"].unique()))

type: 5
nameOrig: 100000
nameDest: 92884


In [65]:
# checking descriptive statistics
round(fraud.describe(), 2)

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,243.87,178416.64,829814.15,851152.15,1098032.0,1220395.0,0.0,0.0
std,142.67,594847.58,2871606.24,2908936.87,3326227.0,3590567.0,0.03,0.0
min,1.0,0.93,0.0,0.0,0.0,0.0,0.0,0.0
25%,156.0,13405.31,0.0,0.0,0.0,0.0,0.0,0.0
50%,239.0,74940.52,14451.5,0.0,132039.8,213952.7,0.0,0.0
75%,335.0,208186.15,107833.5,144301.16,947248.3,1114916.0,0.0,0.0
max,735.0,40891132.65,34809311.48,34870813.36,185641500.0,214500700.0,1.0,1.0


In [66]:
# checking NaN's
fraud.isna().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

In [67]:
# checking correlations
round(fraud.corr(), 2) 

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
step,1.0,0.02,-0.01,-0.01,0.03,0.02,0.03,0.0
amount,0.02,1.0,-0.0,-0.01,0.31,0.48,0.06,0.05
oldbalanceOrg,-0.01,-0.0,1.0,1.0,0.07,0.05,0.01,0.02
newbalanceOrig,-0.01,-0.01,1.0,1.0,0.07,0.05,-0.01,0.02
oldbalanceDest,0.03,0.31,0.07,0.07,1.0,0.98,-0.0,-0.0
newbalanceDest,0.02,0.48,0.05,0.05,0.98,1.0,0.0,-0.0
isFraud,0.03,0.06,0.01,-0.01,-0.0,0.0,1.0,0.1
isFlaggedFraud,0.0,0.05,0.02,0.02,-0.0,-0.0,0.1,1.0


### What is the distribution of the outcome? 

In [69]:
# not sure which one is the outcome but both of them are very imbalanced
print(fraud["isFlaggedFraud"].value_counts())
print(fraud["isFraud"].value_counts())

0    99999
1        1
Name: isFlaggedFraud, dtype: int64
0    99890
1      110
Name: isFraud, dtype: int64


### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [71]:
# i'm not sure what the variable step is reffering for
# so i will leave it be for now

### Run a logisitc regression classifier and evaluate its accuracy.

In [72]:
# importing the libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder, RobustScaler, StandardScaler 
from sklearn.preprocessing import PolynomialFeatures, MinMaxScaler, KBinsDiscretizer
from sklearn.preprocessing import MultiLabelBinarizer, Normalizer, OneHotEncoder
from sklearn.metrics import classification_report, accuracy_score

In [73]:
# labeling transactions type:
le = LabelEncoder()
label_cols = ["type", "nameOrig", "nameDest"]
fraud[label_cols] = fraud[label_cols].apply(le.fit_transform)

In [74]:
# splitting the data into target and predictors
# i will use isFraud as the target
X = fraud.drop(labels = "isFraud", axis = 1)
y = fraud["isFraud"]

In [75]:
# dividing train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [76]:
# preparing and fitting the model
lr = LogisticRegression().fit(X_train, y_train)

In [77]:
# accuracy score
acc = lr.score(X_test, y_test)

print(f"Logistic Regression Test Accuracy {round(acc, 2)}%")

Logistic Regression Test Accuracy 1.0%


### Now pick a model of your choice and evaluate its accuracy.

In [78]:
from sklearn.ensemble import RandomForestClassifier 

In [79]:
# Random Forest Prediction Model
# creating features
X = fraud.drop(labels = "isFraud", axis = 1)

# creating labels
y = fraud["isFraud"]

### Which model worked better and how do you know?

In [81]:
# (c) duarteharris way 

# Spliting the dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [82]:
# creating a Gaussian Classifier
clf = RandomForestClassifier(n_estimators = 100)

# training the model using the training sets y_pred = clf.predict(X_test)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

In [83]:
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.99945


In [84]:
# checking the importance of the features to predict
feature_imp = pd.Series(clf.feature_importances_, index = X.columns).sort_values(ascending = False)
feature_imp

oldbalanceOrg     0.240155
newbalanceDest    0.233043
amount            0.145804
step              0.108982
oldbalanceDest    0.066814
type              0.060720
nameDest          0.057234
nameOrig          0.049321
newbalanceOrig    0.034524
isFlaggedFraud    0.003403
dtype: float64

### (c) duarteharris explanation

"""


I'm highly suspicious of both scores. That said:

«Precision is a good measure to select the "best" model, when the costs of False Positive is
high. 

(...)

Recall shall be the model metric we use to select our best model when there is a high cost 
associated with False Negative.

For instance, in fraud detection or sick patient detection. If a fraudulent transaction 
(Actual Positive) is predicted as non-fraudulent (Predicted Negative), the consequence can 
be very bad for the bank.

(...)

F1 Score might be a better measure to use if we need to seek a balance between Precision and 
Recall AND there is an uneven class distribution (large number of Actual Negatives).

(...)

If you cannot decide or thinks that its best to reduce both, False Negatives and False 
Positives then choose F1.»

As such, judging from this, Recall or F1 seem to be the better measures to choose a model 
(and not accuracy or precision).

Source:
https://koopingshung.com/blog/machine-learning-model-selection-accuracy-precision-recall-f1/



"""

### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.