# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### First, download the data from: https://www.kaggle.com/ntnu-testimon/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?

In [1]:
# Your code here
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


In [2]:
df = pd.read_csv("../PS_20174392719_1491204439457_log.csv")
print(df.shape)
df.head()

(6362620, 11)


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [3]:
df.isna().sum()

# there are no missing values

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

In [4]:
df.dtypes

# According to documentation, step is a measure of time.
# There are 3 categorical variables: type has a limited number of values and may
# be one-hot encoded, the other two can be dropped.

# isFraud will be the target, isFlaggedFraud are target predictions from another model

step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

In [5]:
df.type.value_counts()

CASH_OUT    2237500
PAYMENT     2151495
CASH_IN     1399284
TRANSFER     532909
DEBIT         41432
Name: type, dtype: int64

In [6]:
df.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0
mean,243.3972,179861.9,833883.1,855113.7,1100702.0,1224996.0,0.00129082,2.514687e-06
std,142.332,603858.2,2888243.0,2924049.0,3399180.0,3674129.0,0.0359048,0.001585775
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,156.0,13389.57,0.0,0.0,0.0,0.0,0.0,0.0
50%,239.0,74871.94,14208.0,0.0,132705.7,214661.4,0.0,0.0
75%,335.0,208721.5,107315.2,144258.4,943036.7,1111909.0,0.0,0.0
max,743.0,92445520.0,59585040.0,49585040.0,356015900.0,356179300.0,1.0,1.0


In [7]:
# the important features in predicting the outcome (isFraud) should be
# (step, amount, oldbalanceOrg, newbalanceOrig, oldbalanceDest, newbalanceDest)

### What is the distribution of the outcome? 

In [8]:
# Your response here
df["isFraud"].value_counts()

# Outcome is binary - Bernoulli/binomial distribution

0    6354407
1       8213
Name: isFraud, dtype: int64

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [9]:
# Your code here
df.drop(["nameOrig", "nameDest"], axis=1, inplace=True)

In [10]:
encodedtype = pd.get_dummies(df["type"])
df = pd.concat([df,encodedtype], axis=1)
df.drop("type", axis=1, inplace=True)
df

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,CASH_IN,CASH_OUT,DEBIT,PAYMENT,TRANSFER
0,1,9839.64,170136.00,160296.36,0.00,0.00,0,0,0,0,0,1,0
1,1,1864.28,21249.00,19384.72,0.00,0.00,0,0,0,0,0,1,0
2,1,181.00,181.00,0.00,0.00,0.00,1,0,0,0,0,0,1
3,1,181.00,181.00,0.00,21182.00,0.00,1,0,0,1,0,0,0
4,1,11668.14,41554.00,29885.86,0.00,0.00,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6362615,743,339682.13,339682.13,0.00,0.00,339682.13,1,0,0,1,0,0,0
6362616,743,6311409.28,6311409.28,0.00,0.00,0.00,1,0,0,0,0,0,1
6362617,743,6311409.28,6311409.28,0.00,68488.84,6379898.11,1,0,0,1,0,0,0
6362618,743,850002.52,850002.52,0.00,0.00,0.00,1,0,0,0,0,0,1


In [11]:
# I see no issue with the encoding of step.
# Actual dates would add no relevant information to the simulation.

### Run a logistic regression classifier and evaluate its accuracy.

In [12]:
# taking a sample to make it easier on the computer...
df_sample = df.sample(n=100000)

In [13]:
df_sample["isFraud"].unique()

array([0, 1])

In [14]:
# Your code here
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

x = df_sample.drop(["isFraud", "isFlaggedFraud"], axis=1)
y = df_sample["isFraud"]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=45)
model = LogisticRegression().fit(x_train, y_train)
model.score(x_test, y_test)



0.9992

### Now pick a model of your choice and evaluate its accuracy.

In [15]:
# Your code here
from sklearn.neighbors import KNeighborsClassifier

model_knn = KNeighborsClassifier().fit(x_train, y_train)
model_knn.score(x_test, y_test)

0.9992

### Which model worked better and how do you know?

In [16]:
# Your response here
from sklearn.model_selection import cross_val_score

cv_results = cross_val_score(model, x, y, cv=5)
cv_results



array([0.99960002, 0.99890005, 0.9985    , 0.99904995, 0.99849992])

In [17]:
cv_results_knn = cross_val_score(model_knn, x, y, cv=5) 
cv_results_knn

array([0.99925004, 0.99945003, 0.99915   , 0.99929996, 0.99924996])

In [18]:
if cv_results.mean() > cv_results_knn.mean():
    print("Logistic regression worked better")
else: print("KNN worked better")

KNN worked better
