# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### First, download the data from: https://www.kaggle.com/ntnu-testimon/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?

In [1]:
# Your code here

import pandas as pd


In [2]:
#fraud = pd.read_csv('fraud.csv')

### What is the distribution of the outcome? 

In [3]:
# The dataset has way more cases of Fraud the not Fraud so we need to avoid Overfitting

fraud['isFraud'].value_counts()

0    6354407
1       8213
Name: isFraud, dtype: int64

In [4]:
# Your response here

fraud.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [5]:
fraud.shape

(6362620, 11)

In [6]:
# VARIABLES DESCRIPTION


# step - unit of time in the real world. 1 step = 1 hour of time. Total steps 744 (30 days simulation).

# type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

# amount - amount of the transaction in local currency.

# nameOrig - customer who started the transaction

# oldbalanceOrg - initial balance before the transaction

# newbalanceOrig - new balance after the transaction

# nameDest - customer who is the recipient of the transaction

# oldbalanceDest - initial balance recipient before the transaction. No information for customers that start with M (Merchants).

# newbalanceDest - new balance recipient after the transaction. No information for customers that start with M (Merchants).

# isFraud - Transactions made by the fraudulent agents inside the simulation. 
# Their goal is to empty the funds by transferring to another account and then cashing out of the system.

# isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. 
# An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.

In [7]:
fraud.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


In [8]:
fraud.isnull().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

In [9]:
# What is important for determining fraud or not? 
 
# type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

# amount - amount of the transaction in local currency.

# oldbalanceOrg - initial balance before the transaction

# newbalanceOrig - new balance after the transaction

# isFlaggedFraud - attempt to transfer more than 200.000 in a single transaction


In [10]:
# I'm not goign to use  oldbalanceDest and newbalanceDest because  74% of the colum values are missing

missing = 2704388 * 100 / len(fraud)

print(fraud['oldbalanceDest'].value_counts(), '\n ')

print('Percentage of missing values:', missing)

0.00           2704388
10000000.00        615
20000000.00        219
30000000.00         86
40000000.00         31
                ...   
28209542.84          1
529734.16            1
499717.38            1
531125.34            1
2011.42              1
Name: oldbalanceDest, Length: 3614697, dtype: int64 
 
Percentage of missing values: 42.5043142604776


In [11]:
fraud['newbalanceDest'].value_counts()

missing =  2439433 * 100 / len(fraud)

print(fraud['newbalanceDest'].value_counts(), '\n ') 

print('Percentage of missing values:', missing)

0.00           2439433
10000000.00         53
971418.91           32
19169204.93         29
1254956.07          25
                ...   
2384900.66           1
573230.90            1
444426.88            1
170489.19            1
324704.47            1
Name: newbalanceDest, Length: 3555499, dtype: int64 
 
Percentage of missing values: 38.34007059984723


In [12]:
# Your code here
# step = 1 step = 1 hour of time

fraud['step'].value_counts()

19     51352
18     49579
187    49083
235    47491
307    46968
       ...  
725        4
245        4
655        4
112        2
662        2
Name: step, Length: 743, dtype: int64

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [13]:
fraud['type'].value_counts()


CASH_OUT    2237500
PAYMENT     2151495
CASH_IN     1399284
TRANSFER     532909
DEBIT         41432
Name: type, dtype: int64

In [14]:
newdf = pd.get_dummies(fraud['type'])

fraud_clean = fraud.join(newdf)

fraud_clean.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,CASH_IN,CASH_OUT,DEBIT,PAYMENT,TRANSFER
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0,0,0,0,1,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0,0,0,0,1,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0,0,0,0,0,1
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0,0,1,0,0,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0,0,0,0,1,0


In [15]:
#fraud_clean.drop(['step','type','nameOrig', 'nameDest','oldbalanceDest','newbalanceDest'], axis = 1, inplace=True) 

fraud_clean.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,CASH_IN,CASH_OUT,DEBIT,PAYMENT,TRANSFER
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0,0,0,0,1,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0,0,0,0,1,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0,0,0,0,0,1
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0,0,1,0,0,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0,0,0,0,1,0


In [16]:
fraud_clean['isFraud'].value_counts()

0    6354407
1       8213
Name: isFraud, dtype: int64

In [17]:
# Choose a sample where is NOT FRAUD is about 5 times bigger than IS fraud (8213)

notfraud_sample = fraud_clean[fraud_clean['isFraud']==0].sample(n=41000) # 41000 rows

isfraud_sample = fraud_clean[fraud_clean['isFraud']==1]  # 8213 rows

In [18]:
frames = [notfraud_sample,isfraud_sample]

fraud_final = pd.concat(frames)

fraud_final.shape

(49213, 16)

In [19]:
fraud_final.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,CASH_IN,CASH_OUT,DEBIT,PAYMENT,TRANSFER
5355620,375,PAYMENT,11563.27,C1667931150,385531.25,373967.98,M630731492,0.0,0.0,0,0,0,0,0,1,0
4380086,311,PAYMENT,8909.4,C2048433398,17697.0,8787.6,M1907947457,0.0,0.0,0,0,0,0,0,1,0
5106455,355,PAYMENT,3381.17,C2091951865,0.0,0.0,M2116432572,0.0,0.0,0,0,0,0,0,1,0
714747,37,PAYMENT,12112.76,C1105447760,223612.26,211499.49,M222512406,0.0,0.0,0,0,0,0,0,1,0
1454918,140,CASH_IN,264237.71,C1642094927,3626045.46,3890283.17,C1664276005,573584.16,309346.46,0,0,1,0,0,0,0


### Run a logisitc regression classifier and evaluate its accuracy.

In [20]:
# Model with 8213 cases of IS Fraud and 41000 cases of NO fraud

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix,  precision_score, recall_score, f1_score


# split
x = fraud_final[['amount','oldbalanceOrg','newbalanceOrig','isFlaggedFraud', 'CASH_IN','CASH_OUT','DEBIT','PAYMENT' ,'TRANSFER']]
y = fraud_final['isFraud']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=0)

# build & fit model

fraud_model = LogisticRegression().fit(X_train, y_train)

# predict

y_pred = fraud_model.predict(X_test)

# evaluate

print('Accuracy:', accuracy_score(y_test,y_pred), '\n')
#  Precision = true positives / true positives + false POSITIVES
print('Precision:', precision_score(y_test, y_pred), '\n')

# Recall =  true positives / true positives + false NEGATIVES
print('Recall/Sensitivity:', recall_score(y_test, y_pred), '\n') # how many TP

# Average of precision and recall
print('F1-Score:', f1_score(y_test, y_pred),  '\n')      
      

print(confusion_matrix(y_test,y_pred))

Accuracy: 0.9129330488672153 

Precision: 0.6582023377670294 

Recall/Sensitivity: 0.9945188794153471 

F1-Score: 0.7921416444336647 

[[7353  848]
 [   9 1633]]


In [21]:
# Model with the original dataset 8213 cases of IS Fraud and 6354407 cases of NO fraud


# split
x = fraud_clean[['amount','oldbalanceOrg','newbalanceOrig','isFlaggedFraud', 'CASH_IN','CASH_OUT','DEBIT','PAYMENT' ,'TRANSFER']]
y = fraud_clean['isFraud']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=0)

# build & fit model

fraud_model_all_data = LogisticRegression().fit(X_train, y_train)

# predict

y_pred2 = fraud_model_all_data.predict(X_test)

# evaluate

print('Accuracy:', accuracy_score(y_test,y_pred2), '\n')

print('Precision:', precision_score(y_test, y_pred2), '\n')

print('Recall/Sensitivity:', recall_score(y_test, y_pred2), '\n') # how many TP

print('F1-Score:', f1_score(y_test, y_pred2),  '\n')  

print(confusion_matrix(y_test,y_pred2))

Accuracy: 0.9994782023757509 

Precision: 0.7201442091031997 

Recall/Sensitivity: 0.9737964655697745 

F1-Score: 0.8279792746113989 

[[1270262     621]
 [     43    1598]]


### Now pick a model of your choice and evaluate its accuracy.

In [25]:
# Choose a sample where is NOT FRAUD is about 5 times bigger than IS fraud (8213)

notfraud_sample2 = fraud_clean[fraud_clean['isFraud']==0].sample(n=8213) # 41000 rows

isfraud_sample2 = fraud_clean[fraud_clean['isFraud']==1]  # 8213 rows

frames = [notfraud_sample2,isfraud_sample2]

fraud_final2 = pd.concat(frames)

fraud_final2.shape


fraud_final2['isFraud'].value_counts()

1    8213
0    8213
Name: isFraud, dtype: int64

In [26]:
# Model with the original dataset 8213 cases of IS Fraud and 8213 cases of NO fraud


# split
x = fraud_final2[['amount','oldbalanceOrg','newbalanceOrig','isFlaggedFraud', 'CASH_IN','CASH_OUT','DEBIT','PAYMENT' ,'TRANSFER']]
y = fraud_final2['isFraud']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=0)

# build & fit model

fraud_model_half_half_data = LogisticRegression().fit(X_train, y_train)

# predict

y_pred3 = fraud_model_half_half_data.predict(X_test)

# evaluate

print('Accuracy:', accuracy_score(y_test,y_pred3), '\n')

print('Precision:', precision_score(y_test, y_pred3), '\n')

print('Recall/Sensitivity:', recall_score(y_test, y_pred3), '\n') # how many TP

print('F1-Score:', f1_score(y_test, y_pred3),  '\n')  

print(confusion_matrix(y_test,y_pred3))

Accuracy: 0.8925745587340231 

Precision: 0.8268941294530858 

Recall/Sensitivity: 0.9951690821256038 

F1-Score: 0.9032611674431351 

[[1285  345]
 [   8 1648]]


### Which model worked better and how do you know?

In [24]:
# Your response here