# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### First, download the data from: https://www.kaggle.com/ntnu-testimon/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?

In [1]:
import pandas as pd
full_size = 6362619

paysim = pd.read_csv('/home/osboxes/Téléchargements/PS_20174392719_1491204439457_log.csv', nrows = round(full_size*0.1))

### What is the distribution of the outcome? 

In [2]:
paysim.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,636262.0,636262.0,636262.0,636262.0,636262.0,636262.0,636262.0,636262.0
mean,16.63756,162047.9,888489.0,908529.7,975520.9,1137692.0,0.000602,0.0
std,6.846997,270071.6,2962991.0,3000055.0,2318776.0,2473439.0,0.024527,0.0
min,1.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0
25%,12.0,12468.23,0.0,0.0,0.0,0.0,0.0,0.0
50%,16.0,76357.3,17417.6,0.0,114436.1,208494.5,0.0,0.0
75%,19.0,216188.6,154320.8,194080.6,893617.0,1172047.0,0.0,0.0
max,35.0,10000000.0,38939420.0,38946230.0,41482700.0,41482700.0,1.0,0.0


In [3]:
paysim.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [4]:
paysim.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 636262 entries, 0 to 636261
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   step            636262 non-null  int64  
 1   type            636262 non-null  object 
 2   amount          636262 non-null  float64
 3   nameOrig        636262 non-null  object 
 4   oldbalanceOrg   636262 non-null  float64
 5   newbalanceOrig  636262 non-null  float64
 6   nameDest        636262 non-null  object 
 7   oldbalanceDest  636262 non-null  float64
 8   newbalanceDest  636262 non-null  float64
 9   isFraud         636262 non-null  int64  
 10  isFlaggedFraud  636262 non-null  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 53.4+ MB


In [5]:
paysim = paysim.drop(['nameOrig','nameDest'],axis=1)

In [6]:
paysim.type.unique()

array(['PAYMENT', 'TRANSFER', 'CASH_OUT', 'DEBIT', 'CASH_IN'],
      dtype=object)

In [7]:
paysim.loc[paysim.type.str.contains('PAYMENT'), 'type'] = '0'
paysim.loc[paysim.type.str.contains('TRANSFER'), 'type'] = '1'
paysim.loc[paysim.type.str.contains('CASH_OUT'), 'type'] = '2'
paysim.loc[paysim.type.str.contains('DEBIT'), 'type'] = '3'
paysim.loc[paysim.type.str.contains('CASH_IN'), 'type'] = '4'

In [8]:
paysim.type = pd.to_numeric(paysim.type)

In [9]:
paysim.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 636262 entries, 0 to 636261
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   step            636262 non-null  int64  
 1   type            636262 non-null  int64  
 2   amount          636262 non-null  float64
 3   oldbalanceOrg   636262 non-null  float64
 4   newbalanceOrig  636262 non-null  float64
 5   oldbalanceDest  636262 non-null  float64
 6   newbalanceDest  636262 non-null  float64
 7   isFraud         636262 non-null  int64  
 8   isFlaggedFraud  636262 non-null  int64  
dtypes: float64(5), int64(4)
memory usage: 43.7 MB


In [10]:
paysim.step = pd.to_datetime(paysim.step*1000000000*60*60)

In [11]:
import datetime as dt
paysim.step = paysim.step.map(dt.datetime.toordinal)

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [12]:
paysim.columns

Index(['step', 'type', 'amount', 'oldbalanceOrg', 'newbalanceOrig',
       'oldbalanceDest', 'newbalanceDest', 'isFraud', 'isFlaggedFraud'],
      dtype='object')

In [13]:
from sklearn.model_selection import train_test_split
model_X = ['step', 'type', 'amount', 'oldbalanceOrg', 'newbalanceOrig',
       'oldbalanceDest', 'newbalanceDest', 'isFlaggedFraud']
X = paysim[model_X]
y = paysim.isFraud
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Run a logisitc regression classifier and evaluate its accuracy.

In [14]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=42)

In [15]:
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.9995599317894274

### Now pick a model of your choice and evaluate its accuracy.

In [25]:
from sklearn import svm
#clf1 = svm.SVC(gamma=1) don't use SVC for large dataset, use linearSVC
clf1 = svm.LinearSVC(random_state=42)

In [26]:
clf1.fit(X_train, y_train)



LinearSVC(random_state=42)

In [27]:
clf1.score(X_test, y_test)

0.9995835068721366

In [28]:
from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier(max_depth=10, random_state=42)

In [29]:
RFC.fit(X_train, y_train)

RandomForestClassifier(max_depth=10, random_state=42)

In [23]:
RFC.score(X_test, y_test)

0.9995049232631058

### Which model worked better and how do you know?

In [20]:
# I don't know all results are the same 