# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes.


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever).


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.

### First, download the data from: https://www.kaggle.com/datasets/ealaxi/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?
### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

In [8]:
data = pd.read_csv("/content/drive/MyDrive/Labs Week 7/Lab5 Imbalance/PS_20174392719_1491204439457_log.csv").sample(n=100000, random_state=42)

In [11]:
data.head(5)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
3737323,278,CASH_IN,330218.42,C632336343,20866.0,351084.42,C834976624,452419.57,122201.15,0,0
264914,15,PAYMENT,11647.08,C1264712553,30370.0,18722.92,M215391829,0.0,0.0,0,0
85647,10,CASH_IN,152264.21,C1746846248,106589.0,258853.21,C1607284477,201303.01,49038.8,0,0
5899326,403,TRANSFER,1551760.63,C333676753,0.0,0.0,C1564353608,3198359.45,4750120.08,0,0
2544263,206,CASH_IN,78172.3,C813403091,2921331.58,2999503.88,C1091768874,415821.9,337649.6,0,0


In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 3737323 to 6142173
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   step            100000 non-null  int64  
 1   type            100000 non-null  object 
 2   amount          100000 non-null  float64
 3   nameOrig        100000 non-null  object 
 4   oldbalanceOrg   100000 non-null  float64
 5   newbalanceOrig  100000 non-null  float64
 6   nameDest        100000 non-null  object 
 7   oldbalanceDest  100000 non-null  float64
 8   newbalanceDest  100000 non-null  float64
 9   isFraud         100000 non-null  int64  
 10  isFlaggedFraud  100000 non-null  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 9.2+ MB


In [10]:
display(data.describe())

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,243.70907,180581.1,836680.4,858223.4,1104193.0,1230055.0,0.00141,1e-05
std,142.518613,558669.9,2901104.0,2936799.0,3223011.0,3475326.0,0.037524,0.003162
min,1.0,0.92,0.0,0.0,0.0,0.0,0.0,0.0
25%,156.0,13508.21,0.0,0.0,0.0,0.0,0.0,0.0
50%,240.0,76030.86,13938.5,0.0,138748.2,218578.6,0.0,0.0
75%,335.0,209113.0,107077.1,146416.9,960596.3,1126011.0,0.0,0.0
max,736.0,36973900.0,33593210.0,33887090.0,236289600.0,272404700.0,1.0,1.0


### What is the distribution of the outcome?

In [14]:
data['isFraud'].value_counts()

0    99859
1      141
Name: isFraud, dtype: int64

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [15]:
cleaned_data = data.drop(['nameOrig', 'nameDest', 'isFlaggedFraud'], axis=1)
cleaned_data['type'] = cleaned_data['type'].astype('category')
cleaned_data = pd.get_dummies(cleaned_data, columns=['type'], drop_first=True)

cleaned_data['step'] = pd.to_datetime(cleaned_data['step'], unit='s')

X = cleaned_data.drop('isFraud', axis=1)
y = cleaned_data['isFraud']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [17]:
X_train.dtypes

step              datetime64[ns]
amount                   float64
oldbalanceOrg            float64
newbalanceOrig           float64
oldbalanceDest           float64
newbalanceDest           float64
type_CASH_OUT              uint8
type_DEBIT                 uint8
type_PAYMENT               uint8
type_TRANSFER              uint8
dtype: object

In [18]:
X_train['step'] = pd.to_datetime(X_train['step']).astype(int) // 10**9
X_test['step'] = pd.to_datetime(X_test['step']).astype(int) // 10**9

### Run a logisitc regression classifier and evaluate its accuracy.

In [20]:
lr_classifier = LogisticRegression(max_iter=1000)
lr_classifier.fit(X_train, y_train)
lr_pred = lr_classifier.predict(X_test)
lr_accuracy = accuracy_score(y_test, lr_pred)
print(lr_accuracy)
print(classification_report(y_test, lr_pred))

0.99865
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     19972
           1       0.53      0.36      0.43        28

    accuracy                           1.00     20000
   macro avg       0.76      0.68      0.71     20000
weighted avg       1.00      1.00      1.00     20000



### Now pick a model of your choice and evaluate its accuracy.

In [22]:
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train, y_train)
rf_pred = rf_classifier.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_pred)
print( rf_accuracy)

0.99965


In [25]:
print(classification_report(y_test, rf_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     19972
           1       1.00      0.75      0.86        28

    accuracy                           1.00     20000
   macro avg       1.00      0.88      0.93     20000
weighted avg       1.00      1.00      1.00     20000



### Which model worked better and how do you know?

** Logistic Regression worked better based on accuracy and classification report evaluation. This model has a higher accuracy score and its classification report shows better precision, recall, and F1-score values for both classes.**

### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.