# Machine Learning Lab Midterm

<image src="image.png" height=500></image>

## Question 1

This problem is a problem of _class imbalance_.
This can be solved by either:-
- **Undersampling by removing data** 

or
- **Oversampling by duplicating fraud cases**

However, Oversampling may lead to **overfitting on postive class**

Hence, we will use **_Undersampling_** by random elimination to make sure both classes are balanced

- Decreasing Class probablility threshold may solve the problem, but it may lead to **more false positives**
- Regularization will not help here as regularization is used to solve the problem of **overfitting**, not **class imbalance**

We will be using the [Credit Card Fraud dataset](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud) from Kaggle

We will only be using the libraries **Numpy** *(for mathematics)* and **Pandas** (*for data import and preprocessing*)

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('creditcard.csv')

In [3]:
df.head()


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [4]:
df_class0 = df[df['Class']==0]
df_class1 = df[df['Class']==1]

In [5]:
print(f"Number of elements in class 0: {df_class0.shape[0]}")
print(f"Number of elements in class 1: {df_class1.shape[0]}")


Number of elements in class 0: 284315
Number of elements in class 1: 492


## Data Balancing

Here we can see Data is clearly imbalanced, we will undersample *Class 0 (not fraud)* 

In [6]:
elemCount = df_class1.shape[0]
np.random.seed(0)
class_0_indexes = np.random.randint(0,df_class0.shape[0],elemCount,)

In [7]:
balanced_class_0 = df_class0.iloc[class_0_indexes]
balanced_class_0.shape

(492, 31)

In [8]:
df_class1.shape

(492, 31)

Now as Both Class 0 and Class 1 are the same shape, we can join both into a singluar dataframe and create a train test split

In [9]:
balanced_df = pd.concat([balanced_class_0,df_class1],axis=0)
balanced_df.shape

(984, 31)

## Train Test Split

In [10]:
# Considering a 70 30 Train Test Split

all_indexes = np.arange(0,balanced_df.shape[0])
train_indexes = np.random.choice(all_indexes,replace=False,size=int(balanced_df.shape[0]*0.7))
test_indexes = np.setdiff1d(all_indexes,train_indexes)

print(f"train indexes shape: {train_indexes.shape}")
print(f"test indexes shape: {test_indexes.shape}")

train indexes shape: (688,)
test indexes shape: (296,)


In [11]:
train_df = balanced_df.iloc[train_indexes]
test_df = balanced_df.iloc[test_indexes]
print(f"train df shape: {train_df.shape}")
print(f"test df shape: {test_df.shape}")

train df shape: (688, 31)
test df shape: (296, 31)


In [12]:
xtrain = train_df.drop('Class',axis=1)
xtest = test_df.drop('Class',axis=1)
ytrain = train_df['Class']
ytest = test_df['Class']

## Model Definiton (Logistic Regression)

In [13]:
class LogisticRegression:
    def __init__(self):
        print("Init")
        self.theta = None
    
    def sigmoid(self,z):
        return (1/(1+np.exp(-z))) 
    
    def fit(self,X,y,alpha,epochs=1000):
        m, n = X.shape
        self.theta = np.zeros(n)
        for _ in range(epochs):
            predictions = self.sigmoid(np.dot(X, self.theta))
            gradient = (1 / m) * np.dot(X.T, (predictions - y))
            self.theta -= alpha * gradient
        
    def predict(self,xtest):
        probability = self.sigmoid(np.dot(xtest, self.theta))
        return (probability >= 0.5).astype(int)
    
    

In [14]:
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [15]:
model = LogisticRegression()
model.fit(xtrain,ytrain,alpha=0.001)

Init


In [16]:
predictions = model.predict(xtest)
predictions.shape

(296,)

In [17]:
ytest.shape

(296,)

## Metrics

In [18]:
class Metrics:
    def __init__(self):

        print("Init")
        self.tp = None
        self.tn = None
        self.fp = None
        self.fn = None
    
    def calculate_metrics(self,ytrue,ypreds):
        if ytrue.shape!=ypreds.shape:
            print("Incompatible shapes")
            return -1
        self.fn=0
        self.tn=0
        self.tp=0
        self.fp=0

        for i in range(0,ytrue.shape[0]):
            if ytrue.iloc[i]==1 and ypreds[i]==1: #TP
                self.tp+=1
            if ytrue.iloc[i]==1 and ypreds[i]==0: #FN
                self.fn+=1
            if ytrue.iloc[i]==0 and ypreds[i]==1: #FP
                self.fp+=1
            if ytrue.iloc[i]==0 and ypreds[i]==0: #TN
                self.tn+=1

    def accuracy(self):
        return ((self.tp + self.tn) / (self.tp + self.tn + self.fp + self.fn))
    def precision(self):
        return ((self.tp) / (self.tp +self.fp))
        pass
    def recall(self):
        return ((self.tp) / (self.tp +self.fn))
        pass
    def f1(self):
        return ((2*self.precision()*self.recall()) /  (self.precision() + self.recall()))
    
    def conf_matrix(self):
        return [[self.tn, self.fp], 
            [self.fn, self.tp]]

In [19]:
metrics = Metrics()
metrics.calculate_metrics(ytest,ypreds=predictions)

Init


In [20]:
print(f"Accuracy of the model is: {metrics.accuracy()}")
print(f"precision of the model is: {metrics.precision()}")
print(f"Recall of the model is: {metrics.recall()}")
print(f"F1 Score of the model is: {metrics.f1()}")


Accuracy of the model is: 0.5033783783783784
precision of the model is: 0.5033783783783784
Recall of the model is: 1.0
F1 Score of the model is: 0.6696629213483146


In [21]:
print(f"TP: {metrics.tp}")
print(f"FP: {metrics.fp}")
print(f"TN: {metrics.tn}")
print(f"FN: {metrics.fn}")


TP: 149
FP: 147
TN: 0
FN: 0
