# Machine Learning Semester Project
## Murtaza Hussain (29449) and Muhammad Asad ur Rehman (29456)

### Class Imbalance Problem

The below code solves the prevalent problem of imbalanced dataset, where one class dominates the dataset as compared to the other. Such is the case for the following dataset for Credit Card Transactions to detect Fraudulent Transactions. We will evaluate the following methods to resolve Class Imbalance:
1. Random Under Sampling
2. Algorithmic Methods (Using Random Forest as well as modifying Class Weights)
3. Anomaly Detection Method

For the following Dataset, we will use the following 5 Algorithms to draw a comparision between different methods:
1. Logistic Regression
2. K-Nearest Neighbors (KNN)
3. Random Forest
4. Support Vector Machines (SVM)
5. Artificial Neural Network (ANN)

In [17]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, KFold
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, accuracy_score, r2_score, mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import make_scorer, recall_score
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE, ADASYN
from lazypredict.Supervised import LazyClassifier

pd.options.display.float_format = '{:,.4f}'.format

In [7]:
# Data Loader loads data from CSV Files
def load_dataset():
    dataset = pd.read_csv("./Source.CreditCardFraud.csv")
    return dataset

df = load_dataset()

In [3]:
# This function performs a missing value analysis on each column of the dataset, helps you decide on what to do in cleaning process
def null_check(df):
    null_columns = []
    for column in df.columns:
        print("Column Name:", column)
        print("Column DataType:", df[column].dtype)
        if df[column].dtype != 'float64' and df[column].dtype != 'int64':
            print("Column unique values:", df[column].unique())
        print("Column has null:", df[column].isnull().any())

        
        if df[column].isnull().any() == True:
            print("Column Null Count:", df[column].isnull().sum())
            null_columns.append(column)
        print("\n")
    return null_columns

null_check(df)

Column Name: Time
Column DataType: int64
Column has null: False


Column Name: V1
Column DataType: float64
Column has null: False


Column Name: V2
Column DataType: float64
Column has null: False


Column Name: V3
Column DataType: float64
Column has null: False


Column Name: V4
Column DataType: float64
Column has null: False


Column Name: V5
Column DataType: float64
Column has null: False


Column Name: V6
Column DataType: float64
Column has null: False


Column Name: V7
Column DataType: float64
Column has null: False


Column Name: V8
Column DataType: float64
Column has null: False


Column Name: V9
Column DataType: float64
Column has null: False


Column Name: V10
Column DataType: float64
Column has null: False


Column Name: V11
Column DataType: float64
Column has null: False


Column Name: V12
Column DataType: float64
Column has null: False


Column Name: V13
Column DataType: float64
Column has null: False


Column Name: V14
Column DataType: float64
Column has null: False


Colum

[]

In [None]:
# This function drops any null columns and missing values
# This is where you decide whether to remove NULL rows (which will reduce the size of Dataset) or remove NULL columns entirely. You can also choose a combination of both.
def clean_data(df, drop_columns, missing_value = False):
    # Remove unnecessary columns
    df.drop(drop_columns, axis=1, inplace=True)
    # Drop rows with any missing values
    if missing_value == False:
        df.dropna(inplace=True)
    else:
        df.fillna(missing_value, inplace=True)
    return df

In [14]:
# Prints a summary of class instances and distribution
def data_summary(df, target=None):
    if isinstance(df, pd.DataFrame) and target!=None:
        a = df[target].value_counts()
    else:
        a = df.value_counts()
    class0 = format(100 * a[0]/sum(a), ".2f")
    class1 = format(100 * a[1]/sum(a), ".2f")

    meta = pd.DataFrame([{ "%": class0, "count": a[0]},
                         { "%": class1, "count": a[1]}])
    print("Class Distribution:\n\n", meta)

data_summary(df,'Class')

Class Distribution:

        %  count
0  99.78  99776
1   0.22    223


In [9]:
# Transforms categorical and numberical data into numerical data
def transform_data(df):
    # Encode categorical variables
    label_encoder = LabelEncoder()
    print("Categorical columns:", df.select_dtypes(include=['object', 'int64']).columns)
    for col in df.select_dtypes(include=['object']).columns:
        df[col] = label_encoder.fit_transform(df[col])
    
    # Standardize numerical features
    scaler = StandardScaler()
    print("Numerical columns:", df.select_dtypes(include=['float64']).columns)
    numerical_cols = df.select_dtypes(include=['float64', 'int64']).columns
    if len(numerical_cols) > 0:
        df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
    return df

df['Class'] = df['Class'].astype(str)
df = transform_data(df)

Categorical columns: Index(['Time', 'Class'], dtype='object')
Numerical columns: Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount'],
      dtype='object')


In [6]:
# Runs Baseline Model for All 5 Algorithms
def BaselineRunAll(df, target_name, k=10):

    # Separate features and targets
    X = df.drop(target_name, axis=1)
    y = df[target_name]
    results = []

    # Initialize the KNN classifier
    lr_classifier = LogisticRegression()
    rf_classifier = RandomForestClassifier()
    knn_classifier = KNeighborsClassifier()
    svm_classifier = SVC()
    ann_classifier = MLPClassifier(max_iter=1000)
    

    # Initialize k-fold cross-validation where folds = 10
    # The reasoning behind k = 10 is so as to strike a balance between test and train samples of minority class
    k_fold = StratifiedKFold(n_splits=k, shuffle=True, random_state=42)

    # As the majority class has 99.81% presence, accuracy cannot be used as a metric to evaluate performance
    # Define a recall scorer specifically focusing on the minority class
    recall_scorer = make_scorer(recall_score, pos_label=1)

    lr_scores = cross_val_score(lr_classifier, X, y, cv=k_fold, scoring=recall_scorer)
    print("LR CV Completed")
    rf_scores = cross_val_score(rf_classifier, X, y, cv=k_fold, scoring=recall_scorer)
    print("RF CV Completed")
    knn_scores = cross_val_score(knn_classifier, X, y, cv=k_fold, scoring=recall_scorer)
    print("KNN CV Completed")
    svm_scores = cross_val_score(svm_classifier, X, y, cv=k_fold, scoring=recall_scorer)
    print("SVM CV Completed")
    ann_scores = cross_val_score(ann_classifier, X, y, cv=k_fold, scoring=recall_scorer)
    print("ANN CV Completed")
    lr_mean_recall = lr_scores.mean()
    rf_mean_recall = rf_scores.mean()
    knn_mean_recall = knn_scores.mean()
    svm_mean_recall = svm_scores.mean()
    ann_mean_recall = ann_scores.mean()
    results.append({'Method': 'Baseline Class 1 Recall',
                    'Logistic Regression': lr_mean_recall,
                    'Random Forest': rf_mean_recall,
                    'K-Nearest Neighbours': knn_mean_recall,
                    'Support Vector Machines': svm_mean_recall,
                    'Artificial Neural Networks': ann_mean_recall})
    
    df = pd.DataFrame(results)
    return df

results = BaselineRunAll(df, 'Class')
print(results)

LR CV Completed
RF CV Completed
KNN CV Completed
SVM CV Completed
ANN CV Completed


Unnamed: 0,Method,Logistic Regression,Random Forest,K-Nearest Neighbours,Support Vector Machines,Artificial Neural Networks
0,Baseline Class 1 Recall,0.5648,0.8385,0.8298,0.7265,0.8435


In [15]:
# Performs Undersampling of Majority Class followed by Oversampling of Minority class using SMOTE and tests on all 5 Algorithms
def RandomSamplingSMOTE(df, target_name, k=10):

    # Separate features and targets
    X = df.drop(target_name, axis=1)
    y = df[target_name]
    results = []

    # UnderSample Majority Class
    rus = RandomUnderSampler(sampling_strategy={0: 50000, 1: 223}, random_state=42)
    X, y = rus.fit_resample(X, y)
    print("Class Distribution after Undersampling Majority Class:")
    data_summary(y)

    # OverSample using SMOTE
    smote = SMOTE(random_state=42)
    X, y = smote.fit_resample(X, y)
    print("Class Distribution after Oversampling Minority Class using SMOTE:")
    data_summary(y)

    # Initialize the KNN classifier
    lr_classifier = LogisticRegression()
    rf_classifier = RandomForestClassifier()
    knn_classifier = KNeighborsClassifier()
    svm_classifier = SVC()
    ann_classifier = MLPClassifier(max_iter=1000)

    # Initialize k-fold cross-validation where folds = 10
    # The reasoning behind k = 10 is so as to strike a balance between test and train samples of minority class
    k_fold = StratifiedKFold(n_splits=k, shuffle=True, random_state=42)

    # Define a recall scorer specifically focusing on the minority class
    recall_scorer = make_scorer(recall_score, pos_label=1)

    lr_scores = cross_val_score(lr_classifier, X, y, cv=k_fold, scoring=recall_scorer)
    print("LR CV Completed")
    rf_scores = cross_val_score(rf_classifier, X, y, cv=k_fold, scoring=recall_scorer)
    print("RF CV Completed")
    knn_scores = cross_val_score(knn_classifier, X, y, cv=k_fold, scoring=recall_scorer)
    print("KNN CV Completed")
    svm_scores = cross_val_score(svm_classifier, X, y, cv=k_fold, scoring=recall_scorer)
    print("SVM CV Completed")
    ann_scores = cross_val_score(ann_classifier, X, y, cv=k_fold, scoring=recall_scorer)
    print("ANN CV Completed")
    lr_mean_recall = lr_scores.mean()
    rf_mean_recall = rf_scores.mean()
    knn_mean_recall = knn_scores.mean()
    svm_mean_recall = svm_scores.mean()
    ann_mean_recall = ann_scores.mean()
    results.append({'Method': 'SMOTE Class 1 Recall',
                    'Logistic Regression': lr_mean_recall,
                    'Random Forest': rf_mean_recall,
                    'K-Nearest Neighbours': knn_mean_recall,
                    'Support Vector Machines': svm_mean_recall,
                    'Artificial Neural Networks': ann_mean_recall})
    
    df = pd.DataFrame(results)
    return df

results = RandomSamplingSMOTE(df, 'Class')
print(results)

Class Distribution after Undersampling Majority Class:
Class Distribution:

        %  count
0  99.56  50000
1   0.44    223
Class Distribution after Oversampling Minority Class using SMOTE:
Class Distribution:

        %  count
0  50.00  50000
1  50.00  50000
LR CV Completed
RF CV Completed
KNN CV Completed
SVM CV Completed
ANN CV Completed
                 Method  Logistic Regression  Random Forest  \
0  SMOTE Class 1 Recall               0.9483         1.0000   

   K-Nearest Neighbours  Support Vector Machines  Artificial Neural Networks  
0                1.0000                   0.9778                      0.9999  


In [18]:
# Performs Undersampling of Majority Class followed by Oversampling of Minority class using SMOTE and tests on all 5 Algorithms
def RandomSamplingADASYN(df, target_name, k=10):

    # Separate features and targets
    X = df.drop(target_name, axis=1)
    y = df[target_name]
    results = []

    # UnderSample Majority Class
    rus = RandomUnderSampler(sampling_strategy={0: 50000, 1: 223}, random_state=42)
    X, y = rus.fit_resample(X, y)
    print("Class Distribution after Undersampling Majority Class:")
    data_summary(y)

    # OverSample using SMOTE
    adasyn = ADASYN(random_state=42)
    X, y = adasyn.fit_resample(X, y)
    print("Class Distribution after Oversampling Minority Class using SMOTE:")
    data_summary(y)

    # Initialize the KNN classifier
    lr_classifier = LogisticRegression()
    rf_classifier = RandomForestClassifier()
    knn_classifier = KNeighborsClassifier()
    svm_classifier = SVC()
    ann_classifier = MLPClassifier(max_iter=1000)

    # Initialize k-fold cross-validation where folds = 10
    # The reasoning behind k = 10 is so as to strike a balance between test and train samples of minority class
    k_fold = StratifiedKFold(n_splits=k, shuffle=True, random_state=42)

    # Define a recall scorer specifically focusing on the minority class
    recall_scorer = make_scorer(recall_score, pos_label=1)

    lr_scores = cross_val_score(lr_classifier, X, y, cv=k_fold, scoring=recall_scorer)
    print("LR CV Completed")
    rf_scores = cross_val_score(rf_classifier, X, y, cv=k_fold, scoring=recall_scorer)
    print("RF CV Completed")
    knn_scores = cross_val_score(knn_classifier, X, y, cv=k_fold, scoring=recall_scorer)
    print("KNN CV Completed")
    svm_scores = cross_val_score(svm_classifier, X, y, cv=k_fold, scoring=recall_scorer)
    print("SVM CV Completed")
    ann_scores = cross_val_score(ann_classifier, X, y, cv=k_fold, scoring=recall_scorer)
    print("ANN CV Completed")
    lr_mean_recall = lr_scores.mean()
    rf_mean_recall = rf_scores.mean()
    knn_mean_recall = knn_scores.mean()
    svm_mean_recall = svm_scores.mean()
    ann_mean_recall = ann_scores.mean()
    results.append({'Method': 'ADASYN Class 1 Recall',
                    'Logistic Regression': lr_mean_recall,
                    'Random Forest': rf_mean_recall,
                    'K-Nearest Neighbours': knn_mean_recall,
                    'Support Vector Machines': svm_mean_recall,
                    'Artificial Neural Networks': ann_mean_recall})
    
    df = pd.DataFrame(results)
    return df

results = RandomSamplingADASYN(df, 'Class')
print(results)

Class Distribution after Undersampling Majority Class:
Class Distribution:

        %  count
0  99.56  50000
1   0.44    223
Class Distribution after Oversampling Minority Class using SMOTE:
Class Distribution:

        %  count
0  50.01  50000
1  49.99  49985
LR CV Completed
RF CV Completed
KNN CV Completed


In [None]:
# Plot a Number of Folds vs Accuracy graph for Classification Dataset
def plot_model_accuracy_graph(df):
    plt.figure(figsize=(7, 4))
    
    # Plotting line for Logistic Regression Accuracy
    sns.lineplot(x=df['k'], y=df['Logistic Regression Accuracy'], label='Logistic Regression Accuracy', marker='o')

    # Plotting line for Random Forest Accuracy
    sns.lineplot(x=df['k'], y=df['Random Forest Accuracy'], label='Random Forest Accuracy', marker='o')

    plt.title('Model Accuracy by Number of Folds')
    plt.xlabel('Number of Folds (k)')
    plt.ylabel('Accuracy')
    plt.legend()
    plt.grid(True)
    plt.show()

In [None]:
# Plot a Number of Folds vs R2 Score graph for Regression Dataset
def plot_model_r2_graph(df):
    plt.figure(figsize=(7, 4))
    
    # Plotting line for Logistic Regression Accuracy
    sns.lineplot(x=df['k'], y=df['Linear Regression R2 Score'], label='Linear Regression R2 Score', marker='o')

    # Plotting line for Random Forest Accuracy
    sns.lineplot(x=df['k'], y=df['Random Forest R2 Score'], label='Random Forest R2 Score', marker='o')

    plt.title('Model R2 Score by Number of Folds')
    plt.xlabel('Number of Folds (k)')
    plt.ylabel('R2 Score')
    plt.legend()
    plt.grid(True)
    plt.show()

In [None]:
# Plot a Number of Folds vs MSE Score graph for Regression Dataset
def plot_model_mse_graph(df):
    plt.figure(figsize=(7, 4))
    
    # Plotting line for Logistic Regression Accuracy
    sns.lineplot(x=df['k'], y=df['Linear Regression MSE Score'], label='Linear Regression MSE Score', marker='o')

    # Plotting line for Random Forest Accuracy
    sns.lineplot(x=df['k'], y=df['Random Forest MSE Score'], label='Random Forest MSE Score', marker='o')

    plt.title('Model MSE Score by Number of Folds')
    plt.xlabel('Number of Folds (k)')
    plt.ylabel('MSE Score')
    plt.legend()
    plt.grid(True)
    plt.show()

In [None]:
# Loading all Datasets into the required variables
c_cancer, c_mice_expression, c_adult_income, r_life_expectancy, r_appartment_rent, r_song_popularity = load_datasets()

# Classification Datasets

## Dataset 1 : Cancer Detection Dataset (Classification)

In [None]:
c_cancer
# Checking for Null Values
null_check(c_cancer)
# No Null Values present hence Encoding Categorical Data to Numerical
c_cancer = transform_data(c_cancer)
# The target column in 'diagnosis' hence applying Logistic Regression with and without CV.
c_cancer_results = LogisticRegressionCV(c_cancer, 'diagnosis', [0,5,10,20,50,100])
# Displaying the graph of No. of Folds vs Accuracy
plot_model_accuracy_graph(c_cancer_results)

### Interpretation:
As can be interpreted from the graph above, using CV yielded better results for Logistic Regression, but did not make much of an impact for Random Forest as the Algorithm is far more complex.
However, a point of concern is that Logistic Regression has a better accuracy than Random Forest which means that the model is overfitting on Logistic Regression, or another valid explanation can be that the problem was better mapped by Logistic Regression than it was for Random Forest.

## Dataset 2 : Mice Protein Expression Dataset (Classification)

In [None]:
c_mice_expression
# Checking for Null Values
null_check(c_mice_expression)
# Null Values present hence Removing the data
clean_data(c_mice_expression,[])
# Encoding Categorical Data to Numerical
c_mice_expression = transform_data(c_mice_expression)
# The target column in 'class' hence applying Logistic Regression with and without CV.
c_mice_expression_results = LogisticRegressionCV(c_mice_expression, 'class', [0,5,10,20,50,75])
# Displaying the graph of No. of Folds vs Accuracy
plot_model_accuracy_graph(c_mice_expression_results)

### Interpretation:
The graph above shows the impact of using CV on the dataset. For Logistic Regression, the accuracy was better without CV, however, the accuracy decreased but then improved as the number of folds increased. One possible explanation for this can be that the data has 7 target classes and approximately 500 rows after data cleaning is performed. Hence when CV is used, there is a small chunk of data available to train the model from, hence the reduced accuracy. However, as the number of folds increase, the data available for training increases and hence accuracy increases as well as there is more data to train the model on. For Random Forest, without using CV, the model was overfitting on Training data hence the reduced accuracy, however, as CV helps models avoid overfitting, the test accuracy started increasing as the number of folds increased.

## Dataset 3 : Adult Income Dataset (Classification)

In [None]:
c_adult_income
# Checking for Null Values
null_check(c_adult_income)
# Null Values present hence Removing the data
clean_data(c_adult_income,[])
# Encoding Categorical Data to Numerical
c_adult_income = transform_data(c_adult_income)
# The target column in 'class' hence applying Logistic Regression with and without CV.
c_adult_income_results = LogisticRegressionCV(c_adult_income, 'income', [0,5,10,20,50,100])
# Displaying the graph of No. of Folds vs Accuracy
plot_model_accuracy_graph(c_adult_income_results)

### Interpretation:
The graph above shows the impact of using CV on the dataset. For Logistic Regression, the accuracy was some what constant as the data size was pretty large, hence even without cross validation, the model did not overfit. For Random Forest, the initial rise in accuracy and then the decrease can be explained by the fact that initially, the algorithm did overfit, but when CV was used, the generaliseability of the model improved and hence Accuracy decreased.

# Regression Datasets

## Dataset 4 : Life Expectancy Dataset (Regression)

In [None]:
r_life_expectancy
# Checking for Null Values
null_check(r_life_expectancy)
# Null Values present hence Removing the data
clean_data(r_life_expectancy,[])
# Encoding Categorical Data to Numerical
r_life_expectancy = transform_data(r_life_expectancy)
# The target column in 'class' hence applying Logistic Regression with and without CV.
r_life_expectancy_results = LinearRegressionCV(r_life_expectancy, 'Life expectancy ', [0,5,10,20,50,100])
# Displaying the graph of No. of Folds vs R2
plot_model_r2_graph(r_life_expectancy_results)
# Displaying the graph of No. of Folds vs MSE
plot_model_mse_graph(r_life_expectancy_results)

### Interpretation:
The graphs above shows the impact of using CV on the dataset. For Linear Regression, the R2 Score increased and then started decreasing, a possible cause of this could be limited size of the dataset. MSE score on the other hand, remained decreased minutely, and then remained constant. For Random Forest, the R2 Score increased, but then started decreasing, validating that the size of the dataset is small for large number of k-Folds. MSE for Random Forest decreased but then became constant.
For reference, R2 score should be closer to 1 and MSE should be close to 0.

## Dataset 5 : Appartment Rent Estimation (Regression)

In [None]:
r_appartment_rent
# Checking for Null Values
null_check(r_appartment_rent)
# No Null Values present hence Encoding Categorical Data to Numerical
r_appartment_rent = transform_data(r_appartment_rent)
# The target column in 'class' hence applying Logistic Regression with and without CV.
r_appartment_rent_results = LinearRegressionCV(r_appartment_rent, 'price', [0,5,10,20,50,100])
# Displaying the graph of No. of Folds vs R2
plot_model_r2_graph(r_appartment_rent_results)
# Displaying the graph of No. of Folds vs MSE
plot_model_mse_graph(r_appartment_rent_results)

### Interpretation:
The graphs above show the impact of using CV on the dataset. For Linear Regression, the R2 Score decreased drastically and is pretty low, illustrating that the model/algorithm is not a good fit to model the problem. Moreover, the drastic rise in MSE scores shows supports the above hypothesis as well as indicates that CV does avoid overfitting. For Random Forest, the R2 Score increased, and the MSE Score also rose drastically, showing that the model did overfit without CV.

## Dataset 6 : Song Popularity Estimation (Regression)

In [None]:
r_song_popularity
# Checking for Null Values
null_check(r_song_popularity)
# No Null Values present hence Encoding Categorical Data to Numerical
r_song_popularity = transform_data(r_song_popularity)
# The target column in 'class' hence applying Logistic Regression with and without CV.
r_song_popularity_results = LinearRegressionCV(r_song_popularity, 'song_popularity', [0,5,10,20,50,100])
# Displaying the graph of No. of Folds vs R2
plot_model_r2_graph(r_song_popularity_results)
# Displaying the graph of No. of Folds vs MSE
plot_model_mse_graph(r_song_popularity_results)

### Interpretation:
The graphs above show the impact of using CV on the dataset. For Linear Regression, the R2 Score and the MSE Score did not vary much with a low R2 Score, illustrating that the model/algorithm is not a good fit to model the problem. For Random Forest, the R2 Score increased, and the MSE Score also decreased when CV was used, showing that the model did better fit when CV was used.

### Summary:
To Summarize the above results, the following things about CV can be concluded:
1. CV has its many benefits of use, such as avoiding overfitting, improving model fit.
2. In the above interpretations, CV mostly attributed to increase model fitting and sometimes even higher accuracies of model.
3. A case where CV did perform poorly was where the size of dataset was small.
4. Some general observations regarding performance are that the values of folds should not be too high for CV as it is time and compute intensive. Also increasing the folds not always helps the model fit better.