# Artificial Intelligence for Business

This Jupyter notebook performs Exploratory Data Analysis (EDA) on one of the six synthetic tabular datasets in the Bank Account Fraud (BAF) suite of datasets. The BAF datasets were published at NeurIPS 2022 and are intended to provide a realistic, complete, and robust test bed to evaluate novel and existing methods in machine learning (ML) and fair ML.

# Project Description

The objective of this notebook is to provide an overview over the dataset and prepare it to be used for training and evaluating ML models. The notebook is structured as follows:
- [1. Load the dataset](#1.-Load-the-dataset)
- [2. Explore the dataset](#2.-Explore-the-dataset)

The BAF suite of datasets comprises a total of 6 different synthetic bank account fraud tabular datasets. The datasets are realistic, based on a present-day real-world dataset for fraud detection, and each dataset has distinct controlled types of bias. Additionally, the datasets have an imbalanced setting with an extremely low prevalence of positive class, contain temporal data and observed distribution shifts, and have privacy-preserving features to protect the identity of potential applicants.

In this notebook, we will be exploring one of the datasets in the BAF suite, the Base.csv dataset.

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from tabulate import tabulate

from sklearn.utils import shuffle
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, f1_score, accuracy_score, precision_score, recall_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

# Data Understanding and Exploration

### Loading and describing the dataset
We start by loading the dataset into a Pandas DataFrame and displaying its first few rows using the head() function. We then display some basic statistics of the dataset using the describe() function.

In [None]:
try:
    # Read in the data
    df = pd.read_csv('../dataset/Base.csv', header=0)
except FileNotFoundError:
    print("Error: File not found.")
except pd.errors.EmptyDataError:
    print("Error: Empty DataFrame.")
except pd.errors.ParserError:
    print("Error: Parsing error occurred.")
except Exception as e:
    print(f"An error occurred: {str(e)}")

df.head()

In [None]:
df.describe()

### Boxplot
We then use the Seaborn library to create boxplots of the numerical columns in the dataset. Boxplots are used to visualize the distribution and outliers of each numerical column.

In [None]:
# get the list of numerical columns
num_cols = df.select_dtypes(include=['float', 'int']).columns.tolist()

# create a grid of subplots using seaborn
n_cols = 3  # number of columns in the grid
n_rows = (len(num_cols) + n_cols - 1) // n_cols  # number of rows in the grid
fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(25, 5*n_rows))

# loop through the columns and create a boxplot for each one
for i, col in enumerate(num_cols):
    row_idx = i // n_cols  # row index for this subplot
    col_idx = i % n_cols  # column index for this subplot
    ax = sns.boxplot(data=df[col], ax=axes[row_idx, col_idx])
    ax.set_title(col)

During the analysis of the boxplots, we have identified a column that contains only a single value, with an equal value for all data inputs. Since this column does not provide any meaningful variation or information, it is recommended to remove it from the dataset before proceeding with further analysis.

In [None]:
column_to_remove = 'device_fraud_count'

# drop the column from the dataframe
df = df.drop(column_to_remove, axis=1)

### Histograms
We can also create histograms of the numerical columns to see the distribution of each feature. This can help us identify any features that may need to be transformed to achieve a normal distribution.

In [None]:
df.hist(bins=20, figsize=(25, 20))
plt.show()

We can see some columns follow a normal distribution. Those columns are zip_count_4w, velocity_6h, velocity_24h, date_of_birth_distinct_emails_4w, credit_risk_score.

### Count Plot
Finally, we can create a count plot to visualize the distribution of the target variable (fraud). This can help us identify the class imbalance in the dataset.

In [None]:
# Create a count plot of the target variable
sns.countplot(x='fraud_bool', data=df)
plt.title('Target Variable Distribution')
plt.show()

### Division of Variables
The variables have been grouped as follows:
- Target variable: Variable of interest in the project
- Continuous variables: These variables represent quantitative measurements
- Categorical variables: These variables represent a finite set of possible values
- Binary variables: These variables have two distinct values, normally representing a yes/no condition.

In [None]:
continuous_cols = ['income', 'name_email_similarity', 'prev_address_months_count', 'current_address_months_count', 'customer_age', 'days_since_request', 'intended_balcon_amount', 'zip_count_4w', 'velocity_6h', 'velocity_24h', 'velocity_4w', 'bank_branch_count_8w', 'date_of_birth_distinct_emails_4w', 'credit_risk_score', 'bank_months_count', 'proposed_credit_limit', 'session_length_in_minutes', 'device_distinct_emails_8w']
binary_cols = ['fraud_bool', 'email_is_free', 'phone_home_valid', 'phone_mobile_valid', 'has_other_cards', 'foreign_request', 'keep_alive_session']
discrete_cols = ['payment_type', 'employment_status', 'housing_status', 'source', 'device_os', 'month']
normal_distribution_cols = ['zip_count_4w', 'velocity_6h', 'velocity_24h', 'date_of_birth_distinct_emails_4w', 'credit_risk_score']
target_col = 'fraud_bool'

# Exploratory Data Analysis (EDA)

### Analyzing Outliers
Outliers can take many different forms in a dataset. In some cases, outliers may be extreme values that fall far outside the expected range of the data, while in other cases, outliers may appear as discontinuities or gaps in the data.

In this particular dataset, it has been observed that there are some columns - customer_age, days_since_request, intended_balcon_amount, and proposed_credit_limit - that contain discontinuous points, based on the boxplots. These points may represent missing data or errors in data collection, or they may be indicative of some other pattern in the data.

In [None]:
columns_to_analyze = ['customer_age', 'days_since_request', 'intended_balcon_amount', 'proposed_credit_limit', 'device_distinct_emails_8w']

# create a grid of subplots using seaborn
n_cols = 2  # number of columns in the grid
n_rows = (len(columns_to_analyze) + n_cols - 1) // n_cols  # number of rows in the grid
fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(25, 5*n_rows))

# loop through the columns and create a boxplot for each one
for i, col in enumerate(columns_to_analyze):
    row_idx = i // n_cols  # row index for this subplot
    col_idx = i % n_cols  # column index for this subplot
    ax = sns.scatterplot(x=col, y=target_col, data=df, ax=axes[row_idx, col_idx])
    ax.set_title(col)

Upon examining the selected columns, we did not find significant differences in the values between the fraud and non-fraud categories. This suggests that outliers in these variables are not particularly informative or indicative of fraud. Instead, they may be a result of random variations or noise in the data.

Considering these findings, it may not be necessary to remove outliers from these variables to improve the accuracy of your analysis or model.

### Correlation Matrix
We can calculate the correlation matrix between the numerical features in the dataset to see how they are related to each other. This can give us some insight into which features are most important for predicting fraud.

In [None]:
# Calculate the correlation matrix
corr_matrix = df[continuous_cols].corr()

# Create a heatmap of the correlation matrix
fig, ax = plt.subplots(figsize=(25, 30))
ax = sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', annot_kws={'size': 10})
plt.title('Correlation Matrix')
plt.show()

TODO: Add more details about the correlation matrix

## Correlation between continuous variables and our target (binary) variable

### Visualizing the distributions between fraud and non-fraud categories

In [None]:
# Set the number of columns and rows in the grid
num_cols = 3
num_rows = (len(continuous_cols) + num_cols - 1) // num_cols

# Create a grid of subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 10))

# Flatten the axes array
axes = axes.flatten()

# Create box plots for each continuous variable
for i, col in enumerate(continuous_cols):
    ax = axes[i]
    sns.boxplot(x=target_col, y=col, data=df, ax=ax)
    ax.set_title(f"{col} by {target_col}")

# Remove any empty subplots
if len(continuous_cols) < len(axes):
    for j in range(len(continuous_cols), len(axes)):
        fig.delaxes(axes[j])

# Adjust the layout
fig.tight_layout()

# Show the plot
plt.show()

We can see there is slighty different distributions on the income, date_of_birth_distinct_emails_4w, proposed_credit_limit columns between fraud and non-fraud categories. We can also see that the distributions of the other variables are similar between the two categories. This suggests that these variables may not be particularly informative for predicting fraud, however the those three variables may be useful in predicting fraud.

#### T-test

The t-test is commonly used to compare the means of two groups or conditions. It is suitable for situations where we have a continuous outcome variable and a binary variable, our target variable.

The t-test assumes the continuous outcome variable is normally distributed and that the variance of the two groups is equal. Therefore, we perform Levene's test to check the equality of variances between the groups.

In [None]:
print('Levene\'s Test')

# Split the data into two groups based on the binary variable
for col in normal_distribution_cols:
    group_1 = df[df[target_col] == 0][col]
    group_2 = df[df[target_col] == 1][col]

    statistic, p_value = stats.levene(group_1, group_2)

    # Print the test results
    print(f"Test Statistic: {statistic:.4f}")
    print(f"P-value: {p_value:.4f}")

If the p-value from Levene's test is below the chosen significance level (e.g., 0.05), it indicates a statistically significant difference in variances between the groups. This would violate the assumption of equal variances required for the t-test, and thus we would need to explore alternative tests.

Looking at the Levene's test results, we analize that the variance between the two groups is statistically significant.

We will use the Mann-Whitney U test, which does not assume equal variances.

### Mann-Whitney U Test

The Mann-Whitney U test is a non-parametric test that does not rely on the assumption of equal variances. It compares the distributions of two independent groups based on the ranks of the observations.

This test is appropriate when the data do not meet the assumptions required for the T-test, such as when the data are non-normally distributed or the variances are unequal.

In [None]:
effect_sizes = []
results_mannwhitneyu = []

for col in continuous_cols:
    group_1 = df[df[target_col] == 0][col]
    group_2 = df[df[target_col] == 1][col]

    stat, p_value_mannwhitneyu = mannwhitneyu(group_1, group_2)
    results_mannwhitneyu.append((col, stat, p_value_mannwhitneyu))

    # Calculate Cohen's U3
    u3 = stat / (len(group_1) * len(group_2))
    effect_sizes.append((col, u3, p_value_mannwhitneyu))

df_mannwhitneyu = pd.DataFrame(results_mannwhitneyu, columns=['Variable', 'Mann-Whitney U', 'P-value (Mann-Whitney U)'])
effect_sizes_df = pd.DataFrame(effect_sizes, columns=['Variable', "Cohen's U3", 'P-value (Mann-Whitney U)'])

In [None]:
df_mannwhitneyu.plot(x='Variable', y='P-value (Mann-Whitney U)', kind='bar')
plt.title('Mann-Whitney U')
plt.ylabel('P-value')
plt.show()

In [None]:
print(df_mannwhitneyu)

The resulting p-value from the test represents the probability of observing a U statistic as extreme as the one calculated, assuming that the null hypothesis is true (i.e., the distributions of the two groups are identical). A small p-value indicates strong evidence against the null hypothesis and suggests that there is a significant difference between the two groups (fraud and non-fraud).

We will now visualize the magnitude of the differences between the fraud and non-fraud categories for each variable.

In [None]:
variable_names = effect_sizes_df['Variable']
cohens_u3 = effect_sizes_df["Cohen's U3"]
p_values = effect_sizes_df['P-value (Mann-Whitney U)']

bar_width = 0.35

x_indices = np.arange(len(variable_names))

fig, ax = plt.subplots(figsize=(10, 6))
bar1 = ax.bar(x_indices - bar_width, p_values, bar_width, label='P-value')
bar2 = ax.bar(x_indices + bar_width, cohens_u3, bar_width, label="Cohen's U3")

ax.set_xticks(x_indices + bar_width)
ax.set_xticklabels(variable_names, rotation=90)

ax.set_ylabel('Value')
ax.legend()

plt.tight_layout()
plt.show()

In [None]:
print(effect_sizes_df)

A Cohen's U3 effect size of 0.6 is considered relatively large, it indicates a substatial difference between the two groups, value we can see on the date_of_birth_distinct_emails_4w. The p-values are so close to 0.0 that we they do not appear visible in the plot, however this means the confidence on the difference between the two groups is very high.

### Point-Biserial Correlation

TODO

In [None]:
results_correlation = []

for col in continuous_cols:
    correlation, p_value_correlation = stats.pointbiserialr(df[col], df[target_col])
    results_correlation.append((col, correlation, p_value_correlation))

df_correlation = pd.DataFrame(results_correlation, columns=['Variable', 'Correlation', 'P-value (Correlation)'])

In [None]:
df_correlation.plot(x='Variable', y='Correlation', kind='bar')
plt.title('Point-Biserial Correlation: Correlations')
plt.ylabel('Correlation')
plt.show()

In [None]:
print(df_correlation)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 6))

# Boxplot 1
sns.boxplot(x=target_col, y='days_since_request', data=df[df[target_col] == 0][[target_col, 'days_since_request']], ax=axes[0])
axes[0].set_title('Boxplot - Continuous Variable 1')
axes[0].set_xlabel('Binary Variable')
axes[0].set_ylabel('Continuous Variable 1')

# Boxplot 2
sns.boxplot(x=target_col, y='days_since_request', data=df[df[target_col] == 1][[target_col, 'days_since_request']], ax=axes[1])
axes[1].set_title('Boxplot - Continuous Variable 2')
axes[1].set_xlabel('Binary Variable')
axes[1].set_ylabel('Continuous Variable 2')

plt.tight_layout()
plt.show()

### Correlation between categorical variables and our target (binary) variable

### Chi-square test
We want to perceive the relation between the categorical features in the dataset to see how they are related to the target variable. This can give us some insight into which features are most important for predicting fraud.

(Note: chosen significance level (alpha) = 0.05)

In [None]:
col_names = ["Variable", "Chi-square Statistic", "Degrees of freedom", "P-value"]

data = []

for col in discrete_cols:
    contingency_table = pd.crosstab(df[target_col], df[col])

    chi2_stat, p_val, dof, expected = stats.chi2_contingency(contingency_table)

    data.append([col, chi2_stat, dof, p_val])

print(tabulate(data, headers=col_names, tablefmt="fancy_grid"))

In this table, we have 4 columns:
- **Variable** <br>
Name of the variable being analised.

- **Chi-square Statistic** <br>
Measures the overall discrepancy between the observed frequencies and the expected frequencies under the assumption of independence. A larger chi-square statistic suggests a greater difference between the observed and expected frequencies.

- **Degree of freedom** <br>
Represent the number of categories in the variables minus 1. In the context of a chi-square test, it determines the critical values or the distribution of the chi-square statistic. The degrees of freedom help in assessing the statistical significance of the chi-square test.

- **P-value** <br>
Indicates the statistical significance of the association, if the p-value is below a chosen significance level (alpha < 0.05), it suggests a significant association.

Through analysis of the results, we can conclude that, for all the variables analysed, in none of them the p-value is higher or equal to the chosen significance level (alpha = 0.05), only lower. What we can perceive from the fact that the p-value is zero is that the observed results are highly unlikely to be due to random chance, and there is a significant relationship or effect present in the data. As a result, we can only take conclusions based on the chi-square Statistic and the degrees of freedom of each variable. Normally, the higher the degree of freedom the higher the chi-square statistic tends to be. By observing the table, we can understand that some variables have stronger relation with the target variable, such as "device_os" and "housing_status".

## Feature Engineering and Selection

In [None]:
# List of dataframes we will disponibilize throughout the notebook

df_oversampled = -1
df_undersampled = -1
df_all_features = -1
df_without_most_correlated_features = -1
df_without_least_correlated_features = -1

dataframes = [df_oversampled, df_undersampled, df_all_features, df_without_most_correlated_features, df_without_least_correlated_features]

### Data Encoding
In order to encode some of our variables, we looked up some of the most used encoding methods. We ended up choosing Ordinal Encoding for High-Cardinality Variables because we have categorical variables with a high number of unique values (high-cardinality). Being that using One-Hot Encoding might lead to a large number of resulting columns and given that we want to reduce the number of columns, One-Hot Encoding turns out not being the best choice. So we decided to use Ordinal Encoding, which assigns unique integers to each category.

In [None]:
dataset = df.copy()

columns_to_encode = ['payment_type', 'employment_status', 'housing_status', 'source', 'device_os']

encoder = OrdinalEncoder()

dataset[columns_to_encode] = encoder.fit_transform(dataset[columns_to_encode])

#print(dataset)

### Selecting relevant features
In this step we select the features that have the most impact on our target variable ("fraud_bool"). 

TODO - Add more details about the feature selection

### Imbalanced Dataset
Dealing with imbalanced datasets is an important aspect of machine learning. In this section, we will explore some techniques for dealing with imbalanced datasets.



In [None]:
def plot_confusion_matrix(report, cm, title):

    plt.figure(figsize=(8, 6))
    plt.text(1, 3.2, str(report), fontsize=12, ha='center')
    sns.heatmap(cm, annot=True, fmt='d', cmap='Reds')
    plt.xlabel('Predicted label')
    plt.ylabel('True label')
    plt.suptitle(title)
    plt.show()

training_data = dataset.sample(frac=0.1, random_state=1)

# Separate the features and the target variable
X = training_data.drop('fraud_bool', axis=1)
y = training_data['fraud_bool']

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1, stratify=y)

# Shuffle the training data

X_train, y_train = shuffle(X_train, y_train, random_state=1)

# Train a Random Forest classifier on the original data
clf_original = RandomForestClassifier(random_state=1)
clf_original.fit(X_train, y_train)


# Evaluate the original model on the test set
y_pred_original = clf_original.predict(X_test)

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_original)

report = classification_report(y_test, y_pred_original, zero_division=0)
# Plot the confusion matrix
plot_confusion_matrix(report, cm, 'Confusion Matrix - Original Data')

The original dataset without any techniques applied to deal with the imbalanced dataset results in a model that is biased towards predicting non-fraud cases. As we can see from the confusion matrix, the model is not able to predict any fraud cases correctly. The f1-score is 0.00, indicating poor overall performance in identifying fraudulent cases.

In [None]:

# Apply Random Under-sampling to balance the data
rus = RandomUnderSampler(random_state=42)
X_resampled_rus, y_resampled_rus = rus.fit_resample(X_train, y_train)


# Train a Random Forest classifier on the balanced data using Random Under-sampling
clf_rus = RandomForestClassifier(random_state=42)
clf_rus.fit(X_resampled_rus, y_resampled_rus)

# Evaluate the Random Under-sampling model on the test set
y_pred_rus = clf_rus.predict(X_test)

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_rus)

report = classification_report(y_test, y_pred_rus, zero_division=0)
# Plot the confusion matrix
plot_confusion_matrix(report, cm, 'Confusion Matrix - Random Under-sampling')

Aplying a undersampling technique, we can see that the model is now able to predict some fraud cases correctly. The recall for the minority class improves significantly (0.75) indicating that the model captures a higher proportion of fraudulent cases and the f1-score improves to 0.08 indicating better overall performance in identifying fraudulent cases, but still not good enough.

In [None]:
# Apply Random Over-sampling to balance the data
ros = RandomOverSampler(random_state=42)
X_resampled_ros, y_resampled_ros = ros.fit_resample(X_train, y_train)

# Train a Random Forest classifier on the balanced data using Random Over-sampling
clf_ros = RandomForestClassifier(random_state=42)
clf_ros.fit(X_resampled_ros, y_resampled_ros)

# Evaluate the Random Over-sampling model on the test set
y_pred_ros = clf_ros.predict(X_test)

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_ros)

report = classification_report(y_test, y_pred_ros, zero_division=0)

# Plot the confusion matrix
plot_confusion_matrix(report, cm, 'Confusion Matrix - Random Over-sampling')

Aplying a oversampling technique, we can see that the model doesn't improve much. The recall for the minority class is 0.01 and the f1-score is 0.03, which doesn't improve a lot from the original dataset.

In [None]:
# Apply SMOTE to balance the data
smote = SMOTE(random_state=42)
X_resampled_smote, y_resampled_smote = smote.fit_resample(X_train, y_train)

# Train a Random Forest classifier on the balanced data using SMOTEENN
clf_smote = RandomForestClassifier(random_state=42)
clf_smote.fit(X_resampled_smote, y_resampled_smote)

# Evaluate the SMOTE model on the test set
y_pred_smote = clf_smote.predict(X_test)

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_smote)

report = classification_report(y_test, y_pred_smote, zero_division=0)

# Plot the confusion matrix
plot_confusion_matrix(report, cm, 'Confusion Matrix - SMOTE')


Concerning the Smote technique, although the improvements were relatively small we noticed a slight improvement regarding the original dataset.
The precision, recall, and F1-score for the minority class are still quite low compared to the majority class.

## Main Takeaways from analysis

TODO: Add more details about the main takeaways from analysis and preprocessing phase

Overall, the results confirm that the original dataset suffers from severe class imbalance, leading to poor performance in identifying fraudulent cases. While the applied techniques (under-sampling, over-sampling, and SMOTE) show some improvements in identifying fraudulent cases compared to the original dataset, the performance remains limited. Further analysis and experimentation may be required to develop a more effective model for fraud detection in this dataset.

In [None]:
# Compute the false positive rate (FPR), true positive rate (TPR), and threshold
fpr_undersampling, tpr_undersampling, thresholds = roc_curve(y_test, y_pred_rus)
fpr_smote, tpr_smote, thresholds = roc_curve(y_test, y_pred_smote)
fpr_oversampling, tpr_oversampling, thresholds = roc_curve(y_test, y_pred_ros)

auc_undersampling = auc(fpr_undersampling, tpr_undersampling)
auc_smote = auc(fpr_smote, tpr_smote)
auc_oversampling = auc(fpr_oversampling, tpr_oversampling)

# Plot the ROC curve
plt.figure(figsize=(5,5), dpi=100)

plt.plot(fpr_undersampling, tpr_undersampling, color='red', lw=2, label='ROC curve - Undersampling (area = %0.2f)' % auc_undersampling)
plt.plot(fpr_smote, tpr_smote, color='green', lw=2, label='ROC curve - SMOTE (area = %0.2f)' % auc_smote)
plt.plot(fpr_oversampling, tpr_oversampling, color='blue', lw=2, label='ROC curve - Oversampling (area = %0.2f)' % auc_oversampling)
plt
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

plt.legend()

plt.show()

## Statistical Modeling

### Choosing modeling techniques
The modeling techniques chosen are: 
- Naive Bayes
- Decision Tree
- k-Nearest Neighbours (KNN)
- Logistic Regression
- Support Vectors (SVM)
- Random Forest

### Splitting the dataset into training and testing sets

In [None]:
X = dataset.drop('fraud_bool', axis=1) #Features 
y = dataset['fraud_bool'] # Labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Applying models and evaluating performance
In this section, we apply the models chosen and evaluate their performance with appropriate metrics.

The metrics chosen are the following:
- Accuracy

#### Naive Bayes
Naive Bayes is a classification algorithm based on Bayes' theorem. It assumes that features are independent and calculates the probability of an instance belonging to a class. It's computationally efficient, works well with high-dimensional data, and performs best when the independence assumption holds.

In [None]:
# Instantiate and train the Naive Bayes classifier
model = GaussianNB()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)

f1 = f1_score(y_test, y_pred, zero_division=0)

precision = precision_score(y_test, y_pred, zero_division=0)

recall = recall_score(y_test, y_pred, zero_division=0)

print('Accuracy:', accuracy)
print('F1 Score:', f1)
print('Precision:', precision)
print('Recall:', recall)

#### Decision Tree
Decision Trees are classification algorithms that create a tree-like model of decisions. Each internal node represents a feature, and each leaf node represents a class label. They split the data based on feature values to create homogeneous subsets. When making predictions, a new instance traverses the tree to a leaf node, and the corresponding class label is assigned. Decision Trees are interpretable and handle categorical and numerical data, capturing complex decision boundaries.

In [None]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

#### k-Nearest Neighbours (KNN) 
k-Nearest Neighbors (KNN) is a classification algorithm that predicts the class of an instance based on its k nearest neighbors in the feature space. It assumes that instances with similar features tend to belong to the same class. During training, KNN stores the feature vectors and their corresponding class labels. When making predictions, it finds the k nearest neighbors to the target instance and assigns the majority class among those neighbors as the predicted class. KNN is a simple and versatile algorithm that can handle both linear and non-linear classification tasks, making it useful in various applications.

In [None]:
model = KNeighborsClassifier()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

#### Logistic Regression
Logistic Regression is a classification algorithm that predicts the probability of an instance belonging to a specific class. It uses a sigmoid function to map input features to a binary output. By fitting a decision boundary during training, Logistic Regression separates the classes. When making predictions, it calculates the probability of an instance belonging to the positive class and applies a threshold for classification. Logistic Regression is a simple and effective algorithm suitable for binary classification tasks, handling linearly and non-linearly separable data with appropriate transformations or kernel functions.

In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

#### Support Vectors (SVM)
Support Vector Machines (SVM) are powerful classifiers that can handle both linear and non-linear classification tasks. They work by finding an optimal hyperplane that maximally separates the data points of different classes. SVMs also offer various kernels (e.g., linear, polynomial, radial basis function) to capture complex relationships between the features.

In [None]:
model = SVC()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

#### Random Forest 
Random Forest is an ensemble learning method that combines multiple decision trees to make predictions. Each decision tree is trained on a random subset of the data, and the final prediction is determined by aggregating the predictions of individual trees. Random Forests are effective in handling complex datasets, capturing non-linear relationships, and mitigating overfitting.

In [None]:
model = RandomForestClassifier()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

## Models Evaluation and Validation

### Models performance on testing data

### Fine-tuning

### Applying cross-validation techniques

## Interpretation and Insights