# ML mini project 1: Sonar Data

In this project, we are given a small data set containing sonar scans and their corresponding labels. The task is to build a model to predict those labels.

## Setup: Installing required modules

Throughout the course of this small project, we will need to utilize pandas, scikit-learn and numpy (Which gets installed automatically because it is required by pandas). Below is a small script to install the modules.

In [None]:
from setuptools import setup

setup(
    name='your_project_name',
    version='1.0',
    install_requires=[
        'numpy>=1.21.0,<2.0',
        'pandas>=1.3.0,<2.0',
        'scikit-learn>=1.0.0,<2.0'
    ],
    python_requires='>=3.8,<4.0'
)


## Understanding the data

Let's first load the data into the script and see the labels we have.

In [None]:
import pandas as pd

full_data: pd.DataFrame = pd.read_csv('./sonar.csv', header=None)

print("Labels:", str(full_data[60].unique()))
full_data


In [None]:
print("Proportion of non-zero data:", str(full_data.iloc[:, :60].astype(pd.SparseDtype("int", 0)).sparse.density))


We now know that the data set contains a total of 207 instances of sonar scans. Each instance contains 60 readings (floating point numbers) and their corresponding labels. Since this is a binary classification problem, it seems like Logistic Regression might be the best choice. On top of that, now that we saw that the data is not very sparse, we know that l1 regularization might not be the best choice.

## Formatting the Data and Training the Model

The first step to the training is to separate the data from the labels as such.

In [None]:
import numpy as np

label_map: dict[str, np.uint8] = {'R': np.uint8(0), 'M': np.uint8(1)}

data: pd.DataFrame = full_data.drop(columns=[60])
labels: pd.Series = full_data[60].map(label_map)

del full_data, label_map

print(data.info(), end='\n\n')
labels.unique()


Now that we have done that, we need to split the data into a training and test set.

In [None]:
from sklearn.model_selection import train_test_split

# Defining a random seed for reproducible output
SEED: int = 11062024

train_data, test_data, train_labels, test_labels = train_test_split(data, labels, test_size=0.3, stratify=labels, random_state=SEED)

del data, labels

# Checking data integrity
print(train_labels.value_counts(normalize=True), end='\n\n')
print(test_labels.value_counts(normalize=True))


The data is now properly formatted, so we can proceed to model training. As stated before, the model of choice is Logistic Regression for this scenario.

In [None]:
from sklearn.linear_model import LogisticRegression

model: LogisticRegression = LogisticRegression(random_state=SEED, solver='liblinear')
model.fit(train_data, train_labels)

model


And, just like that, the model is ready. Next we will test the fitted model.

## Testing

### Training Data Performance

In [None]:
from sklearn.metrics import classification_report

train_prediction: np.ndarray = model.predict(train_data)

print(classification_report(train_labels, train_prediction, target_names=['R', 'M']))


The model demonstrated strong performance on the training data, achieving metrics exceeding 80% across precision, recall, and F1-score for both classes. However, there is a noticeable discrepancy in the recall for class `R` (75%) compared to class `M` (87%). This indicates that the model has a higher rate of false negatives for class `R`, meaning it often fails to correctly identify samples belonging to this class. Such behavior may suggest that the model is biased toward predicting class `M`.

### Testing Data Performance

In [None]:
test_prediction: np.ndarray = model.predict(test_data)

print(classification_report(test_labels, test_prediction, target_names=['R', 'M']))


The model displayed moderate performance on the test data, achieving an overall accuracy of 76% and balanced metrics for both classes, with precision, recall, and F1-scores all around 76%. While the metrics for class `R` and class `M` are relatively close, there is a slight difference in recall, with class `R` achieving 72% compared to class `M` at 79%. 

This indicates that the model has a marginally higher rate of false negatives for class `R`, leading to missed predictions for this class. Given the smaller size of the test dataset, these variations might be influenced by sampling noise.