# Introduction

This notebook provides you an opportunity to demonstrate proficiency in meeting course learning goals by applying a support vector machine to solve a classification problem using widely-used ML libraries and an ML workflow.


# Mine Detection (Revisited)

In this notebook, you will revisit [a previously seen classification problem](https://www.kaggle.com/code/bakosy/cs-513-notebook-4-classification-with-perceptrons), and see if you can build a better classification model that can predict whether or not a sonar signature is from a mine or a rock.

<div class="alert alert-block alert-warning">
<b>Tip:</b> We suggest reviewing your Notebook 4: Classification with Perceptrons.
</div>

We'll use a version of the [sonar data set](https://www.openml.org/search?type=data&sort=runs&id=40&status=active) by Gorman and Sejnowski. Take a moment now to [reacquaint yourself with the subject matter of this data set](https://datahub.io/machine-learning/sonar%23resource-sonar), and look at the details of the version of this data set, [Mines vs Rocks, hosted on Kaggle](https://www.kaggle.com/datasets/mattcarter865/mines-vs-rocks).

Similar to [a previous notebook](https://www.kaggle.com/code/bakosy/cs-513-notebook-4-classification-with-perceptrons), this notebook expects each student to implement the ML workflow steps. We will get you started by providing the first step, loading the data, and providing some landmarks and tips below. Your process should demonstrate:

1. Loading the data
2. Exploring the data
3. Preprocessing the data
4. Preparing the training and test sets
5. Creating and configuring a sklearn.svm.SVC
6. Training the SVM
7. Validating and Testing the SVM
8. Demonstrating making predictions
9. Evaluate (and Improve) the results

Can you train a classifier that can predict whether a sonar signature is from a mine or a rock? "Three trained human subjects were each tested on 100 signals, chosen at random from the set of 208 returns used to create this data set. Their responses ranged between 88% and 97% correct." Can your classifier outperform the human subjects?

Most importantly, how does the performance of the SVM classifier compare to the perceptron results observed in [Notebook 4](https://www.kaggle.com/code/bakosy/cs-513-notebook-4-classification-with-perceptrons)?



## Step 1: Load the Data

The first step is to load the Mines vs Rocks dataset into a pandas DataFrame. The dataset is provided as a CSV file without a header row. By loading the CSV file into a DataFrame, we can easily explore the data, preprocess it, and split it into training and test sets.

We start by importing the pandas library, which provides powerful data manipulation and analysis tools.
Next, we define the path to the CSV file containing the sonar data. Adjust the file path if needed.
Using the `pd.read_csv()` function from pandas, we read the CSV file into a DataFrame called `sonar_data`. We set the header parameter to None to indicate that the CSV file has no header row.
Now, we have a pandas DataFrame that takes in the sonar data, and we can proceed with our data exploration and analysis.


In [None]:
import pandas as pd

sonar_csv_path = "../input/mines-vs-rocks/sonar.all-data.csv"
sonar_data = pd.read_csv(sonar_csv_path, header=None)

After running this code, we will have the sonar data loaded into the `sonar_data` DataFrame, which we can use for further analysis and preprocessing. 

## Step 2: Explore the Data

Once the data has been loaded into the DataFrame, we can start exploring it to understand its structure and characteristics. 

First, let's check the dimensions of the DataFrame using the shape attribute, which returns the number of rows and columns in the DataFrame. Next, we can inspect the column names using the columns attribute, which returns a list of all the column names in the DataFrame. To get an idea of the data, we will examine a few sample records using the `head()` function, which displays the first few rows of the DataFrame.

Let's begin:


In [None]:
print("Data shape:", sonar_data.shape)

print("Column names:", sonar_data.columns)

print("Sample records:")
print(sonar_data.head())

By running the above code, the given information will be the data shape, column names, and sample records. 

Data shape: This will display the number of rows and columns in the DataFrame. It provides an overview of the dataset's size.

Column names: This will show the names of the columns in the DataFrame. Each column represents a feature or attribute of the data.

Sample records: The first few records in the DataFrame will be printed, which allows us to observe the actual data values.

After running the code, we should be able to see the dimensions of the DataFrame, column names, and a preview of the data. It is important to understand the structure and characteristics of the dataset.

## Step 3: Preprocess the Data

Data preprocessing is an important step that involves handling missing values, encoding categorical variables, and performing feature scaling or normalization if required. It would be helpful to understand the specific preprocessing steps needed for this dataset.

In the code below, we first separate the features `(X)` and class labels `(y)` from the loaded DataFrame. The features are obtained by selecting all columns except the last one, and the class labels are obtained from the last column.

Next, we use the StandardScaler class from scikit-learn to perform z-score normalization on the features. The `fit_transform()` method of the scaler fits the scaler on the data and then applies the transformation to standardize the feature values.

Lastly, we create a new DataFrame `preprocessed_data`, which stores the preprocessed features. This DataFrame will have the same column names as the original data excluding the last column, which contains the class labels. Printing the `preprocessed_data.head()` will show the first few records of the preprocessed data.

In [None]:
from sklearn.preprocessing import StandardScaler

X = sonar_data.iloc[:, :-1] 
y = sonar_data.iloc[:, -1]  

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

preprocessed_data = pd.DataFrame(X_scaled, columns=sonar_data.columns[:-1])
print(preprocessed_data.head())

After running the code, we should be able to obtain a DataFrame with the preprocessed features. The feature values will be scaled using z-score normalization. What it does is that it helps ensure that all features have a mean of 0 and a standard deviation of 1. This normalization step can be beneficial for certain machine learning algorithms, including SVMs.

## Step 4: Prepare the Training and Test Data Sets

In this step, we will split the preprocessed data into training and test sets using the train_test_split function from scikit-learn. The purpose of this step is to have separate data for training the SVM classifier and evaluating its performance on unseen data. 

In the code below, we use the train_test_split function to split the preprocessed features `(X_scaled)` and the class labels `(y)` into training and test sets. We specify the test size as 0.2, where 20% of the data will be allocated for testing, and the remaining 80% will be used for training. The random_state parameter is set to 42 for reproducibility.

The function returns four sets: `X_train, X_test, y_train, and y_test`, representing the training and test sets for the features and class labels. After splitting the data, we print the shapes of the training and test sets to verify that the split was successful. The shapes will indicate the number of samples and the number of features in each set.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape, y_train.shape)
print("Test set shape:", X_test.shape, y_test.shape)

From what we learn, the train_test_split function randomly shuffles the data before splitting it. This randomization helps ensure that the training and test sets are representative of the overall dataset. We see that the training set has a shape of (166, 60), where 166 represents the number of samples and 60 represents the number of features. Similarly, the test set has a shape of (42, 60), indicating 42 samples and 60 features. This means that the confirmation of the split is successful and the classifer will be further trained and evulated based on the correct subsets of the data.

## Step 5: Instantiate and Configure an SVM

 After splitting the data, we can create an instance of the SVM classifier using the SVC class from sklearn.svm module. From reading the documentation, we learn that we can configure various parameters of the SVC, such as the kernel type, regularization parameter, etc.
 
In this step, we will instantiate and configure an SVM classifier using the SVC class from scikit-learn. The SVM classifier is for solving classification problems and can be configured with various parameters to optimize its performance based on the code chunk below.

In [None]:
from sklearn.svm import SVC

svm_classifier = SVC(kernel='rbf', C=1.0, random_state=42)

After creating the classifier, we print the default configuration of the SVM classifier to get an overview of its settings. This can help us understand the default values of various parameters and their effects on the classifier.

Based on the documentation of the SVC class, we are able to configure the SVM classifier by adjusting specific parameters using the `set_params` method as well.

We then fit the SVM classifier to the training data using the fit method. This step involves learning the optimal decision boundaries from the labeled training examples.

In [None]:
print("Default SVM classifier configuration:")
print(svm_classifier)

svm_classifier.fit(X_train, y_train)

## Step 6: Train the SVM

We first fit the SVM classifier to the training data using the `fit()` method. This step involves learning the optimal decision boundaries from the labeled training examples that we gave.
 
In this step, we will train the SVM classifier by fitting it to the training data. The SVM will learn the optimal decision boundaries from the labeled training examples. For the code below, the `fit()` method of the SVM classifier is used to train the model. The `X_train` contains the preprocessed feature data, and `y_train` contains the corresponding class labels.



In [None]:
svm_classifier.fit(X_train, y_train)

During the training process, the SVM classifier will analyze the training data and learn patterns that distinguish mines from rocks. The random_state=42 parameter is  used to set the random seed for reproducibility in ML algorithm. 

After running the code, the SVM classifier will be trained, and the learned model will be stored in svm_classifier. It will contain the optimized decision boundaries that can be used to make predictions on unseen data. The SVM algorithm performs an iterative optimization process to find the best decision possible, which may require multiple iterations over the training data.

## Step 7: Validate and Test the SVM

After training the SVM classifier, we need to evaluate its performance on both the training and test sets. This will help us assess how well the classifier has learned the patterns in the training data and how well it can generalize to unseen data.

We use the `score()` method of the SVM classifier to calculate the accuracy of the classifier on the training set. The `score()` method takes the preprocessed feature data for the training set `(X_train)` and the corresponding class labels `(y_train)` as inputs. The returned value represents the accuracy of the classifier on the training set.

In the provided code, the accuracy of the classifier on the training set is stored in the training_accuracy variable.

Finally, we print the training accuracy using the `print()` function.

In [None]:
training_accuracy = svm_classifier.score(X_train, y_train)
print("Training Accuracy:", training_accuracy)

test_accuracy = svm_classifier.score(X_test, y_test)
print("Test Accuracy:", test_accuracy)

After running the code above, we obtain the training accuracy of the SVM classifier on the training set accuracy of approximately 98.2%.The training accuracy represents the proportion of correctly predicted labels in the training set. It also indicates how well the SVM classifier fits the training data and captures the underlying patterns based on my understanding.

Next, we'll do the same for the test accuracy below. Run the code to see the test accuracy.

In [None]:
test_accuracy = svm_classifier.score(X_test, y_test)
print("Test Accuracy:", test_accuracy)

After running this code, we will obtain the training accuracy and the test accuracy of the SVM classifier. The training accuracy indicates how well the classifier has learned from the training data, while the test accuracy represents the performance of the classifier on unseen data. In the provided code snippet, the test accuracy is approximately 90.5%, indicating that the SVM classifier performs well on the test set.



## Step 8: Demonstrate Making Predictions

In this step, we will use the trained SVM classifier to make predictions on new, unseen data samples. This will help us understand how the classifier performs on data that it hasn't seen during the training process.

We select a subset of samples from the training set, `X_train`, to represent the new, unseen data samples. In the provided code snippet, we select the first 5 samples using `X_train[:5]`. Then we use the `predict()` method of the SVM classifier, svm_classifier, to predict the class labels for the given features. The `predict()` method takes the features as input and returns the predicted class labels.

Finally, we print the predicted class labels using a loop to show the classification output for each specific sample.

In [None]:
sample_features = X_train[:5]

predictions = svm_classifier.predict(sample_features)

print("Predicted labels:")
for prediction in predictions:
    print(prediction)

After running this code, we will obtain the predicted class labels for the new, unseen data samples. Each predicted label represents the classification output for a specific sample. We can assess how well the SVM classifier performs on these samples and compare the predictions with the true class labels to evaluate the model's accuracy. 

## Step 9: Evaluate (and Improve?)

The SVM classifier was configured with the following parameters: `kernel='rbf', C=1.0, and random_state=42`. It achieved a training accuracy of approximately 98.2% and a test accuracy of around 90.5%.

The classifier performs well overall, with high accuracy on both the training and test sets. However, it's worth noting that the accuracy on the training set is higher than the accuracy on the test set, indicating some level of overfitting. This means that the classifier may have memorized the training data too well and is not generalizing as effectively to new, unseen data.

Compared to the perceptron classifier from Notebook 4, the SVM classifier shows better performance. The perceptron achieved an accuracy of 79%, while the SVM classifier achieved around 90.5% on the test set. Basically, the SVM classifier outperforms the perceptron in terms of accuracy for this specific problem.

## Conclusion
We loaded the Mines vs Rocks dataset and performed the following steps required to fulfill this notebook:

We first explored the data by examining its dimensions, column names, and sample records. Next, we preprocessed the data by separating the features and class labels, and then applying z-score normalization using StandardScaler.
Then we prepared the training and test sets using the `train_test_split function`.
Instantiated and configured an SVM classifier using the SVC class from scikit-learn. After that, we trained the SVM classifier by fitting it to the training data. We also validated and tested the SVM classifier's performance on the training and test sets. Next, we had to demonstrated making predictions on new, unseen data using the trained SVM classifier.

One way to improve the SVM classifier is that we can experiment with hyperparameter tuning, feature selection/engineering, and ensemble methods. These approaches could potentially enhance the classifier's performance and increase accuracy.

From what I've read, I believe the SVM classifier could potentially be improved further. 

#### Four things that we can incorporate are: 

Hyperparameter Tuning: We can play around or experimenting with different values for the hyperparameters of the SVM classifier.

Feature Selection or Engineering: We can analyzing the dataset and selecting or engineering informative features that have a stronger correlation with the target variable.

Ensemble Methods: When exploring ensemble methods, such as combining multiple SVM classifiers with different hyperparameters or using other classifiers in conjunction with the SVM classifier, we can create a more robust and accurate model.

Future work could involve exploring additional preprocessing techniques, trying different classification algorithms, or gathering more data to further improve the classifier's performance. The SVM classifier shows great results, but there is always room for improvement. 
