# Introduction

This notebook provides you an opportunity to demonstrate proficiency in meeting course learning goals by applying a support vector machine to solve a classification problem using widely-used ML libraries and an ML workflow.


# Mine Detection (Revisited)

In this notebook, you will revisit [a previously seen classification problem](https://www.kaggle.com/code/bakosy/cs-513-notebook-4-classification-with-perceptrons), and see if you can build a better classification model that can predict whether or not a sonar signature is from a mine or a rock.

<div class="alert alert-block alert-warning">
<b>Tip:</b> We suggest reviewing your Notebook 4: Classification with Perceptrons.
</div>

We'll use a version of the [sonar data set](https://www.openml.org/search?type=data&sort=runs&id=40&status=active) by Gorman and Sejnowski. Take a moment now to [reacquaint yourself with the subject matter of this data set](https://datahub.io/machine-learning/sonar%23resource-sonar), and look at the details of the version of this data set, [Mines vs Rocks, hosted on Kaggle](https://www.kaggle.com/datasets/mattcarter865/mines-vs-rocks).

Similar to [a previous notebook](https://www.kaggle.com/code/bakosy/cs-513-notebook-4-classification-with-perceptrons), this notebook expects each student to implement the ML workflow steps. We will get you started by providing the first step, loading the data, and providing some landmarks and tips below. Your process should demonstrate:

1. Loading the data
2. Exploring the data
3. Preprocessing the data
4. Preparing the training and test sets
5. Creating and configuring a sklearn.svm.SVC
6. Training the SVM
7. Validating and Testing the SVM
8. Demonstrating making predictions
9. Evaluate (and Improve) the results

Can you train a classifier that can predict whether a sonar signature is from a mine or a rock? "Three trained human subjects were each tested on 100 signals, chosen at random from the set of 208 returns used to create this data set. Their responses ranged between 88% and 97% correct." Can your classifier outperform the human subjects?

Most importantly, how does the performance of the SVM classifier compare to the perceptron results observed in [Notebook 4](https://www.kaggle.com/code/bakosy/cs-513-notebook-4-classification-with-perceptrons)?



## Step 1: Load the Data

The notebook comes pre-bundled with the [Mines vs Rocks data set](https://www.kaggle.com/datasets/mattcarter865/mines-vs-rocks). Our first step is to create a pandas DataFrame from the CSV file. Note that the CSV file has no header row. Loading the CSV file into a DataFrame will make it easy for us to explore the data, preprocess it, and split it into training and test sets.


In [1]:
import pandas as pd

sonar_csv_path = "../input/mines-vs-rocks/sonar.all-data.csv"
sonar_data = pd.read_csv(sonar_csv_path, header=None)



We now have a pandas DataFrame encapsulating the sonar data, and can proceed with our data exploration.

## Step 2: Explore the Data

Let's take a look at the data using the `.head()` method. This will give us the dimensions, and a good look at the first five rows.

In [2]:
sonar_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,51,52,53,54,55,56,57,58,59,60
0,0.02,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,...,0.0027,0.0065,0.0159,0.0072,0.0167,0.018,0.0084,0.009,0.0032,R
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0084,0.0089,0.0048,0.0094,0.0191,0.014,0.0049,0.0052,0.0044,R
2,0.0262,0.0582,0.1099,0.1083,0.0974,0.228,0.2431,0.3771,0.5598,0.6194,...,0.0232,0.0166,0.0095,0.018,0.0244,0.0316,0.0164,0.0095,0.0078,R
3,0.01,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,...,0.0121,0.0036,0.015,0.0085,0.0073,0.005,0.0044,0.004,0.0117,R
4,0.0762,0.0666,0.0481,0.0394,0.059,0.0649,0.1209,0.2467,0.3564,0.4459,...,0.0031,0.0054,0.0105,0.011,0.0015,0.0072,0.0048,0.0107,0.0094,R


We see we have 61 columns, which is what we expect based on the documentation on the kaggle website where the data came from. The first 60 columns are the features, and the last one is the labels. The first five are all "R" for rock, but we would expect some others would be "M" for mine. 

Next, let's use the `.describe` method to get some statistics on the features to see the range they occupy. We first set some pandas options so it will display all of the columns instead of hiding the middle ones.

In [3]:
pd.set_option('display.max_columns',None)
sonar_data.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59
count,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0
mean,0.029164,0.038437,0.043832,0.053892,0.075202,0.10457,0.121747,0.134799,0.178003,0.208259,0.236013,0.250221,0.273305,0.296568,0.320201,0.378487,0.415983,0.452318,0.504812,0.563047,0.60906,0.624275,0.646975,0.672654,0.675424,0.699866,0.702155,0.694024,0.642074,0.580928,0.504475,0.43904,0.41722,0.403233,0.392571,0.384848,0.363807,0.339657,0.3258,0.311207,0.289252,0.278293,0.246542,0.214075,0.197232,0.160631,0.122453,0.091424,0.051929,0.020424,0.016069,0.01342,0.010709,0.010941,0.00929,0.008222,0.00782,0.007949,0.007941,0.006507
std,0.022991,0.03296,0.038428,0.046528,0.055552,0.059105,0.061788,0.085152,0.118387,0.134416,0.132705,0.140072,0.140962,0.164474,0.205427,0.23265,0.263677,0.261529,0.257988,0.262653,0.257818,0.255883,0.250175,0.239116,0.244926,0.237228,0.245657,0.237189,0.24025,0.220749,0.213992,0.213237,0.206513,0.231242,0.259132,0.264121,0.239912,0.212973,0.199075,0.178662,0.171111,0.168728,0.138993,0.133291,0.151628,0.133938,0.086953,0.062417,0.035954,0.013665,0.012008,0.009634,0.00706,0.007301,0.007088,0.005736,0.005785,0.00647,0.006181,0.005031
min,0.0015,0.0006,0.0015,0.0058,0.0067,0.0102,0.0033,0.0055,0.0075,0.0113,0.0289,0.0236,0.0184,0.0273,0.0031,0.0162,0.0349,0.0375,0.0494,0.0656,0.0512,0.0219,0.0563,0.0239,0.024,0.0921,0.0481,0.0284,0.0144,0.0613,0.0482,0.0404,0.0477,0.0212,0.0223,0.008,0.0351,0.0383,0.0371,0.0117,0.036,0.0056,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0008,0.0005,0.001,0.0006,0.0004,0.0003,0.0003,0.0001,0.0006
25%,0.01335,0.01645,0.01895,0.024375,0.03805,0.067025,0.0809,0.080425,0.097025,0.111275,0.12925,0.133475,0.166125,0.175175,0.164625,0.1963,0.20585,0.242075,0.299075,0.350625,0.399725,0.406925,0.450225,0.540725,0.5258,0.544175,0.5319,0.534775,0.4637,0.4114,0.34555,0.2814,0.257875,0.217575,0.179375,0.15435,0.1601,0.174275,0.173975,0.18645,0.1631,0.1589,0.1552,0.126875,0.094475,0.06855,0.06425,0.045125,0.02635,0.01155,0.008425,0.007275,0.005075,0.005375,0.00415,0.0044,0.0037,0.0036,0.003675,0.0031
50%,0.0228,0.0308,0.0343,0.04405,0.0625,0.09215,0.10695,0.1121,0.15225,0.1824,0.2248,0.24905,0.26395,0.2811,0.2817,0.3047,0.3084,0.3683,0.43495,0.5425,0.6177,0.6649,0.6997,0.6985,0.7211,0.7545,0.7456,0.7319,0.6808,0.60715,0.49035,0.4296,0.3912,0.35105,0.31275,0.32115,0.3063,0.3127,0.2835,0.27805,0.2595,0.2451,0.22255,0.1777,0.148,0.12135,0.10165,0.0781,0.0447,0.0179,0.0139,0.0114,0.00955,0.0093,0.0075,0.00685,0.00595,0.0058,0.0064,0.0053
75%,0.03555,0.04795,0.05795,0.0645,0.100275,0.134125,0.154,0.1696,0.233425,0.2687,0.30165,0.33125,0.35125,0.386175,0.452925,0.535725,0.659425,0.67905,0.7314,0.809325,0.816975,0.831975,0.848575,0.872175,0.873725,0.8938,0.9171,0.900275,0.852125,0.735175,0.64195,0.5803,0.556125,0.596125,0.59335,0.556525,0.5189,0.44055,0.4349,0.42435,0.387525,0.38425,0.324525,0.27175,0.23155,0.200375,0.154425,0.1201,0.068525,0.025275,0.020825,0.016725,0.0149,0.0145,0.0121,0.010575,0.010425,0.01035,0.010325,0.008525
max,0.1371,0.2339,0.3059,0.4264,0.401,0.3823,0.3729,0.459,0.6828,0.7106,0.7342,0.706,0.7131,0.997,1.0,0.9988,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.9657,0.9306,1.0,0.9647,1.0,1.0,0.9497,1.0,0.9857,0.9297,0.8995,0.8246,0.7733,0.7762,0.7034,0.7292,0.5522,0.3339,0.1981,0.0825,0.1004,0.0709,0.039,0.0352,0.0447,0.0394,0.0355,0.044,0.0364,0.0439


The first thing we notice about this data is that it is not normalized. The minimum and maximum values are different for each feature, which can lead to features with larger numbers being misconstrued as more important by the machine learning algorithm. It is important to normalize this later on.

Now that we have a good understanding of our features space, let's take another look at the labels and their distribution. We know that the labels are in column 60. We will use the `.value_counts()` method to see the distribution of values in the column.

In [4]:
label_distribution = sonar_data[60].value_counts()
print(label_distribution)
print("Total objects: " + str(len(sonar_data)))

M    111
R     97
Name: 60, dtype: int64
Total objects: 208


There are 111 data objects labeled "mine", and 97 data objects labeled "rock". This is not evenly distributed, so balancing the label weights during training might be advantageous. This also means there are 208 total data objects, which we confirm with the len() function above.

The next step is to make sure there are no null values I need to resolve. This is very simple with pandas.

In [5]:
sonar_data.isnull().values.any()

False

There are no null values, so it looks like everything is good to go. We are ready to start preprocessing the data.

## Step 3: Preprocess the Data

First we import `StandardScaler` from sklearn, which normalizes each feature by its z-score (z = (x-u) / s, where u is the mean  and s is the standard deviation of each feature). Next, we split the features (X) from the labels (y) by slicing the pandas dataframe. Then, we initialize the scaler, scale it, and put the data back into a dataframe so we can use the `describe()` method.

In [6]:
from sklearn.preprocessing import StandardScaler

X = sonar_data.iloc[:,:60]
y = sonar_data.iloc[:,60]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled)

X_scaled.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59
count,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0
mean,1.708035e-17,6.832142000000001e-17,-1.195625e-16,1.622634e-16,-1.793437e-16,2.049643e-16,1.024821e-16,3.4160710000000005e-17,-3.757678e-16,3.4160710000000005e-17,-3.4160710000000005e-17,0.0,6.832142000000001e-17,-2.562053e-17,-5.5511150000000004e-17,1.2810270000000001e-17,2.604754e-16,-8.540177e-18,-9.394195000000001e-17,0.0,4.440892e-16,-1.195625e-16,-3.074464e-16,3.4160710000000005e-17,-7.771561e-16,5.636517e-16,-2.647455e-16,-8.540177000000001e-17,1.281027e-16,1.537232e-16,2.562053e-17,1.494531e-16,-5.978124e-17,1.708035e-17,1.024821e-16,1.195625e-16,2.732857e-16,-1.216975e-16,5.978124e-17,2.732857e-16,-1.494531e-16,-1.024821e-16,8.540177000000001e-17,3.074464e-16,-3.4160710000000005e-17,1.708035e-16,1.366428e-16,2.39125e-16,-1.366428e-16,-3.074464e-16,3.4160710000000005e-17,1.024821e-16,3.4160710000000005e-17,-1.45183e-16,2.775558e-17,-2.39125e-16,3.4160710000000005e-17,-1.110223e-16,1.345078e-16,7.686159e-17
std,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413,1.002413
min,-1.206158,-1.150725,-1.104253,-1.036115,-1.236093,-1.600493,-1.921613,-1.52211,-1.443689,-1.468833,-1.564472,-1.621794,-1.812688,-1.641094,-1.547347,-1.560974,-1.448751,-1.589952,-1.769499,-1.898502,-2.168994,-2.359782,-2.36674,-2.71968,-2.666087,-2.568134,-2.668895,-2.813072,-2.618893,-2.359606,-2.137348,-1.873982,-1.793647,-1.656081,-1.432336,-1.430241,-1.373417,-1.418414,-1.453707,-1.680436,-1.483614,-1.620067,-1.778046,-1.609948,-1.303898,-1.20219,-1.411667,-1.46827,-1.447799,-1.498229,-1.341343,-1.313126,-1.449472,-1.364897,-1.229092,-1.366868,-1.302971,-1.185113,-1.271603,-1.176985
25%,-0.6894939,-0.6686781,-0.6490624,-0.6359298,-0.6703975,-0.6367565,-0.6626732,-0.6400918,-0.685659,-0.7232644,-0.8064569,-0.835483,-0.7621827,-0.7398484,-0.7591598,-0.7849821,-0.7988564,-0.8058388,-0.7993884,-0.810706,-0.8139075,-0.8514605,-0.7883456,-0.5530684,-0.6123656,-0.6578782,-0.6947312,-0.6730211,-0.7442437,-0.7698181,-0.7444604,-0.741057,-0.7734584,-0.8048124,-0.8247161,-0.8748025,-0.8511364,-0.7784131,-0.7644914,-0.69997,-0.7390302,-0.7093139,-0.6587522,-0.6557863,-0.6793257,-0.6891506,-0.6709773,-0.7435627,-0.7131495,-0.6509652,-0.6380641,-0.6394049,-0.7999231,-0.7642025,-0.7270112,-0.6678488,-0.7138771,-0.6738235,-0.691858,-0.6788714
50%,-0.2774703,-0.2322506,-0.2486515,-0.2120457,-0.2292089,-0.2106432,-0.2400524,-0.2672134,-0.2180558,-0.1928459,-0.08469964,-0.008381,-0.06652752,-0.09427355,-0.1878739,-0.317922,-0.4089954,-0.3220326,-0.2714467,-0.078416,0.03359247,0.1591469,0.2112606,0.1083491,0.1869404,0.2308561,0.1772798,0.1600721,0.1615793,0.1190734,-0.0661685,-0.04437862,-0.1262995,-0.2262096,-0.3087757,-0.2417501,-0.2402772,-0.1268809,-0.2129934,-0.1860318,-0.1742944,-0.1972008,-0.1730277,-0.2735576,-0.3254731,-0.2939871,-0.2398208,-0.2139841,-0.2015434,-0.1851537,-0.181037,-0.2102002,-0.1645716,-0.2252935,-0.2532164,-0.2396997,-0.3240352,-0.3329639,-0.2499546,-0.2405314
75%,0.2784345,0.2893335,0.3682681,0.2285353,0.4524231,0.5012417,0.5232608,0.4096773,0.4692723,0.450741,0.4958032,0.579876,0.554282,0.5461209,0.6476456,0.677489,0.9254848,0.8690373,0.8804084,0.93992,0.8083857,0.813657,0.8077787,0.8364219,0.81159,0.8194722,0.877092,0.8716615,0.8764117,0.7004289,0.6439769,0.6640521,0.6742455,0.8361697,0.7766817,0.6515635,0.6480175,0.4748773,0.5493604,0.6348106,0.5757088,0.6294876,0.5624103,0.433744,0.2268742,0.2974485,0.3685825,0.460536,0.4627081,0.3558478,0.3970293,0.343864,0.5950106,0.4886751,0.3973675,0.4112618,0.4513169,0.3719959,0.3865486,0.4020352
max,4.706053,5.944643,6.836142,8.025419,5.878863,4.710224,4.074573,3.816498,4.274237,3.746234,3.763162,3.26174,3.127477,4.26888,3.317184,2.672727,2.220238,2.099203,1.924052,1.667629,1.519998,1.471889,1.414514,1.372284,1.328397,1.268223,1.215369,1.293121,1.493402,1.902987,2.160531,2.310789,2.828812,2.433911,2.349744,2.334674,2.448005,3.10807,3.322838,3.470167,3.574989,3.245601,3.798951,4.227453,3.346264,4.255258,4.954231,3.894165,4.075316,4.553653,7.039574,5.980752,4.01668,3.330819,5.008027,5.448568,4.795888,5.585599,4.615037,7.450343


Each feature is now normalized with its own z-score, which represents how far away each data point is (in terms of standard deviations) from the mean. Each is centered on 0, where negative numbers represent points below the mean and positive numbers represent points above. The table above shows this, so we are good to go.

## Step 4: Prepare the Training and Test Data Sets

We will use `train_test_split` from sklearn to split our data into training and testing sets. It splits our data into four different variables, which are described below.

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=0)

Now, the data is split into features (labeled "X") for training and testing, and labels (labeled "y") for training and testing. I chose to hold back 20% of the data for testing, because that is a common value to use for the hold-out method. I also set the random state to 0, so my results are reproducible. 

## Step 5: Instantiate and Configure an SVM

Here we start by importing the `SVC` class from sklearn, and setting some hyperparameters. We start with the default settings, where C is the regularization parameter, the kernel used is rbf (radial basis function), gamma is the kernel coefficient used ('scale' calculates it as 1 / (number of features * feature variance)), tol is the stopping criteria for when the weights are not changing by much anymore (indication that a minimum has been reaching in the loss function), class_weight is an option to set weights for the loss function for different classes (balanced will create weights inversely proportional to their frequencies), max_iter is the max number of iterations through the training set it will use before stopping (-1 sets it to no limit), and random_state allows you to save the random number generation used for shuffling the data so results are reproducible (None will make the shuffle random each time, any integer will represent a random pattern). There are other parameters that could be set, but in class these parameters have been shown to be the most important.

In [8]:
from sklearn.svm import SVC

svm_model = SVC(C=1.0, kernel='rbf', 
                gamma='scale', 
                tol=0.001, class_weight='balanced', 
                max_iter=-1, random_state=0)



We have now initiated our SVM model!

## Step 6: Train the SVM

Training the model is easy with this implementation. All we need to do is call the `fit` method and provide the training features and labels. 

In [9]:
svm_model.fit(X_train,y_train)


SVC(class_weight='balanced', random_state=0)

With this run, we are now ready to use our trained model and examine the accuracy.

## Step 7: Validate and Test the SVM

We will call on the `score` method and use both the training data and the testing data. This will give us the accuracy the model achieved while training, and how well it does on data it has never seen before in the testing set.


In [10]:
print("Training accuracy: " + str(svm_model.score(X_train,y_train)))
print("Testing accuracy: " + str(svm_model.score(X_test,y_test)))


Training accuracy: 0.9939759036144579
Testing accuracy: 0.8333333333333334


We have a very high training accuracy score (99.4%), but a low testing score (83.3%). This may be an indication of over-fitting in our model.

## Step 8: Demonstrate Making Predictions

It is time to show what the trained model can do. I created a random array of data that has 60 values, all between 0 and 1, just like the true features. By plugging these features into the model and using the `predict` method, it will print a prediction. 

In [11]:
import random

random_features = pd.DataFrame([random.uniform(0, 1) for i in range(60)]).values.reshape(1,-1)
#random_features = sonar_data.iloc[24,:60].values.reshape(1,-1)

prediction = svm_model.predict(random_features)
print(prediction)

['M']


Here it predicts "M", which means a mine. I also commented out a line that takes the 24th row of the original dataframe and plugs that back into the model. This was just a test to see a different result. If you uncomment it and run the cell, the model will predict an "R", which means rock.

## Step 9: Evaluate (and Improve?)

The current testing accuracy is 83.3%. This model is currently underperforming and is not a good classifier. It performed worse than trained humans, who had accuracies from 88% to 97%. We have a high training accuracy, and a low testing accuracy, which indicates overfitting is likely.

There are 60 features, which can lead to the perceptron trying to fit the noise instead of the pattern. A dimension reduction technique like PCA could be useful. I think decreasing the regularization parameter "C" would help the model not overfit the data, but I will try increasing it as well. This parameter is used to introduce penalties to the learning algorithm, so it doesn't become overly complicated in exchange for less accurate training results. Experimenting with different kernels such as poly or linear might also yield a better decision boundary. It all depends on the data, so we must experiment.

Let's take another crack at the model. I adjusted each of the hyperparameters mentioned above, and this combination provides the best accuracy. 

In [12]:
svm_model = SVC(C=4.2, kernel='rbf', 
                gamma='scale', degree=3,
                tol=0.001, class_weight='balanced', 
                max_iter=-1, random_state=0)

svm_model.fit(X_train,y_train)
print("Training accuracy: " + str(svm_model.score(X_train,y_train)))
print("Testing accuracy: " + str(svm_model.score(X_test,y_test)))

Training accuracy: 1.0
Testing accuracy: 0.9285714285714286


We have achieved 92.9% accuracy on our testing set. That is great! Our model now performs right in the middle of the average human accuracy (88% to 97%). I increased the regularization parameter "C" to 4.2, which allows the model to get a bit more complicated. Using the poly kernel, I was able to get about 90.5% accuracy, but ultimately went back to the rbf kernel. No other kernels performed as well. I played around with the other settings, but ultimately the default for each was the best.

The training accuracy is 100%, which worries me. I was able to get both accuracies around the 88% mark with one set of hyperparameters, but because I could not get better than that I changed back. Our model may still be overfitting to this data set. If I were to continue this experiment, I would do k-fold cross validation with the data to make sure the accuracies the model performs at are seen even when training and testing on different sets. I also think I would use a dimension reduction technique like PCA. I will not experiment with that here because it is beyond the scope of the workbook.

## Conclusion


In this workbook I explored the sonar dataset again, used the sklearn SVM to train and test a model that predicts whether readings indicate solid rock or a mine, and tried to improve the model accuracy by tuning hyperparameters. I learned how the standard scaler function works from sklearn. I wonder if different scaling techniques influence the performance of the model? Before we did min and max scaling between 0 and 1. I was able to understand that my model was overfitting the data by observing the difference between the training and testing accuracies. I thought I could improve the results by decreasing the regularization parameter "C", to force the model to become simpler. This did improve the testing accuracy up to about 88%, but I was surprised that when I increased the regularization parameter (allowing the model to become a lot more complicated) the testing accuracy increased to 93%. I did not expect that result. I would love to test the model on some new validation data.

The support vector machine did achieve a greater accuracy than the perceptron I used in notebook 4 (85.7%). This is promising, but I still think with more exploration and time to explore the options I have stated above, the accuracy could increase.