#  The Sonar Dataset Project

The purpose of this exercice is to train a network to discriminate between sonar signals bounced off a metal cylinder and those bounced off a roughly cylindrical rock.

 The bouncing sonar signals off a metal cylinder were obtained at various angles and under various conditions.

Let's look at the Data as usual

In [1]:
from pandas import read_csv, set_option

In [2]:
filename = 'sonar.csv'
data = read_csv(filename, header=None)
peek = data.head(20)
shape = data.shape
types = data.dtypes

In [3]:
print(peek)

        0       1       2       3       4       5       6       7       8   \
0   0.0200  0.0371  0.0428  0.0207  0.0954  0.0986  0.1539  0.1601  0.3109   
1   0.0453  0.0523  0.0843  0.0689  0.1183  0.2583  0.2156  0.3481  0.3337   
2   0.0262  0.0582  0.1099  0.1083  0.0974  0.2280  0.2431  0.3771  0.5598   
3   0.0100  0.0171  0.0623  0.0205  0.0205  0.0368  0.1098  0.1276  0.0598   
4   0.0762  0.0666  0.0481  0.0394  0.0590  0.0649  0.1209  0.2467  0.3564   
5   0.0286  0.0453  0.0277  0.0174  0.0384  0.0990  0.1201  0.1833  0.2105   
6   0.0317  0.0956  0.1321  0.1408  0.1674  0.1710  0.0731  0.1401  0.2083   
7   0.0519  0.0548  0.0842  0.0319  0.1158  0.0922  0.1027  0.0613  0.1465   
8   0.0223  0.0375  0.0484  0.0475  0.0647  0.0591  0.0753  0.0098  0.0684   
9   0.0164  0.0173  0.0347  0.0070  0.0187  0.0671  0.1056  0.0697  0.0962   
10  0.0039  0.0063  0.0152  0.0336  0.0310  0.0284  0.0396  0.0272  0.0323   
11  0.0123  0.0309  0.0169  0.0313  0.0358  0.0102  0.0182  0.05

In [4]:
print(shape)

(208, 61)


In [5]:
print(types)

0     float64
1     float64
2     float64
3     float64
4     float64
5     float64
6     float64
7     float64
8     float64
9     float64
10    float64
11    float64
12    float64
13    float64
14    float64
15    float64
16    float64
17    float64
18    float64
19    float64
20    float64
21    float64
22    float64
23    float64
24    float64
25    float64
26    float64
27    float64
28    float64
29    float64
       ...   
31    float64
32    float64
33    float64
34    float64
35    float64
36    float64
37    float64
38    float64
39    float64
40    float64
41    float64
42    float64
43    float64
44    float64
45    float64
46    float64
47    float64
48    float64
49    float64
50    float64
51    float64
52    float64
53    float64
54    float64
55    float64
56    float64
57    float64
58    float64
59    float64
60     object
Length: 61, dtype: object


Let's do some descriptive statistics

In [6]:
set_option('display.width',100)
set_option('precision',3)
description = data.describe()

In [7]:
print(description)

            0          1        2        3        4        5        6        7        8        9   \
count  208.000  2.080e+02  208.000  208.000  208.000  208.000  208.000  208.000  208.000  208.000   
mean     0.029  3.844e-02    0.044    0.054    0.075    0.105    0.122    0.135    0.178    0.208   
std      0.023  3.296e-02    0.038    0.047    0.056    0.059    0.062    0.085    0.118    0.134   
min      0.002  6.000e-04    0.002    0.006    0.007    0.010    0.003    0.005    0.007    0.011   
25%      0.013  1.645e-02    0.019    0.024    0.038    0.067    0.081    0.080    0.097    0.111   
50%      0.023  3.080e-02    0.034    0.044    0.062    0.092    0.107    0.112    0.152    0.182   
75%      0.036  4.795e-02    0.058    0.065    0.100    0.134    0.154    0.170    0.233    0.269   
max      0.137  2.339e-01    0.306    0.426    0.401    0.382    0.373    0.459    0.683    0.711   

       ...       50         51         52       53         54         55         56       

The class counts of metal and rock is:

In [8]:
class_counts = data.groupby(60).size()
print(class_counts)

60
M    111
R     97
dtype: int64


## The machine learning part

Let's seperate the data

In [9]:
from sklearn.model_selection import train_test_split

array = data.values
X = array[:,:-1]
Y = array[:,-1]
validation_size = 0.2
seed=7
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y,
test_size=validation_size, random_state=seed)

Let's import the classification algorithms needed and define the list of models

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

In [12]:
models = []
models.append(('LR', LogisticRegression(solver = "lbfgs", multi_class="auto")))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma="auto")))

In order to evaluate each model we will use the cross validation methode where the training data is split into folds and and the validation set changes every time and is assigned one of the folds.

For our dataset we will split the dat into 10 folds

Let's import the needed libraries first

In [14]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

In [15]:
results = []
names = []
for name, model in models:
    kfold = KFold(n_splits=10, random_state=seed)
    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    print(f'{name}: {cv_results.mean()} {cv_results.std()}')

LR: 0.7694852941176471 0.10051029509664779
LDA: 0.7463235294117647 0.11785367885381073
KNN: 0.8080882352941176 0.06750704820308338
CART: 0.7058823529411764 0.10850260708363019
NB: 0.6488970588235294 0.1418684214516758
SVM: 0.6088235294117647 0.1186560591820866


In this we can see that the KNN algorithm performs the best, but the SVM algorithm performs surprisingly bad, I will try the apply standard scaling to the features to improve the performance. 

In order to do that we will use pipelines and use the standard scaler.

In [16]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [23]:
import numpy as np
pipelines = []
pipelines.append(('ScaledLR', Pipeline([('Scaler', StandardScaler()),('LR',
LogisticRegression(solver = "lbfgs", multi_class="auto"))])))
pipelines.append(('ScaledLASSO', Pipeline([('Scaler', StandardScaler()),('LDA',
LinearDiscriminantAnalysis())])))
pipelines.append(('ScaledNB', Pipeline([('Scaler', StandardScaler()),('NB',
GaussianNB())])))
pipelines.append(('ScaledKNN', Pipeline([('Scaler', StandardScaler()),('KNN',
KNeighborsClassifier())])))
pipelines.append(('ScaledCART', Pipeline([('Scaler', StandardScaler()),('CART',
DecisionTreeClassifier())])))
pipelines.append(('ScaledSVC', Pipeline([('Scaler', StandardScaler()),('SVC', SVC(gamma="auto"))])))
results = []
names = []
for name, model in pipelines:
    kfold = KFold(n_splits=10, random_state=seed)
    cv_results = cross_val_score(model, X_train.astype(np.float), Y_train, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    print(f'{name}: {cv_results.mean()} {cv_results.std()}')

ScaledLR: 0.7404411764705883 0.09466751140841813
ScaledLASSO: 0.7463235294117647 0.11785367885381073
ScaledNB: 0.6488970588235294 0.1418684214516758
ScaledKNN: 0.8257352941176471 0.054511038214266574
ScaledCART: 0.7180147058823529 0.11175563527433832
ScaledSVC: 0.8363970588235293 0.08869747214968386


As we can see the SVM algorithm and the KNN algorithm are the top performers, let's do some tuning

In [24]:
from sklearn.model_selection import GridSearchCV

Let's start with the KNN

In [28]:
# KNN Algorithm tuning
scaler = StandardScaler().fit(X_train.astype(float))
rescaledX = scaler.transform(X_train.astype(float))
k_values = np.array([1,3,5,7,9,11,13,15,17,19,21])
param_grid = dict(n_neighbors=k_values)
model = KNeighborsClassifier()
kfold = KFold(n_splits=10, random_state=seed)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=kfold,iid=True)
grid_result = grid.fit(rescaledX, Y_train)
print(f"Best: {grid_result.best_score_} using {grid_result.best_params_}")
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print(f"{mean} ({stdev}) with: {param}")

Best: 0.8493975903614458 using {'n_neighbors': 1}
0.8493975903614458 (0.05988110090069771) with: {'n_neighbors': 1}
0.8373493975903614 (0.06630330382796286) with: {'n_neighbors': 3}
0.8373493975903614 (0.03749969758430342) with: {'n_neighbors': 5}
0.7650602409638554 (0.08950992596280488) with: {'n_neighbors': 7}
0.7530120481927711 (0.08697897949491253) with: {'n_neighbors': 9}
0.7349397590361446 (0.10489007400824805) with: {'n_neighbors': 11}
0.7349397590361446 (0.10583597849547675) with: {'n_neighbors': 13}
0.7289156626506024 (0.07587309410809662) with: {'n_neighbors': 15}
0.7108433734939759 (0.07871598186667497) with: {'n_neighbors': 17}
0.7228915662650602 (0.08455537779041811) with: {'n_neighbors': 19}
0.7108433734939759 (0.10882920638633953) with: {'n_neighbors': 21}


And the SVM

In [32]:
# svm Algorithm tuning
scaler = StandardScaler().fit(X_train.astype(float))
rescaledX = scaler.transform(X_train.astype(float))
C_values = np.array([0.1,0.3,0.5,0.7,0.9,1.0,1.3,1.5,1.7,2.0])
kernel_values = ['linear','poly','rbf','sigmoid']
param_grid = dict(C=C_values, kernel=kernel_values)
model = SVC(gamma='auto')
kfold = KFold(n_splits=10, random_state=seed)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=kfold,iid=True)
grid_result = grid.fit(rescaledX, Y_train)
print(f"Best: {grid_result.best_score_} using {grid_result.best_params_}")
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print(f"{mean} ({stdev}) with: {param}")

Best: 0.8674698795180723 using {'C': 1.5, 'kernel': 'rbf'}
0.7590361445783133 (0.09886327405671058) with: {'C': 0.1, 'kernel': 'linear'}
0.5301204819277109 (0.11878006022028104) with: {'C': 0.1, 'kernel': 'poly'}
0.572289156626506 (0.13033853327360725) with: {'C': 0.1, 'kernel': 'rbf'}
0.7048192771084337 (0.0663596226254967) with: {'C': 0.1, 'kernel': 'sigmoid'}
0.7469879518072289 (0.10891253844857184) with: {'C': 0.3, 'kernel': 'linear'}
0.6445783132530121 (0.13229030877107076) with: {'C': 0.3, 'kernel': 'poly'}
0.7650602409638554 (0.09231152338173192) with: {'C': 0.3, 'kernel': 'rbf'}
0.7349397590361446 (0.05463116375770295) with: {'C': 0.3, 'kernel': 'sigmoid'}
0.7409638554216867 (0.08303483150783184) with: {'C': 0.5, 'kernel': 'linear'}
0.6807228915662651 (0.09863764643656386) with: {'C': 0.5, 'kernel': 'poly'}
0.7891566265060241 (0.06431559978182318) with: {'C': 0.5, 'kernel': 'rbf'}
0.7469879518072289 (0.059265219662503005) with: {'C': 0.5, 'kernel': 'sigmoid'}
0.7469879518072289

So the best algorithm is the SVM algorithm with C=1.5 and kernel = 'rbf'

## Prediction Time

In [38]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

# prepare the model
scaler = StandardScaler().fit(X_train.astype(float))
rescaledX = scaler.transform(X_train.astype(float))
model = SVC(C=1.5)
model.fit(rescaledX, Y_train)
# estimate accuracy on validation dataset
rescaledValidationX = scaler.transform(X_validation.astype(float))
predictions = model.predict(rescaledValidationX)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

0.8571428571428571
[[23  4]
 [ 2 13]]
              precision    recall  f1-score   support

           M       0.92      0.85      0.88        27
           R       0.76      0.87      0.81        15

   micro avg       0.86      0.86      0.86        42
   macro avg       0.84      0.86      0.85        42
weighted avg       0.86      0.86      0.86        42

