---
## Binary classification in machine learning
---
**Problem Statement:**(Sonar Mines vs Rocks dataset)<br>The problem is to predict metal or rock objects from sonar return data.Each pattern is a set of 60 numbers in the range0.0 to 1.0.Each number represents the energy within a particular frequency band, integratedover a certain period of time. The label associated with each record contains the letter R if
the object is a rock and M if it is a mine (metal cylinder).The numbers in the labels are in increasing order of aspect angle,but they do not encode the angle directly.

### Imporing essential libraries

In [None]:
import numpy
from matplotlib import pyplot
from pandas import read_csv
from pandas import set_option
#from pandas.tools.plotting import scatter_matrix
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier

### Loading the data set

In [None]:
df = read_csv('sonar.all-data', header=None)

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.shape

In [None]:
set_option('display.max_rows', 500)
print(df.dtypes)

In [None]:
print(df.dtypes)

In [None]:
set_option('precision', 3)
print(df.describe())

**we have different mean values and can think of standardizing the data**

In [None]:
# class distribution
print(df.groupby(60).size())

**we observe that the Metals are 111 and Rocks are 97**

### Unimodal Data Visualizations<br><br><font color=blue>Graphical representation using histogram

In [None]:
df.hist(sharex=False, sharey=False, xlabelsize=1,ylabelsize=1,figsize=(12,12) );

### <font color=blue> Graphical representation of Density

In [None]:
df.plot(kind='density', subplots=True, layout=(8,8), sharex=False, legend=False, figsize=(12,12), fontsize=1)
pyplot.show()

We observe that many attributes have skewed distribution. Box Cox transform can correct the skewness

# Multimodal Data Distribution
visualize correlation between attributes

In [None]:
# correlation matrix
fig= pyplot.figure()
ax= fig.add_subplot(111)
cax = ax.matshow(df.corr(), vmin=-1, vmax=1, interpolation='none')
fig.colorbar(cax)
pyplot.show()

The yellow color around the diagonal shows the attributes that are next to each other are generally more correlated
with each other.
The green patches also suggest some moderate negative correlation the further attributes are away from each other 
in the ordering. This makes sense if the order of theattributes refers to the angle of sensors for the sonar chirp

### Validation Dataset<br><br><font color=blue>Splitting the validation dataset

In [None]:
array = df.values
X = array[:,0:60].astype(float)
Y = array[:, 60]
seed = 7
X_train , X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=0.20, random_state= seed)

### <font color=blue>Evaluating Algorthims

In [None]:
num_folds=10
seed = 7 
scoring = 'accuracy'

### <font color=blue> check few algorithms

In [None]:
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))

In [None]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [None]:
results = []
names = []
for name, model in models:
    kfold = KFold(n_splits=num_folds, random_state= seed)
    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

### <font color=blue>Visualizing the distribution of accuracy values of the above algorthims

In [None]:
fig = pyplot.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
pyplot.show()

**We see that KNN has low variance and is good for further analysisSVM shows low accuracy. The distribution of the data has an affect on the accuracy so we now standardize the data**

### <font color=blue>Standardizing the Dataset

In [None]:
pipelines = []
pipelines.append(('ScaledLR', Pipeline([('Scaler',StandardScaler()), ('LR', LogisticRegression())])))
pipelines.append(('ScaledLDA', Pipeline([('Scaler', StandardScaler()), ('LDA',LinearDiscriminantAnalysis())])))
pipelines.append(('ScaledKNN', Pipeline([('Scaler',StandardScaler()), ('KNN',KNeighborsClassifier())])))
pipelines.append(('ScaledCART',Pipeline([('Scaler',StandardScaler()),('CART', DecisionTreeClassifier())])))
pipelines.append(('ScaledNB'  ,Pipeline([('Scaler',StandardScaler()),('NB', GaussianNB())])))
pipelines.append(('ScaledSVM' ,Pipeline([('Scaler',StandardScaler()),('SVM', SVC())])))
results = []
names = []
for name, model in pipelines:
    kfold= KFold(n_splits=num_folds, random_state=seed)
    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring= scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name , cv_results.mean(), cv_results.std())
    print(msg)

**we observe that the accuracy of SVM now has become the highest compared to unscaled accuracy above.**

### <font color=blue>ploting the distribution of the accuracy scores

In [None]:
fig = pyplot.figure()
fig.suptitle('Scaled Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
pyplot.show()

**We will tune the parameters for KNN and SVM as they have shown good accuracy**

### Tuning KNN

In [None]:
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
neighbors = [1,3,5,7,9,11,13,15,17,19,21]
param_grid = dict(n_neighbors=neighbors)
model = KNeighborsClassifier()
kfold = KFold(n_splits=num_folds, random_state=seed)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)
grid_result = grid.fit(rescaledX, Y_train)
print("BEST: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean , stdev , param in zip (means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

**We can see the most accurate configuration was SVM with an RBF kernel and a C value of 1.5. The accuracy 86.7470% is seemingly better than what KNN could achieve`**

In [None]:
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
c_values = [0.1,0.3,0.5,0.7,0.9,1.0,1.3, 1.5,1.7,2.0]
kernel_values = ['linear', 'poly','rbf', 'sigmoid']
param_grid = dict(C=c_values, kernel=kernel_values)
model = SVC()
kflod = KFold(n_splits=num_folds , random_state=seed)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)
grid_result = grid.fit(rescaledX, Y_train)
print("Best: %f using %s" % (grid_result.best_score_, grid_results.best_params_))
means = grid_results.cv_results_['mean_test_score']
stds = grid_results.cv_results_['std_test_score']
params = grid_results.cv_results_['params']
for mean , stdev , param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

###  Cmparing algorithms

In [None]:
fig= pyplot.figure()
fig.suptitle('Ensemble Algo Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
pyplot.show()

**Finalised the model hre we use SVM to prepare our model**

### prepare the model

In [None]:
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
model = SVC(C=1.5)
model.fit(rescaledX, Y_train)
# estimate the accuracy on validation dataset
rescaledValidationX = scaler.transform(X_validation)
predictions = model.predict(rescaledValidationX)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation,predictions))
print(classification_report(Y_validation, predictions))

**Accuracy is nearly 86% on the hold out data setand SVM algorthim too had the 86% accuracy on the training data set**