## The purpose of this project is to predict whether an object is either Rock or Mine with SONAR Data. For this use case, we will use Logistic Regression Model. 

<p style="font-size:18px;">
The submarine uses sonar that sends sound signals and recieve these signals. After that, the signal is being processed to detect whether the object is a mine or a rock</p>

# These are the processes of the project
- <p style="font-size:18px;"> Collecting Sonar Data. The sonar is used to send and recieve signals that bounce back from metal cylinder. The data is obtained from a rock and metal cylinder. The data will be fed to the model to predict whether the object is rock or mine </p> <br>
- <p style="font-size:18px;"> Data Pre Processing (Analyzing & Understanding the Data)  </p><br>
- <p style="font-size:18px;"> Split data into train test data  </p><br>
- <p style="font-size:18px;"> The model used is Logistic Regression. This model is really useful in solving binary classification problems such as this problem  </p><br>

# 1 - Collecting Dataset

Dataset link to download : 
     https://drive.google.com/file/d/1pQxtljlNVh0DHYg-Ye7dtpDTlFceHVfa/view >>> CSV File

# 2 - Importing Libraries

In [8]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 3 - Data Preprocessing

In [9]:
# Reading Dataset to a pandas Dataframe
sonar_dataset = pd.read_csv('sonar_data.csv')
# Printing The Top 5 Rows of the Dataset
sonar_dataset.head()

Unnamed: 0,0.0200,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,...,0.0027,0.0065,0.0159,0.0072,0.0167,0.0180,0.0084,0.0090,0.0032,R
0,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0084,0.0089,0.0048,0.0094,0.0191,0.014,0.0049,0.0052,0.0044,R
1,0.0262,0.0582,0.1099,0.1083,0.0974,0.228,0.2431,0.3771,0.5598,0.6194,...,0.0232,0.0166,0.0095,0.018,0.0244,0.0316,0.0164,0.0095,0.0078,R
2,0.01,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,...,0.0121,0.0036,0.015,0.0085,0.0073,0.005,0.0044,0.004,0.0117,R
3,0.0762,0.0666,0.0481,0.0394,0.059,0.0649,0.1209,0.2467,0.3564,0.4459,...,0.0031,0.0054,0.0105,0.011,0.0015,0.0072,0.0048,0.0107,0.0094,R
4,0.0286,0.0453,0.0277,0.0174,0.0384,0.099,0.1201,0.1833,0.2105,0.3039,...,0.0045,0.0014,0.0038,0.0013,0.0089,0.0057,0.0027,0.0051,0.0062,R


dataset.head() shows the first 5 rows but with wrong column titles (as column titles are shown in values) so we need to set 
the column values with Features

In [None]:
# we can provide data_columns as the list that contains the names of the columns (if we have them)
# data_columns = ['Col 1 Name','Col 2 Name','Col 3 Name']
#sonar_dataset = pd.read_csv('breast-cancer-wisconsin.data',names=data_columns)


In [10]:
# Another Solution is to set the header as none when reading the dataset
sonar_dataset = pd.read_csv('sonar_data.csv',header=None)
# Printing The Top 5 Rows of the Dataset
sonar_dataset.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,51,52,53,54,55,56,57,58,59,60
0,0.02,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,...,0.0027,0.0065,0.0159,0.0072,0.0167,0.018,0.0084,0.009,0.0032,R
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0084,0.0089,0.0048,0.0094,0.0191,0.014,0.0049,0.0052,0.0044,R
2,0.0262,0.0582,0.1099,0.1083,0.0974,0.228,0.2431,0.3771,0.5598,0.6194,...,0.0232,0.0166,0.0095,0.018,0.0244,0.0316,0.0164,0.0095,0.0078,R
3,0.01,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,...,0.0121,0.0036,0.015,0.0085,0.0073,0.005,0.0044,0.004,0.0117,R
4,0.0762,0.0666,0.0481,0.0394,0.059,0.0649,0.1209,0.2467,0.3564,0.4459,...,0.0031,0.0054,0.0105,0.011,0.0015,0.0072,0.0048,0.0107,0.0094,R


In [13]:
# Number of Rows and Columns in the dataset
sonar_dataset.shape
# The first item in the tuple is number of rows, the second item in the tuple is number of columns

(208, 61)

## Descripion and statistical representaion of dataset ( Mean & Standard Deviation & Count ....)

In [16]:
sonar_dataset.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,50,51,52,53,54,55,56,57,58,59
count,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,...,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0
mean,0.029164,0.038437,0.043832,0.053892,0.075202,0.10457,0.121747,0.134799,0.178003,0.208259,...,0.016069,0.01342,0.010709,0.010941,0.00929,0.008222,0.00782,0.007949,0.007941,0.006507
std,0.022991,0.03296,0.038428,0.046528,0.055552,0.059105,0.061788,0.085152,0.118387,0.134416,...,0.012008,0.009634,0.00706,0.007301,0.007088,0.005736,0.005785,0.00647,0.006181,0.005031
min,0.0015,0.0006,0.0015,0.0058,0.0067,0.0102,0.0033,0.0055,0.0075,0.0113,...,0.0,0.0008,0.0005,0.001,0.0006,0.0004,0.0003,0.0003,0.0001,0.0006
25%,0.01335,0.01645,0.01895,0.024375,0.03805,0.067025,0.0809,0.080425,0.097025,0.111275,...,0.008425,0.007275,0.005075,0.005375,0.00415,0.0044,0.0037,0.0036,0.003675,0.0031
50%,0.0228,0.0308,0.0343,0.04405,0.0625,0.09215,0.10695,0.1121,0.15225,0.1824,...,0.0139,0.0114,0.00955,0.0093,0.0075,0.00685,0.00595,0.0058,0.0064,0.0053
75%,0.03555,0.04795,0.05795,0.0645,0.100275,0.134125,0.154,0.1696,0.233425,0.2687,...,0.020825,0.016725,0.0149,0.0145,0.0121,0.010575,0.010425,0.01035,0.010325,0.008525
max,0.1371,0.2339,0.3059,0.4264,0.401,0.3823,0.3729,0.459,0.6828,0.7106,...,0.1004,0.0709,0.039,0.0352,0.0447,0.0394,0.0355,0.044,0.0364,0.0439


In [20]:
# Getting how many rocks and mines in the dataset
# The 60th column is the column of the Target Variable >>> Rock (R) or Mine (M) so we will use this column 
sonar_dataset[60].value_counts()

M    111
R     97
Name: 60, dtype: int64

Mines = M <br>
Rocks = R

<p style="font-size:18px;">We made this step to make sure that the difference between the number of rocks and mines is not big in order to make a good prediciton <br>
    <b> The More the Data, The more Accurate the model is </b></p>

### Datatypes of Features

In [11]:
sonar_dataset.dtypes

0     float64
1     float64
2     float64
3     float64
4     float64
       ...   
56    float64
57    float64
58    float64
59    float64
60     object
Length: 61, dtype: object

Here we find that the 60th Column which is the Target (Label) is of type object so we need to convert this column to the same type as the columns

## Splitting the data into Input and Output using train_test_split from sklearn

In [47]:
# The Data input is all the columns but without The Target (60 >> Output)
data_input = sonar_dataset.drop(columns=60,axis=1)  # axis = 1 >> when u drop a column , axis = 0 >> when u drop a row
data_output = sonar_dataset[60]
data_input.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,50,51,52,53,54,55,56,57,58,59
0,0.02,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,...,0.0232,0.0027,0.0065,0.0159,0.0072,0.0167,0.018,0.0084,0.009,0.0032
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0125,0.0084,0.0089,0.0048,0.0094,0.0191,0.014,0.0049,0.0052,0.0044
2,0.0262,0.0582,0.1099,0.1083,0.0974,0.228,0.2431,0.3771,0.5598,0.6194,...,0.0033,0.0232,0.0166,0.0095,0.018,0.0244,0.0316,0.0164,0.0095,0.0078
3,0.01,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,...,0.0241,0.0121,0.0036,0.015,0.0085,0.0073,0.005,0.0044,0.004,0.0117
4,0.0762,0.0666,0.0481,0.0394,0.059,0.0649,0.1209,0.2467,0.3564,0.4459,...,0.0156,0.0031,0.0054,0.0105,0.011,0.0015,0.0072,0.0048,0.0107,0.0094


In [48]:
# We can replace the values in the data_output       R >> represents 0 & M represents 1
data_output = data_output.replace({'R': 0, 'M': 1})
data_output.unique()  # Returning the Unique Values inside the dataset

array([0, 1], dtype=int64)

### Train Test Split

In [60]:
X, X_test , y,y_test = train_test_split(data_input,
                                       data_output,
                                        test_size=0.1,
                                        stratify=data_output,
                                       random_state = 2)
# Test size >> is the percentage of Test set size which is 10% of the Whole data
# Stratify = data_output >> we need to split the data into rocks and mines so we need to have almost equal number of rocks in training data and equal number of mines in the training data
# Random State >> when u set this number to fixed number
# everytime you use train_test_split the function splits the data randomly each time so we need to put the random state with fixed number in order to not randomize data each time
# ----------------------------------------------------

In [61]:
# Split the X, y to training set and validation set
X_train, X_val, y_train, y_val = train_test_split(data_input,
                                       data_output,
                                        test_size=0.33,
                                       random_state = 2)
# Test size >> is the percentage of Test set size which is 33% of the Whole data
# Random State >> when u set this number to fixed number
# everytime you use train_test_split the function splits the data randomly each time so we need to put the random state with fixed number in order to not randomize data each time
# ----------------------------------------------------

In [62]:
# Split the X, y to training set and validation set
X_train, X_val, y_train, y_val = train_test_split(X,y, test_size=0.33, random_state =2)

In [63]:
print("Shape of The Training input : ", X_train.shape)
print("Shape of The Training Output : ",y_train.shape) # Output of Training
print("Shape of The Validation input : ",X_val.shape)
print("Shape of The Validation Output : ",y_val.shape) # Output of Validation
print("Shape of The Test input : ",X_test.shape)
print("Shape of The Test Output : ",y_test.shape) # Output of Test

Shape of The Training input :  (125, 60)
Shape of The Training Output :  (125,)
Shape of The Validation input :  (62, 60)
Shape of The Validation Output :  (62,)
Shape of The Test input :  (21, 60)
Shape of The Test Output :  (21,)


# Model Training >>> Logistic Regression

In [64]:
model = LogisticRegression()

In [65]:
# Training the Logistic Regression with Training Data
model.fit(X_train,y_train)

LogisticRegression()

## Checking the Accuracy Score of Model ( Model Evaluation )

In [67]:
# Accuracy on Training Data
X_train_prediction = model.predict(X_train) # Prediction of model on training data
# Compare the prediction of the model and real value
training_data_accuracy = (accuracy_score(y_train,X_train_prediction))*100 # y_train >> real value of training
print("Accuracy on Training Data = {} %".format(training_data_accuracy))

Accuracy on Training Data = 84.8 %


In [68]:
# Accuracy on Validation Data
y_pred_val = model.predict(X_val) # Prediction of model on training data
# Compare the prediction of the model and real value
validation_data_accuracy = (accuracy_score(y_val,y_pred_val))*100 # y_train >> real value of training
print("Accuracy on Validation Data = {} %".format(validation_data_accuracy))

Accuracy on Validation Data = 74.19354838709677 %


# Making Predictive System

## Now we Test the model with new data  (Test Dataset)

In [69]:
new_model = LogisticRegression()

In [70]:
new_model.fit(X_train,y_train)

LogisticRegression()

In [76]:
y_pred_test = new_model.predict(X_test)
print("The Accuracy of Test set = {} %".format(accuracy_score(y_test,y_pred_test)*100))

The Accuracy of Test set = 66.66666666666666 %


In [84]:
input_data = (0.0286,0.0453,0.0277,0.0174,0.0384,0.0990,0.1201,0.1833,0.2105,0.3039,0.2988,0.4250,0.6343,0.8198,1.0000,0.9988,0.9508,0.9025,0.7234,0.5122,0.2074,0.3985,0.5890,0.2872,0.2043,0.5782,0.5389,0.3750,0.3411,0.5067,0.5580,0.4778,0.3299,0.2198,0.1407,0.2856,0.3807,0.4158,0.4054,0.3296,0.2707,0.2650,0.0723,0.1238,0.1192,0.1089,0.0623,0.0494,0.0264,0.0081,0.0104,0.0045,0.0014,0.0038,0.0013,0.0089,0.0057,0.0027,0.0051,0.0062)
input_data_2 = (0.0629,0.1065,0.1526,0.1229,0.1437,0.1190,0.0884,0.0907,0.2107,0.3597,0.5466,0.5205,0.5127,0.5395,0.6558,0.8705,0.9786,0.9335,0.7917,0.7383,0.6908,0.3850,0.0671,0.0502,0.2717,0.2839,0.2234,0.1911,0.0408,0.2531,0.1979,0.1891,0.2433,0.1956,0.2667,0.1340,0.1073,0.2023,0.1794,0.0227,0.1313,0.1775,0.1549,0.1626,0.0708,0.0129,0.0795,0.0762,0.0117,0.0061,0.0257,0.0089,0.0262,0.0108,0.0138,0.0187,0.0230,0.0057,0.0113,0.0131)
# Changing the Input Data to a Numpy Array
input_data_as_numpy_array = np.asarray(input_data)
input_data_2_as_numpy_array = np.asarray(input_data_2)

# reshape the np array as we are predicting for one instance
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)
input_data_reshaped_2 = input_data_2_as_numpy_array.reshape(1,-1)

prediction = new_model.predict(input_data_reshaped)
prediction2 = new_model.predict(input_data_reshaped_2)

if(prediction[0] == 0):
    print("Prediction of First Input Data : ",prediction," is a Rock")
else:
    print("Prediction of First Input Data : ",prediction," is a Mine")
if(prediction2[0] == 0):
    print("Prediction of Second Input Data : ",prediction2," is a Rock")
else:
    print("Prediction of Second Input Data : ",prediction2," is a Mine")



[1]
Prediction of First Input Data :  [0]  is a Rock
Prediction of Second Input Data :  [1]  is a Mine
