<a href="https://www.kaggle.com/code/manishkr1754/sonar-rock-vs-mine-prediction?scriptVersionId=142552867" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

---
<center><h1>SONAR Rock Vs Mine Prediction</h1></center>
<center><h3>Part of 30 Days 30 ML Projects Challenge</h3></center>

---

## 1) Understanding Problem Statement
---

The SONAR Rock Vs Mine Prediction falls under **Classication Machine Learning Problem**. The project aims to develop a machine learning model capable of accurately distinguishing between metal cylinders(mines) and rocks based on SONAR return data.

## 2) Understanding Data
---

The project is based on SONAR return data. Each data point consists of a set of 60 numerical values ranging between 0 to 1 representing the energy within specific frequency bands over time. The labels are 'M' for mines and 'R' for rock. The numbers in the labels are in increasing order of aspect angle, but they do not encode the angle directly.


## 3) Getting System Ready
---
Importing required libraries

In [1]:
import numpy as np
import pandas as pd

# for model buidling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

## 4) Data Eyeballing
---

### Loading Data

In [5]:
sonar_data = pd.read_csv('Datasets/Day1_Data_Sonar_Data.csv', header=None)

In [6]:
sonar_data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,51,52,53,54,55,56,57,58,59,60
0,0.0200,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,...,0.0027,0.0065,0.0159,0.0072,0.0167,0.0180,0.0084,0.0090,0.0032,R
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0084,0.0089,0.0048,0.0094,0.0191,0.0140,0.0049,0.0052,0.0044,R
2,0.0262,0.0582,0.1099,0.1083,0.0974,0.2280,0.2431,0.3771,0.5598,0.6194,...,0.0232,0.0166,0.0095,0.0180,0.0244,0.0316,0.0164,0.0095,0.0078,R
3,0.0100,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,...,0.0121,0.0036,0.0150,0.0085,0.0073,0.0050,0.0044,0.0040,0.0117,R
4,0.0762,0.0666,0.0481,0.0394,0.0590,0.0649,0.1209,0.2467,0.3564,0.4459,...,0.0031,0.0054,0.0105,0.0110,0.0015,0.0072,0.0048,0.0107,0.0094,R
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
203,0.0187,0.0346,0.0168,0.0177,0.0393,0.1630,0.2028,0.1694,0.2328,0.2684,...,0.0116,0.0098,0.0199,0.0033,0.0101,0.0065,0.0115,0.0193,0.0157,M
204,0.0323,0.0101,0.0298,0.0564,0.0760,0.0958,0.0990,0.1018,0.1030,0.2154,...,0.0061,0.0093,0.0135,0.0063,0.0063,0.0034,0.0032,0.0062,0.0067,M
205,0.0522,0.0437,0.0180,0.0292,0.0351,0.1171,0.1257,0.1178,0.1258,0.2529,...,0.0160,0.0029,0.0051,0.0062,0.0089,0.0140,0.0138,0.0077,0.0031,M
206,0.0303,0.0353,0.0490,0.0608,0.0167,0.1354,0.1465,0.1123,0.1945,0.2354,...,0.0086,0.0046,0.0126,0.0036,0.0035,0.0034,0.0079,0.0036,0.0048,M


In [7]:
print('The size of Dataframe is: ', sonar_data.shape)
print('-'*100)
print('The Column Name, Record Count and Data Types are as follows: ')
sonar_data.info()
print('-'*100)

The size of Dataframe is:  (208, 61)

----------------------------------------------------------------------------------------------------

The Column Name, Record Count and Data Types are as follows: 

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 208 entries, 0 to 207

Data columns (total 61 columns):

 #   Column  Non-Null Count  Dtype  

---  ------  --------------  -----  

 0   0       208 non-null    float64

 1   1       208 non-null    float64

 2   2       208 non-null    float64

 3   3       208 non-null    float64

 4   4       208 non-null    float64

 5   5       208 non-null    float64

 6   6       208 non-null    float64

 7   7       208 non-null    float64

 8   8       208 non-null    float64

 9   9       208 non-null    float64

 10  10      208 non-null    float64

 11  11      208 non-null    float64

 12  12      208 non-null    float64

 13  13      208 non-null    float64

 14  14      208 non-null    float64

 15  15      208 non-null    float64

 16  

In [10]:
# Defining numerical & categorical columns
numeric_features = [feature for feature in sonar_data.columns if sonar_data[feature].dtype != 'O']
categorical_features = [feature for feature in sonar_data.columns if sonar_data[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

We have 60 numerical features : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59]



We have 1 categorical features : [60]


In [8]:
print('Missing Value Presence in different columns of DataFrame are as follows : ')
print('-'*100)
total=sonar_data.isnull().sum().sort_values(ascending=False)
percent=(sonar_data.isnull().sum()/sonar_data.isnull().count()*100).sort_values(ascending=False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

Missing Value Presence in different columns of DataFrame are as follows : 

----------------------------------------------------------------------------------------------------


Unnamed: 0,Total,Percent
0,0,0.0
31,0,0.0
33,0,0.0
34,0,0.0
35,0,0.0
...,...,...
25,0,0.0
26,0,0.0
27,0,0.0
28,0,0.0


In [11]:
print('Summary Statistics of numerical features for DataFrame are as follows:')
print('-'*100)
sonar_data.describe()

Summary Statistics of numerical features for DataFrame are as follows:

----------------------------------------------------------------------------------------------------


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,50,51,52,53,54,55,56,57,58,59
count,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,...,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0
mean,0.029164,0.038437,0.043832,0.053892,0.075202,0.10457,0.121747,0.134799,0.178003,0.208259,...,0.016069,0.01342,0.010709,0.010941,0.00929,0.008222,0.00782,0.007949,0.007941,0.006507
std,0.022991,0.03296,0.038428,0.046528,0.055552,0.059105,0.061788,0.085152,0.118387,0.134416,...,0.012008,0.009634,0.00706,0.007301,0.007088,0.005736,0.005785,0.00647,0.006181,0.005031
min,0.0015,0.0006,0.0015,0.0058,0.0067,0.0102,0.0033,0.0055,0.0075,0.0113,...,0.0,0.0008,0.0005,0.001,0.0006,0.0004,0.0003,0.0003,0.0001,0.0006
25%,0.01335,0.01645,0.01895,0.024375,0.03805,0.067025,0.0809,0.080425,0.097025,0.111275,...,0.008425,0.007275,0.005075,0.005375,0.00415,0.0044,0.0037,0.0036,0.003675,0.0031
50%,0.0228,0.0308,0.0343,0.04405,0.0625,0.09215,0.10695,0.1121,0.15225,0.1824,...,0.0139,0.0114,0.00955,0.0093,0.0075,0.00685,0.00595,0.0058,0.0064,0.0053
75%,0.03555,0.04795,0.05795,0.0645,0.100275,0.134125,0.154,0.1696,0.233425,0.2687,...,0.020825,0.016725,0.0149,0.0145,0.0121,0.010575,0.010425,0.01035,0.010325,0.008525
max,0.1371,0.2339,0.3059,0.4264,0.401,0.3823,0.3729,0.459,0.6828,0.7106,...,0.1004,0.0709,0.039,0.0352,0.0447,0.0394,0.0355,0.044,0.0364,0.0439


In [12]:
print('Summary Statistics of categorical features for DataFrame are as follows:')
print('-'*100)
sonar_data.describe(include= 'object')

Summary Statistics of categorical features for DataFrame are as follows:

----------------------------------------------------------------------------------------------------


Unnamed: 0,60
count,208
unique,2
top,M
freq,111


In [13]:
sonar_data[60].value_counts()

60
M    111
R     97
Name: count, dtype: int64

- Here **`M`** stands for **Mine** and **`R`** stands for **Rock**

### No Data Cleaning and Preprocessing Needed

## 5) Model Building : Logistic Regression
---

### Creating Feature Matrix (Independent Variables) & Target Variable (Dependent Variable)

In [16]:
X=sonar_data.drop(columns=60,axis=1)      # Feature Matrix
X

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,50,51,52,53,54,55,56,57,58,59
0,0.0200,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,...,0.0232,0.0027,0.0065,0.0159,0.0072,0.0167,0.0180,0.0084,0.0090,0.0032
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0125,0.0084,0.0089,0.0048,0.0094,0.0191,0.0140,0.0049,0.0052,0.0044
2,0.0262,0.0582,0.1099,0.1083,0.0974,0.2280,0.2431,0.3771,0.5598,0.6194,...,0.0033,0.0232,0.0166,0.0095,0.0180,0.0244,0.0316,0.0164,0.0095,0.0078
3,0.0100,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,...,0.0241,0.0121,0.0036,0.0150,0.0085,0.0073,0.0050,0.0044,0.0040,0.0117
4,0.0762,0.0666,0.0481,0.0394,0.0590,0.0649,0.1209,0.2467,0.3564,0.4459,...,0.0156,0.0031,0.0054,0.0105,0.0110,0.0015,0.0072,0.0048,0.0107,0.0094
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
203,0.0187,0.0346,0.0168,0.0177,0.0393,0.1630,0.2028,0.1694,0.2328,0.2684,...,0.0203,0.0116,0.0098,0.0199,0.0033,0.0101,0.0065,0.0115,0.0193,0.0157
204,0.0323,0.0101,0.0298,0.0564,0.0760,0.0958,0.0990,0.1018,0.1030,0.2154,...,0.0051,0.0061,0.0093,0.0135,0.0063,0.0063,0.0034,0.0032,0.0062,0.0067
205,0.0522,0.0437,0.0180,0.0292,0.0351,0.1171,0.1257,0.1178,0.1258,0.2529,...,0.0155,0.0160,0.0029,0.0051,0.0062,0.0089,0.0140,0.0138,0.0077,0.0031
206,0.0303,0.0353,0.0490,0.0608,0.0167,0.1354,0.1465,0.1123,0.1945,0.2354,...,0.0042,0.0086,0.0046,0.0126,0.0036,0.0035,0.0034,0.0079,0.0036,0.0048


In [17]:
y=sonar_data[60]    # Target Variable
y

0      R
1      R
2      R
3      R
4      R
      ..
203    M
204    M
205    M
206    M
207    M
Name: 60, Length: 208, dtype: object

### Train-Test Split

In [19]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=45)

In [20]:
print(X.shape, X_train.shape, X_test.shape)

(208, 60) (166, 60) (42, 60)


In [21]:
print(y.shape, y_train.shape, y_test.shape)

(208,) (166,) (42,)


### Model Training

In [22]:
model = LogisticRegression()

In [23]:
#training the Logistic Regression model with training data
model.fit(X_train, y_train)

### Model Evaluation

In [24]:
#accuracy on training data
X_train_prediction = model.predict(X_train)
accuracy_training_data = accuracy_score(X_train_prediction, y_train) 

In [25]:
print('Accuracy on Training Data : ', accuracy_training_data)

Accuracy on Training Data :  0.8313253012048193


In [27]:
#accuracy on test data
X_test_prediction = model.predict(X_test)
accuracy_test_data = accuracy_score(X_test_prediction, y_test) 

In [28]:
print('Accuracy on test data : ',accuracy_test_data )

Accuracy on test data :  0.7857142857142857


## 6) Model Comparison for Selection of Best Model
---
Model Comparison between **LogisticRegression, SVC, DecisionTreeClassifier** and **RandomForestClassifier**

### Importing necessary libraries

In [41]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import accuracy_score

In [42]:
models = [LogisticRegression, SVC, DecisionTreeClassifier, RandomForestClassifier]
accuracy_scores = []

for model in models:
    classifier = model().fit(X_train, y_train)
    pred = classifier.predict(X_test)
    accuracy_scores.append(accuracy_score(y_true=y_test, y_pred=pred))

In [43]:
classification_model_df = pd.DataFrame({
    "Model": ['Logistic Regression', 'Support Vector Classifier', 'Decision Tree Classifier',
              'Random Forest Classifier'],
    "Accuracy": accuracy_scores,
})

classification_model_df.set_index('Model', inplace=True)
classification_model_df

Unnamed: 0_level_0,Accuracy
Model,Unnamed: 1_level_1
Logistic Regression,0.785714
Support Vector Classifier,0.785714
Decision Tree Classifier,0.690476
Random Forest Classifier,0.809524


### Inference

- Best Model based on accuracy score only is **Random Forest Classifier**. However, for real life best model selection are not solely based on accuracy score, we need to take into account **other evaluation metrics, business context and model interpretability**.