Problem Statement: **Data Analytics III**

* Load the **iris.csv** dataset.
* Implement **Naïve Bayes classification** using Python/R.
* Train the model and make predictions.
* Compute the Confusion Matrix.
* Calculate:
  * True Positives (TP)
  * False Positives (FP)
  * True Negatives (TN)
  * False Negatives (FN)
  * Accuracy, Error Rate, Precision, Recall.

### import Required Files

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### read CSV File

In [2]:
df = pd.read_csv('6_Iris.csv')

In [3]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


### Understand Data

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


In [5]:
df.describe()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


### Drop the Id column as it has no value for ML Model

In [6]:
df = df.drop('Id', axis=1)

### Separates features (X) and target (y) from the dataset for supervised learning

In [7]:
X = df.drop('Species', axis=1)
y = df['Species']

In [8]:
y

0         Iris-setosa
1         Iris-setosa
2         Iris-setosa
3         Iris-setosa
4         Iris-setosa
            ...      
145    Iris-virginica
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
149    Iris-virginica
Name: Species, Length: 150, dtype: object

### Splits the dataset into training and testing sets (80% train, 20% test) 

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Model Building

## Gaussian Naive Bayes (GaussianNB)

Gaussian Naive Bayes is a probabilistic classification algorithm based on **Bayes' Theorem**.  
It assumes that the features follow a normal (Gaussian) distribution and is especially effective for continuous data.

<h3>Bayes' Theorem:</h3>
<p>
  <strong>P(y | X) = [P(X | y) ⋅ P(y)] / P(X)</strong>
</p>

<h3>Gaussian Likelihood Formula:</h3>
<p>
  <strong>
    P(x<sub>i</sub> | y) = 
    (1 / √(2&pi;&sigma;<sub>y</sub><sup>2</sup>)) &middot; 
    exp[−(x<sub>i</sub> − &mu;<sub>y</sub>)<sup>2</sup> / (2&sigma;<sub>y</sub><sup>2</sup>)]
  </strong>
</p>

<h4>Where:</h4>
<ul>
  <li><strong>x<sub>i</sub></strong>: Feature value</li>
  <li><strong>&mu;<sub>y</sub></strong>: Mean of the feature for class <strong>y</strong></li>
  <li><strong>&sigma;<sub>y</sub><sup>2</sup></strong>: Variance of the feature for class <strong>y</strong></li>
</ul>

### Use Case
Gaussian Naive Bayes is commonly used in:
- **Text classification**, such as spam detection or sentiment analysis
- **Medical diagnosis**, e.g., predicting disease based on numerical test results
- **Real-time prediction systems**, where data features are assumed to follow a Gaussian distribution


In [11]:
from sklearn.naive_bayes import GaussianNB

In [12]:
model = GaussianNB()

In [13]:
model.fit(X_train, y_train)

### predict Result on testing data

In [14]:
y_pred = model.predict(X_test)

### Evaluate Result

In [15]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

In [16]:
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

Confusion Matrix:
 [[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]


In [17]:
print("Classification Report:\n", classification_report(y_test, y_pred))

Classification Report:
                  precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        10
Iris-versicolor       1.00      1.00      1.00         9
 Iris-virginica       1.00      1.00      1.00        11

       accuracy                           1.00        30
      macro avg       1.00      1.00      1.00        30
   weighted avg       1.00      1.00      1.00        30



In [18]:
accuracy = accuracy_score(y_test, y_pred)
error_rate = 1 - accuracy

print(f"Accuracy: {accuracy:.2f}")
print(f"Error Rate: {error_rate:.2f}")

Accuracy: 1.00
Error Rate: 0.00
