## Data Analytics III  
Implement **Simple Naïve Bayes classification algorithm** using Python/R on iris.csv
Compute **Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision. Recall** on the given dataset.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('Iris.csv')

In [3]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [29]:
flower_counts = df['Species'].value_counts()
print(flower_counts)

Species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64


In [4]:
df.drop(columns=['Id'], inplace=True)

In [5]:
df.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [6]:
df.isnull().sum()

SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

In [7]:
df.fillna(method='ffill')

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [8]:
df.dtypes

SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species           object
dtype: object

In [12]:
# Independent variables (features)
X = df[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']]
# Dependent variable (target/label)
y = df['Species']  # assuming the species is the label column

In [16]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.30,random_state =42)

In [17]:
# Create and train the model
from sklearn.naive_bayes import GaussianNB
gaussian = GaussianNB()
gaussian.fit(X_train, y_train)

In [23]:
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report,precision_score,recall_score


# Accuracy, Precision, and Recall

## 1. **Accuracy**  
**What it means:**  
Accuracy tells you **how often your model was correct overall**.

- It’s the **percentage of correct predictions** out of all predictions made.
  
**Formula:**  Accuracy = (Correct Predictions) / (Total Predictions) * 100
## 2. **Precision**  
**What it means:**  
Precision tells you **how many of the predicted positive results were actually correct**.

- It focuses on how good the model is at not making mistakes when it **predicts a specific class** (like a particular flower species).

**Formula:** Precision = (True Positives (TP)) / (True Positives (TP) + False Positives (FP))
## 3. **Recall**  
**What it means:**  
Recall tells you **how many of the actual positive cases were correctly identified** by the model.

- It focuses on how good the model is at **catching all the positives** (like identifying all the Setosa flowers).

**Formula:**  
Recall = (True Positives (TP)) / (True Positives (TP) + False Negatives (FN))

## What is average in precision and recall
### Suppose you played 3 sports:  
Football (100 matches)  
Tennis (10 matches)  
Cricket (5 matches)  
When checking your winning accuracy:  
**Micro**: Count all wins and all matches together (100+10+5).  
**Macro**: Find winning % for each sport, then simple average of three.  
**Weighted**: Winning % for each sport, but football counts more because you played it much more.  

In [28]:
#👉 "Use the trained model (gaussian) to predict the species of the flowers in X_test."
y_pred = gaussian.predict(X_test)
accuracy = accuracy_score(y_test,y_pred )


precision =precision_score(y_test, y_pred,average='macro' )
recall = recall_score(y_test, y_pred,average='macro')
print("Accuracy: ",accuracy)
print("Precision: ",precision)
print("Recall : ",recall )

Accuracy:  0.9777777777777777
Precision:  0.9761904761904763
Recall :  0.9743589743589745


### Confusion Matrix
|                    | Predicted Setosa | Predicted Versicolor | Predicted Virginica |
|--------------------|------------------|----------------------|---------------------|
| **Actual Setosa**    | 19               | 0                    | 0                   |
| **Actual Versicolor**| 0                | 12                   | 1                   |
| **Actual Virginica** | 0                | 0                    | 13                  |


**19 flowers were actually Setosa and were correctly predicted as Setosa ✅  
12 flowers were actually Versicolor and were correctly predicted as Versicolor ✅  
13 flowers were actually Virginica and were correctly predicted as Virginica ✅  
1 flower was actually Versicolor but was wrongly predicted as Virginica ❌**  



In [25]:
 cm = confusion_matrix(y_test, y_pred)
print("Confusion matrix: \n",cm)

Confusion matrix: 
 [[19  0  0]
 [ 0 12  1]
 [ 0  0 13]]


In [31]:
# Initialize for each class
for i in range(len(cm)):
    TP = cm[i,i]
    FP = np.sum(cm[:,i]) - TP
    FN = np.sum(cm[i,:]) - TP
    TN = np.sum(cm) - (TP + FP + FN)

    print(f"Class {i}:")
    print(f"TP: {TP}, FP: {FP}, FN: {FN}, TN: {TN}\n")

Class 0:
TP: 19, FP: 0, FN: 0, TN: 26

Class 1:
TP: 12, FP: 0, FN: 1, TN: 32

Class 2:
TP: 13, FP: 1, FN: 0, TN: 31

