1. What assumptions are made about the attributes in Naïve Bayes method of classification and why? 
What is Laplacian correction and why it is necessary?

Ans: The Naïve Bayes classifier is a supervised machine learning algorithm, which is used for classification tasks, like text classification. This method makes a key assumption about the independence of the features or attributes used in the classification. This assumption is "naïve" because it assumes that the presence or absence of a particular feature is independent of the presence or absence of other features, given the class label. It also assumes that all features contribute equally to the outcome. 

For example, in a text classification task, if you're trying to classify emails as spam or not spam, Naïve Bayes assumes that the occurrence of individual words in the email is independent of each other, given the information about whether the email is spam or not.

This assumption simplifies the computation of probabilities and makes the algorithm computationally efficient. However, in real-world data, features are often correlated, and this independence assumption may not hold true. Despite this simplification, Naïve Bayes often performs well in practice, especially in text classification tasks, where the independence assumption may not be too far from reality.

Laplace correction, also known as Laplacian smoothing or add-one smoothing, is a technique used to handle the problem of zero probabilities in Bayesian statistics, particularly in the context of Naive Bayes classifiers. 

In the context of Naive Bayes classifiers, Laplace correction is applied when estimating probabilities from the training data. The issue arises when a particular feature value in the dataset has not been observed with a certain class label. In such cases, the probability estimation for that feature given the class becomes zero, which can lead to problems during classification, especially when combining probabilities in a Naive Bayes model.

Laplace correction involves adding a small, non-zero constant (usually 1) to all the observed counts to avoid zero probabilities. The idea is to "smooth" the probability estimates and prevent the model from being overly confident in its predictions based on limited data. 

### Use Naïve Bayes algorithm from scratch for the following questions.

2. Take the iris data. Use Naïve Bayes algorithm to find the species with the following two observations.

[4.5 3.0 5.6 2.1; 5.4 2.6 4.5 0.0]

Show various steps involved and do as explained in the class.

In [1]:
# Import necessary libraries and load the iris dataset
from sklearn import datasets
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [2]:
# Load the iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

In [3]:
# Print aspects of the dataset
iris.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [4]:
# Print classes/species
target_names=iris.target_names
print(target_names)

['setosa' 'versicolor' 'virginica']


In [5]:
# Features of the dataset
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [6]:
from sklearn.preprocessing import LabelEncoder

# Create a LabelEncoder
label_encoder = LabelEncoder()

# Fit the LabelEncoder on the target variable (y) to learn the mapping
label_encoder.fit(y)

# Transform numerical labels to species names
species_names = label_encoder.inverse_transform(y)

# Print the result
print(species_names)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [7]:
# Mapping each target value in y to its corresponding species name in target_names 
# 0 to setosa, 1 to versicolor and 2 to virginica

species_names = [target_names[i] for i in y]
print(species_names)

['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicol

In [8]:
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [9]:
(pd.DataFrame(X)).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       150 non-null    float64
 1   1       150 non-null    float64
 2   2       150 non-null    float64
 3   3       150 non-null    float64
dtypes: float64(4)
memory usage: 4.8 KB


In [10]:
pd.DataFrame(X).describe()

Unnamed: 0,0,1,2,3
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [11]:
# Renaming the columns

df = pd.DataFrame(X)
df.columns = ['SL', 'SW', "PL","PW"]

In [12]:
df.describe()

Unnamed: 0,SL,SW,PL,PW
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [13]:
df.isnull().sum()

SL    0
SW    0
PL    0
PW    0
dtype: int64

In [14]:
X_train=df
print(X_train.shape)
print(X_train.head())

(150, 4)
    SL   SW   PL   PW
0  5.1  3.5  1.4  0.2
1  4.9  3.0  1.4  0.2
2  4.7  3.2  1.3  0.2
3  4.6  3.1  1.5  0.2
4  5.0  3.6  1.4  0.2


In [15]:
y_train=y
print(y_train)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


Calculate conditional probabilty

In [16]:
# Define function to calculate conditional probability

def calculate_conditional_probability(obs, data_mean, data_std, target_name):
    """
    Define function to calculate conditional probability
    obs: List[float]
    data_mean: List[float]
    data_std: List[float]
    target_name: String
    """
    probabilities = []
    for i in range(4):
        probability = np.prod((1 / (np.sqrt(2 * np.pi * data_std[i]))) * np.exp(-(obs[i] - data_mean[i]) ** 2 / (2 * data_std[i])))
        probabilities.append(probability)
        print(f"Probability of {iris.feature_names[i]} given the class {target_name}: {probability}")
    return np.prod(probabilities)


def print_class_info(class_name, data_mean, data_std):
    print(f"For class {class_name} the mean values for features are:\n{data_mean}")
    print("\n")
    print(f"For class {class_name} the standard deviation values for features are:\n{data_std}")
    print("\n")

In [17]:
# Calculate prior probabilities of each class

count = pd.DataFrame(y_train).value_counts()
_sum = pd.DataFrame(y_train).value_counts().sum()
setosa_priori_probability = count[0] / _sum
versicolor_priori_probability = count[1] / _sum
virginica_priori_probability = count[2] / _sum

print(f"Prior probability of Setosa: {setosa_priori_probability}")
print(f"Prior probability of Versicolor: {versicolor_priori_probability}")
print(f"Prior probability of Virginica: {virginica_priori_probability}")

Prior probability of Setosa: 0.3333333333333333
Prior probability of Versicolor: 0.3333333333333333
Prior probability of Virginica: 0.3333333333333333


In [18]:
# Calculate mean and standard deviation for each class

setosa_data = X_train[y_train == 0]
setosa_data_mean = setosa_data.mean()
setosa_data_std = setosa_data.std()

versicolor_data = X_train[y_train == 1]
versicolor_data_mean = versicolor_data.mean()
versicolor_data_std = versicolor_data.std()

virginica_data = X_train[y_train == 2]
virginica_data_mean = virginica_data.mean()
virginica_data_std = virginica_data.std()


# Print class information

print_class_info("Setosa", setosa_data_mean, setosa_data_std)
print_class_info("Versicolor", versicolor_data_mean, versicolor_data_std)
print_class_info("Virginica", virginica_data_mean, virginica_data_std)

For class Setosa the mean values for features are:
SL    5.006
SW    3.428
PL    1.462
PW    0.246
dtype: float64


For class Setosa the standard deviation values for features are:
SL    0.352490
SW    0.379064
PL    0.173664
PW    0.105386
dtype: float64


For class Versicolor the mean values for features are:
SL    5.936
SW    2.770
PL    4.260
PW    1.326
dtype: float64


For class Versicolor the standard deviation values for features are:
SL    0.516171
SW    0.313798
PL    0.469911
PW    0.197753
dtype: float64


For class Virginica the mean values for features are:
SL    6.588
SW    2.974
PL    5.552
PW    2.026
dtype: float64


For class Virginica the standard deviation values for features are:
SL    0.635880
SW    0.322497
PL    0.551895
PW    0.274650
dtype: float64




In [19]:
# Observations:
observations = np.array([[4.5, 3.0, 5.6, 2.1], [5.4, 2.6, 4.5, 0.0]])


# Calculate conditional Probability
for obs in observations:
    print("\n\nFor observation", obs, "\n")

    # Calculate conditional probabilities
    probability_setosa = calculate_conditional_probability(obs, setosa_data_mean, setosa_data_std, target_names[0])
    print("\n")
    probability_versicolor = calculate_conditional_probability(obs, versicolor_data_mean, versicolor_data_std, target_names[1])
    print("\n")
    probability_virginica = calculate_conditional_probability(obs, virginica_data_mean, virginica_data_std, target_names[2])

    # Print results
    print(f"\nThe conditional probability for class Setosa = {probability_setosa}")
    print(f"The conditional probability for class Versicolor = {probability_versicolor}")
    print(f"The conditional probability for class Virginica = {probability_virginica}\n")

    # Now multiply prior probability and conditional probability of each class to obtain final probabilities
    print(f"The probability for class Setosa = {setosa_priori_probability * probability_setosa}")
    print(f"The probability for class Versicolor = {versicolor_priori_probability * probability_versicolor}")
    print(f"The probability for class Virginica = {virginica_priori_probability * probability_virginica}\n")




For observation [4.5 3.  5.6 2.1] 

Probability of sepal length (cm) given the class setosa: 0.4673140174355561
Probability of sepal width (cm) given the class setosa: 0.508881338060571
Probability of petal length (cm) given the class setosa: 3.720658979326054e-22
Probability of petal width (cm) given the class setosa: 1.0160666574351192e-07


Probability of sepal length (cm) given the class versicolor: 0.0753378511718843
Probability of sepal width (cm) given the class versicolor: 0.6546032519987927
Probability of petal length (cm) given the class versicolor: 0.08612916635209425
Probability of petal width (cm) given the class versicolor: 0.19725072065531182


Probability of sepal length (cm) given the class virginica: 0.016233371073797832
Probability of sepal width (cm) given the class virginica: 0.7017659587696752
Probability of petal length (cm) given the class virginica: 0.5358897238768604
Probability of petal width (cm) given the class virginica: 0.7536864664865808

The condition

We observe that:
- for the first observation [4.5, 3.0, 5.6, 2.1], the highest probability is for class Virginica
- for the second observation [5.4, 2.6, 4.5, 0.0], the highest probability is for class Versicolor

### Verify your answer with sklearn
iris Data:

In [20]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [21]:
# Train the Naïve Bayes model
naive_bayes_model = GaussianNB()
naive_bayes_model.fit(X_train, y_train)

In [22]:
# Observations
observations = [
    [4.5, 3.0, 5.6, 2.1],
    [5.4, 2.6, 4.5, 0.0]
]

In [23]:
# Predict probabilities for each observation
predicted_probabilities = naive_bayes_model.predict_proba(observations)

# Print the predicted probabilities for each species for each observation
for i, probs in enumerate(predicted_probabilities):
    observation = observations[i]
    print(f"For observation {observation}:")
    
    for j, prob in enumerate(probs):
        species = iris.target_names[j]
        print(f"The probability for class {species} = {prob}")

    print()

For observation [4.5, 3.0, 5.6, 2.1]:
The probability for class setosa = 1.419620587145963e-179
The probability for class versicolor = 7.722702329486218e-05
The probability for class virginica = 0.9999227729767053

For observation [5.4, 2.6, 4.5, 0.0]:
The probability for class setosa = 3.9083106837588104e-54
The probability for class versicolor = 0.9992168624477715
The probability for class virginica = 0.0007831375522288574



We observe that:

- for the first observation [4.5, 3.0, 5.6, 2.1], the highest probability is for class Virginica
- for the second observation [5.4, 2.6, 4.5, 0.0], the highest probability is for class Versicolor

****************************************************************************************************

3. Consider the data of play tennis or not. Predict the output for the following cases.

case 1. Outlook: Sunny, Temp: Hot, Humidity: High, Wind= Strong

case 2. Outlook = Overcast, Temp = Mild, humidity = High, Wind = Strong

Find probability of Play = Yes, No

P(Yes) = 9/14
P(No) = 5/14

Calculate probability of Outlook = Sunny, Overcast, Rainy

- when Play = Yes

P(Outlook=Sunny|Play=Yes) = 2/9

P(Outlook=Overcast|Play=Yes) = 4/9

P(Outlook=Rainy|Play=Yes) = 3/9


- when Play = No

P(Outlook=Sunny|Play=No) = 3/5

P(Outlook=Overcast|Play=No) = 0

P(Outlook=Rainy|Play=No) = 2/5

Calculate probability of Temperature = Hot, Mild, Cold

- when Play = Yes

P(Temperature=Hot|Play=Yes) = 2/9

P(Temperature=Mild|Play=Yes) = 4/9

P(Temperature=Cold|Play=Yes) = 3/9


- when Play = No

P(Temperature=Hot|Play=No) = 2/5

P(Temperature=Mild|Play=No) = 2/5

P(Temperature=Cold|Play=No) = 1/5

Calculate probability of Temperature = High, Normal

- when Play = Yes

P(Humidity=High|Play=Yes) = 3/9

P(Humidity=Normal|Play=Yes) = 6/9


- when Play = No

P(Humidity=High|Play=No) = 4/5

P(Humidity=Normal|Play=No) = 4/5

Calculate probability of Wind = Strong, Weak

- when Play = Yes

P(Wind=Strong|Play=Yes) = 3/9

P(Wind=Weak|Play=Yes) = 6/9


- when Play = No

P(Wind=Strong|Play=No) = 3/4

P(Wind=Weak|Play=No) = 2/5

Case 1 Calculation:

X = Outlook: Sunny, Temp: Hot, Humidity: High, Wind= Strong

P(Yes|X) = [P(Sunny|Yes)P(Hot|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) 
         = (2/9)(2/9)(3/9)(3/9)(9/14)
         = 0.00352

P(No|X) = [P(Sunny|No)P(Hot|No)P(High|No)P(Strong|No)]P(Play=No) 
        = (3/5)(2/5)(4/5)(3/5)(5/14)
        = 0.0411

P(No|X) > P(Yes|X)

Hence case 1 prediction is No

Case 2 Calculation:

X = Outlook = Overcast, Temp = Mild, humidity = High, Wind = Strong

P(Yes|X) = [P(Overcast|Yes)P(Mild|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) 
         = (4/9)(4/9)(3/9)(3/9)(9/14)
         = 0.0141

P(No|X) = [P(Overcast|Yes)P(Mild|Yes)P(High|Yes)P(Strong|Yes)]P(Play=No) 
        = 0(2/5)(4/5)(3/5)(5/14)
        = 0

P(Yes|X) > P(No|X)

Hence case 2 prediction is yes

### Verify your answer with sklearn

Tennis Data:

In [24]:
import pandas as pd
from sklearn.naive_bayes import CategoricalNB
from sklearn.preprocessing import OrdinalEncoder

In [25]:
# Create a DataFrame from the given data
data = {
    'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rainy', 'Rainy', 'Rainy', 'Overcast', 'Sunny', 'Sunny', 'Rainy', 'Sunny', 'Overcast', 'Overcast', 'Rainy'],
    'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild', 'Cool', 'Mild', 'Mild', 'Mild', 'Hot', 'Mild'],
    'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'High'],
    'Wind': ['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Strong'],
    'PlayTennis': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']
}

df = pd.DataFrame(data)

In [26]:
df

Unnamed: 0,Outlook,Temperature,Humidity,Wind,PlayTennis
0,Sunny,Hot,High,Weak,No
1,Sunny,Hot,High,Strong,No
2,Overcast,Hot,High,Weak,Yes
3,Rainy,Mild,High,Weak,Yes
4,Rainy,Cool,Normal,Weak,Yes
5,Rainy,Cool,Normal,Strong,No
6,Overcast,Cool,Normal,Strong,Yes
7,Sunny,Mild,High,Weak,No
8,Sunny,Cool,Normal,Weak,Yes
9,Rainy,Mild,Normal,Weak,Yes


In [27]:
# Convert categorical data to numerical using OrdinalEncoder
encoder = OrdinalEncoder()
df_encoded = pd.DataFrame(encoder.fit_transform(df[['Outlook', 'Temperature', 'Humidity', 'Wind']]), columns=['Outlook', 'Temperature', 'Humidity', 'Wind'])
df_encoded['PlayTennis'] = df['PlayTennis']

In [28]:
df_encoded

Unnamed: 0,Outlook,Temperature,Humidity,Wind,PlayTennis
0,2.0,1.0,0.0,1.0,No
1,2.0,1.0,0.0,0.0,No
2,0.0,1.0,0.0,1.0,Yes
3,1.0,2.0,0.0,1.0,Yes
4,1.0,0.0,1.0,1.0,Yes
5,1.0,0.0,1.0,0.0,No
6,0.0,0.0,1.0,0.0,Yes
7,2.0,2.0,0.0,1.0,No
8,2.0,0.0,1.0,1.0,Yes
9,1.0,2.0,1.0,1.0,Yes


In [29]:
# Train the Naive Bayes model
X = df_encoded[['Outlook', 'Temperature', 'Humidity', 'Wind']]
y = df_encoded['PlayTennis']

naive_bayes_model = CategoricalNB()
naive_bayes_model.fit(X, y)

In [30]:
# Store the test case conditions in new_cases
new_cases = pd.DataFrame({'Outlook': ['Sunny', 'Overcast'],
                           'Temperature': ['Hot', 'Mild'],
                           'Humidity': ['High', 'High'],
                           'Wind': ['Strong', 'Strong']})

In [31]:
# Encode the new cases
new_cases_encoded = pd.DataFrame(encoder.transform(new_cases), columns=['Outlook', 'Temperature', 'Humidity', 'Wind'])

In [32]:
new_cases_encoded

Unnamed: 0,Outlook,Temperature,Humidity,Wind
0,2.0,1.0,0.0,0.0
1,0.0,2.0,0.0,0.0


In [33]:
# Make predictions
predictions = naive_bayes_model.predict(new_cases_encoded)

for i in range(len(new_cases)):
    cases = {}
    for column in new_cases.columns:
        cases[column] = new_cases.loc[i, column]
    
    print(f"\nCase {i + 1}: {cases} - \n\tPredicted Play Tennis: {predictions[i]}")



Case 1: {'Outlook': 'Sunny', 'Temperature': 'Hot', 'Humidity': 'High', 'Wind': 'Strong'} - 
	Predicted Play Tennis: No

Case 2: {'Outlook': 'Overcast', 'Temperature': 'Mild', 'Humidity': 'High', 'Wind': 'Strong'} - 
	Predicted Play Tennis: Yes
