# Notebook ICD - 14

### Libraries

In [1]:
import pandas as pd
import numpy as np
from collections import defaultdict, Counter

## Naive Bayes from scratch

This section implements the NaiveBayesClassifier class that includes two main methods: fit and predict. 

The __init__ method initializes the data structures that will store the a priori probabilities of each class (self.class_priors), as well as the conditional probabilities for each attribute given a class (self.likelihoods). 

The **fit** method takes care of calculating the a priori probabilities of the classes from the observed frequencies in the training data and then calculates the likelihoods (conditional probabilities) by applying Laplace smoothing to avoid zero probability values when an attribute value has not been observed. 

Finally, the **predict** method takes test instances, calculates the posterior probabilities for each class, and assigns the class with the highest probability to each instance.

In [2]:
class NaiveBayesClassifier:
    def __init__(self):
        self.class_priors = {}  # Prior probabilities of the classes
        self.likelihoods = {}   # Conditional probabilities (likelihoods)
        self.classes = None     # Unique classes in the dataset
        self.features = None    # Features (attributes)
    
    def fit(self, X, y):

        # Get the unique classes and features (attributes)
        self.classes = np.unique(y)
        self.features = X.columns
        total_samples = len(y)  # Total number of training instances
        
        # Estimate prior probabilities (relative frequency of each class)
        class_counts = y.value_counts().to_dict()
        self.class_priors = {cls: (class_counts[cls] / total_samples) for cls in self.classes}
        
        # Initialize conditional probabilities (likelihoods)
        self.likelihoods = {cls: {} for cls in self.classes}
        
        # Calculate the likelihoods (conditional probabilities) for each feature
        for cls in self.classes:
            X_cls = X[y == cls]  # Filter instances where the class is 'cls'
            total_cls_samples = len(X_cls)  # Number of instances per class
            
            # Calculate the likelihoods for each attribute and attribute value
            for feature in self.features:
                feature_counts = X_cls[feature].value_counts().to_dict()  # Frequency of each attribute value
                total_feature_values = len(X[feature].unique())  # Total number of possible attribute values
                
                # Apply Laplace smoothing and calculate the likelihoods
                self.likelihoods[cls][feature] = {
                    value: (feature_counts.get(value, 0) + 1) / (total_cls_samples + total_feature_values)
                    for value in X[feature].unique()
                }
    
    def predict(self, X_test):
        
        results = []
        
        # Iterate over each test instance
        for _, x in X_test.iterrows():
            class_probabilities = {}  # Store the posterior probabilities for each class
            
            # Calculate the posterior probability for each class
            for cls in self.classes:
                # Initialize with the prior probability of the class
                prob = self.class_priors[cls]
                
                # Multiply by the likelihoods (conditional probabilities) of each feature
                for feature in self.features:
                    value = x[feature]
                    prob *= self.likelihoods[cls][feature].get(value, 1 / (len(self.likelihoods[cls][feature]) + len(self.features)))
                
                # Store the calculated probability for the class
                class_probabilities[cls] = prob
            
            # Select the class with the highest posterior probability
            predicted_class = max(class_probabilities, key=class_probabilities.get)
            results.append(predicted_class)
        
        return results

### Implementation example

The 'Play Tennis' dataset will be imported in order to build a Naive Bayes classifier to predict whether tennis will be played or not as a function of weather conditions such as temperature, humidity and wind. The 14 available instances will serve as a training basis for the model, while a new instance, not included in the training, will be used to evaluate its performance and generalization.

In [3]:
data = pd.read_csv('weather.nominal.csv')

# Define X (features) and y (label)
X = data.iloc[:, :-1]  # All columns except the last one
y = data.iloc[:, -1]  # Last column (label)

# Train the classifier using the Naive Bayes algorithm with the original column names
nb_classifier = NaiveBayesClassifier()
nb_classifier.fit(X, y)

# Create the instance to test: sunny, hot, normal, TRUE
test_instance = pd.DataFrame([{
    'outlook': 'sunny',
    'temperature': 'cool',
    'humidity': 'high',
    'windy': True
}])

# Make the prediction
prediction = nb_classifier.predict(test_instance)
print(f"Prediction for the instance {test_instance.iloc[0].to_dict()}: {prediction[0]}")

Prediction for the instance {'outlook': 'sunny', 'temperature': 'cool', 'humidity': 'high', 'windy': True}: no


## Scikit-learn implementation

The Naive Bayes algorithm is a simple and efficient probabilistic classifier that assumes conditional independence between features. While this assumption may not always hold in real-world data, Naive Bayes often performs remarkably well in many applications.

**Naive Bayes Classifier**

The Naive Bayes algorithm is built on Bayes' Theorem, which is expressed as:

\[
P(C|X) = \frac{P(X|C)P(C)}{P(X)}
\]

where:
- \(P(C|X)\) represents the posterior probability of class \(C\) given the data \(X\),
- \(P(X|C)\) is the likelihood of the data given class \(C\),
- \(P(C)\) is the prior probability of class \(C\),
- \(P(X)\) is the probability of the data (which is constant for all classes and can be ignored for classification purposes).

**Gaussian Naive Bayes**

In the case of Gaussian Naive Bayes (GaussianNB), the algorithm assumes that the features follow a Gaussian (normal) distribution. The likelihood of a feature \(x_i\) given a class \(C_k\) is calculated using the probability density function of the Gaussian distribution:

\[
P(x_i | C_k) = \frac{1}{\sqrt{2\pi \sigma_k^2}} \exp\left(-\frac{(x_i - \mu_k)^2}{2\sigma_k^2}\right)
\]

where:
- \( \mu_k \) denotes the mean of feature \(x_i\) for class \(C_k\),
- \( \sigma_k^2 \) is the variance of feature \(x_i\) for class \(C_k\),
- \( x_i \) represents the value of the feature for the given instance.

### Library

In [4]:
from sklearn.naive_bayes import GaussianNB

### Dataset

In [5]:
df = pd.read_csv(r'weather.numeric.csv')

Show dataset

In [6]:
print(df)

    Day   Outlook  Temperature  Humidity    Wind   Play
0     1     sunny           85        85    weak  False
1     2     sunny           80        90  strong  False
2     3  overcast           83        86    weak   True
3     4      rain           70        96    weak   True
4     5      rain           68        80    weak   True
5     6      rain           65        70  strong  False
6     7  overcast           64        65  strong   True
7     8     sunny           72        95    weak  False
8     9     sunny           69        70    weak   True
9    10      rain           75        80    weak   True
10   11     sunny           75        70  strong   True
11   12  overcast           72        90  strong   True
12   13  overcast           81        75    weak   True
13   14      rain           71        91  strong  False


In [7]:
# defining the dependent and independent variables
X_train = df[['Outlook', 'Temperature', 'Humidity', 'Wind']]
y_train = df[['Play']]

print(X_train.head())
print(y_train.head())

    Outlook  Temperature  Humidity    Wind
0     sunny           85        85    weak
1     sunny           80        90  strong
2  overcast           83        86    weak
3      rain           70        96    weak
4      rain           68        80    weak
    Play
0  False
1  False
2   True
3   True
4   True


### From categorical to numeric

In [8]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

outlook = X_train.iloc[:,0]
outlook_enc = encoder.fit_transform(outlook)
print(outlook.tolist())
print(outlook_enc)

wind = X_train.iloc[:,3]
wind_enc = encoder.fit_transform(wind)
print(wind.tolist())
print(wind_enc)

['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain']
[2 2 0 1 1 1 0 2 2 1 2 0 0 1]
['weak', 'strong', 'weak', 'weak', 'weak', 'strong', 'strong', 'weak', 'weak', 'weak', 'strong', 'strong', 'weak', 'strong']
[1 0 1 1 1 0 0 1 1 1 0 0 1 0]


In [9]:
df_outlook = pd.DataFrame(outlook_enc, columns = ['Outlook'])
df_wind = pd.DataFrame(outlook_enc, columns = ['Wind'])
X_train_num = pd.concat([df_outlook, X_train.iloc[:,1], X_train.iloc[:,2], df_wind], axis=1)
print(X_train_num)

    Outlook  Temperature  Humidity  Wind
0         2           85        85     2
1         2           80        90     2
2         0           83        86     0
3         1           70        96     1
4         1           68        80     1
5         1           65        70     1
6         0           64        65     0
7         2           72        95     2
8         2           69        70     2
9         1           75        80     1
10        2           75        70     2
11        0           72        90     0
12        0           81        75     0
13        1           71        91     1


### Generación del modelo

Gaussian Naive Bayes. GaussianNB implements the Gaussian Naive Bayes algorithm for classification. The likelihood of the features is assumed to be Gaussian.

In [10]:
clf = GaussianNB().fit(X_train_num, y_train)

  y = column_or_1d(y, warn=True)


### Evaluando modelo con nueva instancia

In [11]:
# sunny:2, hot:85, normal:65, strong:0 
new_example = [[2, 60, 65, 1]]
X_test = pd.DataFrame(new_example, columns = ['Outlook', 'Temperature', 'Humidity', 'Wind'])
print(X_test)
clf.predict(X_test)

   Outlook  Temperature  Humidity  Wind
0        2           60        65     1


array([ True])