<div style="line-height:0.45">
<h1 style="color:#b85e88  "> Naive Bayes 1 </h1>
</div>
<div style="line-height:0.5">
<h4> Naive Bayes Classifier Implementation from scratch with numpy and pandas.
</h4>
</div>
<br>
<div style="margin-top: -10px;">
<span style="display: inline-block;">
    <h3 style="color: lightblue; display: inline;">Keywords:</h3> Stratified k-fold validation example + double train_test_split 
</span>
</div>

In [8]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold

In [2]:
class CustomNaiveBayes:
    """ Custom implementation of Naive Bayes classifier.
    
        Args:
            - Dataset with features [ndarry]
            - Target labels of shape [ndarry]

        Attributes: 
            - num_samples [int]
            - num_features [int]
            - num_classes [int]
            - epsilon value for numerical stability when performing calculations involving division
                It is added to avoid potential division by zero errors

        Details:
            - Store the shape of the features matrix in num_samples and num_features
            - np.unique returns the unique values in the target labels array
        """

    def __init__(self, X_data, y_labels):
        """ Constructor. Initialize the CustomNaiveBayes classifier. """
        self.num_samples, self.num_features = X_data.shape
        self.num_classes = len(np.unique(y_labels))
        self.eps = 1e-6

    def fit(self, X_data, y_labels):
        """ Train the classifier to the training data.

        Parameters:
            - Dataset with features [ndarry]
            - Target labels of shape [ndarry]

        Details:
            - Store in dictionaries of numpy arrays. 
                Use the string representation of the class as the key.
            - Get a subset of X_data containing samples where y_labels is equal to c.
            - Calculate the mean along the specified axis (0) of the array X_c.
            - Compute the variance along the specified axis (0) of the array X_c.
            - Calculate the prior probability of the current class. \\
                Divide the number of samples in X_c by the total number of samples in X_data. \\
                It represents the relative frequency or likelihood of encountering the current class in the training data.
        """
        self.classes_mean = {}
        self.classes_variance = {}
        self.classes_prior = {}

        for c in range(self.num_classes):
            X_c = X_data[y_labels == c]
            self.classes_mean[str(c)] = np.mean(X_c, axis=0)
            self.classes_variance[str(c)] = np.var(X_c, axis=0)
            self.classes_prior[str(c)] = X_c.shape[0] / X_data.shape[0]

    def predict(self, X_data):
        """ Predict the class labels for the given data.

        Parameters:
            - Dataset [ndarray]
        
        Details:
            - Create an array of zeros with shape (self.num_samples, self.num_classes).
            - Retrieve the prior probability (likelihood of encountering the current class in the training data), \\
                for the current class classes_prior dictionary. 
            - Calculate the density function for each sample in X_data using the mean and variance of each class. 
            - Combine the calculated probabilities to the corresponding column in the probabilities array, \\
                with the logarithm of the prior probability.
            - Return the indices of the maximum values along axis 1, which represents the predicted class labels.
        Returns:
            - Predicted class labels of shape [ndarray]
        """
        probabilities = np.zeros((self.num_samples, self.num_classes))

        for c in range(self.num_classes):
            prior = self.classes_prior[str(c)]
            probabilities_c = self.calculate_density(X_data, self.classes_mean[str(c)], self.classes_variance[str(c)])
            probabilities[:, c] = probabilities_c + np.log(prior)

        return np.argmax(probabilities, 1)

    def calculate_density(self, x, mean, variance):
        """ Calculate the probability density function using the Gaussian density function.

        Parameters:
            - Input data [ndarray]
            - Mean values of the features for a specific class [ndarray]
            - Variance values of the features for a specific class [ndarray]

        Details:
            - Calculate the constant term in the Gaussian density function.
                1 Normalization factor:
                    The division by 2 is due to the Gaussian distribution being symmetric. \\
                2 Scaling factor:
                    Take the natural log of each variance value, adds the epsilon value to it, and then sums them all. 
            - Calculate the element-wise squared difference between samples and mean, divided by variance plus a small epsilon value.
                - Sum the logarithm of each element in the variance array.
                - Scales the squared differences by the variances
                - Likelihood of each sample belonging to the specific class.

        Returns:
            - Calculated probabilities using the Gaussian density [ndarray]
        """
        const = -self.num_features / 2 * np.log(2 * np.pi) - 0.5 * np.sum(np.log(variance + self.eps))
        probabilities = 0.5 * np.sum(np.power(x - mean, 2) / (variance + self.eps), 1)
        return const - probabilities

In [4]:
dataset_smoke = pd.read_csv('./smoke_detection_iot.csv')
new_column_name = "ID"  
dataset_smoke.columns.values[0] = new_column_name
#or ...
#dataset_smoke.rename(columns={"Unnamed: 0": "ID"}, inplace=True)

dataset_smoke.head()

Unnamed: 0,ID,UTC,Temperature[C],Humidity[%],TVOC[ppb],eCO2[ppm],Raw H2,Raw Ethanol,Pressure[hPa],PM1.0,PM2.5,NC0.5,NC1.0,NC2.5,CNT,Fire Alarm
0,0,1654733331,20.0,57.36,0,400,12306,18520,939.735,0.0,0.0,0.0,0.0,0.0,0,0
1,1,1654733332,20.015,56.67,0,400,12345,18651,939.744,0.0,0.0,0.0,0.0,0.0,1,0
2,2,1654733333,20.029,55.96,0,400,12374,18764,939.738,0.0,0.0,0.0,0.0,0.0,2,0
3,3,1654733334,20.044,55.28,0,400,12390,18849,939.736,0.0,0.0,0.0,0.0,0.0,3,0
4,4,1654733335,20.059,54.69,0,400,12403,18921,939.744,0.0,0.0,0.0,0.0,0.0,4,0


In [5]:
#dataset_smoke.drop(['Fire Alarm'], axis=1, inplace=True)
X = dataset_smoke.iloc[:,:-1]
y = dataset_smoke['Fire Alarm']
dataset_smoke.head()

Unnamed: 0,ID,UTC,Temperature[C],Humidity[%],TVOC[ppb],eCO2[ppm],Raw H2,Raw Ethanol,Pressure[hPa],PM1.0,PM2.5,NC0.5,NC1.0,NC2.5,CNT,Fire Alarm
0,0,1654733331,20.0,57.36,0,400,12306,18520,939.735,0.0,0.0,0.0,0.0,0.0,0,0
1,1,1654733332,20.015,56.67,0,400,12345,18651,939.744,0.0,0.0,0.0,0.0,0.0,1,0
2,2,1654733333,20.029,55.96,0,400,12374,18764,939.738,0.0,0.0,0.0,0.0,0.0,2,0
3,3,1654733334,20.044,55.28,0,400,12390,18849,939.736,0.0,0.0,0.0,0.0,0.0,3,0
4,4,1654733335,20.059,54.69,0,400,12403,18921,939.744,0.0,0.0,0.0,0.0,0.0,4,0


In [6]:
## Split into Training, Validation and Test sets
X_train, X_val_test, y_train, y_val_test = train_test_split(X, y, test_size=0.2 , random_state=12)
X_val, X_test, y_val, y_test = train_test_split(X_val_test, y_val_test, test_size=0.5, random_state=12)

In [7]:
NB = CustomNaiveBayes(X, y)
# Train
NB.fit(X, y)
# Predict
y_pred = NB.predict(X)
# Accuracy
acc = sum(y_pred==y) / X.shape[0]
print(f"The Accuracy of our model is: {acc}")

The Accuracy of our model is: 0.8454095481398691


In [26]:
NB = CustomNaiveBayes(X_val, y_val)
# Train
NB.fit(X_val, y_val)
# Predict
y_pred = NB.predict(X_val)
# Accuracy
acc = sum(y_pred==y_val) / X_val.shape[0]
print(f"The Accuracy of our model is: {acc}")

The Accuracy of our model is: 0.8478365000798339


<h3 style="color:#b85e88  "> Trying Stratified K-Folds cross-validator to overcome random sampling issue  </h3>

In [11]:
X.head()

Unnamed: 0,ID,UTC,Temperature[C],Humidity[%],TVOC[ppb],eCO2[ppm],Raw H2,Raw Ethanol,Pressure[hPa],PM1.0,PM2.5,NC0.5,NC1.0,NC2.5,CNT
0,0,1654733331,20.0,57.36,0,400,12306,18520,939.735,0.0,0.0,0.0,0.0,0.0,0
1,1,1654733332,20.015,56.67,0,400,12345,18651,939.744,0.0,0.0,0.0,0.0,0.0,1
2,2,1654733333,20.029,55.96,0,400,12374,18764,939.738,0.0,0.0,0.0,0.0,0.0,2
3,3,1654733334,20.044,55.28,0,400,12390,18849,939.736,0.0,0.0,0.0,0.0,0.0,3
4,4,1654733335,20.059,54.69,0,400,12403,18921,939.744,0.0,0.0,0.0,0.0,0.0,4


In [22]:
n_splits = 2

# Create an instance of StratifiedKFold
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=12)
# Try the result of the loop
a, b = skf.split(X, y)
a, type(a)


((array([    0,     4,     8, ..., 62624, 62625, 62629]),
  array([    1,     2,     3, ..., 62626, 62627, 62628])),
 tuple)

In [30]:
""" Perform stratified cross-validation splits.
N.B.
X is a pandas DataFrame and y is a pandas Series, so here using the "values" attribute is necessary to extract the NumPy arrays. 
"""
## Initialize empty lists to store the split data
X_train_list, X_val_list, X_test_list = [], [], []
y_train_list, y_val_list, y_test_list = [], [], []

for train_index, test_index in skf.split(X.values, y.values):
    X_train, X_val_test = X.values[train_index], X.values[test_index]
    y_train, y_val_test = y.values[train_index], y.values[test_index]
    # Split again the val_test set into validation and test sets
    X_val, X_test, y_val, y_test = train_test_split(X_val_test, y_val_test, test_size=0.5, random_state=12)

    ###### Append the split data to the lists, which contain lists of NumPy arrays with different shapes 
    X_train_list.append(X_train)
    X_val_list.append(X_val)
    X_test_list.append(X_test)
    y_train_list.append(y_train)
    y_val_list.append(y_val)
    y_test_list.append(y_test)

In [34]:
""" Convert again to Dataframes """

### Stack the arrays in each list along the first axis to create 3D arrays
X_train_array = np.stack(X_train_list, axis=0)
X_val_array = np.stack(X_val_list, axis=0)
X_test_array = np.stack(X_test_list, axis=0)

####### Convert the 3D arrays to DataFrames
X_train_df = pd.DataFrame(X_train_array.reshape(-1, X_train_array.shape[-1]))
X_val_df = pd.DataFrame(X_val_array.reshape(-1, X_val_array.shape[-1]))
X_test_df = pd.DataFrame(X_test_array.reshape(-1, X_test_array.shape[-1]))
y_train_series = pd.Series(np.concatenate(y_train_list))
y_val_series = pd.Series(np.concatenate(y_val_list))
y_test_series = pd.Series(np.concatenate(y_test_list))

In [35]:
NB = CustomNaiveBayes(X_val_df, y_val_series)
# Train
NB.fit(X_val_df, y_val_series)
# Predict
y_pred = NB.predict(X_val_df)

acc = sum(y_pred==y_val_series) / X_val_df.shape[0]
print(f"The Accuracy of our model is: {acc}")

The Accuracy of our model is: 0.8408430464633562
