## Overview

We will implement a `NaiveBayes` classifier for categorical features in pure Python, mimicking the interface of scikit-learn. Our `NaiveBayes` class will expose two primary methods:

1. **`fit(X, y, α=1)`**  
   - **Goal**: Estimate the class priors $P(Y=i)$ and the conditional likelihoods $P(X_j = v \mid Y = i)$ for each feature $X_j$ and each class $i$.
   - **Class Priors**  
     Let $N(i)$ be the count of samples in class $i$, and $N_{\text{tot}}$ the total number of samples. Then
     $$
       P(Y=i) = \frac{N(i)}{N_{\text{tot}}}.
     $$
   - **Laplace‐smoothed Likelihoods**  
     For each feature $X_j$ taking value $v$ and each class $i$, let
     - $N(v, i)$ = number of times $X_j=v$ among samples with $Y=i$,
     - $k_j$ = number of distinct categories of feature $X_j$,
     - $\alpha$ = smoothing parameter (default $\alpha=1$).  
     
     Then
     $$
       P(X_j = v \mid Y = i)
       = \frac{N(v, i) + \alpha}{N(i) + \alpha\,k_j}.
     $$
     This ensures no probability is zero, so the product of likelihoods remains nonzero.

   All estimated probabilities are stored in dictionaries:
   ```python
   self.prior      # {class_i: P(Y=i)}
   self.likelihood # {(j, i): {v: P(X_j=v | Y=i)}}

   
2. **`predict(x_new)`**  
- **Goal**: Compute the posterior score for each class $i$ given a new sample $\mathbf{x} = (x_1, \dots, x_n)$ and return the class with the highest score.  
- **Posterior (log‐space)**  
  Since  
  $$
    P(Y=i \mid \mathbf{x})
    \propto P(Y=i)\,\prod_{j=1}^n P(X_j = x_j \mid Y=i),
  $$  
  we work in log‐space to avoid underflow:  
  $$
    \log P(Y=i \mid \mathbf{x})
    = \log P(Y=i)
    + \sum_{j=1}^n \log P(X_j = x_j \mid Y=i).
  $$  
- **Decision Rule**  
  Return  
  $$
    \widehat{y} = \arg\max_i \Bigl\{\,\log P(Y=i) + \sum_{j=1}^n \log P(X_j = x_j \mid Y=i)\Bigr\}.
  $$

---




## Lets import the necesary librarys

In [40]:
import pandas as pd 
import numpy as np 
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import CategoricalNB

## Small Dataset:

In [41]:
tennis = pd.read_csv('Bases de datos\\tennis.csv')
encoders = {}

for i in tennis.columns:
    le = LabelEncoder()
    tennis[i] = le.fit_transform(tennis[i])
    encoders[i] = le 

tennis

Unnamed: 0,outlook,temp,humidity,windy,play
0,2,1,0,0,0
1,2,1,0,1,0
2,0,1,0,0,1
3,1,2,0,0,1
4,1,0,1,0,1
5,1,0,1,1,0
6,0,0,1,1,1
7,2,2,0,0,0
8,2,0,1,0,1
9,1,2,1,0,1


In [42]:
labels = [i for i in tennis.columns if i != 'play']
X = tennis[labels]
y = tennis['play']

## Categorical NaiveBayes

In [43]:
class NaiveBayes():
    def __init__(self):
        pass

    def fit(self , X , y , alpha = 1 , parameters = False):
        """
        Estimate class priors and feature likelihoods for a categorical Naive Bayes model.

        Parameters
        ----------
        X : pandas.DataFrame
            Feature matrix with categorical columns.
        y : pandas.Series
            Target labels corresponding to each row of X.
        alpha : float, default=1
            Laplace smoothing parameter.
        parameters : bool, default=False
            If True, return the raw likelihood, prior, and class counts.

        Attributes set on self
        ----------------------
        likelihood : dict
            Nested dict mapping each class to a dict of feature→{value: P(value|class)}.
        prior : dict
            Mapping from class to P(class).
        cols : int
            Number of feature columns.
        labels : list
            List of original column names from X.
        """

        labels = [i for i in X.columns]
        clases_totales , prior , likelihood= {} ,{} , {}
        combinada = X.copy()
        combinada['y'] = y

        # Build a dict of total counts per class
        for i in y:
            if i not in clases_totales:
                clases_totales[i] = y.eq(i).sum()
            else:
                pass

        # Compute class priors: count_of_class / total_samples
        for i in clases_totales:
            prior[i] = clases_totales[i]/y.shape[0]
            likelihood[i] = {}

        # Compute likelihoods with Laplace smoothing
        for i in prior:
            # Filter rows belonging to class i
            df1 = combinada[combinada['y'] == i]
            for k in range(X.shape[1]):
                posibles = X.iloc[:, k].unique()
                n_posibles = len(posibles)

                # Count occurrences of each category value within class i
                dic = df1.iloc[:, k].value_counts().to_dict()
                # Apply Laplace smoothing:
                # (count(value, class) + alpha) / (count(class) + alpha * number_of_categories)
                dic_dividido = {valor: (dic.get(valor, 0) + alpha) / (clases_totales[i] + alpha * n_posibles) for valor in posibles}
                likelihood[i][labels[k]] = dic_dividido

        # Store the computed parameters
        self.likelihood = likelihood
        self.prior = prior
        self.cols = X.shape[1]
        self.labels = labels

        # Optionally return raw parameters for inspection
        if parameters:
            return likelihood , prior , clases_totales

    def predict(self , l:list):
        """
        Predict the class label for a single observation using the trained Naive Bayes model.

        Parameters
        ----------
        l : list
            A list of feature values for the new observation. Length must match the number of features used in training.

        Returns
        -------
        tuple
            (predicted_class, class_log_probabilities)
            - predicted_class: the class with the highest posterior log-probability
            - class_log_probabilities: dict mapping each class to its computed log-posterior
        """
        prob_finales = {}
        probabilidad = 0
        c, final = float('-inf') , None

        # Check that input has correct number of features
        if len(l) != self.cols:
            return f'Model was trained with a different number of features'
        else:

            # Compute log-posterior for each class
            for i in self.prior:
                # Sum log-likelihoods for each feature
                for etiquetas , nuevo in zip(self.labels, l):
                    if nuevo not in self.likelihood[i][etiquetas]:
                        # If unseen value, add a small log-probability to avoid zero
                        probabilidad += np.log(9e-11)
                    else:
                        probabilidad += np.log(self.likelihood[i][etiquetas][nuevo])

                # Add log-prior to get log-posterior
                prob_finales[i] = probabilidad + np.log(self.prior[i])
                probabilidad = 0

        # Select the class with the highest log-posterior
        for i , k in prob_finales.items():
            if k > c:
                final = i
                c = k
            else:
                pass
        return final , prob_finales

## Lets train the model:

In [44]:
model = NaiveBayes()
verosimilitud , priors , frecuencia = model.fit(X,y , parameters= True)

### First lets check the frecuency of each class

In [45]:
print(frecuencia)

{0: 5, 1: 9}


### Now lets see the prior probabilitys of each class

In [46]:
print(priors)

{0: 0.35714285714285715, 1: 0.6428571428571429}


### Finally let's see the Likelihood 

In [47]:
verosimilitud

{0: {'outlook': {2: 0.5, 0: 0.125, 1: 0.375},
  'temp': {1: 0.375, 2: 0.375, 0: 0.25},
  'humidity': {0: 0.7142857142857143, 1: 0.2857142857142857},
  'windy': {0: 0.42857142857142855, 1: 0.5714285714285714}},
 1: {'outlook': {2: 0.25, 0: 0.4166666666666667, 1: 0.3333333333333333},
  'temp': {1: 0.25, 2: 0.4166666666666667, 0: 0.3333333333333333},
  'humidity': {0: 0.36363636363636365, 1: 0.6363636363636364},
  'windy': {0: 0.6363636363636364, 1: 0.36363636363636365}}}

## Detailed Interpretation of the Likelihood Dictionary



- **Top-level keys** (`0`, `1`) are the class labels $Y=0$ and $Y=1$.  
- **Feature keys** (`'outlook'`, `'temp'`, `'humidity'`, `'windy'`) list the conditional probabilities for each feature.  
- **Values** inside each feature map represent $P(X_j = v \mid Y=\text{class})$.

---

### Class = 0

- **outlook**  
  - 2 → 0.50  
  - 0 → 0.125  
  - 1 → 0.375  
  _Half of the class 0 samples have outlook=2; only 12.5% have outlook=0._

- **temp**  
  - 1 → 0.375  
  - 2 → 0.375  
  - 0 → 0.25  
  _Temperatures 1 and 2 are equally common; temp=0 is less frequent._

- **humidity**  
  - 0 → 0.714  
  - 1 → 0.286  
  _Low humidity strongly indicates class 0 (about 71% of samples)._

- **windy**  
  - 0 → 0.429  
  - 1 → 0.571  
  _Slightly more class 0 samples are windy (57%)._

---

### Class = 1

- **outlook**  
  - 0 → 0.417  
  - 1 → 0.333  
  - 2 → 0.25  
  _Overcast (0) is most common; rainy (2) is least common for class 1._

- **temp**  
  - 2 → 0.417  
  - 0 → 0.333  
  - 1 → 0.25  
  _Higher temperatures (2) favor class 1._

- **humidity**  
  - 1 → 0.636  
  - 0 → 0.364  
  _High humidity strongly suggests class 1._

- **windy**  
  - 0 → 0.636  
  - 1 → 0.364  
  _Not windy favors class 1._

---

By comparing these tables, you can see which feature values most strongly differentiate the two classes. Features like **humidity** (with large probability gaps) will dominate the Naive Bayes decision when you sum log-probabilities for a new observation.


## Lets predict a new sample 

In [48]:
model.predict([1,0 , 1 , 1])

(1, {0: -5.209121787743566, 1: -4.102643365036796})

It’s important to remember that these values are **log-probabilities** (natural logarithms of the true probabilities), which is why they appear negative. What matters is **which log-probability is larger** (i.e. closer to zero).  

Given the output:

```python 
(1, {0: -5.209121787743566, 1: -4.102643365036796})
```

- Log-probability for class 0: **–5.2091**  
- Log-probability for class 1: **–4.1026**  

Since  
$$
-4.1026 > -5.2091
$$
class 1 has the higher log-probability, so the model predicts **1**.

## Now let's check with the model already created by SKLearn if the prediction is the same.

In [49]:

modelo = CategoricalNB()
modelo.fit(X,y)
modelo.predict([[1,0 , 1 , 1]])



array([1])

We can see that the SKLearn model predicted the same thing as our Manual version.

## Lets Make some new predictions 

In [58]:
new_data = {
    'outlook':   ['sunny',    'rainy',    'overcast', 'sunny',    'rainy',    'overcast', 'sunny'],
    'temp':      ['hot',      'mild',     'cool',     'mild',     'cool',     'hot',      'cool'],
    'humidity':  ['high',     'high',     'normal',   'normal',   'normal',   'high',     'high'],
    'windy':     [False,      True,       False,      True,       False,      True,       True],
    'play':      ['no',       'yes',      'yes',      'yes',      'yes',      'yes',      'no']}

new_df = pd.DataFrame(new_data)

for col, le in encoders.items():
    if new_df[col].dtype == 'bool':
        new_df[col] = new_df[col].astype(int)
    new_df[col] = le.transform(new_df[col])

new_df


Unnamed: 0,outlook,temp,humidity,windy,play
0,2,1,0,0,0
1,1,2,0,1,1
2,0,0,1,0,1
3,2,2,1,1,1
4,1,0,1,0,1
5,0,1,0,1,1
6,2,0,0,1,0


In [59]:

labels = [i for i in new_df if i != "play"]
X_test = new_df[labels]

predicciones=[]
for i in range(X_test.shape[0]):
  pr = model.predict(list(X_test.iloc[i , :]))
  print(pr)
  predicciones.append(pr[0])

new_df['Naive'] = predicciones


(0, {0: -3.8873659477612463, 1: -4.6780075099403575})
(0, {0: -3.8873659477612463, 1: -4.439115601658008})
(1, {0: -6.595416148863456, 1: -3.3198840257871636})
(1, {0: -4.515974607183621, 1: -4.167181886174366})
(1, {0: -5.496803860195347, 1: -3.5430275771013733})
(1, {0: -4.9859782364293554, 1: -4.726797674109789})
(0, {0: -4.005148983417629, 1: -4.9499412254239985})


In [60]:
new_df

Unnamed: 0,outlook,temp,humidity,windy,play,Naive
0,2,1,0,0,0,0
1,1,2,0,1,1,0
2,0,0,1,0,1,1
3,2,2,1,1,1,1
4,1,0,1,0,1,1
5,0,1,0,1,1,1
6,2,0,0,1,0,0


 **Overall accuracy:** 6 out of 7 correct → ~85.7%.  
- **True Negatives (TN):** 2 (rows 0, 6: play=0, predicted=0)  
- **True Positives (TP):** 4 (rows 2–5: play=1, predicted=1)  
- **False Negatives (FN):** 1 (row 1: play=1, predicted=0)  
- **False Positives (FP):** 0  

The only mistake is at index 1, where the model predicted 0 but the true label was 1 (a false negative). All other cases are correct. This tells us the classifier is quite accurate on this small set, with perfect specificity (no false alarms) and a single miss on a positive example.

## Let's do the same predictions with SKLearn

In [61]:

modelo = CategoricalNB()
modelo.fit(X,y)
modelo.predict(X_test)

array([0, 0, 1, 1, 1, 1, 0])

## Conclusion

The SciKit-Learn `CategoricalNB` model produces the exact same predictions as our manual implementation:

```python 
array([0, 0, 1, 1, 1, 1, 0])
```


This perfect match confirms that our `NaiveBayes` class correctly reproduces sklearn’s behavior on this dataset.

## Next Steps

1. **Extend to Numerical Features**  
   - Implement **Gaussian Naive Bayes** for continuous variables.  
   - Or add a **discretization** step to convert numerics into categories before fitting.

2. **Enrich the API**  
   - Add `predict_proba()` to return class probabilities.  
   - Support `score()` for quick accuracy checks.  
   - Expose hyperparameters (`alpha`, `priors`) in the constructor.

3. **Performance and Robustness**  
   - **Vectorize** loops with NumPy for faster training/prediction.  
   - Add **input validation** and handle unseen categories gracefully.  

4. **Pipeline Compatibility**  
   - Subclass `BaseEstimator` and `ClassifierMixin` for full scikit-learn compatibility.  
   - Enable use within `Pipeline` alongside encoders and scalers.

By following these steps, you can evolve this script into a production-ready Naive Bayes classifier—handling both categorical and numerical data, offering a rich API, and fitting seamlessly into common data-science workflows.  
