Given $\mathbf{x} = (x_1, ..., x_n)$, an observation with $n$ features, and labels $C_1, ... C_K$

$$
p(C_k | \mathbf{x}) = \frac{p(C_k) p(\mathbf{x} | C_k)}{p(\mathbf{x})}
$$
$$
\text{posterior} = \frac{\text{prior} \times \text{likelihood}}{\text{evidence}} 
$$

In other words, we're interested in the probability that the observation belongs to a certain class.

"In practice, there is interest only in the numerator of that fraction, because the denominator does not depend on $C$ and the values of the features $x_{i}$ are given, so that the denominator is effectively constant."

But the numerator is equivalent to a joint probability model:

$$
p(C_k, x_1, ..., x_n)
$$

which we expand using the probability chain rule:

$$
\begin{align}
    p(C_k, x_1, ..., x_n) &= p(x_1, ..., x_n, C_k) \\
                          &= p(x_1 | x_2, ..., x_n, C_k) p(x_2, ..., x_n, C_k) \\
                          &= ... \\
                          &= p(x_1 | x_2, ..., x_n, C_k) p(x_2 | x_3 , ..., x_n, C_k) ... p(x_{n-1} | x_n, C_k) p(x_n | C_k) p(C_k)
\end{align}
$$

This essentially multiplies the probability of each feature given the probability of the class.

The "naive" part is that we assume each of the features are mutually independent.

This is just the model; the classifier itself simply takes the argmax of all the possible classes.
$$
\hat{y} = \argmax_{k \in {1, ..., K}}{p(C_k) \prod_{i=1}^{n}{p(x_i | C_k)}}
$$

Technically, we're supposed to multiply this by a scaling factor $\frac{1}{Z}$, where $Z$ is the probability of the feature vector $\mathbf{x}$. But since we're taking the $\argmax$ anyway, we just omit it. 

In [2]:
import numpy as np
import pandas as pd

# Binary example

In [15]:
car_example = [
    ("Red", "Sports", "Domestic", "Yes"),
    ("Red", "Sports", "Domestic", "No"),
    ("Red", "Sports", "Domestic", "Yes"),
    ("Yellow", "Sports", "Domestic", "No"),
    ("Yellow", "Sports", "Imported", "Yes"),
    ("Yellow", "SUV", "Imported", "No"),
    ("Yellow", "SUV", "Imported", "Yes"),
    ("Yellow", "SUV", "Domestic", "No"),
    ("Red", "SUV", "Imported", "No"),
    ("Red", "Sports", "Imported", "Yes"),
]

df = pd.DataFrame(car_example, columns=["Color", "Type", "Origin", "Stolen"])
df

Unnamed: 0,Color,Type,Origin,Stolen
0,Red,Sports,Domestic,Yes
1,Red,Sports,Domestic,No
2,Red,Sports,Domestic,Yes
3,Yellow,Sports,Domestic,No
4,Yellow,Sports,Imported,Yes
5,Yellow,SUV,Imported,No
6,Yellow,SUV,Imported,Yes
7,Yellow,SUV,Domestic,No
8,Red,SUV,Imported,No
9,Red,Sports,Imported,Yes


In [16]:
# map dataset to integers
color_map = {"Red":0, "Yellow":1}
type_map = {"Sports":0, "SUV":1}
origin_map = {"Domestic":0, "Imported":1}
# stolen_map = {"No":0, "Yes":1}

In [17]:
df.Color = df.Color.apply(lambda x: color_map[x])
df.Type = df.Type.apply(lambda x: type_map[x])
df.Origin = df.Origin.apply(lambda x: origin_map[x])
# df.Stolen = df.Stolen.apply(lambda x: stolen_map[x])

In [18]:
df

Unnamed: 0,Color,Type,Origin,Stolen
0,0,0,0,Yes
1,0,0,0,No
2,0,0,0,Yes
3,1,0,0,No
4,1,0,1,Yes
5,1,1,1,No
6,1,1,1,Yes
7,1,1,0,No
8,0,1,1,No
9,0,0,1,Yes


In [14]:
# task: classify whether a new observation (not present in dataset) is stolen
# for example, a Red SUV Domestic
obs = (color_map["Red"], type_map["SUV"], origin_map["Domestic"])
obs

(0, 1, 0)

In [20]:
# for the training dataset, calculate the total observations in each class
df.Stolen.value_counts() 

# count the number of times each class occurs per column
df[df.Stolen == "Yes"].Color.value_counts()
df[df.Stolen == "No"].Color.value_counts()
df[df.Stolen == "Yes"].Type.value_counts()
df[df.Stolen == "No"].Type.value_counts()
df[df.Stolen == "Yes"].Type.value_counts()
df[df.Stolen == "No"].Type.value_counts()

Unnamed: 0,Color,Type,Origin,Stolen
0,0,0,0,Yes
2,0,0,0,Yes
4,1,0,1,Yes
6,1,1,1,Yes
9,0,0,1,Yes


In [25]:
df.Stolen.value_counts()

Yes    5
No     5
Name: Stolen, dtype: int64

# Continuous example

essentially instead of doing counts and division, calculate the mean and stdev for each column split by category. then calculate the probability by pluggin them into a gaussian