<a href="https://colab.research.google.com/github/ickma2311/mycolab/blob/main/Statistics_Learning/Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np

Prior Probability: $ \mathbb {P}(Y=ck), k=1,2,3...,K$.    

Conditional Probability(Likehood Probability):
$$ \mathbb {P} (X=x \mid Y=c_k) = \mathbb {P}(X^{(1)}=x^{(1)},...,X^{(n})=x^{(n)} \mid Y=c_k), k=1,2,...K
$$

Based on conditional indepenet assumution

$$ \mathbb {P} (X=x \mid Y=c_k)= \prod_{j=1}^n \mathbb {P}(X^{(j)}=x^{(j)} \mid Y=c_k)
$$

Posterior Probability
$$
\mathbb{P}(Y=c_k \mid X=x)= \frac { \mathbb {P}(X=x \mid Y=c_k) \mathbb {P} (Y=c_k)}
 {\sum_k \mathbb {P}(X=x \mid Y=c_k) \mathbb {P}(Y=c_k)}
$$

$$
= $\frac { \mathbb {P}(Y=c_k) \cdot \prod_j \mathbb {P}(X|Y=c_k)} {\mathbb {P(X)} }
$$

Naive Bayes
$$
P(A|B)=\frac {P(B|A) \cdot P(A)} {P(B)}
$$

Now we got the function, we can remove the denominator because for each $Y_{c_k}$, the denominator is the same
$$
y= arg \max_{c_k} P(Y=c_k)\cdot \prod_j P(X_j|Y_{c_k})
$$

In [6]:
from collections import defaultdict
import numpy as np
class NaiveBayes:
    def __init__(self):
        self.class_prob=defaultdict(float)
        self.feature_prob=defaultdict(lambda:defaultdict(float))
        self.classes=set()

        self.vocab=set()
        self.class_count=defaultdict(int)
        self.feature_count=defaultdict(lambda:defaultdict(int))

    def fix(self,X,y):
        self.classes=set(y)

        for feature,label in zip(X,y):
          self.class_prob[label]+=1
          self.class_count[label]+=1
          for value in feature:
            self.feature_count[label][value]+=1
            self.vocab.add(value)

        for label in self.classes:
          self.class_prob[label]/=len(y)
          for value in self.feature_count[label]:
            # Laplace smoothing
            self.feature_prob[label][value]=(self.feature_count[label][value]+1)/(self.class_count[label]+len(self.vocab))


    def predict(self,X):
        prob=defaultdict(float)
        for label in self.classes:
          prob[label]=np.log(self.class_prob[label])
          for value in X:
            prob[label]+=np.log(self.feature_prob[label][value])
        return max(prob,key=prob.get)





In [1]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X_raw, y = data.data, data.target

In [7]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X_raw, y = data.data, data.target

# Discretize features into 5 bins (convert to integers)
discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
X_binned = discretizer.fit_transform(X_raw).astype(int)

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X_binned, y, test_size=0.2, random_state=42)

# Convert to list-of-lists for your NaiveBayes class
X_train_list = [list(row) for row in X_train]
X_test_list = [list(row) for row in X_test]

# Train your classifier
nb = NaiveBayes()
nb.fix(X_train_list, y_train)

# Predict
y_pred = [nb.predict(x) for x in X_test_list]

# Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.9210526315789473


# Normal Distribution PDF(Probability Density Function)
$$
f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right)
$$

In [None]:
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)

In [None]:
X

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])