# Naive Bayes

Naive Bayes methods are a set of supervised learning algorithms
based on applying Bayes’ theorem with the “naive” assumption of
conditional independence between every pair of features given the
value of the class variable. Bayes’ theorem states the following
relationship, given class variable $y$ and dependent feature
vector $x_1$ though $x_n$ :
$$ P(y \mid x_1, \dots, x_n) = \frac{P(y) P(x_1, \dots, x_n \mid y)}
                                 {P(x_1, \dots, x_n)} $$
                                 
Using the naive conditional independence assumption that
$$P(x_i | y, x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_n) = P(x_i | y),$$
for all $i$ , this relationship is simplified to
$$P(y \mid x_1, \dots, x_n) = \frac{P(y) \prod_{i=1}^{n} P(x_i \mid y)}
                                 {P(x_1, \dots, x_n)}$$
Since $P(x_1, \dots, x_n)$ is constant given the input, we can use the following classification rule:

$$ \begin{align}\begin{aligned}P(y \mid x_1, \dots, x_n) \propto P(y) \prod_{i=1}^{n} P(x_i \mid y)\\\Downarrow\\\hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i \mid y),\end{aligned}\end{align} $$


Naive Bayes methods are a set of supervised learning algorithms
based on applying Bayes’ theorem with the “naive” assumption of
conditional independence between every pair of features given the
value of the class variable. Bayes’ theorem states the following
relationship, given class variable (target 변수) $y$ and dependent feature vector (input 변수) $x_1$ though $x_n$ :
$$ P(y \mid x_1, \dots, x_n) = \frac{P(y) P(x_1, \dots, x_n \mid y)}
                                 {P(x_1, \dots, x_n)} $$
  
Using the naive conditional independence assumption that
$$P(x_i | y, x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_n) = P(x_i | y),$$
for all $i$ , this relationship is simplified to
$$P(y \mid x_1, \dots, x_n) = \frac{P(y) \prod_{i=1}^{n} P(x_i \mid y)}
                                 {P(x_1, \dots, x_n)}$$

**Likelihood $P(x_i \mid y)$ 분포에 따라 Gaussian, Multinomial, ....**

Since $P(x_1, \dots, x_n)$ is constant given the input, we can use the following classification rule:

$$ \begin{align}\begin{aligned}P(y \mid x_1, \dots, x_n) \propto P(y) \prod_{i=1}^{n} P(x_i \mid y)\\\Downarrow\\\hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i \mid y),\end{aligned}\end{align} $$



In spite of their apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many real-world situations, famously document classification and spam filtering. They require a small amount of training data to estimate the necessary parameters. (For theoretical reasons why naive Bayes works well, and on which types of data it does, see the references below.)

Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods. The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality.

On the flip side, although naive Bayes is known as a decent classifier, it is known to be a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.

-----

## 1. Gaussian Naive Bayes

GaussianNB implements the Gaussian Naive Bayes algorithm for classification. The likelihood of the features is assumed to be Gaussian:  
$$P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right)$$
The parameters $\sigma_y$ and $ \mu_y $ are estimated using maximum likelihood.



In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
gnb = GaussianNB()
# y_pred = gnb.fit(X_train, y_train).predict(X_test)
gnbmodel = gnb.fit(X_train, y_train)
y_pred = gnbmodel.predict(X_test)

# fit/ partial_fit -> 대용량 데이터 한번에 계산하기에 메모리가 부족할 수 있을 때 사용

print("Number of mislabeled points out of a total %d points : %d"
      % (X_test.shape[0], (y_test != y_pred).sum()))

Number of mislabeled points out of a total 75 points : 4


In [2]:
help(GaussianNB)

Help on class GaussianNB in module sklearn.naive_bayes:

class GaussianNB(_BaseNB)
 |  GaussianNB(*, priors=None, var_smoothing=1e-09)
 |  
 |  Gaussian Naive Bayes (GaussianNB)
 |  
 |  Can perform online updates to model parameters via :meth:`partial_fit`.
 |  For details on algorithm used to update feature means and variance online,
 |  see Stanford CS tech report STAN-CS-79-773 by Chan, Golub, and LeVeque:
 |  
 |      http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf
 |  
 |  Read more in the :ref:`User Guide <gaussian_naive_bayes>`.
 |  
 |  Parameters
 |  ----------
 |  priors : array-like of shape (n_classes,)
 |      Prior probabilities of the classes. If specified the priors are not
 |      adjusted according to the data.
 |  
 |  var_smoothing : float, default=1e-9
 |      Portion of the largest variance of all features that is added to
 |      variances for calculation stability.
 |  
 |      .. versionadded:: 0.20
 |  
 |  Attributes
 |  ----------
 |  c

In [3]:
# train 
import numpy as np
X_train, y_train
unique, counts = np.unique(y_train, return_counts=True)
dict(zip(unique, counts))
dict(zip(unique, counts/counts.sum()))

{0: 0.38666666666666666, 1: 0.26666666666666666, 2: 0.3466666666666667}

In [4]:
gnb.class_count_, dict(zip(unique, counts))

(array([29., 20., 26.]), {0: 29, 1: 20, 2: 26})

In [5]:
# class_prior 사전확률
gnb.class_prior_, dict(zip(unique, counts/counts.sum()))

(array([0.38666667, 0.26666667, 0.34666667]),
 {0: 0.38666666666666666, 1: 0.26666666666666666, 2: 0.3466666666666667})

In [6]:
# variance of each feature per class (n_classes, n_features)
gnb.sigma_

# stdev of each feature per class (n_classes, n_features)
gnb.sigma_**(1/2)

array([[0.32126386, 0.36342931, 0.12763281, 0.0920115 ],
       [0.50623611, 0.28792361, 0.50524747, 0.21447611],
       [0.62345668, 0.31855815, 0.55949312, 0.21825018]])

In [7]:
# mean of each feature per class (n_classes, n_features)
gnb.theta_

array([[4.97586207, 3.35862069, 1.44827586, 0.23448276],
       [5.935     , 2.71      , 4.185     , 1.3       ],
       [6.77692308, 3.09230769, 5.73461538, 2.10769231]])

In [8]:
one_test = X_test[7]
print(one_test)

gnbmodel.predict([one_test])

[6.7 3.1 4.7 1.5]


array([1])

##### 확률이 가장 높은 label을 찾아가는 과정 확인
>참고사이트 https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/

In [9]:
# 원본
import pandas as pd
from math import *
exdf = pd.DataFrame(X_train)
exdf['target'] = y_train
exdf.head()

exdataset = exdf.values
# Split the dataset by class values, returns a dictionary
def separate_by_class(dataset):
	separated = dict()
	for i in range(len(dataset)):
		vector = dataset[i]
		class_value = vector[-1]
		if (class_value not in separated):
			separated[class_value] = list()
		separated[class_value].append(vector)
	return separated
 
# Calculate the mean of a list of numbers
def mean(numbers):
	return sum(numbers)/float(len(numbers))
 
# Calculate the standard deviation of a list of numbers
def stdev(numbers):
	avg = mean(numbers)
	variance = sum([(x-avg)**2 for x in numbers]) / float(len(numbers)-1)
	return sqrt(variance)
 
# Calculate the mean, stdev and count for each column in a dataset
def summarize_dataset(dataset):
	summaries = [(mean(column), stdev(column), len(column)) for column in zip(*dataset)]
	del(summaries[-1])
	return summaries
 
# Split dataset by class then calculate statistics for each row
def summarize_by_class(dataset):
	separated = separate_by_class(dataset)
	summaries = dict()
	for class_value, rows in separated.items():
		summaries[class_value] = summarize_dataset(rows)
	return summaries
 
summaries = summarize_by_class(exdataset)
summaries

# Calculate the Gaussian probability distribution function for x
def calculate_probability(x, mean, stdev):
	exponent = np.exp(-((x-mean)**2 / (2 * stdev**2 )))
	return (1 / (sqrt(2 * pi) * stdev)) * exponent

# Calculate the probabilities of predicting each class for a given row
def calculate_class_probabilities(summaries, row):
	total_rows = sum([summaries[label][0][2] for label in summaries])
	probabilities = dict()
	for class_value, class_summaries in summaries.items():
		probabilities[class_value] = summaries[class_value][0][2]/float(total_rows)
		for i in range(len(class_summaries)):
			mean, stdev, _ = class_summaries[i]
			probabilities[class_value] *= calculate_probability(row[i], mean, stdev)
	return probabilities

x = one_test
probabilities = calculate_class_probabilities(summaries, x)
print(probabilities)

{0.0: 8.516100787267677e-182, 2.0: 0.0015442142261661732, 1.0: 0.022042527983872816}


In [10]:
gnb.classes_
stdev = gnb.sigma_ **( 1/2 )
mean = gnb.theta_
x = one_test

summaries = dict()
for i in range(stdev.shape[0]):
    summaries_class = []
    for j in range(stdev.shape[1]):
        summaries_class.append((mean[i,j], stdev[i,j], gnb.class_count_[i]))
    summaries[gnb.classes_[i]] = summaries_class

# 위에 summaries와 표준편차에서 차이가 조금 존재함
summaries
    
# Calculate the Gaussian probability distribution function for x
def calculate_probability(x, mean, stdev):
	exponent = np.exp(-((x-mean)**2 / (2 * stdev**2 )))
	return (1 / (sqrt(2 * pi) * stdev)) * exponent

calculate_probability(x , mean, stdev)

# Calculate the probabilities of predicting each class for a given row
def calculate_class_probabilities(summaries, row):
	total_rows = sum([summaries[label][0][2] for label in summaries])
	probabilities = dict()
	for class_value, class_summaries in summaries.items():
		probabilities[class_value] = summaries[class_value][0][2]/float(total_rows)
		for i in range(len(class_summaries)):
			mean, stdev, _ = class_summaries[i]
			probabilities[class_value] *= calculate_probability(row[i], mean, stdev)
	return probabilities

calculate_class_probabilities(summaries, x)

{0: 2.9146425371463955e-188, 1: 0.021007691878018386, 2: 0.0013468926536311186}

## 2. Multinomial Naive Bayes

MultinomialNB implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification (where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice).  

분포는 각 클래스에 대해 벡터에 의해 매개 변수화된다.  
The distribution is parametrized by vectors $ \theta_y = (\theta_{y1},\ldots,\theta_{yn}) $ for each class $y$,  
텍스트 분류에서 n은 단어의 수를 의미한다.  
where $n$ is the number of features (in text classification, the size of the vocabulary) and  
$ \theta_{yi} $ is the probability $ P(x_i \mid y) $ of feature  appearing in a sample belonging to class $y$.  
  
The parameters $\theta_y$ is estimated by a smoothed version of maximum likelihood, i.e. relative frequency counting:
$$ \hat{\theta}_{yi} = \frac{ N_{yi} + \alpha}{N_y + \alpha n} $$


where $ N_{yi} = \sum_{x \in T} x_i $ is the number of times feature $i$ appears in a sample of class $y$ in the training set $T$,  
and $N_{y} = \sum_{i=1}^{n} N_{yi}$ is the total count of all features for class $y$.  
The smoothing priors $\alpha \ge 0$ accounts for features not present in the learning samples and prevents zero probabilities in further computations. Setting $\alpha = 1$ is called Laplace smoothing, while $\alpha < 1$ is called Lidstone smoothing.

In [11]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

mnb = MultinomialNB(alpha = 1)
mnbmodel = mnb.fit(X_train, y_train)
y_pred = mnbmodel.predict(X_test)

# fit/ partial_fit -> 대용량 데이터 한번에 계산하기에 메모리가 부족할 수 있을 때 사용
print("Number of mislabeled points out of a total %d points : %d"
      % (X_test.shape[0], (y_test != y_pred).sum()))

Number of mislabeled points out of a total 75 points : 30


In [12]:
help(MultinomialNB())

Help on MultinomialNB in module sklearn.naive_bayes object:

class MultinomialNB(_BaseDiscreteNB)
 |  MultinomialNB(*, alpha=1.0, fit_prior=True, class_prior=None)
 |  
 |  Naive Bayes classifier for multinomial models
 |  
 |  The multinomial Naive Bayes classifier is suitable for classification with
 |  discrete features (e.g., word counts for text classification). The
 |  multinomial distribution normally requires integer feature counts. However,
 |  in practice, fractional counts such as tf-idf may also work.
 |  
 |  Read more in the :ref:`User Guide <multinomial_naive_bayes>`.
 |  
 |  Parameters
 |  ----------
 |  alpha : float, default=1.0
 |      Additive (Laplace/Lidstone) smoothing parameter
 |      (0 for no smoothing).
 |  
 |  fit_prior : bool, default=True
 |      Whether to learn class prior probabilities or not.
 |      If false, a uniform prior will be used.
 |  
 |  class_prior : array-like of shape (n_classes,), default=None
 |      Prior probabilities of the classe

> 참고사이트 https://medium.com/analytics-vidhya/naive-bayes-classifier-for-text-classification-556fabaf252b

In [27]:
import pandas as pd

columns = ['sent', 'class']
rows = []

rows = [['This is my book', 'stmt'], 
        ['They are novels', 'stmt'],
        ['have you read this book', 'question'],
        ['who is the author', 'question'],
        ['what are the characters', 'question'],
        ['This is how I bought the book', 'stmt'],
        ['I like fictions', 'stmt'],
        ['what is your favorite book', 'question']]

training_data = pd.DataFrame(rows, columns=columns)
training_data

Unnamed: 0,sent,class
0,This is my book,stmt
1,They are novels,stmt
2,have you read this book,question
3,who is the author,question
4,what are the characters,question
5,This is how I bought the book,stmt
6,I like fictions,stmt
7,what is your favorite book,question


In [28]:
from sklearn.feature_extraction.text import CountVectorizer

vec_s = CountVectorizer()
X_s = vec_s.fit_transform(training_data['sent'].values)

In [29]:
from sklearn.feature_extraction.text import CountVectorizer

stmt_docs = [row['sent'] for index,row in training_data.iterrows() if row['class'] == 'stmt']
stmt_docs

tdm = pd.DataFrame(X_s.toarray(), columns=vec_s.get_feature_names())
tdm

Unnamed: 0,are,author,book,bought,characters,favorite,fictions,have,how,is,...,my,novels,read,the,they,this,what,who,you,your
0,0,0,1,0,0,0,0,0,0,1,...,1,0,0,0,0,1,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
2,0,0,1,0,0,0,0,1,0,0,...,0,0,1,0,0,1,0,0,1,0
3,0,1,0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,1,0,0
4,1,0,0,0,1,0,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0
5,0,0,1,1,0,0,0,0,1,1,...,0,0,0,1,0,1,0,0,0,0
6,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,1,0,0,1,0,0,0,1,...,0,0,0,0,0,0,1,0,0,1


In [30]:
X = tdm.values
y = training_data['class']

from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(X, y)

MultinomialNB()

In [31]:
training_data.iloc[5, :]

sent     This is how I bought the book
class                             stmt
Name: 5, dtype: object

In [32]:
mnb.predict([X[5]])

array(['stmt'], dtype='<U8')

In [39]:
test = "what is that"
test_s = vec_s.transform([test])
test_s.toarray()
tdf_test2 = pd.DataFrame(test_s.toarray(), columns = vec_s.get_feature_names())
tdf_test2

Unnamed: 0,are,author,book,bought,characters,favorite,fictions,have,how,is,...,my,novels,read,the,they,this,what,who,you,your
0,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0


In [40]:
mnb.predict(tdf_test2)

array(['question'], dtype='<U8')

## 3. Complement Naive Bayes

ComplementNB implements the complement naive Bayes (CNB) algorithm. CNB is an adaptation of the standard multinomial naive Bayes (MNB) algorithm that is particularly suited for imbalanced data sets. Specifically, CNB uses statistics from the complement of each class to compute the model’s weights. The inventors of CNB show empirically that the parameter estimates for CNB are more stable than those for MNB. Further, CNB regularly outperforms MNB (often by a considerable margin) on text classification tasks. The procedure for calculating the weights is as follows:  
$$  \begin{align}\begin{aligned}\hat{\theta}_{ci} = \frac{\alpha_i + \sum_{j:y_j \neq c} d_{ij}}
                         {\alpha + \sum_{j:y_j \neq c} \sum_{k} d_{kj}}\\w_{ci} = \log \hat{\theta}_{ci}\\w_{ci} = \frac{w_{ci}}{\sum_{j} |w_{cj}|}\end{aligned}\end{align}  $$
where the summations are over all documents $j$ not in class $c$,$d_{ij}$ is either the count or tf-idf value of term $i$ in document $j$, $\alpha_i$ is a smoothing hyperparameter like that found in MNB, and $\alpha = \sum_{i} \alpha_i$. The second normalization addresses the tendency for longer documents to dominate parameter estimates in MNB. The classification rule is:
$$\hat{c} = \arg\min_c \sum_{i} t_i w_{ci}$$
i.e., a document is assigned to the class that is the poorest complement match.

In [19]:
X = tdm.values
y = training_data['class']

from sklearn.naive_bayes import ComplementNB
cnb = ComplementNB()
cnb.fit(X, y)

ComplementNB()

In [20]:
cnb.predict([X[5]])

array(['stmt'], dtype='<U8')

## 4. Bernoulli Naive Bayes

BernoulliNB implements the naive Bayes training and classification algorithms for data that is distributed according to multivariate Bernoulli distributions; i.e., there may be multiple features but each one is assumed to be a binary-valued (Bernoulli, boolean) variable. Therefore, this class requires samples to be represented as binary-valued feature vectors; if handed any other kind of data, a BernoulliNB instance may binarize its input (depending on the binarize parameter).

The decision rule for Bernoulli naive Bayes is based on
$$P(x_i \mid y) = P(i \mid y) x_i + (1 - P(i \mid y)) (1 - x_i)$$

which differs from multinomial NB’s rule in that it explicitly penalizes the non-occurrence of a feature $i$ that is an indicator for class $y$ , where the multinomial variant would simply ignore a non-occurring feature.

In the case of text classification, **word occurrence vectors (rather than word count vectors)** may be used to train and use this classifier. BernoulliNB might perform better on some datasets, especially those with shorter documents. It is advisable to evaluate both models, if time permits.

In [62]:
training_data

Unnamed: 0,sent,class
0,This is my book,stmt
1,They are novels,stmt
2,have you read this book,question
3,who is the author,question
4,what are the characters,question
5,This is how I bought the book,stmt
6,I like fictions,stmt
7,what is your favorite book,question


In [67]:
# word occurrence vector
from sklearn.feature_extraction.text import CountVectorizer

vec_s = CountVectorizer()
X_s = vec_s.fit_transform(training_data['sent'].values)

wov = X_s.toarray()
wov[wov>1] = 1
wov

tdf_test = pd.DataFrame(wov, columns = vec_s.get_feature_names())

Unnamed: 0,are,author,book,bought,characters,favorite,fictions,have,how,is,...,my,novels,read,the,they,this,what,who,you,your
0,0,0,1,0,0,0,0,0,0,1,...,1,0,0,0,0,1,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
2,0,0,1,0,0,0,0,1,0,0,...,0,0,1,0,0,1,0,0,1,0
3,0,1,0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,1,0,0
4,1,0,0,0,1,0,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0
5,0,0,1,1,0,0,0,0,1,1,...,0,0,0,1,0,1,0,0,0,0
6,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,1,0,0,1,0,0,0,1,...,0,0,0,0,0,0,1,0,0,1


In [68]:
X = tdf_test.values
y = training_data['class']

from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()
bnb.fit(X, y)

BernoulliNB()

In [69]:
print(test, "\n", bnb.predict(tdf_test2))

what is that 
 ['question']


## 5. Categorical Naive Bayes

CategoricalNB implements the categorical naive Bayes algorithm for categorically distributed data. It assumes that each feature, which is described by the index $i$, has its own categorical distribution.

For each feature $i$ in the training set $X$, CategoricalNB estimates a categorical distribution for each feature i of X conditioned on the class y. The index set of the samples is defined as $J = \{ 1, \dots, m \}$, with $m$ as the number of samples.

The probability of category $t$ in feature $i$ given class $c$ is estimated as:
$$P(x_i = t \mid y = c \: ;\, \alpha) = \frac{ N_{tic} + \alpha}{N_{c} +
                                       \alpha n_i},$$
where $N_{tic} = |\{j \in J \mid x_{ij} = t, y_j = c\}|$ is the number of times category $t$ appears in the samples $x_i$, which belong to class $c$,  
$N_{c} = |\{ j \in J\mid y_j = c\}|$ is the number of samples with class c, $\alpha$ is a smoothing parameter and $n_i$ is the number of available categories of feature $i$.  
  
CategoricalNB assumes that the sample matrix $X$ is encoded (for instance with the help of OrdinalEncoder) such that all categories for each feature $i$ are represented with numbers $0, ..., n_i - 1$ where $n_i$ is the number of available categories of feature $i$.

## 6. Out-of-core naive Bayes model fitting

Naive Bayes models can be used to tackle large scale classification problems for which the full training set might not fit in memory. To handle this case, MultinomialNB, BernoulliNB, and GaussianNB expose a partial_fit method that can be used incrementally as done with other classifiers as demonstrated in Out-of-core classification of text documents. All naive Bayes classifiers support sample weighting.

Contrary to the fit method, the first call to partial_fit needs to be passed the list of all the expected class labels.

For an overview of available strategies in scikit-learn, see also the out-of-core learning documentation.