## Naive Bayes
#### Lew Sears

Naive Bayes uses the simple and powerful mathematical rule, Bayes Rule, which in ittself has created many branches of mathematical and statistical research. The general principal is stated mathematically as: 
$$ P(A | B) = \frac{P( A\cap B)}{P(B)}. $$
The idea, simply, is that we train a data set to calculate the probability distributions of the target variable given the values of the features. Then, we can classify a new data point by finding the probability of whether that data point should be classified for every group of our target choosing the highest probability. Rigorously, for data with $n$ features, where $\vec{x}$ represents a vector of explicit features, the probability of some data point being in class $A_{i} \in \chi$, with $\chi = \{A_1,\ldots,A_k\}$ being all classes in the target, is calculated as:
$$ P(A_{j} | \vec{x}) = \frac{P(A_{j})\cdot P(\vec{x} | A_{j})}{P(\vec{x})}$$
Disregarding the denominator since it doesn't help differentiate probabilities between separate classes, if $\vec{x} = (x_1\ldots x_n)$ using properties of conditional probabilities:
$$ P(A_{j},x_1,\ldots,x_n) = P(x_{1} | x_2,\ldots,x_n,A_j)\cdots P(x_{n-2} | x_{n-1}, x_n, A_{j}) \cdot P(x_{n-1} | x_n, A_{j})\cdot P(x_n | A_j)\cdot P(A_{j}).$$
If we could calculate this probability for every class $A_{j}$, then we could definitively find the maximum probability of $\vec{x}$ and easily classify it! Unfortunately, this is extremely computationally heavy and sometimes the relevant conditional probabilities don't even exist. This is where we incorporate the *naive* element of Naive Bayes. We simplify the product elements to be 
$$P(x_{i} | x_{i+1},\ldots, x_{n}, A_{j}) \approx P(x_i | A_{j}).$$
This naive assumption has its drawbacks, but in practice it is truly incredible considering how fast a prediction can be made after it is trained. So knowing this, the pros of Naive Bayes are fast real time predictions, it scales well, and works well for highly dimensional data. The main con, of course, is that the naive assumption of conditional probabilities rarely hold in real life. Nonetheless, it is a powerful classifier and the final algorithm to classify a data point $\vec{x} = (x_{1},\ldots,x_{n})$ in a set of classes $\chi = \{A_{1},\ldots,A_{k}\}$ is as follows:
$$ \text{Class }\left(\vec{x}\right) = \underset{A_{j}\in \chi}{\text{argmax}} \left(P(A_{j})\cdot \underset{x_i\in\vec{x}}{\Pi} P(x_i | A_{j})\right)$$

### Implementation of Naive Bayes

In [1]:
import numpy as np
import pandas as pd

In [218]:
class NaiveBayesClassifier:
    '''Naive Bayes algorithm for classifying discrete targets. Be sure to one-hot-encode 
    any categorical variables. The default classification will be '''
    def __init__(self, k):
        self.k = k
    
    def fit(self, train_data, train_target):
        '''Calculates summary statistics for the training data by class.'''
        #Make data np arrays
        X = np.array(train_data)
        y = np.array(train_target)
        self.data = X
        self.target = y
        self.class_counts = np.unique(y, return_counts = True)
        
        #Dictionary to separate data already converted to numpy arrays
        class_dict = dict()
        for class_ in self.class_counts[0]:
            class_dict[class_] = X[y == class_]
            
        #Get summary statistics for every class
        class_summary = dict()
        for class_ in class_dict.keys():
            class_summary[class_] = [[np.mean(class_dict[class_][:, i]), 
                                      np.std(class_dict[class_][:, i])] for i in 
                                     range(class_dict[class_].shape[1])]
            
        
        self.class_statistics = class_summary 
        return self
    
    
    #Now take the summary statistics to predict unseen data using gaussian distributions
    def predict(self, test_data):
        '''Now take the summary statistics from the test data to predict unseen data using 
        gaussian distributions.'''
        
        #Function to calculate the gaussian probability 
        def gaussian_probability(x, mean, std):
            z_score = (x-mean)/std
            return np.exp(-0.5*(z_score**2)) * (1/(std*np.sqrt(2*np.pi)))
        
        class_list = []
        probability_list = []
        for key in self.class_statistics.keys():
            class_list.append(key)
            statistics = self.class_statistics[key]
            
            #Add the total probability of the class first
            probability = len(self.target[self.target == key])/len(self.target)
            for i in range(len(statistics)):
                probability_i = gaussian_probability(test_data[i], 
                                                    self.class_statistics[key][i][0],
                                                    self.class_statistics[key][i][1]) 
                probability *= probability_i
            probability_list.append(probability)
        
        self.predict_proba = [[class_list[i], probability_list[i]] for i in range(len(class_list))]
        self.prediction = class_list[[i for i, val in enumerate(probability_list) if 
                                      val == max(probability_list)][0]]
        return self  