# Bayesian Algorithms

A family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other. Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. Bayes’s formula provides relationship between P(A|B) and P(B|A)

## Naive Bayes

In [1]:
import numpy as np
import pandas as pd

In [3]:
df = pd.read_csv('Datasets/Mushroom Dataset/mushroom.csv')

In [4]:
df.head()

Unnamed: 0,type,cap_shape,cap_surface,cap_color,bruises,odor,gill_attachment,gill_spacing,gill_size,gill_color,...,stalk_surface_below_ring,stalk_color_above_ring,stalk_color_below_ring,veil_type,veil_color,ring_number,ring_type,spore_print_color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [5]:
df.shape

(8124, 23)

In [6]:
from sklearn.preprocessing import LabelEncoder

In [7]:
le = LabelEncoder()

### Example

In [8]:
le.fit_transform(['a', 'a', 'b', 'a', 'c'])

array([0, 0, 1, 0, 2])

In [10]:
df = df.apply(le.fit_transform)

In [11]:
df.head()

Unnamed: 0,type,cap_shape,cap_surface,cap_color,bruises,odor,gill_attachment,gill_spacing,gill_size,gill_color,...,stalk_surface_below_ring,stalk_color_above_ring,stalk_color_below_ring,veil_type,veil_color,ring_number,ring_type,spore_print_color,population,habitat
0,1,5,2,4,1,6,1,0,1,4,...,2,7,7,0,2,1,4,2,3,5
1,0,5,2,9,1,0,1,0,0,4,...,2,7,7,0,2,1,4,3,2,1
2,0,0,2,8,1,3,1,0,0,5,...,2,7,7,0,2,1,4,3,2,3
3,1,5,3,8,1,6,1,0,1,5,...,2,7,7,0,2,1,4,2,3,5
4,0,5,2,3,0,5,1,1,0,4,...,2,7,7,0,2,1,0,3,0,1


In [12]:
data = df.values  # make numpy array

In [14]:
X = data[:, 1:]
y = data[:, 0]

In [16]:
from sklearn.model_selection import train_test_split

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Classifier

**P(y=val)**

In [19]:
def prior_prob(y, label):
    total_examples = len(y)
    label_examples = (y == label).sum()
    
    return label_examples/total_examples

**P(X|y=val)**

In [20]:
def cond_prob(X, y, feature_col, feature_val, label):
    x_filtered = X[y==label]
    num = (x_filtered[:, feature_col] == feature_val).sum()
    den = (y==label).sum()
    
    return num/den

# Compute Posterior Probability

In [45]:
def predict(X, y, xtext):
    classes = np.unique(y)
    n_features = X.shape[1]
    
    post_probs = []
    
    for class_ in classes:
        # likelihood for class
        prob = 1.0
        for f in range(n_features):
            prob *= cond_prob(X, y, f, xtext[f], class_)
            
        prior = prior_prob(X, class_)
        
        post_probs.append(prob*prior)
        
    prediction = np.argmax(post_probs)    
        
    return prediction

In [48]:
predict(X_train, y_train, X_test[2])

1

In [49]:
y_test[2]

1

In [50]:
def score(X_train, y_train, X_test, y_test):
    prediction = []
    for i in range(len(X_test)):
        prediction.append(predict(X_train, y_train, X_test[i]))
    
    prediction = np.array(prediction)
    
    return (prediction == y_test).mean()

In [51]:
score(X_train, y_train, X_test, y_test)

0.9973890339425587

## Gausian Naive Bayes

Gaussian Naïve Bayes, assumes that the distribution of probability is Gaussian (normal). Because of the assumption of the normal distribution, Gaussian Naive Bayes is used in cases when all our features are continuous. For example, if we consider the Iris dataset, the features are sepal width, petal width, etc. They can have different values in the dataset like width and length, hence we can’t represent them in terms of their occurrences and we need to use the Gaussian Naive Bayes here.

Considerations

* It assumes the distribution of features is normal
* It is usually used when all our features are continuous

## Multinomial Naive Bayes

The term Multinomial Naive Bayes implies that each feature has a multinomial distribution. It’s used when we have discrete data (e.g. movie ratings ranging 1 and 5 as each rating will have certain frequency to represent). In text learning we have the count of each word to predict the class or label. This algorithm is mostly used for document classification problem (whether a document belongs to the category of sports, politics, technology etc.). The features/predictors used by the classifier are the frequency of the words present in the document.

Considerations

* Used with discrete data
* Works well for data which can easily be turned into counts, such as word counts in text.

## Averaged One-Dependence Estimators (AODE)

AODE is a semi-naive Bayesian Learning method. It was developed to address the attribute independence problem of the popular naive Bayes classifier. It does it by averaging over all of the models in which all attributes depend upon the class and a single other attribute. It frequently develops more accurate classifiers than naive Bayes at the cost of a small increase in the amount of computation.

Considerations

* Using it for nominal data is computationally more efficient than regular naïve bayes, and achieves very low error rates.

## Bayesian Belief Network (BBN)

It is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. A BBN is a special type of diagram (called a directed graph) together with an associated set of probability tables. Example is tossing a coin. The coin can have two values- heads or tails with a 50% probability each. We call these probabilities “beliefs” (i.e. our belief that the state coin=head is 50%).

Considerations

* BBNs enable us to model and reason about uncertainty
* The most important use of BBNs is in revising probabilities in the light of actual observations of events
* Can be used to understand what caused a certain problem, or the probabilities of different effects given an action in areas like computational biology and medicine for risk analysis and decision support.