# Bayes Networks

Sometimes we get so caught up in all the latest in deep nets that we forget to look at some of the more basic approaches.  One of those is Bayes Networks.  In a nutshell, a Bayesian network [is](https://en.wikipedia.org/wiki/Bayesian_network) a "probabilistic graphical model (a type of statistical model) that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG)".

Besides the Wikipedia article, [scikit-learn](http://scikit-learn.org/stable/modules/naive_bayes.html) has some good documentation and [this])(https://dataaspirant.com/2017/02/06/naive-bayes-classifier-machine-learning/) article as well as [this](http://dataaspirant.com/2017/02/20/gaussian-naive-bayes-classifier-implementation-python/) article explain some more.  I also like [this github entry](https://github.com/taneresme/ml.naiveBayes/blob/master/Naive-Bayes-Classifier.ipynb).

## Explanation

The Naive Bayes classifier is a straightforward and powerful algorithm for the classification task. When working with a data set with millions of records with some attributes it is suggested to try Naive Bayes because the algorithm is not computationally expensive when compared to other algorithms.  However, Naive Bayes may perform poorly if your training set isn't representative sample of “the real world”.  To understand the naive Bayes classifier we need to understand the Bayes theorem.

### What is Bayes Theorem?
Bayes theorem named after [Rev. Thomas Bayes](https://en.wikipedia.org/wiki/Thomas_Bayes). It is based on conditional probability, that is, the probability that something will happen, given that something else has already occurred. Using the conditional probability, we can calculate the probability of an event using its prior knowledge.

Consider two events, $A$ and $B$. $A \cap B$ is defined as the intersection of $A$ and $B$.
$P(A \mid B)$ is defined as probability of A given B.

![](https://github.com/taneresme/ml.naiveBayes/raw/3a6f361ddab709b6cd9610ef4b9a8146cfccbc82/vennDiagramOfBayesTheorem.png)

When Event $B$ has occurred, the sample space is $B$ given on the right in the figure. Now compute the probability of $A$ also occuring (the conditional probability of $A$). That is, find the probability of $A \cap B$ given that we are in the space of $B$.

$$
P(A \mid B) = \frac{P(A \cap B)}{P(B)}
$$
We can rewrite $P(A \cap B)$ as $P(A, B)$. These mean the probability of $A$ and $B$ at the same time. So the new form of the equation is:

$$
P(A \mid B) = \frac{P(A, B)}{P(B)}
$$
For the probability of $A$ and $B$, we can deduce equations below from the figure above.

$$\begin{align}
P(A, B) = P(B, A) = P(A \mid B)P(B) \\
P(A, B) = P(B, A) = P(B \mid A)P(A)
\end{align}$$
Let's look at the new form of the equation putting the second form of $P(A, B)$:

$$
P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)}
$$
This equation is known as Bayes Theorem.

$P(A \mid B)$ : the probability of $A$ when $B$ is given
$P(B)$ : the marginal probability of $B$
$P(B \mid A)$ : the probability of $B$ when $A$ is given
$P(A)$ : the marginal probability of $A$.

## Naive Bayes Methods
[Naive Bayes methods](http://scikit-learn.org/stable/modules/naive_bayes.html) are a set of supervised learning algorithms based on applying Bayes theorem with the "naive" assumption of independence between every pair of features. Given a class variable $y$ and a dependent feature vector $\{x_1, ..., x_n\}$, Bayes theorem states the following relationship:

$$
P(y \mid x_1, ..., x_n) = \frac{P(y) P(x_1, ..., x_n \mid y)}{P(x_1, ..., x_n)}
$$

Applying the naive independence assumption that

$$
P(x_i \mid y, x_1, ..., x_{i-1}, x_{i+1}, ... x_n) = P(x_i \mid y)
$$

for all $i$, the relationship is simplified to

$$
P(y \mid x_1, ..., x_n) = \frac{P(y) \prod_{i=1}^n P(x_i \mid y)}{P(x_1, ..., x_n)}
$$

Since $P(x_1, ..., x_n)$ is constant given the input, we can simplify the previous equation to this classification rule:

$$
P(y \mid x_1, ..., x_n) \propto P(y) \prod_{i=1}^n P(x_i \mid y)
$$

The [maximum likelihood estimate](https://en.wikipedia.org/wiki/Likelihood_function) is given as

$$
\begin{array}{lcl}\hat{y} = & \text{arg } \text{max } P(y) & P(y) \prod_{i=1}^n P(x_i \mid y) \\ & ^y & \end{array}
$$

Use the *Maximum A Posteriori (MAP)* estimation to estimate $P(y)$ and $P(x_i \mid y)$; the former is then the relative frequency of class y in the training set.

[Different naive Bayes classifiers](http://scikit-learn.org/stable/modules/naive_bayes.html) differ mainly by the assumptions they make regarding the distribution of $P(x_i \mid y)$.

In spite of their apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many real-world situations, famously document classification and spam filtering. They require a small amount of training data to estimate the necessary parameters. For theoretical reasons why naive Bayes works well, and on which types of data it does, see this [reference](http://www.cs.unb.ca/~hzhang/publications/FLAIRS04ZhangH.pdf).

Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods. The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality.

What I thought would be interesting is to work though a couple different of packages with [this example](http://dataaspirant.com/2017/02/20/gaussian-naive-bayes-classifier-implementation-python/).  First, we use scikit-learn.

In [1]:
# Required Python Machine learning Packages
import pandas as pd
import numpy as np
# For preprocessing the data
from sklearn.preprocessing import Imputer
from sklearn import preprocessing
# To split the dataset into train and test datasets
from sklearn.model_selection import train_test_split
# To model the Gaussian Naive Bayes classifier
from sklearn.naive_bayes import GaussianNB
# To calculate the accuracy score of the model
from sklearn.metrics import accuracy_score


In [2]:
# Get the data
url="https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
adult_df=pd.read_csv(url, header = None, delimiter=' *, *', engine='python')

In [3]:
# Assign column names.
adult_df.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
                    'marital_status', 'occupation', 'relationship',
                    'race', 'sex', 'capital_gain', 'capital_loss',
                    'hours_per_week', 'native_country', 'income']

In [4]:
display(adult_df.head(5))

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [5]:
# Data cleanup
# Is there any missing data?
print(adult_df.isnull().sum())

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64


In [6]:
# No nulls.  Is there data with "?" or other funny stuff going on?
for value in ['workclass', 'education',
          'marital_status', 'occupation',
          'relationship','race', 'sex',
          'native_country', 'income']:
    print(value,":", sum(adult_df[value] == '?'))

workclass : 1836
education : 0
marital_status : 0
occupation : 1843
relationship : 0
race : 0
sex : 0
native_country : 583
income : 0


In [7]:
# Save the original
save_df = adult_df

# Get summary statistics - look at all those NaNs
sum_stats = adult_df.describe(include= 'all')

Impute missing data - e.g., the '?'.  For workclass, we will replace with 'Private' and so forth

In [8]:
for value in ['workclass', 'education',
          'marital_status', 'occupation',
          'relationship','race', 'sex',
          'native_country', 'income']:
    adult_df[value] = adult_df[value].replace('?', sum_stats[value][2])

In [9]:
# Validate...
for value in ['workclass', 'education',
          'marital_status', 'occupation',
          'relationship','race', 'sex',
          'native_country', 'income']:
    print(value,":", sum(adult_df[value] == '?'))

workclass : 0
education : 0
marital_status : 0
occupation : 0
relationship : 0
race : 0
sex : 0
native_country : 0
income : 0


For naive Bayes, we need to convert all the data values in one format. Encode all the labels with the value between 0 and n_classes-1. To implement this, use LabelEncoder from the scikit learn library. For encoding, we can also use the One-Hot encoder. It encodes the data into binary format.

In [10]:
le = preprocessing.LabelEncoder()
workclass_cat = le.fit_transform(save_df.workclass)
education_cat = le.fit_transform(save_df.education)
marital_cat   = le.fit_transform(save_df.marital_status)
occupation_cat = le.fit_transform(save_df.occupation)
relationship_cat = le.fit_transform(save_df.relationship)
race_cat = le.fit_transform(save_df.race)
sex_cat = le.fit_transform(save_df.sex)
native_country_cat = le.fit_transform(save_df.native_country)

#initialize the encoded categorical columns
adult_df['workclass_cat'] = workclass_cat
adult_df['education_cat'] = education_cat
adult_df['marital_cat'] = marital_cat
adult_df['occupation_cat'] = occupation_cat
adult_df['relationship_cat'] = relationship_cat
adult_df['race_cat'] = race_cat
adult_df['sex_cat'] = sex_cat
adult_df['native_country_cat'] = native_country_cat

#drop the old categorical columns from dataframe
dummy_fields = ['workclass', 'education', 'marital_status', 
                  'occupation', 'relationship', 'race',
                  'sex', 'native_country']
adult_df = adult_df.drop(dummy_fields, axis = 1)

Note the adult_df_rev.head()  result below. You will be able to see that all the columns should be reindexed. They are not in proper order.  We want to gather the categorical variables together.

In [11]:
display(adult_df.head(5))

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,income,workclass_cat,education_cat,marital_cat,occupation_cat,relationship_cat,race_cat,sex_cat,native_country_cat
0,39,77516,13,2174,0,40,<=50K,6,9,4,0,1,4,1,38
1,50,83311,13,0,0,13,<=50K,5,9,2,3,0,4,1,38
2,38,215646,9,0,0,40,<=50K,3,11,0,5,1,4,1,38
3,53,234721,7,0,0,40,<=50K,3,1,2,5,0,2,1,38
4,28,338409,13,0,0,40,<=50K,3,9,2,9,5,2,0,4


In [12]:
adult_df = adult_df.reindex_axis(['age', 'workclass_cat', 'fnlwgt', 'education_cat',
                                    'education_num', 'marital_cat', 'occupation_cat',
                                    'relationship_cat', 'race_cat', 'sex_cat', 'capital_gain',
                                    'capital_loss', 'hours_per_week', 'native_country_cat', 
                                    'income'], axis= 1)

display(adult_df.head(5))

Unnamed: 0,age,workclass_cat,fnlwgt,education_cat,education_num,marital_cat,occupation_cat,relationship_cat,race_cat,sex_cat,capital_gain,capital_loss,hours_per_week,native_country_cat,income
0,39,6,77516,9,13,4,0,1,4,1,2174,0,40,38,<=50K
1,50,5,83311,9,13,2,3,0,4,1,0,0,13,38,<=50K
2,38,3,215646,11,9,0,5,1,4,1,0,0,40,38,<=50K
3,53,3,234721,1,7,2,5,0,2,1,0,0,40,38,<=50K
4,28,3,338409,9,13,2,9,5,2,0,0,0,40,4,<=50K


## Standardization of Data

All the data values of our dataframe are numeric. Now, we need to convert them on a single scale. We can standardize the values.  We can use the below formula for standardization.

${x}_i = \frac{{x}_i - mean(x)} {\sigma(x)}$

In [13]:
num_features = ['age', 'workclass_cat', 'fnlwgt', 'education_cat', 'education_num',
                'marital_cat', 'occupation_cat', 'relationship_cat', 'race_cat',
                'sex_cat', 'capital_gain', 'capital_loss', 'hours_per_week',
                'native_country_cat']

scaled_features = {}
for each in num_features:
    mean, std = adult_df[each].mean(), adult_df[each].std()
    scaled_features[each] = [mean, std]
    adult_df.loc[:, each] = (adult_df[each] - mean)/std

display(adult_df.head(5))

Unnamed: 0,age,workclass_cat,fnlwgt,education_cat,education_num,marital_cat,occupation_cat,relationship_cat,race_cat,sex_cat,capital_gain,capital_loss,hours_per_week,native_country_cat,income
0,0.03067,2.624257,-1.063594,-0.335432,1.134721,0.92162,-1.545232,-0.277801,0.393661,0.703061,0.148451,-0.216656,-0.035429,0.261366,<=50K
1,0.837096,1.721073,-1.008692,-0.335432,1.134721,-0.406206,-0.79008,-0.900167,0.393661,0.703061,-0.145918,-0.216656,-2.222119,0.261366,<=50K
2,-0.042641,-0.085295,0.245075,0.181329,-0.420053,-1.734032,-0.286645,-0.277801,0.393661,0.703061,-0.145918,-0.216656,-0.035429,0.261366,<=50K
3,1.057031,-0.085295,0.425795,-2.402474,-1.19744,-0.406206,-0.286645,-0.900167,-1.962591,0.703061,-0.145918,-0.216656,-0.035429,0.261366,<=50K
4,-0.775756,-0.085295,1.408154,-0.335432,1.134721,-0.406206,0.720225,2.211664,-1.962591,-1.422309,-0.145918,-0.216656,-0.035429,-5.352858,<=50K


# Training and Testing Sets

Split the data into training and testing sets.

In [14]:
features = adult_df.values[:,:14]
target = adult_df.values[:,14]
features_train, features_test, target_train, target_test = train_test_split(features,
                                                                            target, test_size = 0.33, random_state = 10)

In [15]:
display(features)

array([[0.03067008638002341, 2.6242573354416034, -1.0635944124434624,
        ..., -0.2166562000280046, -0.03542890292132261,
        0.2613659753386185],
       [0.8370961257882877, 1.7210732167696725, -1.0086915112165729, ...,
        -0.2166562000280046, -2.2221189981594556, 0.2613659753386185],
       [-0.04264137174800061, -0.08529502057418954, 0.24507474139091864,
        ..., -0.2166562000280046, -0.03542890292132261,
        0.2613659753386185],
       ...,
       [1.4235877908124799, -0.08529502057418954, -0.3587719044411823,
        ..., -0.2166562000280046, -0.03542890292132261,
        0.2613659753386185],
       [-1.215624701796385, -0.08529502057418954, 0.11095818060267934,
        ..., -0.2166562000280046, -1.6551993438384582,
        0.2613659753386185],
       [0.9837190420443357, 0.8178890980977415, 0.9298782967974407, ...,
        -0.2166562000280046, -0.03542890292132261, 0.2613659753386185]],
      dtype=object)

In [16]:
display(target)

array(['<=50K', '<=50K', '<=50K', ..., '<=50K', '<=50K', '>50K'],
      dtype=object)

# Gaussian Naive Bayes

We are using Scikit-Learn and have built a GaussianNB classifier. The classifier is trained using training data. We can use fit() method for training it. After building a classifier, our model is ready to make predictions. We can use predict() method with test set features as its parameters.

In [17]:
import time

start_time = time.time()
clf = GaussianNB()
clf.fit(features_train, target_train)
target_pred = clf.predict(features_test)
print("Naive Bayes process time = {}".format(time.time()-start_time))

Naive Bayes process time = 0.13645672798156738


## Accuracy of our Gaussian Naive Bayes model
It’s time to test the quality of our model. We have made some predictions. Let’s compare the model’s prediction with actual target values for the test set. By following this method, we are going to calculate the accuracy of our model.

In [18]:
accuracy_score(target_test, target_pred, normalize = True)

0.8014144798064397

Scikit-Learn gives an accuracy of 80% in predicting salaries with Naive Bayes.

# Bernoulli Naive Bayes

Let's try the Bernoulli Naive Bayes method.

In [19]:
from sklearn.naive_bayes import BernoulliNB

start_time = time.time()
bnb = BernoulliNB()
bnb.fit(features_train, target_train)
target_pred = bnb.predict(features_test)
print("Bernoulli Naive Bayes process time = {}".format(time.time()-start_time))

accuracy_score(target_test, target_pred, normalize = True)
 

Bernoulli Naive Bayes process time = 0.19118714332580566


0.802252000744463

That is hardly an improvement.

# Decision Tree

[Decision trees](http://scikit-learn.org/stable/modules/tree.html) are another option for classification.

In [20]:
from sklearn import tree

start_time = time.time()
tclf = tree.DecisionTreeClassifier()
tclf = tclf.fit(features_train, target_train)
target_pred = tclf.predict(features_test)
print("Decision Tree process time = {}".format(time.time()-start_time))

accuracy_score(target_test, target_pred, normalize = True)

Decision Tree process time = 0.3126842975616455


0.8126744835287549

# Random Forest

We've gone this far - let's try a [Random Forest)](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [21]:
from sklearn.ensemble import RandomForestClassifier

start_time = time.time()
rfclf = RandomForestClassifier(max_depth=2, random_state=0)
rfclf.fit(features_train, target_train)
target_pred = rfclf.predict(features_test)
print("Random Forest process time = {}".format(time.time()-start_time))

accuracy_score(target_test, target_pred, normalize = True)

Random Forest process time = 0.2171158790588379


0.794714312302252

That was unexpected.  Previously, I was under the impression that Random Forests are one of the best classifiers.  But that is a blanket statement that shouldn't be made in general.  Indeed, some methods will do better than others on different problems.

Or perhaps, we should just increase the depth?

In [22]:
start_time = time.time()
rfclf = RandomForestClassifier(max_depth=4, random_state=0)
rfclf.fit(features_train, target_train)
target_pred = rfclf.predict(features_test)
print("Random Forest process time = {}".format(time.time()-start_time))

accuracy_score(target_test, target_pred, normalize = True)

Random Forest process time = 0.28003907203674316


0.8401265587195236

Try once more...

In [23]:
start_time = time.time()
rfclf = RandomForestClassifier(max_depth=8, random_state=0)
rfclf.fit(features_train, target_train)
target_pred = rfclf.predict(features_test)
print("Random Forest process time = {}".format(time.time()-start_time))

accuracy_score(target_test, target_pred, normalize = True)

Random Forest process time = 0.34059715270996094


0.854178298901917

I did suspect that increasing the depth would only so far.

# AdaBoost

Let's try boosting with [AdaBoost](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) and see what that does.

In [24]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier

start_time = time.time()
adaclf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
                         algorithm="SAMME",
                         n_estimators=200)
adaclf.fit(features_train, target_train)
target_pred = adaclf.predict(features_test)
print("AdaBoost process time = {}".format(time.time()-start_time))

accuracy_score(target_test, target_pred, normalize = True)


AdaBoost process time = 13.648805856704712


0.8542713567839196

Better, but very slow.  And in reality, not that great of an improvement over Naive Bayes.  Out of curiosity, let's increase depth and estimators.

In [25]:
start_time = time.time()
adaclf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
                         algorithm="SAMME.R",
                         n_estimators=200)
adaclf.fit(features_train, target_train)
target_pred = adaclf.predict(features_test)
print("AdaBoost process time = {}".format(time.time()-start_time))

accuracy_score(target_test, target_pred, normalize = True)

AdaBoost process time = 14.740965127944946


0.8652521868602271

The claim was that SAMME.R [converges faster](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) and has smaller test error.  We did get an improvement.  I didn't see faster converges - still seemed slow.  One more variation.

In [26]:
start_time = time.time()
adaclf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2),
                         algorithm="SAMME.R",
                         n_estimators=400)
adaclf.fit(features_train, target_train)
target_pred = adaclf.predict(features_test)
print("AdaBoost process time = {}".format(time.time()-start_time))

accuracy_score(target_test, target_pred, normalize = True)

AdaBoost process time = 38.720948457717896


0.8597617718220734

Not better.