# Classification using Bayesian Classifier 

### Bayesian Classifiers
These classifiers are "probabilistic classifiers" based on Bayes' theorem. Bayesian classifiers are highly scalable. They are often used when dimensionality of the inputs is high. 

### Types
1. Naïve Bayes
2. Bayesian Belief Network

### Problem Statement

UCI dataset: Skin Segmentation Data Set (https://archive.ics.uci.edu/ml/machine-learning-databases/00229/).
The Skin Segmentation dataset is constructed over the B, G, R color space. Skin and Nonskin dataset is generated using skin textures from face images of people with diverse age, gender, and race. The task is to identify whether the BGR combination is a skin color or not.


In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

Following code reads a text file into a dataframe. Modify the below code to split the values of B, G, R and class label into different columns. Each column must have a column name as specified below:

```
Column No.   Expected Column Name
1            BLUE
2            GREEN
3            RED
4            RESULT     
```

In [2]:
df = pd.read_csv(sep='\t', header=None, names=['BLUE', 'GREEN', 'RED', 'RESULT'],
    filepath_or_buffer='https://archive.ics.uci.edu/ml/machine-learning-databases/00229/Skin_NonSkin.txt')

In [3]:
df.head()

Unnamed: 0,BLUE,GREEN,RED,RESULT
0,74,85,123,1
1,73,84,122,1
2,72,83,121,1
3,70,81,119,1
4,70,81,119,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 245057 entries, 0 to 245056
Data columns (total 4 columns):
BLUE      245057 non-null int64
GREEN     245057 non-null int64
RED       245057 non-null int64
RESULT    245057 non-null int64
dtypes: int64(4)
memory usage: 7.5 MB


Write some code to define X and y dataframes containing R G B components in X and the class in y. Then these will be used to split the data into test / train data. We will be using the Test-Train Split in order to calculate the accuracy of a classification model.

In [5]:
# Write your code here
X = df[['BLUE', 'GREEN', 'RED']]
y = df['RESULT']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

The code in the next cell is used to classify the test data by following the steps below:
    1. Import Gaussian Naïve Bayes Classifier
    2. Fit the model with training data (X: attributes and y:labels)
    3. Use the trained model to predict labels of test data (X_test)
    4. Calculate the accuracy score using actual labels (y_test) and predicted labels (y_pred)

In [6]:
gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
accuracy_score(y_test, y_pred)

0.9239206724883702

Write some code to calculate Precision, Recall and F-score of the results obtained using the given GaussianNB classification model.


In [7]:
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print('Precision: %.03f' % precision)
print('Recall: %.03f' % recall)
print('F1: %.03f' % f1)

Precision: 0.874
Recall: 0.738
F1: 0.800


In [8]:
# Might be easier to read with less code & more information
print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          1       0.87      0.74      0.80     12636
          2       0.93      0.97      0.95     48629

avg / total       0.92      0.92      0.92     61265



Write some code to classify X_train and y_train using Multinomial Naive Bayes Classifier or Bernoulli Naive Bayes Classifier. Calculate the accuracy, precision, recall and f1-score values for your trained model. Use the scikit learn library for this task.

In [9]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB


mnb = MultinomialNB()
y_pred_mnb = mnb.fit(X_train, y_train).predict(X_test)
accuracy_score(y_test, y_pred_mnb)

0.9345792867052967

In [10]:
bnb = BernoulliNB()
y_pred_bnb = bnb.fit(X_train, y_train).predict(X_test)
accuracy_score(y_test, y_pred_bnb)

0.7937484697625071

In [11]:
print('Classification report for MultinomialNB: ')
print(classification_report(y_test, y_pred_mnb))

Classification report for MultinomialNB: 
             precision    recall  f1-score   support

          1       0.76      1.00      0.86     12636
          2       1.00      0.92      0.96     48629

avg / total       0.95      0.93      0.94     61265



In [12]:
print('Classification report for BernoulliNB: ')
print(classification_report(y_test, y_pred_bnb))

Classification report for BernoulliNB: 
             precision    recall  f1-score   support

          1       0.00      0.00      0.00     12636
          2       0.79      1.00      0.89     48629

avg / total       0.63      0.79      0.70     61265



  'precision', 'predicted', average, warn_for)


### Analysis

It seems Multinomial Naive Bayes performed the best in terms of raw accuracy as well as precision, recall & f1 score compared to Gaussian Naive Bayes & Bernoulli Naive Bayes. Bernoulli Naive Bayes was designed for binary/boolean features and thus assumes that each feature follows a multivariate Bernoulli distribution, so this is why it performs poorly. Gaussian Naive Bayes assumes that the features are normally distributed and follow a Gaussian/normal distribution, which may not be the case for the features in this dataset.