# Naive Bayes

It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Naive Bayes is a classification algorithm for binary (two-class) and multi-class classification problems. The technique is easiest to understand when described using binary or categorical input values.

It is called naive Bayes because the calculation of the probabilities for each hypothesis are simplified to make their calculation tractable. Rather than attempting to calculate the values of each attribute value P(d1, d2, d3|h), they are assumed to be conditionally independent given the target value and calculated as P(d1|h) * P(d2|H) and so on.

## Bayes' Theorem
Bayes’ Theorem is stated as:

P(h|d) = (P(d|h) * P(h)) / P(d)

Where

###### P(h|d) is the probability of hypothesis h given the data d. This is called the posterior probability.
###### P(d|h) is the probability of data d given that the hypothesis h was true.
###### P(h) is the probability of hypothesis h being true (regardless of the data). This is called the prior probability of h.
###### P(d) is the probability of the data (regardless of the hypothesis).

### Useful Libraries

#### Load Dataset. Use "bank-data.csv"

In [21]:
# import dataset
import pandas as pd
df = pd.read_csv("bank-data.csv")
print(df)

          id  age     sex      region   income married  children  car  \
0    ID12101   48  FEMALE  INNER_CITY  17546.0      NO         1   NO   
1    ID12102   40    MALE        TOWN  30085.1     YES         3  YES   
2    ID12103   51  FEMALE  INNER_CITY  16575.4     YES         0  YES   
3    ID12104   23  FEMALE        TOWN  20375.4     YES         3   NO   
4    ID12105   57  FEMALE       RURAL  50576.3     YES         0   NO   
..       ...  ...     ...         ...      ...     ...       ...  ...   
474  ID12575   31  FEMALE        TOWN  22678.1      NO         1  YES   
475  ID12576   33  FEMALE        TOWN  12178.5     YES         2   NO   
476  ID12577   43    MALE       RURAL  26106.7      NO         1   NO   
477  ID12578   40    MALE  INNER_CITY  27417.6     YES         0   NO   
478  ID12579   47    MALE        TOWN  23337.2     YES         2   NO   

    save_act current_act mortgage  pep  
0         NO          NO       NO  YES  
1         NO         YES      YES   NO  


#### Preprocess the data

In [2]:
# import library for preprocessing
from sklearn import preprocessing

In [22]:
# Tranform data using "fit_transform(attribute)" function
# converting string attributes to numeric attributes
df = df.replace({"sex": {"MALE": 0, "FEMALE": 1},
                 "region": {"INNER_CITY": 0, "SUBURBAN": 1, "TOWN": 2, "RURAL": 3},
                 "married": {"NO": 0, "YES": 1}, 
                 "car": {"NO": 0, "YES": 1}, 
                 "save_act": {"NO": 0, "YES": 1}, 
                 "current_act": {"NO": 0, "YES": 1}, 
                 "mortgage": {"NO": 0, "YES": 1}, 
                 "pep": {"NO": 0, "YES": 1}})

# normalizing relevant attributes
df[["age", "sex", "region", "income", "married", "children",
    "car", "save_act", "current_act", "mortgage"]] = \
preprocessing.StandardScaler().fit_transform(df[["age", "sex", "region", "income", "married",
                                                 "children", "car", "save_act", "current_act", "mortgage"]])
print(df)

          id       age       sex   region    income   married  children  \
0    ID12101  0.365673  1.018969 -0.97830 -0.783389 -1.385905 -0.007880   
1    ID12102 -0.183625 -0.981384  0.73194  0.190099  0.721550  1.879330   
2    ID12103  0.571660  1.018969 -0.97830 -0.858742  0.721550 -0.951485   
3    ID12104 -1.350884  1.018969  0.73194 -0.563725  0.721550  1.879330   
4    ID12105  0.983634  1.018969  1.58706  1.780957  0.721550 -0.951485   
..       ...       ...       ...      ...       ...       ...       ...   
474  ID12575 -0.801586  1.018969  0.73194 -0.384952 -1.385905 -0.007880   
475  ID12576 -0.664261  1.018969  0.73194 -1.200101  0.721550  0.935725   
476  ID12577  0.022362 -0.981384  1.58706 -0.118769 -1.385905 -0.007880   
477  ID12578 -0.183625 -0.981384 -0.97830 -0.016995  0.721550 -0.951485   
478  ID12579  0.297011 -0.981384  0.73194 -0.333782  0.721550  0.935725   

          car  save_act  current_act  mortgage  pep  
0   -0.965117 -1.452718    -1.799705 -0.71820

#### Select independent variables and target column

In [26]:
# Select the independent variables and the target attribute
X = df[["age", "sex", "region", "income", "married", "children",
    "car", "save_act", "current_act", "mortgage"]].values
Y = df["pep"].values
print(X)
print(Y)

[[ 0.36567334  1.01896902 -0.97830006 ... -1.45271801 -1.79970499
  -0.71820804]
 [-0.18362507 -0.9813841   0.73193983 ... -1.45271801  0.55564662
   1.39235423]
 [ 0.57166024  1.01896902 -0.97830006 ...  0.68836484  0.55564662
  -0.71820804]
 ...
 [ 0.02236183 -0.9813841   1.58705977 ... -1.45271801  0.55564662
  -0.71820804]
 [-0.18362507 -0.9813841  -0.97830006 ...  0.68836484  0.55564662
   1.39235423]
 [ 0.29701104 -0.9813841   0.73193983 ...  0.68836484  0.55564662
   1.39235423]]
[1 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0
 1 0 0 1 0 1 0 0 1 1 1 1 0 0 0 1 1 0 1 1 1 0 1 1 1 0 0 0 0 0 0 0 1 0 1 1 0
 1 1 0 1 0 1 1 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 1 0 1
 0 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 1 0 0 0 0 1 1 0 1 0 1 0 0 0 0 1 1 0 0
 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 0
 1 1 1 0 0 1 0 1 1 1 1 1 1 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 0 0 0 1 0 0 1
 1 0 1 1 1 1 1 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0

#### Import Naive Bayes Classifier library 

In [28]:
# import Classifier library
from sklearn.naive_bayes import GaussianNB

In [29]:
# Call the Classifier
naive = GaussianNB()

#### Predict the target column and find the perfromance of the model

In [36]:
# Divide the dataset into training and testing partition
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1)

# fit the data
naive.fit(X_train, Y_train)
Y_pred = naive.predict(X_test)

In [37]:
# Print Number of mislabeled points
print((Y_pred != Y_test).sum())

59


### Prediction and Evaluation

In [39]:
# import required libraries
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import accuracy_score

In [40]:
# Calculate and print confusion matrix and other performance measures (Refer previous labsheet)
print(classification_report(Y_test, Y_pred))
print("Confusion Matrix")
print(confusion_matrix(Y_test, Y_pred))
print("\n Accuracy")
print(accuracy_score(Y_test, Y_pred))

              precision    recall  f1-score   support

           0       0.54      0.84      0.65        67
           1       0.72      0.38      0.50        77

    accuracy                           0.59       144
   macro avg       0.63      0.61      0.58       144
weighted avg       0.64      0.59      0.57       144

Confusion Matrix
[[56 11]
 [48 29]]

 Accuracy
0.5902777777777778


#### Q1: Consider "current_act" as an irrelevant attribute. Remove it and find the accuracy of Naive Bayes classifier

In [41]:
# display dataframe first 5 columns
df = df.drop(["current_act"], axis = 1)
print(df.head())

        id       age       sex   region    income   married  children  \
0  ID12101  0.365673  1.018969 -0.97830 -0.783389 -1.385905 -0.007880   
1  ID12102 -0.183625 -0.981384  0.73194  0.190099  0.721550  1.879330   
2  ID12103  0.571660  1.018969 -0.97830 -0.858742  0.721550 -0.951485   
3  ID12104 -1.350884  1.018969  0.73194 -0.563725  0.721550  1.879330   
4  ID12105  0.983634  1.018969  1.58706  1.780957  0.721550 -0.951485   

        car  save_act  mortgage  pep  
0 -0.965117 -1.452718 -0.718208    1  
1  1.036143 -1.452718  1.392354    0  
2  1.036143  0.688365 -0.718208    0  
3 -0.965117 -1.452718 -0.718208    0  
4 -0.965117  0.688365 -0.718208    0  


In [43]:
# Selecting the independent variables
X = df[["age", "sex", "region", "income", "married", "children",
    "car", "save_act", "mortgage"]].values
print(X)

[[ 0.36567334  1.01896902 -0.97830006 ... -0.96511741 -1.45271801
  -0.71820804]
 [-0.18362507 -0.9813841   0.73193983 ...  1.03614337 -1.45271801
   1.39235423]
 [ 0.57166024  1.01896902 -0.97830006 ...  1.03614337  0.68836484
  -0.71820804]
 ...
 [ 0.02236183 -0.9813841   1.58705977 ... -0.96511741 -1.45271801
  -0.71820804]
 [-0.18362507 -0.9813841  -0.97830006 ... -0.96511741  0.68836484
   1.39235423]
 [ 0.29701104 -0.9813841   0.73193983 ... -0.96511741  0.68836484
   1.39235423]]


In [44]:
# selecting only the target lableled column
Y = df["pep"].values
print(Y)

[1 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0
 1 0 0 1 0 1 0 0 1 1 1 1 0 0 0 1 1 0 1 1 1 0 1 1 1 0 0 0 0 0 0 0 1 0 1 1 0
 1 1 0 1 0 1 1 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 1 0 1
 0 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 1 0 0 0 0 1 1 0 1 0 1 0 0 0 0 1 1 0 0
 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 0
 1 1 1 0 0 1 0 1 1 1 1 1 1 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 0 0 0 1 0 0 1
 1 0 1 1 1 1 1 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0
 0 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 1 0 1 0 0 0 1 0 0 1 1 1 1 0 0 0
 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 0 1 0 0
 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 1 0 0 1 1 1 0 1 0 1 1 0 0 0 1 1 1 0 0 0
 0 1 1 0 1 0 1 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 1 0 0 1 1 0 1 0 0 1 1 1 1
 1 1 0 1 0 1 1 0 0 0 1 1 0 0 1 0 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1
 1 0 0 1 1 1 0 0 1 0 0 0 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 0 0]


In [45]:
# Apply the classifier and Print Number of mislabeled points
naive = GaussianNB()
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
naive.fit(X_train, Y_train)
Y_pred = naive.predict(X_test)
print((Y_pred != Y_test).sum())

58


In [46]:
# Calculate and print confusion matrix and other performance measures
print(classification_report(Y_test, Y_pred))
print("Confusion Matrix")
print(confusion_matrix(Y_test, Y_pred))
print("\n Accuracy")
print(accuracy_score(Y_test, Y_pred))

              precision    recall  f1-score   support

           0       0.54      0.87      0.67        67
           1       0.76      0.36      0.49        77

    accuracy                           0.60       144
   macro avg       0.65      0.61      0.58       144
weighted avg       0.66      0.60      0.57       144

Confusion Matrix
[[58  9]
 [49 28]]

 Accuracy
0.5972222222222222


#### Q2: Write your observation

In [47]:
# the accuracy slightly went up

### Load "car.csv" dataset. 

#### Q3: Apply Naive Bayes classifier on this dataset

In [76]:
# Load the data
df = pd.read_csv("car.csv", header = None)
# shuffle the DataFrame rows
df = df.sample(frac = 1)
print(df)

          0      1  2  3      4     5      6
746    high    med  6  4    big  high    acc
179   vhigh   high  4  4    big  high  unacc
1521    low    med  2  4  small   low  unacc
694    high    med  3  6  small   med  unacc
1374    low  vhigh  4  6    big   low  unacc
...     ...    ... .. ..    ...   ...    ...
848    high    low  6  4  small  high    acc
1706    low    low  6  2    med  high  unacc
484    high  vhigh  3  6    big   med  unacc
449    high  vhigh  2  4    big  high  unacc
1266    med    low  4  6    big   low  unacc

[1728 rows x 7 columns]


In [77]:
# Preprocess and Tranform data using "fit_transform(attribute)" function
# converting string attributes to numeric attributes
df = df.replace({0: {"low": 0, "med": 1, "high": 2, "vhigh": 3},
                1: {"low": 0, "med": 1, "high": 2, "vhigh": 3},
                4: {"small": 0, "med": 1, "big": 2},
                5: {"low": 0, "med": 1, "high": 2},
                6: {"unacc": 0, "acc": 1, "good": 2, "vgood": 3}})

# normalizing relevant attributes
df[df.columns[0:6]] = preprocessing.StandardScaler().fit_transform(df[df.columns[0:6]])
print(df)

             0         1         2         3         4         5  6
746   0.447214 -0.447214  1.521278  0.000000  1.224745  1.224745  1
179   1.341641  0.447214  0.169031  0.000000  1.224745  1.224745  0
1521 -1.341641 -0.447214 -1.183216  0.000000 -1.224745 -1.224745  0
694   0.447214 -0.447214 -0.507093  1.224745 -1.224745  0.000000  0
1374 -1.341641  1.341641  0.169031  1.224745  1.224745 -1.224745  0
...        ...       ...       ...       ...       ...       ... ..
848   0.447214 -1.341641  1.521278  0.000000 -1.224745  1.224745  1
1706 -1.341641 -1.341641  1.521278 -1.224745  0.000000  1.224745  0
484   0.447214  1.341641 -0.507093  1.224745  1.224745  0.000000  0
449   0.447214  1.341641 -1.183216  0.000000  1.224745  1.224745  0
1266 -0.447214 -1.341641  0.169031  1.224745  1.224745 -1.224745  0

[1728 rows x 7 columns]


In [78]:
# Select the independent variables and the target attribute
X = df[df.columns[0:6]]
Y = df[df.columns[6]]
print(X)
print(Y)

             0         1         2         3         4         5
746   0.447214 -0.447214  1.521278  0.000000  1.224745  1.224745
179   1.341641  0.447214  0.169031  0.000000  1.224745  1.224745
1521 -1.341641 -0.447214 -1.183216  0.000000 -1.224745 -1.224745
694   0.447214 -0.447214 -0.507093  1.224745 -1.224745  0.000000
1374 -1.341641  1.341641  0.169031  1.224745  1.224745 -1.224745
...        ...       ...       ...       ...       ...       ...
848   0.447214 -1.341641  1.521278  0.000000 -1.224745  1.224745
1706 -1.341641 -1.341641  1.521278 -1.224745  0.000000  1.224745
484   0.447214  1.341641 -0.507093  1.224745  1.224745  0.000000
449   0.447214  1.341641 -1.183216  0.000000  1.224745  1.224745
1266 -0.447214 -1.341641  0.169031  1.224745  1.224745 -1.224745

[1728 rows x 6 columns]
746     1
179     0
1521    0
694     0
1374    0
       ..
848     1
1706    0
484     0
449     0
1266    0
Name: 6, Length: 1728, dtype: int64


In [79]:
# Apply the classifier
naive = GaussianNB()

In [80]:
# Divide the dataset into training and testing partition
# predictions for testing partition
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
naive.fit(X_train, Y_train)
Y_pred = naive.predict(X_test)

In [81]:
# Print Number of mislabeled points
print((Y_pred != Y_test).sum())

185


In [82]:
# Calculate and print confusion matrix and other performance measures
print(classification_report(Y_test, Y_pred))
print("Confusion Matrix")
print(confusion_matrix(Y_test, Y_pred))
print("\n Accuracy")
print(accuracy_score(Y_test, Y_pred))

              precision    recall  f1-score   support

           0       0.81      0.85      0.83       343
           1       0.47      0.11      0.18       133
           2       0.69      0.39      0.50        23
           3       0.17      1.00      0.29        20

    accuracy                           0.64       519
   macro avg       0.54      0.59      0.45       519
weighted avg       0.69      0.64      0.63       519

Confusion Matrix
[[290  12   0  41]
 [ 64  15   4  50]
 [  3   5   9   6]
 [  0   0   0  20]]

 Accuracy
0.6435452793834296


#### Q4: Find the correlation between the attributes of the dataset.

In [90]:
# Find the pairwise correlation of attributes and arrange in ascending order
df.corr().unstack().sort_values()

6  0   -2.827504e-01
0  6   -2.827504e-01
1  6   -2.324215e-01
6  1   -2.324215e-01
2  1   -3.350586e-17
1  2   -3.350586e-17
4  1   -7.966878e-18
1  4   -7.966878e-18
   5   -1.541976e-18
5  1   -1.541976e-18
0  1   -1.156482e-18
1  0   -1.156482e-18
3  4   -7.709882e-19
5  3   -7.709882e-19
1  3   -7.709882e-19
3  5   -7.709882e-19
   1   -7.709882e-19
4  3   -7.709882e-19
5  4   -2.569961e-19
4  5   -2.569961e-19
   2    7.067392e-19
2  4    7.067392e-19
   3    7.709882e-19
3  2    7.709882e-19
0  4    1.798972e-18
4  0    1.798972e-18
2  5    5.718163e-18
5  2    5.718163e-18
3  0    5.910910e-18
0  3    5.910910e-18
5  0    7.709882e-18
0  5    7.709882e-18
2  0    8.063252e-18
0  2    8.063252e-18
2  6    5.984170e-02
6  2    5.984170e-02
4  6    1.579317e-01
6  4    1.579317e-01
   3    3.417068e-01
3  6    3.417068e-01
6  5    4.393373e-01
5  6    4.393373e-01
0  0    1.000000e+00
3  3    1.000000e+00
4  4    1.000000e+00
2  2    1.000000e+00
1  1    1.000000e+00
5  5    1.000

#### Q5: Remove one of the highly correlated attributes and apply Naive Bayes classifier

In [91]:
# Drop highly correlated attribute
df = df.drop([1], axis = 1)
print(df.head())

             0         2         3         4         5  6
746   0.447214  1.521278  0.000000  1.224745  1.224745  1
179   1.341641  0.169031  0.000000  1.224745  1.224745  0
1521 -1.341641 -1.183216  0.000000 -1.224745 -1.224745  0
694   0.447214 -0.507093  1.224745 -1.224745  0.000000  0
1374 -1.341641  0.169031  1.224745  1.224745 -1.224745  0


In [93]:
# Apply the classifier
# Divide the dataset into training and testing partition
# predictions for testing partition
# Print Number of mislabeled points
X = df[df.columns[0:5]]
Y = df[df.columns[5]]
naive = GaussianNB()
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
naive.fit(X_train, Y_train)
Y_pred = naive.predict(X_test)
print((Y_pred != Y_test).sum())

199


In [94]:
# Calculate and print confusion matrix and other performance measures
print(classification_report(Y_test, Y_pred))
print("Confusion Matrix")
print(confusion_matrix(Y_test, Y_pred))
print("\n Accuracy")
print(accuracy_score(Y_test, Y_pred))

              precision    recall  f1-score   support

           0       0.77      0.84      0.80       343
           1       0.48      0.09      0.15       133
           2       0.00      0.00      0.00        23
           3       0.17      1.00      0.29        20

    accuracy                           0.62       519
   macro avg       0.35      0.48      0.31       519
weighted avg       0.64      0.62      0.58       519

Confusion Matrix
[[288  11   0  44]
 [ 71  12   0  50]
 [ 15   2   0   6]
 [  0   0   0  20]]

 Accuracy
0.6165703275529865


  _warn_prf(average, modifier, msg_start, len(result))


#### Q6: Write your observation below in the performance of model in Q4 and Q6

In [95]:
# the accuracy slightly dropped