# Program 6 - Naive Bayesian (Doc)

We use Multinomial Naive Bayes classifier of the `scikit-learn` library.

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

Load the data set from local file with labels

In [2]:
msg = pd.read_csv('prog6_dataset.csv', names = ['message', 'label'])
print("Total instances in the data set:\n", msg.shape[0])

Total instances in the data set:
 18


`x` contains messages (feature)

`y` contains label number (target)

In [3]:
# here, msg['labelnum'] is important as that is the output that is used in mapping the NB classifier

msg['labelnum'] = msg.label.map({
    'pos': 1,
    'neg': 0
})
x = msg.message
y = msg.labelnum
x, y

(0                      I love this sandwich
 1                  This is an amazing place
 2        I feel very good about these beers
 3                      This is my best work
 4                      What an awesome view
 5             I do not like this restaurant
 6                  I am tired of this stuff
 7                    I can't deal with this
 8                      He is my sworn enemy
 9                       My boss is horrible
 10                 This is an awesome place
 11    I do not like the taste of this juice
 12                          I love to dance
 13        I am sick and tired of this place
 14                     What a great holiday
 15           That is a bad locality to stay
 16           We will have good fun tomorrow
 17         I went to my enemy's house today
 Name: message, dtype: object, 0     1
 1     1
 2     1
 3     1
 4     1
 5     0
 6     0
 7     0
 8     0
 9     0
 10    1
 11    0
 12    1
 13    0
 14    1
 15    0
 16    1
 17    

First 5 msgs with labels are printed

In [4]:
x5, y5 = x[0:5], msg.label[0:5]
for x1, y1 in zip(x5, y5):
    print(x1, ',', y1)

I love this sandwich , pos
This is an amazing place , pos
I feel very good about these beers , pos
This is my best work , pos
What an awesome view , pos


Split the data set into training and testing data

In [5]:
xtrain, xtest, ytrain, ytest = train_test_split(x, y)
print('Total training instances: ', xtrain.shape[0])
print('Total testing instances: ', xtest.shape[0])

Total training instances:  13
Total testing instances:  5


`CountVectorizer` is used for feature extraction

The output of the count vectorizer is a sparse matrix

In [13]:
count_vec = CountVectorizer()
xtrain_dtm = count_vec.fit_transform(xtrain) # sparse matrix
xtest_dtm = count_vec.transform(xtest)
print("Total features extracted using CountVectorizer: ", xtrain_dtm.shape[1])

print("Features for first 5 training instances are listed below\n")
df = pd.DataFrame(xtrain_dtm.toarray(), columns = count_vec.get_feature_names())
print(df[:5])
# print(xtrain_dtm) # same as above, but sparse matrix representation

Total features extracted using CountVectorizer:  47
Features for first 5 training instances are listed below

   about  am  amazing  an  awesome  bad  beers  best  boss  can  ...   tired  \
0      0   0        0   0        0    0      0     0     0    0  ...       0   
1      0   0        0   1        1    0      0     0     0    0  ...       0   
2      0   0        0   0        0    0      0     0     0    0  ...       0   
3      0   0        0   0        0    0      0     0     1    0  ...       0   
4      0   0        0   0        0    0      0     0     0    0  ...       0   

   to  tomorrow  very  view  we  what  will  with  work  
0   0         0     0     0   0     0     0     0     0  
1   0         0     0     1   0     1     0     0     0  
2   0         0     0     0   0     0     0     0     0  
3   0         0     0     0   0     0     0     0     0  
4   0         1     0     0   1     0     1     0     0  

[5 rows x 47 columns]


## Training Naive Bayes Classifier

In [11]:
clf = MultinomialNB().fit(xtrain_dtm, ytrain)
predicted = clf.predict(xtest_dtm)
predicted

array([1, 0, 0, 1, 0], dtype=int64)

Classification results of testing samples are:

In [8]:
for doc, p in zip(xtest, predicted):
    pred = 'pos' if p == 1 else 'neg'
    print('%s -> %s' % (doc, pred))

I love this sandwich -> pos
I went to my enemy's house today -> neg
I am sick and tired of this place -> neg
What a great holiday -> pos
I do not like this restaurant -> neg


## Metrics

In [9]:
print("Accuracy: ", metrics.accuracy_score(ytest, predicted))
print("Recall: ", metrics.recall_score(ytest, predicted))
print("Precision: ", metrics.precision_score(ytest, predicted))

Accuracy:  1.0
Recall:  1.0
Precision:  1.0


Confusion Matrix

In [10]:
print(metrics.confusion_matrix(ytest, predicted))

[[3 0]
 [0 2]]
