***Machine Learning Exp. 6 / 15th Sep 2022***

Naive Bayesian Classifier Model: Assuming a set of documents that need to be classified, use the naive Bayesian classifier model to perform this task. Built-in Java classes/API can be used to write the program. Calculate the accuracy, precision, and recall for your data set.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

path = "/content/drive/MyDrive/naivetext.csv"

In [None]:
msg = pd.read_csv(path, names=['message', 'label'])

In [None]:
msg.head()

Unnamed: 0,message,label
0,I love this sandwich,pos
1,This is an amazing place,pos
2,I feel very good about these beers,pos
3,This is my best work,pos
4,What an awesome view,pos


In [None]:
msg.tail()

Unnamed: 0,message,label
13,I am sick and tired of this place,neg
14,What a great holiday,pos
15,That is a bad locality to stay,neg
16,We will have good fun tomorrow,pos
17,I went to my enemy's house today,neg


In [None]:
msg['labelnum'] = msg.label.map({'pos':1, 'neg':0})

In [None]:
msg.head()

Unnamed: 0,message,label,labelnum
0,I love this sandwich,pos,1
1,This is an amazing place,pos,1
2,I feel very good about these beers,pos,1
3,This is my best work,pos,1
4,What an awesome view,pos,1


In [None]:
msg.tail()

Unnamed: 0,message,label,labelnum
13,I am sick and tired of this place,neg,0
14,What a great holiday,pos,1
15,That is a bad locality to stay,neg,0
16,We will have good fun tomorrow,pos,1
17,I went to my enemy's house today,neg,0


In [None]:
X = msg.message
X

0                      I love this sandwich
1                  This is an amazing place
2        I feel very good about these beers
3                      This is my best work
4                      What an awesome view
5             I do not like this restaurant
6                  I am tired of this stuff
7                    I can't deal with this
8                      He is my sworn enemy
9                       My boss is horrible
10                 This is an awesome place
11    I do not like the taste of this juice
12                          I love to dance
13        I am sick and tired of this place
14                     What a great holiday
15           That is a bad locality to stay
16           We will have good fun tomorrow
17         I went to my enemy's house today
Name: message, dtype: object

In [None]:
y = msg.labelnum
y

0     1
1     1
2     1
3     1
4     1
5     0
6     0
7     0
8     0
9     0
10    1
11    0
12    1
13    0
14    1
15    0
16    1
17    0
Name: labelnum, dtype: int64

In [None]:
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25)

In [None]:
#CountVectorizer - to get the frequency of words
count_vect = CountVectorizer()

In [None]:
#Document Term Matrix (DTM)
xtrain_dtm = count_vect.fit_transform(xtrain)
xtest_dtm = count_vect.transform(xtest)

In [None]:
print("Words present in text documnet")
print(count_vect.get_feature_names())

Words present in text documnet
['about', 'am', 'amazing', 'an', 'and', 'awesome', 'bad', 'beers', 'best', 'boss', 'can', 'deal', 'do', 'enemy', 'feel', 'fun', 'good', 'great', 'have', 'he', 'holiday', 'horrible', 'is', 'juice', 'like', 'locality', 'love', 'my', 'not', 'of', 'place', 'sandwich', 'sick', 'stay', 'sworn', 'taste', 'that', 'the', 'these', 'this', 'tired', 'to', 'tomorrow', 'very', 'view', 'we', 'what', 'will', 'with', 'work']




In [None]:
df = pd.DataFrame(xtrain_dtm.toarray(), columns=count_vect.get_feature_names())

In [None]:
df.head()

Unnamed: 0,about,am,amazing,an,and,awesome,bad,beers,best,boss,...,tired,to,tomorrow,very,view,we,what,will,with,work
0,0,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,1,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,1,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [None]:
mclf = MultinomialNB().fit(xtrain_dtm, ytrain)

In [None]:
predicted = mclf.predict(xtest_dtm)
predicted

array([0, 1, 1, 0, 0])

In [None]:
ytest

5     0
12    1
10    1
6     0
17    0
Name: labelnum, dtype: int64

In [None]:
print("Confusion matrix")
print(metrics.confusion_matrix(ytest, predicted))

Confusion matrix
[[3 0]
 [0 2]]


In [None]:
print("Accuracy ", metrics.accuracy_score(ytest, predicted))

Accuracy  1.0


In [None]:
print("Precision ", metrics.precision_score(ytest, predicted))

Precision  1.0


In [None]:
print("Recall ", metrics.recall_score(ytest, predicted))

Recall  1.0


In [None]:
newText = ["my boss is best"]
newText

['my boss is best']

In [None]:
newText_dtm = count_vect.transform(newText)

In [None]:
newText_predicted = mclf.predict(newText_dtm)

In [None]:
print("Predicted result of newText is ", newText_predicted)

Predicted result of newText is  [0]
