## **Lab 10**

Assuming a set of documents that need to be classified, use the naïve Bayesian Classifier
model to perform this task. Built-in Java classes/API can be used to write the program.
Calculate theaccuracy, precision, and recall for your data set

## Description

In a nutshell, this program is about building a text classification model using the Naive Bayes algorithm. It aims to classify text messages into two categories, typically "positive" and "negative" sentiment, based on the words present in the messages.

Here's a more detailed breakdown:

**Data Preparation**: The program starts by loading a dataset of text messages and their corresponding labels (positive or negative). It then preprocesses the data by converting the text labels into numerical representations (0 for negative, 1 for positive) and splitting the data into training and testing sets.

**Feature Extraction**: To enable the machine learning algorithm to work with the text data, the program uses a technique called "CountVectorizer." This technique converts the text messages into numerical vectors, where each element represents the frequency of a particular word in the message. This numerical representation is known as a document-term matrix.

**Model Training**: The core of the program is the training of a Naive Bayes classifier. This classifier learns the relationship between the numerical features (word frequencies) and the labels (positive or negative) from the training data.

**Prediction and Evaluation**: Once the model is trained, it is used to predict the labels of the text messages in the testing set. The program then evaluates the performance of the model by comparing its predictions with the true labels, using metrics like accuracy, precision, and recall.

**Inference and Purpose**:

The main purpose of this program is to build a model that can automatically classify new, unseen text messages into positive or negative sentiment categories. By analyzing the words present in a message, the model infers the sentiment expressed by the author.

**Here's what the program infers:**

Sentiment Classification: Given a new text message, the program uses the trained Naive Bayes model to predict whether the sentiment expressed in the message is positive or negative.
Word Importance: The CountVectorizer used in the program implicitly identifies words that are more indicative of positive or negative sentiment. These words have higher weights in the model's decision-making process.
Applications:

This type of text classification program has various applications, including:

Sentiment Analysis: Analyzing customer reviews, social media posts, and other textual data to understand public opinion about products, services, or brands.
Spam Detection: Identifying spam emails or messages based on the language used.
Topic Classification: Categorizing news articles or documents into different topics.
In essence, this program demonstrates a fundamental machine learning task of text classification using the Naive Bayes algorithm. It provides a basic framework for building models that can automatically infer meaning and sentiment from textual data.

In [22]:
import pandas as pd
msg=pd.read_csv('/content/sample_data/document.csv',names=['message','label'])
print('The dimensions of the dataset',msg.shape)
msg['labelnum']=msg.label.map({'pos':1,'neg':0})
X=msg.message
y=msg.labelnum
print(X)
print(y)

#splitting the dataset into train and test data
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest=train_test_split(X,y)
print ('\n The total number of Training Data :',ytrain.shape)
print ('\n The total number of Test Data :',ytest.shape)
#output of count vectoriser is a sparse matrix

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
xtrain_dtm = count_vect.fit_transform(xtrain)
xtest_dtm=count_vect.transform(xtest)
print('\n The words or Tokens in the text documents \n')
# Use get_feature_names_out() instead of get_feature_names()
print(count_vect.get_feature_names_out())
# Use get_feature_names_out() instead of get_feature_names()
df=pd.DataFrame(xtrain_dtm.toarray(),columns=count_vect.get_feature_names_out())
# Training Naive Bayes (NB) classifier on training data.
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(xtrain_dtm,ytrain)
predicted = clf.predict(xtest_dtm)
#printing accuracy, Confusion matrix, Precision and Recall
from sklearn import metrics
print('\n Accuracy of the classifer is',metrics.accuracy_score(ytest,predicted))
print('\n Confusion matrix')
print(metrics.confusion_matrix(ytest,predicted))
print('\n The value of Precision' ,
metrics.precision_score(ytest,predicted))
print('\n The value of Recall' ,
metrics.recall_score(ytest,predicted))

The dimensions of the dataset (18, 2)
0                      I love this sandwich
1                  This is an amazing place
2        I feel very good about these beers
3                      This is my best work
4                      What an awesome view
5             I do not like this restaurant
6                  I am tired of this stuff
7                    I can't deal with this
8                      He is my sworn enemy
9                       My boss is horrible
10                 This is an awesome place
11    I do not like the taste of this juice
12                          I love to dance
13        I am sick and tired of this place
14                     What a great holiday
15           That is a bad locality to stay
16           We will have good fun tomorrow
17         I went to my enemy's house today
Name: message, dtype: object
0     1
1     1
2     1
3     1
4     1
5     0
6     0
7     0
8     0
9     0
10    1
11    0
12    1
13    0
14    1
15    0
16    1
17   