# **Text Classification with Naive Bayes: A Complete Guide from Data Preparation to Model Evaluation**


This notebook guides you through text classification using Naive Bayes. It covers data cleaning, feature extraction with CountVectorizer, model training, and evaluation. Learn how to preprocess text data, build a Bag of Words model, and assess model performance with confusion matrices and accuracy scores.

# Natural Language Processing

Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and human language. The goal of NLP is to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful.

## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

The dataset consists of 1,000 **restaurant reviews** with two columns: Review and Liked. The Review column contains textual feedback from customers, while the Liked column indicates sentiment (1 for positive, 0 for negative). This data is used for sentiment analysis to classify the sentiment of reviews.

In [2]:
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)
dataset

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
...,...,...
995,I think food should have flavor and texture an...,0
996,Appetite instantly gone.,0
997,Overall I was not impressed and would not go b...,0
998,"The whole experience was underwhelming, and I ...",0


In [9]:
print(dataset.iloc[5])

Review    Now I am getting angry and I want my damn pho.
Liked                                                  0
Name: 5, dtype: object


## Cleaning the texts

In [10]:
import re
import nltk
nltk.download('stopwords')                                      # Download stop words list
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 1000):
  review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
  review = review.lower()
  review = review.split()
  ps = PorterStemmer()
  all_stopwords = stopwords.words('english')
  all_stopwords.remove('not')
  review = [ps.stem(word) for word in review if not word in set(all_stopwords)]
  review = ' '.join(review)
  corpus.append(review)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.



1. ` review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i]) `

    This line of code uses a regular expression to clean the review text. It replaces any character that is not a letter (a-z, A-Z) with a space. This process removes numbers, punctuation, special characters, and other non-alphabetic symbols from the review, leaving only the letters and spaces, which helps in standardizing the text for further processing.

    * Numbers:
    Example: 2024 → ' '

    * Punctuation marks:
    Commas ,, periods ., exclamation marks !, question marks ?, etc.
    Example: "Great service!" → "Great service "

    * Special characters:
    Symbols such as @, #, $, %, &, etc.
    Example: "Price: $10" → "Price "

    * Whitespace (multiple spaces, tabs, newlines):
    If multiple non-letter characters are found consecutively, they are replaced with spaces, and any extra spaces are handled later.
    Example: "\tHello\nWorld!" → " Hello World "

    * Other characters:
    Non-English letters (e.g., accented letters), emojis, or any symbols that don't fit the pattern [a-zA-Z].
    Example: "Café ☕️" → "Caf "
    <br>

2. `review = review.lower()`

    This step is done to normalize the text, ensuring that words like "Great" and "great" are treated the same. It helps to avoid treating the same word differently just because one starts with a capital letter.

    Example:
    * Before: "Great Service!"
    * After: "great service! <br><br>


3. `review = review.split()`

    This step is essential for text processing, as it breaks the text into individual words that can be analyzed or processed separately.

    * Before: "great service at the restaurant"
    * After: ['great', 'service', 'at', 'the', 'restaurant']<br><br>

4. `ps = PorterStemmer()`

    PorterStemmer is a type of stemming algorithm that reduces words to their base or root form. This process helps in normalizing words so that variations of the same word are treated as equivalent.

    Example:
    * Word: "running"
    * Stemmed Word: "run"
    * Word: "flies"
    * Stemmed Word: "fli"<br><br>

5. `all_stopwords = stopwords.words('english')`

    Stopwords are common words that are often filtered out in text processing tasks because they typically don't add much meaningful information. These words usually include articles, prepositions, conjunctions, and pronouns, such as "the," "is," "and," "of," etc.

    The list includes words like:

    * i, me, my, the, and, is ,to ,of <br><br>

6. `all_stopwords.remove('not')`

    By default, stopwords lists typically include common words like "not" because they are common and often deemed unimportant in text analysis. However, in some text processing tasks, such as sentiment analysis, "not" can be significant because it changes the meaning of the words it modifies. For example, "not good" has a different sentiment than "good."

    Example:

    * Before Removal: The sentence "The movie was not good." would have 'not' removed if it were in the stopwords list.

    * After Removal: The sentence remains intact as "The movie was not good." and the negation effect of 'not' is preserved. <br><br>


7. `all_stopwords.remove('not')`

    This processing helps in focusing on the important words and reducing variations in word forms for further text analysis or machine learning.

     Start with a List of Words

    * Example: `['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']`

    * Remove Common Words (Stopwords)

        - Remove words like 'the' and 'over'.
   
        - What’s left: `['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']`

    * Turn Words into Their Base Forms (Stemming)

        - 'quick' stays 'quick'
        - 'jumps' becomes 'jump'
        - 'lazy' becomes 'lazi'
        -  Final list: `['quick', 'brown', 'fox', 'jump', 'lazi', 'dog']`
        <br><br>

8. `review = ' '.join(review)`

    The idea is to convert a list of words into a coherent string that resembles a natural sentence.

    * words = ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

    * sentense = 'the quick brown fox jumps over the lazy dog'

In [20]:
for sentence in corpus[:20]:
    print(sentence)


wow love place
crust not good
not tasti textur nasti
stop late may bank holiday rick steve recommend love
select menu great price
get angri want damn pho
honeslti tast fresh
potato like rubber could tell made ahead time kept warmer
fri great
great touch
servic prompt
would not go back
cashier care ever say still end wayyy overpr
tri cape cod ravoli chicken cranberri mmmm
disgust pretti sure human hair
shock sign indic cash
highli recommend
waitress littl slow servic
place not worth time let alon vega
not like


## Creating the Bag of Words model

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, -1].values

**max_features=1500**: This parameter limits the number of features (or tokens) to 1500. It means that only the top 1500 most frequent words (or tokens) will be considered. Less frequent words beyond this limit will be ignored.

In [13]:
X[:10]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [14]:
y[:10]

array([1, 0, 0, 1, 1, 0, 0, 0, 1, 1])

## Splitting the dataset into the Training set and Test set

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

## Training the Naive Bayes model on the Training set

In [16]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

## Predicting the Test set results

In [17]:
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[1 0]
 [1 0]
 [1 0]
 [0 0]
 [0 0]
 [1 0]
 [1 1]
 [1 0]
 [1 0]
 [1 1]
 [1 1]
 [1 1]
 [1 0]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [0 1]
 [1 1]
 [1 0]
 [1 0]
 [0 1]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [1 0]
 [0 0]
 [1 0]
 [1 1]
 [1 1]
 [1 0]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [1 0]
 [1 0]
 [0 0]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 0]
 [0 0]
 [1 1]
 [1 1]
 [0 0]
 [1 1]
 [1 0]
 [0 0]
 [1 0]
 [1 0]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [1 0]
 [1 1]
 [0 1]
 [0 0]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 0]
 [0 0]
 [1 1]
 [1 0]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [1 0]
 [1 1]
 [1 0]
 [1 1]
 [1 1]
 [1 0]
 [0 1]
 [1 1]
 [1 1]
 [1 0]
 [0 1]
 [1 0]
 [1 1]
 [1 1]
 [0 0]
 [0 1]
 [0 1]
 [1 1]
 [0 0]
 [1 0]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [1 1]
 [1 0]
 [0 0]
 [0 0]
 [1 1]
 [1 0]
 [0 0]
 [1 1]
 [1 0]
 [1 1]
 [0 0]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 0]
 [0 1]
 [1 1]
 [1 1]

## Making the Confusion Matrix

In [18]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[55 42]
 [12 91]]


0.73

END