<a href="https://colab.research.google.com/github/kilos11/Data_Science/blob/main/Text_Classification(Na%C3%AFve_Bayes)_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Classification with Naïve Bayes

In [1]:

%pylab inline
# This line is a magic command that allows for interactive plotting in Jupyter notebooks

import IPython
import sklearn as sk
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_20newsgroups
# This import statement allows us to fetch the 20 Newsgroups dataset, a popular dataset for text classification tasks

from sklearn.model_selection import cross_val_score, KFold
# These imports provide functions for cross-validation and splitting data into folds

from scipy.stats import sem
# This import provides a function for calculating the standard error of a distribution

from sklearn.naive_bayes import MultinomialNB
# This import provides the Multinomial Naive Bayes classifier, commonly used for text classification tasks

from sklearn.pipeline import Pipeline
# This import provides a class for defining a pipeline of data processing steps and machine learning models

from sklearn import metrics
# This import provides various metrics for evaluating the performance of machine learning models

from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer
# These imports provide classes for transforming text data into numerical features using various techniques (TF-IDF, hashing, etc.)

Populating the interactive namespace from numpy and matplotlib


In [5]:

# Fetch the 20 Newsgroups dataset
# The 'subset' parameter is set to 'all' to fetch all the documents from the dataset
news = fetch_20newsgroups(subset='all')

In [6]:

# Retrieve the keys of the 'news' dictionary
news.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [7]:

# Print the types of the 'data', 'target', and 'target_names' attributes of the 'news' object
print(type(news.data), type(news.target), type(news.target_names))

<class 'list'> <class 'numpy.ndarray'> <class 'list'>


In [8]:
print(news.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [9]:
print(len(news.data))

18846


In [10]:
print(len(news.target))

18846


If you look at, say, the first instance, you will see the content of a newsgroup message, and you can get its corresponding category:

In [11]:
print(news.data[0])
print(news.target[0], news.target_names[news.target[0]])

From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!


10 rec.sport.hockey


Let's build the training and testing datasets:

In [12]:

SPLIT_PERC = 0.75
# Set the split percentage to 0.75, indicating that 75% of the data will be used for training

split_size = int(len(news.data)*SPLIT_PERC)
# Calculate the split size, which is the total length of the data multiplied by the split percentage
# The split size represents the index at which the data will be split into training and testing sets

X_train = news.data[:split_size]
# Assign the first 'split_size' elements of the 'news.data' list to 'X_train'
# This forms the training data, which will be used to train a machine learning model

X_test = news.data[split_size:]
# Assign the elements from 'split_size' onwards of the 'news.data' list to 'X_test'
# This forms the testing data, which will be used to evaluate the performance of the trained model

y_train = news.target[:split_size]
# Assign the first 'split_size' elements of the 'news.target' list to 'y_train'
# This forms the corresponding target labels for the training data

y_test = news.target[split_size:]
# Assign the elements from 'split_size' onwards of the 'news.target' list to 'y_test'
# This forms the corresponding target labels for the testing data

### This function will serve to perform and evaluate a cross validation:

In [13]:

def evaluate_cross_validation(clf, X, y, K):
    # Define a function to evaluate a classifier using cross-validation

    cv = KFold(len(y), K, shuffle=True, random_state=0)
    # Create a KFold object to generate indices for K-fold cross-validation
    # 'len(y)' is the number of samples in the dataset
    # 'K' is the number of folds
    # 'shuffle=True' shuffles the indices before splitting into folds
    # 'random_state=0' sets the random seed for reproducibility

    scores = cross_val_score(clf, X, y, cv=cv)
    # Perform cross-validation for the classifier 'clf' using the generated indices
    # 'X' is the input data
    # 'y' is the target labels
    # 'cv' is the cross-validation object

    print(scores)
    # Print the individual scores obtained from each fold of cross-validation

    print(f"Mean score: {np.mean(scores)} (+/-{sem(scores)})")
    # Print the mean score and the standard error of the mean across all folds

In [16]:

# Create three different classifier pipelines

clf_1 = Pipeline([('vect', CountVectorizer()), ('clf', MultinomialNB()),])
# Pipeline 1:
# - Uses the CountVectorizer to convert text documents into a matrix of token counts
# - Applies the MultinomialNB classifier for classification

clf_2 = Pipeline([('vect', HashingVectorizer()), ('clf', MultinomialNB())])
# Pipeline 2:
# - Uses the HashingVectorizer to convert text documents into a sparse matrix of token counts
# - Applies the MultinomialNB classifier for classification

clf_3 = Pipeline([('vect', TfidfVectorizer()), ('clf', MultinomialNB()),])
# Pipeline 3:
# - Uses the TfidfVectorizer to convert text documents into a matrix of TF-IDF features
# - Applies the MultinomialNB classifier for classification

In [None]:

clfs = [clf_1, clf_2, clf_3]
# Create a list 'clfs' containing the three classifier pipelines 'clf_1', 'clf_2', and 'clf_3'

for clf in clfs:
    # Iterate over each classifier pipeline in the 'clfs' list

    cv = KFold(n_splits=5, shuffle=True, random_state=0)
    evaluate_cross_validation(clf, news.data, news.target, 5)
    # Call the 'evaluate_cross_validation' function to evaluate the classifier
    # Pass the current classifier pipeline 'clf', the input data 'news.data', the target labels 'news.target',
    # and the number of folds '5' as arguments

In [None]:
from sklearn.model_selection import KFold

clfs = [clf_1, clf_2, clf_3]
# Create a list 'clfs' containing the three classifier pipelines 'clf_1', 'clf_2', and 'clf_3'

for clf in clfs:
    # Iterate over each classifier pipeline in the 'clfs' list

    evaluate_cross_validation(clf, news.data, news.target, 5)
    # Call the 'evaluate_cross_validation' function to evaluate the classifier
    # Pass the current classifier pipeline 'clf', the input data 'news.data', the target labels 'news.target',
    # and the number of folds '5' as arguments

We will keep the TF-IDF vectorizer but use a different regular expression to pefrom tokenization. The default regular expression: ur"\b\w\w+\b" considers alphanumeric characters and the underscore. Perhaps also considering the slash and the dot could improve the tokenization, and begin considering tokens as Wi-Fi and site.com. The new regular expression could be: ur"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b". If you have queries about how to define regular expressions, please refer to the Python re module documentation. Let's try our new classifier:

In [23]:
clf_4 = Pipeline([('vect', TfidfVectorizer(token_pattern=ur"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b")),
                  ('clf', MultinomialNB())])
# Pipeline 4:
clf_4.fit(X_train_4, y_train_4)
clf_4.fit(X_train_4, y_train)
# - Uses the TfidfVectorizer to convert text documents into a matrix of TF-IDF features
# - Applies the MultinomialNB classifier for classification

# TfidfVectorizer parameters:
# - token_pattern: Specifies the regular expression pattern used to tokenize the input text.
#   In this case, the pattern \b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b matches lowercase alphabets, numbers, underscores,
#   hyphens, and dots, ensuring that the token has at least one alphabet character.
#   This pattern is used to identify valid tokens in the text.

# MultinomialNB classifier:
# - A Naive Bayes classifier that is commonly used for text classification tasks.

SyntaxError: invalid syntax (<ipython-input-23-458a22c343b0>, line 1)

In [None]:

evaluate_cross_validation(clf_4, news.data, news.target, 5)
# Call the 'evaluate_cross_validation' function to evaluate the classifier pipeline 'clf_4'
# Pass the classifier pipeline 'clf_4', the input data 'news.data', the target labels 'news.target',
# and the number of folds '5' as arguments

[ 0.86100796  0.8718493   0.86203237  0.87291059  0.8588485 ]
Mean score: 0.865 (+/-0.003)


In [None]:

def get_stop_words():
    # Define a function named 'get_stop_words' that retrieves a set of stop words

    result = set()
    # Create an empty set named 'result' to store the stop words

    for line in open('data/stopwords_en.txt', 'r').readlines():
        # Iterate over each line in the file 'stopwords_en.txt'

        result.add(line.strip())
        # Remove leading and trailing whitespace from each line and add it to the 'result' set

    return result
    # Return the set of stop words stored in 'result'

In [None]:
stop_words = get_stop_words()
print (stop_words)

set(['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', 'fify', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom', 'seeming', 'under', 'ours', 'has', 'might', 'thereafter', 'latterly', 'do', 'them', 'his', 'around', 'than', 'get', 'very', 'de', 'none', 'cannot', 'every', 'whether', 'they', 'front', 'during', 'thus', 'now', 'him', 'nor', 'name', 'several', 'hereafter', 'always', 'who', 'cry', 'whither', 'this', 'someone', 'either', 'each', 'become', 'thereupon', 'sometime', 'side', 'two', 'therein', 'twelve', 'because', 'often', 'ten', 'our', 'eg', 'some', 'back', 'up', 'go', 'namely', 'towards', 'are', 'further', 'beyond', 'ourselves', 'yet', 'out', 'even', 'will', 'what', 'still', 'for', 'bottom', 'mine', 'since', 'please', 'forty', 'per', 'its', 'everything', 'behind', 'un', 'above', 'between', 'it', 'neither', 

In [None]:

clf_5 = Pipeline([
    ('vect', TfidfVectorizer(
                stop_words=stop_words,
                token_pattern=ur"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",
    )),
    ('clf', MultinomialNB()),
])
# Pipeline 5:
# - Uses the TfidfVectorizer to convert text documents into a matrix of TF-IDF features
# - Specifies additional parameters for the TfidfVectorizer:
#     - stop_words: Specifies a set of stop words to be excluded from the tokenization process.
#       This variable 'stop_words' should be defined prior to this pipeline.
#     - token_pattern: Specifies the regular expression pattern used to tokenize the input text.
#       In this case, the pattern \b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b matches lowercase alphabets, numbers, underscores,
#       hyphens, and dots, ensuring that the token has at least one alphabet character.
#       This pattern is used to identify valid tokens in the text.

# MultinomialNB classifier:
# - A Naive Bayes classifier that is commonly used for text classification tasks.

In [None]:
evaluate_cross_validation(clf_5, news.data, news.target, 5)

[ 0.88116711  0.89519767  0.88325816  0.89227912  0.88113558]
Mean score: 0.887 (+/-0.003)


Try to improve by adjusting the alpha parameter on the MultinomialNB classifier:

In [None]:
clf_7 = Pipeline([
    ('vect', TfidfVectorizer(
                stop_words=stop_words,
                token_pattern=ur"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",
    )),
    ('clf', MultinomialNB(alpha=0.01)),
])

In [None]:
evaluate_cross_validation(clf_7, news.data, news.target, 5)

[ 0.9204244   0.91960732  0.91828071  0.92677103  0.91854603]
Mean score: 0.921 (+/-0.002)


The results had an important boost from 0.89 to 0.92, pretty good. At this point, we could continue doing trials by using different values of alpha or doing new modifications of the vectorizer. In Chapter 4, Advanced Features, we will show you practical utilities to try many different configurations and keep the best one.

If we decide that we have made enough improvements in our model, we are ready to evaluate its performance on the testing set.

In [None]:
def train_and_evaluate(clf, X_train, X_test, y_train, y_test):

    clf.fit(X_train, y_train)

    print ("Accuracy on training set:")
    print (clf.score(X_train, y_train))
    print ("Accuracy on testing set:")
    print (clf.score(X_test, y_test))

    y_pred = clf.predict(X_test)

    print ("Classification Report:")
    print (metrics.classification_report(y_test, y_pred))
    print ("Confusion Matrix:")
    print (metrics.confusion_matrix(y_test, y_pred))

In [None]:
train_and_evaluate(clf_7, X_train, X_test, y_train, y_test)

Accuracy on training set:
0.996957690675
Accuracy on testing set:
0.917869269949
Classification Report:
             precision    recall  f1-score   support

          0       0.95      0.88      0.91       216
          1       0.85      0.85      0.85       246
          2       0.91      0.84      0.87       274
          3       0.81      0.86      0.83       235
          4       0.88      0.90      0.89       231
          5       0.89      0.91      0.90       225
          6       0.88      0.80      0.84       248
          7       0.92      0.93      0.93       275
          8       0.96      0.98      0.97       226
          9       0.97      0.94      0.96       250
         10       0.97      1.00      0.98       257
         11       0.97      0.97      0.97       261
         12       0.90      0.91      0.91       216
         13       0.94      0.95      0.95       257
         14       0.94      0.97      0.95       246
         15       0.90      0.96      0.93     

As we can see, we obtained very good results, and as we would expect, the accuracy in the training set is quite better than in the testing set. We may expect, in new unseen instances, an accuracy of around 0.91.

If we look inside the vectorizer, we can see which tokens have been used to create our dictionary:

In [None]:
clf_7.named_steps['vect'].get_feature_names()

[u'0-.66d8wt',
 u'0-04g55',
 u'0-100mph',
 u'0-13-117441-x--or',
 u'0-3mb',
 u'0-40mb',
 u'0-40volts',
 u'0-5mb',
 u'0-60mph',
 u'0-8.3mb',
 u'0-a00138',
 u'0-byte',
 u'0-defects',
 u'0-e8',
 u'0-for-4',
 u'0-hc',
 u'0-ii',
 u'0-uw',
 u'0-uw0',
 u'0-uw2',
 u'0-uwa',
 u'0-uwt',
 u'0-uwt7',
 u'0-uww',
 u'0-uww7',
 u'0.-w0',
 u'0..x-1',
 u'0.00...nice',
 u'0.02cents',
 u'0.0cb',
 u'0.1-ports',
 u'0.15mb',
 u'0.2d-_',
 u'0.5db',
 u'0.6-micron',
 u'0.65mb',
 u'0.97pl4',
 u'0.b34s_',
 u'0.c0rgo5kj7pp0',
 u'0.c4',
 u'0.jy',
 u'0.s_',
 u'0.tprv6ekj7r',
 u'0.tt',
 u'0.txa_',
 u'0.txc',
 u'0.vpp',
 u'0.vpsll2',
 u'00-index.txt',
 u'000-foot',
 u'000-kg',
 u'000-man',
 u'000-maxwell',
 u'000-strong',
 u'000000.active.spx',
 u'000062david42',
 u'000100255pixel',
 u'0005111312na1em',
 u'0005111312na3em',
 u'000hz',
 u'000iu',
 u'000mg',
 u'000mi',
 u'000miles',
 u'000puq9',
 u'000rpm',
 u'000th',
 u'000ug',
 u'000usd',
 u'0010580b.0b6r49',
 u'0010580b.vma7o9',
 u'0010580b.vmcbrt',
 u'001200201pixel

In [None]:
print (len(clf_7.named_steps['vect'].get_feature_names()))

145771
