In this practical exercise, we will be building a Naïve Bayes model that classifies new article into 20 different categories.

We are new to the concept of processing text, which fall under the category of Natural Language processing. For the sake of using the Naive Bayes classifier, we will be using a TF-IDF Tockenizer, to convert the text into vectors so it can be fed to the Naive Bayes classifier. We have a description of the TF-IDF Tockenizer later in this notebook.

Your main task in this notebook is to use the proper type of the Naive Bayes classifier (from the three types we learned about) and perform the proper evaluation. The rest of the cells are already provided, all you need to do is fill the missing pieces specified as \<write your code here\>.


The dataset is accessible using the below link. We will be using the one provided by the Scikit-Learn library, so no need to download it.

https://www.kaggle.com/datasets/crawford/20-newsgroups/data




### Importing libraries

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Text classification with Naive Bayes Classifier

Naive Bayes classifier is used for text classification and spam detection tasks.



In [2]:
# Data Loading
from sklearn.datasets import fetch_20newsgroups

# Preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer

# Model/estimator
# !!!!!! <write your code here>

# Model evaluation
from sklearn.metrics import confusion_matrix

# Plotting library
import matplotlib.pyplot as plt

## Dataset

We will be using 20 newsgroup dataset for classification.

As a first step, let's download 20 newsgroup dataset with fetch_20newsgroups API

In [3]:
data = fetch_20newsgroups()

In [4]:
data.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

There are **20 categories** in the dataset. For simplicity, we will select 4 of these categories and download training and test sets.

In [5]:
categories = ['talk.religion.misc', 'soc.religion.christian',
             'sci.space', 'comp.graphics']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)

In [6]:
# let's look at a sample training document.
print(train.data[5])

From: dmcgee@uluhe.soest.hawaii.edu (Don McGee)
Subject: Federal Hearing
Originator: dmcgee@uluhe
Organization: School of Ocean and Earth Science and Technology
Distribution: usa
Lines: 10


Fact or rumor....?  Madalyn Murray O'Hare an atheist who eliminated the
use of the bible reading and prayer in public schools 15 years ago is now
going to appear before the FCC with a petition to stop the reading of the
Gospel on the airways of America.  And she is also campaigning to remove
Christmas programs, songs, etc from the public schools.  If it is true
then mail to Federal Communications Commission 1919 H Street Washington DC
20054 expressing your opposition to her request.  Reference Petition number

2493.



## Data preprocessing and modelling

`Tfidfvectorizer` is one such API that converts text input into a vector of numerical values.

- We will use `TfidfVectorizer` as a preprocessing step to obtain feature vector corresponding to the text document.
- You can read more about `TfidfVectorizer` if you are interested in Natural Language Processing by accessing the following blog: https://medium.com/@cmukesh8688/tf-idf-vectorizer-scikit-learn-dbc0244a911a
- The documentation of `TfidfVectorizer` is accessible of the Scikit-Learn website: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [9]:
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB


model = make_pipeline(TfidfVectorizer(), MultinomialNB())

In [10]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(train.data)

Here you would have to create and train the Naive Bayes model. Make sure to select the proper one.

In [11]:
# !!!!!! <write your code here>
# !!!!!! <write your code here>
model.fit(train.data, train.target)

## Model evaluation

Let's first predict the labels for the test set and then calculate the confusion matrix for the test set.

In [14]:
# !!!!!! <write your code here>
# !!!!!! <write your code here>
# !!!!!! <write your code here>
pred = model.predict(test.data)
conf_matrix = confusion_matrix(test.target, pred)

Now we have a tool to classify statements into one of these four classes.
> Make use of `predict` function on the model. But before, you need to use the tf-idf tokenizer to transform the test data

In [15]:
def predict_category(s, train=train, model=model):
    # !!!!!! <write your code here>
    # !!!!!! <write your code here>
    pred = model.predict([s])
    return train.target_names[pred[0]]

#### Compute the model accuracy

In [17]:
# !!!!!! <write your code here>
# !!!!!! <write your code here>
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(test.target, pred)
print(f"Accuracy= {accuracy:.4f}")


Accuracy= 0.8017


#### Let's do some test on new samples.

In [18]:
predict_category('sending a payload to the ISS')

'sci.space'

In [19]:
predict_category('discussing islam vs atheism')

'soc.religion.christian'

In [20]:
predict_category('determining the screen resolution')

'comp.graphics'