<a href="https://colab.research.google.com/github/kiranshahi/Natural-Language-Processing/blob/main/Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Classification

## Problem description
In this task, we are going to categories the document and assign it to one of 20 newsgroups. Here we have dataset called 20newsgroups from sklearn which contain the collection of message and 20 newsgroups such as 'sci.electronics', 'sci.crypt','comp.os.ms-windows.misc', 'rec.motorcycles' etc.

We are going to use simple word count features like tf or tf-idf, and **Multinomial Naïve Bayes** as the method of classification.

We can formulate a text classification problem in the following steps.


*   20newsgroups dataset is divided into two subsets for training (development) and testing(performance evaluation). At first, we load the 'train' subset as training data.
*   We convert train data into tf-idf or tf vectors.
*   Next we train our data with Multinomial Naïve Bayes classifier.
*   We load the testing set of data, convert it into tf-idf features and perform the classification.


## Implementation and Results

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Get the training dataset for the specified categoires
categories = ['sci.electronics', 'sci.crypt',
              'comp.os.ms-windows.misc', 'rec.motorcycles']
training_data = fetch_20newsgroups(subset='train', categories=categories)

In [None]:
# Create the tf-idf transformer
tfidf = TfidfVectorizer(use_idf=True)
# tfidf = TfidfVectorizer(use_idf=False)
training_tfidf = tfidf.fit_transform(training_data.data)
print(training_tfidf.shape)

# Train a Multinomial Naive Bayes classifier
classifier = MultinomialNB().fit(training_tfidf, training_data.target)

(2375, 64073)


In [None]:
from sklearn import metrics

testing_data = fetch_20newsgroups(subset='test', categories=categories)
testing_tfidf = tfidf.transform(testing_data.data)
predictions = classifier.predict(testing_tfidf)

In [None]:
errors = [i for i in range(len(predictions)) if predictions[i] != testing_data.target[i]]

for i, post_id in enumerate(errors[:5]):
  print("------------------------------------------------------------------")
  print("%s --> %s\n" %(testing_data.target_names[testing_data.target[post_id]], 
                      testing_data.target_names[predictions[post_id]]))
  print(testing_data.data[post_id])

------------------------------------------------------------------
sci.electronics --> rec.motorcycles

From: jeffj@cbnewsm.cb.att.com (jeffrey.n.jones)
Subject: SPICE for XT with no co-processer?
Organization: AT&T
Distribution: usa
Lines: 10

I want to run SPICE on my XT so I can learn more about amplifiers
and oscilators. Is there a version of this that will run on my XT
with no math co-processer, if so where can I get it? Thanks for any
and all help!

Jeff
-- 
 Jeff Jones  AB6MB         |  OPPOSE THE NORTH AMERICAN FREE TRADE AGREEMENT!
 jeffj@seeker.mystic.com   |  Canada/USA Free Trade cost Canada 400,000 jobs. 
 Infolinc BBS 415-778-5929 |  Want to guess how many we'll lose to Mexico?

------------------------------------------------------------------
sci.electronics --> sci.crypt

From: me170pjd@emba-news.uvm.edu.UUCP (Peter J Demko)
Subject: Re: PC parallel I (!= I/O)
Originator: me170pjd@freehold.emba.uvm.edu
Organization: University of Vermont -- Division of EMBA Computer Fa

In [None]:
print(metrics.classification_report(testing_data.target, predictions, target_names=categories))

                         precision    recall  f1-score   support

        sci.electronics       0.98      0.83      0.90       394
              sci.crypt       0.95      0.97      0.96       398
comp.os.ms-windows.misc       0.69      0.98      0.81       396
        rec.motorcycles       0.98      0.69      0.81       393

               accuracy                           0.87      1581
              macro avg       0.90      0.87      0.87      1581
           weighted avg       0.90      0.87      0.87      1581



## Discussions
In this task, we had classified the messages into 4 different groups. For this, we took 20newsgroups datasets from sklearn and implemented simple word count features like tf and tf-idf for representation of text and Multinomial Naïve Bayes for classification.

From the above metrics for performance evaluation, we can conclude that the accuracy of the model was 87% approx which is pretty good. For classification we had classified the message into four class 'sci.electronic', 'sci.crypt', 'comp.os.ms-windows.misc' and 'rec.motorcycles'.

For sci.electronics the precision value was 0.98, the recall was 0.83 and the f1-score of 0.90. It means that 98% of the message that was predicted under the group of sci.electronics were true and the remaining 2% predicted for sci.electronics category falls under a different category. Similarly, 83% of messages from group sci.electronics were predicted as sci.electronics and remaining messages from sci.electronics group were predicted something else.

95% of the message that was predicted under the group of sci.crypt were true and the remaining 5% were false predictions. Similarly, 97% of messages from group sci.crypt were predicted as sci.crypt and the remaining messages from sci.crypt group were predicted something else.

69% of the message that was predicted under the group of comp.os.ms-windows.misc were true and the remaining were false predictions which is quite low. Similarly, 98% of messages from group comp.os.ms-windows.misc were predicted as comp.os.ms-windows.misc and remaining messages from comp.os.ms-windows.misc group were predicted something else.

98% of the message that was predicted under the group of rec.motorcycles were true and the remaining were false predictions. Similarly, 69% of messages from group rec.motorcycles were predicted as rec.motorcycles and the remaining message from rec.motorcycles group were predicted something else.

From the observation of the f1-score, we find that sci.crypt group higher value i.e. 0.96. So, we can conclude that the classification performs pretty well for sci.crypt group.