# Lab 6. Natural Language Processing
# Task 6.2 Text Classification
## Problem Descriptions
In this task, text classification aims to classify messages into a predefined topic category. The problem can be formulated as follows:
1. Collection of texts with class labels:
 * Choose rec. motorcycles, talk. politics. guns, comp.windows.x, and sci. med from the sklearn newsgroups dataset as text collection for modelling.
2. Data preprocessing:
 * Tokenization: Split the word of the movie review into the individual token
 * Removing stopword: Remove words that don't contribute to the overall sentiment of the movie review such as "is", "may", or "the".
 * Convert all words into lowercase to ensure uniformity.
 * Stemming or lemmatization: Reduces the word to its root form.


3. Extract the word count features: Convert the processed newsgroup dataset into a Bag-of-word representation, so that it is suitable for the machine learning model. Either one of the word count feature extraction approaches is implemented.
 *  tf feature:  tf stands for term frequency, it counts how often a vocabulary is present in that document (message).
          tf(t,d)=count of term t in the document (message)
                  /total number of words in the document d (message).
 *  Extract td-idf feature: A word with a high tf-idf score means that the word is specific to the particular document (message) but not common in the entire newsgroup dataset. tf-idf score can be calculated by using the formula below.
                
          idf(t, D)=log[(total number of messages in newsgroup dataset D)
                        /(number of messages containing term t)]
                 
          tfidf(t,d,D)=tf(t,d)*idf(t,D)
             
4. Build the naïve Bayes classifier from the training set
 * In this case, multinomial Naive Bayes model is used because it is suitable for modelling discrete data in text classification task.


5. Apply the classifier on the testing data set to classify new messages into their category.    

6. Evaluate the performance of the model in handling testing data by using metrics like accuracy, precision, recall, F1-score, and confusion matrices.



## Implementation and Results

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Get the training dataset for the specified categoires
categories = ['rec.motorcycles', 'talk.politics.guns',
              'comp.windows.x', 'sci.med']
training_data = fetch_20newsgroups(subset='train', categories=categories)

In [None]:
# Create the tf-idf transformer
tfidf = TfidfVectorizer(use_idf=True)
# tfidf = TfidfVectorizer(use_idf=False)
training_tfidf = tfidf.fit_transform(training_data.data)
print(training_tfidf.shape)

# Train a Multinomial Naive Bayes classifier
classifier = MultinomialNB().fit(training_tfidf, training_data.target)


(2331, 40367)


In [None]:
from sklearn import metrics

testing_data = fetch_20newsgroups(subset='test', categories=categories)
testing_tfidf = tfidf.transform(testing_data.data)
predictions = classifier.predict(testing_tfidf)
print(metrics.classification_report(testing_data.target, predictions, target_names=categories))


                    precision    recall  f1-score   support

   rec.motorcycles       0.98      0.94      0.96       395
talk.politics.guns       0.96      0.98      0.97       398
    comp.windows.x       0.98      0.91      0.94       396
           sci.med       0.90      0.99      0.94       364

          accuracy                           0.95      1553
         macro avg       0.95      0.95      0.95      1553
      weighted avg       0.95      0.95      0.95      1553



In [None]:
errors = [i for i in range(len(predictions)) if predictions[i] != testing_data.target[i]]

for i, post_id in enumerate(errors[:5]):
  print("------------------------------------------------------------------")
  print("%s --> %s\n" %(testing_data.target_names[testing_data.target[post_id]],
                      testing_data.target_names[predictions[post_id]]))
  print(testing_data.data[post_id])


------------------------------------------------------------------
sci.med --> comp.windows.x

From: andrewm@bio.uts.EDU.AU (Andrew Mears)
Subject: sheep models in cardiology
Organization: University of Technology, Sydney
Lines: 18
Distribution: world
NNTP-Posting-Host: iris.bio.uts.edu.au
Keywords: sheep ovine arrhythmias


Dear news readers,

Is there anyone using sheep models for cardiac research, specifically
concerned with arrhythmias, pacing or defibrillation? I would like
to hear from you.

Many thanks,

Andrew Mears
*************** PLEASE EMAIL ME *************
-- 
*************************************************************************
**  *   Andrew Mears                            h: 61-2-9774245         *
* **    CRC for Cardiac Technology, UTS         w: 61-2-3304091	        *
* **    Westbourne St, GORE HILL                F: 61-2-3304003         *
**  *   N.S.W  2065               email: <andrewm@iris.bio.uts.edu.au>  *
**************************************************

##Discussion
The performance of the text classification model is evaluated in terms of precision, recall, f1 score, support, accuracy, macro average, and weighted average. The explanation of each matrix can be found in lab 6.1.

                          precision    recall  f1-score   support

    rec. motorcycles           0.98      0.94      0.96       395
    talk. politics. guns       0.96      0.98      0.97       398
    comp. windows.x            0.98      0.91      0.94       396
    sci. med                   0.90      0.99      0.94       364

    accuracy                                       0.95      1553
    macro avg                  0.95      0.95      0.95      1553
    weighted avg               0.95      0.95      0.95      1553


From the performance table with all performance metrics over 90%, we can infer that the model achieves excellent performance across text classification for all 4 categories. The model is effective in correctly identifying positive instances while minimizing false positives, this reflects on the f1 score of over 90%. The high overall accuracy indicates that the model is generally reliable for classifying texts into these specific categories. Identical macro and weighted averages imply that the model's performance is consistently good across all categories, even when accounting for differences in category size. In conclusion, this model is generally reliable and consistent for classifying messages into newsgroups of rec. motorcycles, talk. politics. guns, comp.windows.x, and sci. med.