#Use the news dataset from sklearn

We will use the same dataset as last week. We will re-use the three categories for demonstration purpose

This is a news dataset for text classification: given a news as an input, classify its category (one of the twenty pre-defined categories) as the output. The detail is here https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html#sklearn.datasets.fetch_20newsgroups

In [10]:
from sklearn.datasets import fetch_20newsgroups

#we use the same random seed for reproducibility
RANDOM_STATE = 0
#this dataset has almost 20k instances. For quick demostration,
#we will only use a sub-sample from the three categories below
categories = ['comp.graphics', 'sci.med', 'talk.politics.guns']
#now let's retrieve all the data
data_all = fetch_20newsgroups(
        subset='all',
        shuffle=True,
        categories= categories,
        random_state=RANDOM_STATE,
    )



#Re-generate the training, validation, and testing sets

In [11]:
#We used the same training, validation, and test set as last week
#Step 1: calculate the train, validation, test set size

train_set_size = int(len(data_all.data)*0.7) #70% of the data for training
valid_set_size = int(len(data_all.data)*0.1) #10% of the data for validation
test_set_size = len(data_all.data) - train_set_size - valid_set_size #the remaining data for testing

print('Training set size:', train_set_size)
print('Valid set size:', valid_set_size)
print('Testing set size:', test_set_size)

from sklearn.model_selection import train_test_split

#Step 2: now we first split the testing set out. the remaining data will be training and validation
#We use the https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

#data_all.data: they are the input instances that we want our model to learn
#data_all.target: they are the output class labels that we want our model to predict
X_trainvalid, X_test, y_trainvalid, y_test = train_test_split(data_all.data,
                                                              data_all.target,
                                                              test_size=test_set_size,
                                                              random_state=RANDOM_STATE)

#Step 3: now it's your turn. We further split the remaining data into train and valid
X_train, X_valid, y_train, y_valid = train_test_split(X_trainvalid, y_trainvalid, test_size=valid_set_size, random_state=RANDOM_STATE)

Training set size: 2011
Valid set size: 287
Testing set size: 575


#Redo text representation

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

#we use the tfidf to generate representation from the training set
X_train_vector = vectorizer.fit_transform(X_train)

#then we apply this vector to the validation set
#note here we use transform instead of fit_transform because we already generated the vector
#from the training set
X_valid_vector = vectorizer.transform(X_valid)

#Exercise 1 Train a Naive Bayes model

In [13]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Create Naive Bayes model with default hyperparameters
model = MultinomialNB()

# Now it's your turn. Train model using training set tfidf vectors and labels
# model.fit(XX, XX)
model.fit(X_train_vector, y_train)

# Now it's your turn. Predict on validation set
# y_predict_valid = model.predict(XX)
y_predict_valid = model.predict(X_valid_vector)
print ('accuracy in the validation set', accuracy_score(y_valid, y_predict_valid))

accuracy in the validation set 0.9825783972125436


#Examine the performance

#Exercise 2 Calculate precision, recall, and f1-score in addition to accuracy

In [14]:
#The results show that a simple Naive Bayes model with default setting already
#achieved an accuracy of ~0.98 on the validation set! Instead of hurray,
#we need to carefully examine the performance

#Recall that accuracy might be biased to the majority class
#Let's first calculate other metrics to show a balanced view

#We can caculate precision, recall, and f-score introduced in the class
#It's your turn
#You can call the classification_report function via https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
from sklearn.metrics import classification_report
print(classification_report(y_valid, y_predict_valid, target_names=categories))

                    precision    recall  f1-score   support

     comp.graphics       0.98      0.98      0.98       104
           sci.med       0.99      0.98      0.98        91
talk.politics.guns       0.98      0.99      0.98        92

          accuracy                           0.98       287
         macro avg       0.98      0.98      0.98       287
      weighted avg       0.98      0.98      0.98       287



In [15]:
#precision, recall, and f-score also show that the basic model did
#a reasonable job on classifying all the class.

#Now, let's further examine the confusion matrix

#We can use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_valid, y_predict_valid))

[[102   1   1]
 [  1  89   1]
 [  1   0  91]]


In [16]:
#For graphics and med, the model only made 3 errors in total

#Now let's examine what features are most important for the model

import numpy as np
def show_topN_features(classifier, vectorizer, categorie, N):
    #the method is adapted from https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html
    #get the token represented in the tfidf vector
    feature_names = vectorizer.get_feature_names_out()
    #for each class, print its top n weighted tokens
    for i, category in enumerate(categories):
        topN = np.argsort(classifier.feature_log_prob_[i])[0-N:]
        print("%s: %s" % (category, " ".join(feature_names[topN])))

show_topN_features(model, vectorizer, categories, 10)

comp.graphics: edu graphics it in for is and of to the
sci.med: you edu that it in and is of to the
talk.politics.guns: com they is you that in and of to the


These important features seem concerning. The tokens such as 'edu' and 'com' do not seem like from the main content, but like from emails. In other words, the model may pick up the noise rather than actual features from the actual content. Also many top features are stop words. This reminds us the importance of text processing. For now, we focus on the model and evaluation. Next time we will do a complete pipeline.

In [17]:
#let's also examine a few instances manually
#recall that we should do this at the very beginning
#this is for demonstration when we have suspicous results

for data, label in zip(X_train[:5], y_train[:5]):
  print(label)
  print(data)

1
From: bai@msiadmin.cit.cornell.edu (Dov Bai-MSI Visitor)
Subject: Re: Earwax
Organization: Mathematical Sciences Institute (MSI)-Cornell University
Lines: 14
NNTP-Posting-Host: msiadmin.cit.cornell.edu

In article <lu2defINNac7@news.bbn.com> levin@bbn.com (Joel B Levin) writes:
>bobm@Ingres.COM (Bob McQueer) writes:
>|One question I do have - a doctor who flushed out my ears once also advocated
>|a drop of rubbing alcohol in them afterwards to flush out any remaining
>|trapped water - said he told swimmers to do this after swimming, too.  It
>|works, but it stings like the devil, so I've always been content to let any
>|water in my ears from swimming or flushing them out figure out how to get
>|out by itself if shaking my head a few times won't do the trick.  Any
>|comments?

Perhaps diluting the rubbing alcohol in some water, until you
feels comfortable will do the trick ?



2
From: fcrary@ucsu.Colorado.EDU (Frank Crary)
Subject: Re: Gun Control (was Re: We're Mad as Hell at the TV

The examples also show that the top tokens are emails or headers, not from the body. This further suggests that the model may be overfitting. For more detail, you can print(data_all.DESCR).

It provides the function to remove headers, footers, or quotes so that the model learns from the actual content.

If you have time, you are enocuraged to re-implement the pipeline using the suggested evaluation approach (remove=('headers', 'footers', 'quotes')). This case study reminds us the importance of text processing and thorough evaluations. We need to understand the data and problem well.

In [18]:
print(data_all.DESCR)

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`~sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

Classes                     20
Samples total            18846
Dimensionality               1
Features                  text

#Common ways to reduce the chance of overfitting:

1. As mentioned in our class, the text processing phase may account for over 80% of the time for a text mining project. During this phase, you need to examine the training set thoroughly and makes sure you know the data and problem well.

2. Use n-fold cross-validation and monitor the variance of the performance

3. During the evaluation phase, like what we did, perform error analysis, examine top features, manually check a few samples and predictions

4. Apply the methods that designed for improving generalization capability; for instance, random forest https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html