#  NLP unassessed exercises: This notebook is based on sklearn's tutorial 'Working with Text Data' with some extras and exercises

In [None]:
import sklearn

In [None]:
categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']

In [None]:
#Loading the 20 Newsgroups dataset
from sklearn.datasets import fetch_20newsgroups
from matplotlib import pyplot as plt

In [None]:
# with a random seed, always keep it the same number each time
# for reproducibility (here 42 (=the meaning of life...))
twenty_train = fetch_20newsgroups(subset='train',categories=categories, 
                                  shuffle=True, random_state=42)

In [None]:
#fetch_20newsgroups puts the data in the .data attribute
len(twenty_train.data)

In [None]:
print(twenty_train.data[0])

In [None]:
# Let's have a look at the first text in the collection
print("\n".join(twenty_train.data[0].split("\n")))

In [None]:
# Extracting features from text data
# Make sure you read the part of the tutorial/lecture about the bags of words
# representation

In [None]:
# A vectorizer is used to extract features from each item in the dataset
from sklearn.feature_extraction.text import CountVectorizer

# create a count vectorizer, which by default does some pre-processing
# tokenize (into single words/unigrams) + lower-casing
# to change these default settings look at the sklearn documentation
count_vect = CountVectorizer(min_df=1)
X_train_counts = count_vect.fit_transform(twenty_train.data)

In [None]:
# let's see how many features we extracted (vocab size) using the CountVectorizer
print (len(count_vect.get_feature_names_out()))

In [None]:
# let's see what is at position 15000 in the global vocab/feature vector
print (count_vect.get_feature_names_out()[15000])

In [None]:
type(X_train_counts)

In [None]:
# CountVectorizer has extracted all the features for all the docs from the data
# putting them into a matrix of dimensions #instances * #features
X_train_counts.shape

In [None]:
# To see the index of a specific word, you can use the following
count_vect.vocabulary_.get(u'furnace')

In [None]:
# With the index a look at what's in the first row/document (see printout above)
# This should be the bag of words representation for the instance
first_row = X_train_counts[0].toarray()[0]
for i in range(len((list(first_row)))):
    # only look at elements that are non-0
    if first_row[i] >0:
        # print out the index of the feature, the feature name (i.e. the word), count
        print(i, count_vect.get_feature_names_out()[i], first_row[i])

# Naive Bayes

In [None]:
# Training a multinomial (beyond 2 class) NB classifier
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_counts, twenty_train.target)

In [None]:
twenty_train.target

In [None]:
# Testing on a toy dataset
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)

In [None]:
predicted = clf.predict(X_new_counts)
for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

In [None]:
# A Pipeline is an object that can carry out count extraction, weighting
# and classification all in one go- be careful you know what each part does
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('clf', MultinomialNB()),
                    ])

In [None]:
# Proper testing on the full 20newsgroups test set
import numpy as np
twenty_test = fetch_20newsgroups(subset='test', categories=categories, 
                                 shuffle=True, random_state=42)
docs_test = twenty_test.data
text_clf.fit(twenty_train.data, twenty_train.target)
predicted = text_clf.predict(docs_test)

# Evaluation

In [None]:
# Using the metrics package
from sklearn import metrics

# Get a classification report to see overall and per-class performance 
print(metrics.classification_report(twenty_test.target, predicted,
                                    target_names=twenty_test.target_names))

In [None]:
# Confusion matrix
metrics.confusion_matrix(twenty_test.target, predicted)

In [None]:
# a function to make the confusion matrix readable and pretty
def confusion_matrix_heatmap(y_test, preds, labels):
    """Function to plot a confusion matrix"""
    cm = metrics.confusion_matrix(y_test, preds)
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(111)
    cax = ax.matshow(cm)
    plt.title('Confusion matrix of the classifier')
    fig.colorbar(cax)
    ax.set_xticks(np.arange(len(labels)))
    ax.set_yticks(np.arange(len(labels)))
    ax.set_xticklabels( labels, rotation=45)
    ax.set_yticklabels( labels)

    for i in range(len(cm)):
        for j in range(len(cm)):
            text = ax.text(j, i, cm[i, j],
                           ha="center", va="center", color="w")

    plt.xlabel('Predicted')
    plt.ylabel('True')
    
    # fix for mpl bug that cuts off top/bottom of seaborn viz:
    b, t = plt.ylim() # discover the values for bottom and top
    b += 0.5 # Add 0.5 to the bottom
    t -= 0.5 # Subtract 0.5 from the top
    plt.ylim(b, t) # update the ylim(bottom, top) values
    plt.show() # ta-da!
    plt.show()

In [None]:
confusion_matrix_heatmap(twenty_test.target, predicted, twenty_test.target_names)

## Interpreting the confusion matrix
A perfect classification of this test set would be all the diagonals having the lightest colour, and everywhere else in the darkest colour (zero confusion/errors). In reality that won't happen with NLP applications worth studying.

Here there are quite a few squares outside the diagonal with moderate numbers. Notice that many alt.atheism documents were classified as soc.religion.christian, hence the lower recall for alt.atheism and lower precision for soc.religion.christian. Quite a few sci-med documents were classified as soc.religion.christian too, again affecting the precision of soc.religion.christian whilst making the recall of sci-med go down a little bit.

In [None]:
# Print out some predictions against the labels
n = 20
for doc, label_idx in zip(docs_test, twenty_test.target):
    label = twenty_test.target_names[label_idx]
    prediction = text_clf.predict([doc])[0]
    print('{0} => {1}, ground truth = {2}'.format(doc, twenty_test.target_names[prediction], label))
    n-=1
    if n <0:
        break
    print('*'*50)
    print()

# Exercise 1: Error analysis of False Positives

Performing error analyses is a key part of improving your NLP applications. 

Iterate over the twenty_test.data and, using the list of predictions and labels, print out all the instances where there is a false positive error for that class (i.e. a false positive is where the label is predicted for a given instance, but this is not the corresponding ground truth label). Format the print-out to make it as clear as possible what the correct label and incorrect prediction are for each wrongly classified text. 

HINT: This may be achieved most easily by editing the cell above beginning with the comment `# Print out some predictions against the labels`.

For each example of a given class being predicted as a False Positive, think about which features could be added to reduce the number of these errors and write a summary of the patterns you see for each class wrongly predicted (e.g. when alt.atheism is wrongly predicted). The idea is to try to understand where and why the classifier mistakenly classifies something as a certain class when it is not of that class and try to find out why it is getting confused? Think about trying some ways to get rid of these errors based on extra features (meta-features like document length, different types of pre-processing, feature extraction etc.).

# Exercise 2: Error analysis of False Negatives
Do the same as in Exercise 1 but for False Negatives (note the incorrect predictions will be the same as in Exercise 1, but identifying the classes for which this will be an error will be different). 

For each class for which there are False Negatives, think about which features could be added to reduce the number of these errors.  The idea is to try to understand where and why the classify mistakenly misses something as being of a certain class.