# Text Classification

This lab explores a new dataset for text classification tasks using naïve Bayes and logistic regression.

### Outcomes
* Train and test NB and LR classifiers using an established library.
* Apply evaluation metrics to the classifiers and display examples of misclassifications.
* Examine learned model parameters to explain how each classifier makes a decision.

### Overview

The first part of the notebook loads a new Twitter dataset, which is described in [this paper](https://arxiv.org/pdf/2010.12421.pdf), then extracts feature vectors from each sample.
The next part involves implementing and evaluating the classifiers using Scikit-learn.

# 1. Preparing the Data 

In [1]:
from datasets import load_dataset

cache_dir = "./data_cache"

train_dataset = load_dataset(
    "tweet_eval",
    name="sentiment",
    split="train",
    ignore_verifications=True,
    cache_dir=cache_dir,
)

print(f"Training dataset with {len(train_dataset)} instances loaded")

test_dataset = load_dataset(
    "tweet_eval",
    name="sentiment",
    split="test",
    ignore_verifications=True,
    cache_dir=cache_dir,
)

print(f"Test dataset with {len(test_dataset)} instances loaded")

Downloading:   0%|          | 0.00/2.37k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

Downloading and preparing dataset tweet_eval/sentiment (download: 6.17 MiB, generated: 6.62 MiB, post-processed: Unknown size, total: 12.79 MiB) to ./data_cache/tweet_eval/sentiment/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343...


  0%|          | 0/6 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/2.24M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/12.7k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/527k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.53k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/99.7k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

  0%|          | 0/6 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset tweet_eval downloaded and prepared to ./data_cache/tweet_eval/sentiment/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343. Subsequent calls will reuse this data.
Training dataset with 45615 instances loaded


Reusing dataset tweet_eval (./data_cache/tweet_eval/sentiment/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)


Test dataset with 12284 instances loaded


In [2]:
train_dataset[0]

{'text': '"QT @user In the original draft of the 7th book, Remus Lupin survived the Battle of Hogwarts. #HappyBirthdayRemusLupin"',
 'label': 2}

In [6]:
# Put the data into lists ready for the next steps...
from tqdm import tqdm
train_tweets = []
train_labels = []
for i in tqdm(range(len(train_dataset))):
    train_tweets.append(train_dataset[i]['text'])
    train_labels.append(train_dataset[i]['label'])    
print(train_tweets[2])

100%|██████████| 45615/45615 [00:05<00:00, 8456.10it/s]

Sorry bout the stream last night I crashed out but will be on tonight for sure. Then back to Minecraft in pc tomorrow night.





In [7]:
test_tweets = []
test_labels = []
for i in tqdm(range(len(test_dataset))):
    test_tweets.append(test_dataset[i]['text'])
    test_labels.append(test_dataset[i]['label'])
print(test_tweets[2])

100%|██████████| 12284/12284 [00:01<00:00, 8336.03it/s]

@user @user That's coming, but I think the victims are going to be Medicaid recipients.





The next step is to convert the tokenised text of each tweet to a feature vectors that we can use as input to a classifier. The feature vector needs to be a numerical vector of a fixed size. For the bag-of-words representation, the feature vector for a tweet will represent the number of occurrences of each word in the vocabulary in that tweet.

For this, we can use the CountVectorizer class: [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

QUESTION: Why do we need to fit the CountVectorizer on the train set?

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

vectorizer.fit(train_tweets)
X_train = vectorizer.transform(train_tweets)
X_test = vectorizer.transform(test_tweets)

In [9]:
print(vectorizer.vocabulary_)



# 2. Naive Bayes Classifier

The code above has obtained the feature vectors and lists of labels. The data is now ready for use
with scikit-learn's classifiers.

TODO: Train a classifier using the [MultinomialNB class.](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB)

In [28]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, train_labels)

MultinomialNB()

TODO: obtain predictions on the test set.

In [14]:
y_pred = clf.predict(X_test)

TODO: compute accuracy, precision, recall and F1 scores on the test set using [scikit-learn's metrics libary.](https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules)

In [19]:
from sklearn import metrics
accuracy = metrics.accuracy_score(test_labels, y_pred)
precision = metrics.precision_score(test_labels, y_pred, average='weighted')
recall = metrics.recall_score(test_labels, y_pred, average='weighted')
f1 = 2 * (precision * recall) / (precision + recall)
print('accuracy:  ', accuracy)
print('precision: ', precision)
print('recall:    ', recall)
print('f1:        ', f1)

accuracy:   0.5814881146206448
precision:  0.5869304903583864
recall:     0.5814881146206448
f1:         0.5841966274715278


TODO: print out the ten features with the strongest association with each class. Hint: use the `feature_log_prob_ndarray` attribute of the MultinomialNB object.

In [34]:
import numpy as np
num = 10
for i in range(clf.feature_log_prob_.shape[0]):
    a = clf.feature_log_prob_[i, :]
    ind = np.argpartition(a, -num)[-num:]
    sorted_feats = ind[np.argsort(a[ind])]
    top_feats = []
    mydict = vectorizer.vocabulary_
    for feat in sorted_feats:
        top_feats.append(list(mydict.keys())[list(mydict.values()).index(feat)])
    print(top_feats)

['may', 'it', 'on', 'of', 'is', 'and', 'in', 'user', 'to', 'the']
['at', 'is', 'for', 'of', 'and', 'on', 'in', 'user', 'to', 'the']
['is', 'you', 'of', 'for', 'on', 'in', 'and', 'user', 'to', 'the']


TODO: print out an example of a misclassified tweet along with its predicted and true labels.

In [None]:
# WRITE YOUR CODE HERE

# 3. Logistic Regression Classifier

TODO: Train a classifier using the [LogisticRegression class.](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [None]:
# WRITE YOUR CODE HERE

TODO: obtain predictions on the test set.

In [None]:
# WRITE YOUR CODE HERE

TODO: compute accuracy, precision, recall and F1 scores on the test set using [scikit-learn's metrics libary.](https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules)

In [None]:
# WRITE YOUR CODE HERE

TODO: print out the ten features with the highest weights for each class. Hint: use the `coef_` attribute of the LogisticRegression object.

In [None]:
# WRITE YOUR CODE HERE

TODO: print out an example of a misclassified tweet along with its predicted and true labels.

In [None]:
# WRITE YOUR CODE HERE

# 4. Extension: n-grams

TODO: Use bigram features instead of unigrams (single tokens). To do these, change the `ngram_range` parameter in the CountVectorizer then try running the best classifier again/

In [None]:
# WRITE YOUR CODE HERE