## Lab6-Assignment: Topic Classification

Use the same training, development, and test partitions of the the 20 newsgroups text dataset as in Lab6.4-Topic-classification-BERT.ipynb 

* Fine-tune and examine the performance of another transformer-based pretrained language models, e.g., RoBERTa, XLNet

* Compare the performance of this model to the results achieved in Lab6.4-Topic-classification-BERT.ipynb and to a conventional machine learning approach (e.g., SVM, Naive Bayes) using bag-of-words or other engineered features of your choice. 
Describe the differences in performance in terms of Precision, Recall, and F1-score evaluation metrics.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import sklearn
from sklearn.metrics import classification_report
from simpletransformers.classification import ClassificationModel, ClassificationArgs
import matplotlib.pyplot as plt 
import seaborn as sn 
from collections import Counter
from sklearn.datasets import fetch_20newsgroups

### importing dataset of 20 newsgroups with 18K posts on 20 topics

In [2]:
# load only a sub-selection of the categories (4 in our case)
categories = ['alt.atheism', 'comp.graphics', 'sci.med', 'sci.space'] 

# remove the headers, footers and quotes (to avoid overfitting)
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), categories=categories, random_state=42)
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'), categories=categories, random_state=42)

# Data Exploration

In [3]:
Counter(newsgroups_train.target)

Counter({2: 594, 3: 593, 1: 584, 0: 480})

In [4]:
Counter(newsgroups_test.target)

Counter({2: 396, 3: 394, 1: 389, 0: 319})

### converting to pandas dataframe

In [5]:
train = pd.DataFrame({'text': newsgroups_train.data, 'labels': newsgroups_train.target})

In [6]:
print(len(train))
train.head(5)

2251


Unnamed: 0,text,labels
0,WHile we are on the subject of the shuttle sof...,3
1,There is a program called Graphic Workshop you...,1
2,,2
3,My girlfriend is in pain from kidney stones. S...,2
4,I think that's the correct spelling..\n\tI am ...,2


In [7]:
test = pd.DataFrame({"text": newsgroups_test.data, "labels": newsgroups_test.target})

In [8]:
print(len(test))
test.head(5)

1498


Unnamed: 0,text,labels
0,\nAnd guess who's here in your place.\n\nPleas...,1
1,Does anyone know if any of Currier and Ives et...,1
2,=FLAME ON\n=\n=Reading through the posts about...,2
3,\nBut in this case I said I hoped that BCCI wa...,0
4,\nIn the kind I have made I used a Lite sour c...,2


### Development Set

from sklearn.model_selection import train_test_split

train, dev = train_test_split(train, test_size=0.1, random_state=0, 
                               stratify=train[['labels']])

In [9]:
from sklearn.model_selection import train_test_split

train, dev = train_test_split(
    train, test_size=0.1, random_state=0, stratify=train[["labels"]]
)

In [10]:
print(len(train))
print("train:", train[["labels"]].value_counts(sort=False))
train.head(3)

2025
train: labels
0         432
1         525
2         534
3         534
Name: count, dtype: int64


Unnamed: 0,text,labels
559,I wonder how many atheists out there care to s...,0
2060,We are interested in purchasing a grayscale pr...,1
1206,"Dear Binary Newsers,\n\nI am looking for Quick...",1


In [11]:
print(len(dev))
print("dev:", dev[["labels"]].value_counts(sort=False))
dev.head(3)

226
dev: labels
0         48
1         59
2         60
3         59
Name: count, dtype: int64


Unnamed: 0,text,labels
1570,I'd dump him. Rude is rude and it seems he en...,2
1761,Hi Everyone ::\n\nI am looking for some soft...,1
455,A friend of mine has been diagnosed with Psori...,2


# RoBERTa model

In [23]:
# Import necessary modules and classes
from simpletransformers.classification import (
    ClassificationModel,
    MultiLabelClassificationModel,
    ClassificationArgs,
)
from transformers import RobertaConfig, RobertaModel

# Model configuration
model_args = ClassificationArgs()
model_args.overwrite_output_dir = True  # overwrite existing saved models in the same directory
model_args.evaluate_during_training = True  # perform evaluation while training the model
model_args.num_train_epochs = 1  # number of epochs -- 10 
model_args.train_batch_size = 32  # batch size -- 32
model_args.learning_rate = 4e-6  # learning rate
model_args.max_seq_length = 256  # maximum sequence length -- 256

# Early stopping to combat overfitting
model_args.use_early_stopping = True
model_args.early_stopping_delta = 0.01  # improvement over best_eval_loss necessary to count as a better checkpoint
model_args.early_stopping_metric = "eval_loss"
model_args.early_stopping_metric_minimize = True
model_args.early_stopping_patience = 2
model_args.evaluate_during_training_steps = 32  # how often to run validation in terms of training steps -- 32

In [24]:
# Checking steps per epoch
steps_per_epoch = int(np.ceil(len(train) / float(model_args.train_batch_size)))
print(
    "Each epoch will have {:,} steps.".format(steps_per_epoch)
)  # 64 steps = validating 2 times per epoch

Each epoch will have 64 steps.


In [27]:
model = ClassificationModel(
    "roberta", "roberta-base", num_labels=4, args=model_args, use_cuda=False
)  # CUDA is enabled

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [28]:
_, history = model.train_model(train, eval_df=dev, average="micro")

  0%|          | 0/4 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 1 of 1:   0%|          | 0/64 [00:00<?, ?it/s]

0it [00:00, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].

In [None]:
# Evaluate the model
result, model_outputs, wrong_predictions = model.eval_model(dev)
result