In [1]:
import pandas as pd
import numpy as np
import re
import nltk
import datasets
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import DataCollatorWithPadding, TrainingArguments, Trainer
from sentence_transformers import SentenceTransformer
from datasets import load_metric
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

## Assignment: Document Classification

#### by Matthew Lucich

First we load the pre-processed news article data. The data is a subset of the POLUSA dataset, which is a compliation of .9 million political English news articles from 2017-2019. My pre-processing which took place as part of my Master's capstone project included filtering out market news, removing instances with null values and with undefined political leanings.

In [2]:
df_polusa = pd.read_csv("polusa_2019_5k.csv")

In [3]:
df_polusa.shape

(5000, 12)

In [4]:
set(df_polusa["outlet"])

{'ABC News',
 'Breitbart',
 'CBS News',
 'Fox News',
 'HuffPost',
 'Los Angeles Times',
 'NBC News',
 'NPR',
 'National Review',
 'PBS',
 'Reuters',
 'Slate',
 'The Daily Caller',
 'The Guardian',
 'The Nation',
 'The New York Times',
 'The State',
 'USA Today',
 'Yahoo! News'}

In [5]:
set(df_polusa["political_leaning"])

{'CENTER', 'LEFT', 'RIGHT'}

Below we can see we have imbalanced classes with the smallest minority class, RIGHT, making up approximately 16% of the instances. According to [Google ML researchers](https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data), this would be considered a moderate imbalanced and is not too far from being considered only a mild imbalance. Therefore, we will move forward without downsampling. If we see the model is never predicting RIGHT or there is exceptionally poor performance on that class we may reconsider.

In [6]:
df_polusa['political_leaning'].value_counts() / len(df_polusa)

CENTER    0.6094
LEFT      0.2326
RIGHT     0.1580
Name: political_leaning, dtype: float64

In [7]:
df_polusa = df_polusa.drop(columns={"Unnamed: 0"})
df_polusa.head(2)

Unnamed: 0,id,date_publish,outlet,headline,lead,body,authors,domain,url,political_leaning,head_lead_body
0,4328691,2019-08-27 09:37:19,Breitbart,Sweden: Woman Was Cradling Baby When She Was F...,A woman in her 30s was fatally shot in the hea...,A woman in her 30s was fatally shot in the hea...,Chris Tomlinson,www.breitbart.com,https://www.breitbart.com/europe/2019/08/27/sw...,RIGHT,Sweden: Woman Was Cradling Baby When She Was F...
1,3883779,2019-08-27 09:38:34,HuffPost,Mormon Leaders Ban Guns In Church,A new church policy prohibits all parishioners...,The Church of Jesus Christ of Latter-day Saint...,Senior Reporter,www.huffpost.com,https://www.huffpost.com/entry/mormon-lds-chur...,LEFT,Mormon Leaders Ban Guns In Church###A new chur...


### Preprocess Dataset

Convert the text categories, "LEFT", "CENTER", and "RIGHT", to numeric categories, 0, 1, and 2, respectively.

In [8]:
df_transformer_format = df_polusa[["head_lead_body", "political_leaning"]]
convert_to_num = {"LEFT": 0, "CENTER": 1, "RIGHT": 2}
df_transformer_format = df_transformer_format.replace({"political_leaning": convert_to_num})
df_transformer_format = df_transformer_format.rename(columns={"head_lead_body":"text", "political_leaning":"label"})

Create train, dev test, and test sets. We take a similar approach as suggested for project 3. The news articles (rows) are sorted by time and therefore the classes should be approximately evenly distributed throughout the set.

In [9]:
df_train = df_transformer_format[:4000]
df_dev_test = df_transformer_format[-1000:-500]
df_test = df_transformer_format[-500:]

In [10]:
print(len(df_train))
print(len(df_dev_test))
print(len(df_test))

4000
500
500


Convert Pandas dataframes into special Huggingface dictionaries, which is preferred by their models.

In [11]:
train_dataset = datasets.Dataset.from_dict(df_train)
dev_test_dataset = datasets.Dataset.from_dict(df_dev_test)
test_dataset = datasets.Dataset.from_dict(df_test)

In [35]:
test_dataset["text"][1][:90]

'$58,000 reward offered after more than 40 wild burros found shot dead in the Mojave Desert'

Tokenize text of the articles (and headlines and leads) into document embeddings. We enable padding so the vector sizes are uniform and enable truncation to prevent outliers in terms of article length making the vector sizes unnecessarily long.

In [13]:
def preprocess_function(examples):
    return tokenizer(examples["text"], padding=True, truncation=True)

Load the distilbert-base-uncased tokenizer. The model is a faster version of the BERT transformer model and was trained on the wikipedia and book corpuses.

In [14]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [15]:
tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_dev_test = dev_test_dataset.map(preprocess_function, batched=True)
tokenized_test = test_dataset.map(preprocess_function, batched=True)



  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [16]:
tokenized_train = tokenized_train.remove_columns(['text'])
tokenized_dev_test = tokenized_dev_test.remove_columns(['text'])
tokenized_test = tokenized_test.remove_columns(['text'])

### Classification Model: distilbert-base-uncased (transformer)

In [22]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=3)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier

Define our evaluation metrics, which include accuracy, precision, and recall.

In [23]:
def compute_metrics(pred):
    """Sourced from: https://huggingface.co/transformers/v3.0.2/training.html"""
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, labels=[0, 1, 2])
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [24]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Define our trainer, specifying the model, the training set, test set, tokenizer, data collator, and evaluation metrics.

In [25]:
trainer = Trainer(
    model=model,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_dev_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

Fine-tune (train) our model over three Epochs. This ended up taking six hours to complete.

In [26]:
trainer.train()

***** Running training *****
  Num examples = 4000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1500


Step,Training Loss
500,0.4612
1000,0.2582
1500,0.1288


Saving model checkpoint to tmp_trainer/checkpoint-500
Configuration saved in tmp_trainer/checkpoint-500/config.json
Model weights saved in tmp_trainer/checkpoint-500/pytorch_model.bin
tokenizer config file saved in tmp_trainer/checkpoint-500/tokenizer_config.json
Special tokens file saved in tmp_trainer/checkpoint-500/special_tokens_map.json
Saving model checkpoint to tmp_trainer/checkpoint-1000
Configuration saved in tmp_trainer/checkpoint-1000/config.json
Model weights saved in tmp_trainer/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in tmp_trainer/checkpoint-1000/tokenizer_config.json
Special tokens file saved in tmp_trainer/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to tmp_trainer/checkpoint-1500
Configuration saved in tmp_trainer/checkpoint-1500/config.json
Model weights saved in tmp_trainer/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in tmp_trainer/checkpoint-1500/tokenizer_config.json
Special tokens file saved in tmp_traine

TrainOutput(global_step=1500, training_loss=0.28275184122721353, metrics={'train_runtime': 21473.7349, 'train_samples_per_second': 0.559, 'train_steps_per_second': 0.07, 'total_flos': 1589637132288000.0, 'train_loss': 0.28275184122721353, 'epoch': 3.0})

We see the accuracy on the dev test set is approximately 92%, which is 31 percentage points above the majority class rate. We see similar results with the F1 score. Additionally, according to F1, precision, and recall, CENTER seems to be the class that is best predicted. As somewhat expected, the smalles minority class, RIGHT, is has the lowest F1, precision, and recall values.

In [27]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 500
  Batch size = 8


Trainer is attempting to log a value of "[0.87804878 0.95503876 0.81675393]" of type <class 'numpy.ndarray'> for key "eval/f1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "[0.88888889 0.94769231 0.82978723]" of type <class 'numpy.ndarray'> for key "eval/precision" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "[0.86746988 0.9625     0.80412371]" of type <class 'numpy.ndarray'> for key "eval/recall" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


{'eval_loss': 0.35529208183288574,
 'eval_accuracy': 0.916,
 'eval_f1': array([0.87804878, 0.95503876, 0.81675393]),
 'eval_precision': array([0.88888889, 0.94769231, 0.82978723]),
 'eval_recall': array([0.86746988, 0.9625    , 0.80412371]),
 'eval_runtime': 263.2604,
 'eval_samples_per_second': 1.899,
 'eval_steps_per_second': 0.239,
 'epoch': 3.0}

In [28]:
preds_data = trainer.predict(tokenized_test)

***** Running Prediction *****
  Num examples = 500
  Batch size = 8


Evaluation metrics on the average see slightly decreased performance. Yet predicting LEFT is improved for F1, precision and recall. Conversely, we see RIGHT with diminished performance in the same three metrics. Overall, we have a modest preference towards precision, since it may be in bad form to label an article as biased in a certain direction without high confidence, while missing some biased articles is less crucial.

In [29]:
preds_data.metrics

{'test_loss': 0.39563611149787903,
 'test_accuracy': 0.904,
 'test_f1': array([0.90502793, 0.9519833 , 0.7607362 ]),
 'test_precision': array([0.92045455, 0.92307692, 0.80519481]),
 'test_recall': array([0.89010989, 0.98275862, 0.72093023]),
 'test_runtime': 261.4704,
 'test_samples_per_second': 1.912,
 'test_steps_per_second': 0.241}

In [30]:
trainer.save_model(output_dir="news-model")

Saving model checkpoint to news-model
Configuration saved in news-model/config.json
Model weights saved in news-model/pytorch_model.bin
tokenizer config file saved in news-model/tokenizer_config.json
Special tokens file saved in news-model/special_tokens_map.json


### References

*Natural Language Processing with Python* by Steven Bird, Ewan Klein, and Edward Loper <br> 
Text Classification, Huggingface: https://huggingface.co/docs/transformers/tasks/sequence_classification <br>
Fine-tuning a model with the Trainer API, Huggingface: https://huggingface.co/course/chapter3/3?fw=pt <br>
Loading a Metric, Huggingface: https://huggingface.co/docs/datasets/v1.0.1/loading_metrics.html <br>
The POLUSA Dataset by Lukas Gebhard, Felix Hamborg: https://arxiv.org/abs/2005.14024 <br>
Imbalanced Data: https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data <br>
Multi-Label Classification on Unhealthy Comments, YouTube: https://www.youtube.com/watch?v=vNKIg8rXK6w <br>
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011 <br>
Convert pandas dataframe to datasetDict: https://stackoverflow.com/questions/71618974/convert-pandas-dataframe-to-datasetdict