#### BERT Classification for Determining News Publication

In this notebook we will fine-tune a pre-trained BERT model (using DistilBERT for efficiency) to determine the publication of a news article based on its clean text (in the `clean_article` column)

In [14]:
!pip install datasets
!pip install transformers
!pip install evaluate



## Part 1: Setup and Mounting Google Drive

First, mount your Google Drive so that we can load the dataset files from your Drive.

In [15]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Part 2: Loading and Preprocessing the Data
`date, year, month, day, author, title, url, section, publication, split, clean_article`

In our case, we want to train on the `clean_article` column and predict the publication. To do this, we need to:

1. Load the dataset.
2. Map the `publication` column (a string) to an integer label.
3. Tokenize the `clean_article` text.

In [16]:
from datasets import load_dataset

# Path to the single CSV file containing all the data.
data_file = '/content/drive/MyDrive/NLP/NLP_Group_Project/Data/all-the-news-2-1-SMALL-CLEANED.csv'

# Load the CSV file into a dataset.
# Note: When loading a single CSV file, Hugging Face places all data into the 'train' split by default.
news_data = load_dataset('csv', data_files=data_file)

# Now, use the 'split' column to filter the dataset into training and test sets.
# (Assumes that the 'split' column contains the strings 'train' or 'test')
train_data = news_data['train'].filter(lambda example: example['split'].lower() == 'train')
test_data  = news_data['train'].filter(lambda example: example['split'].lower() == 'test')

# You can print the number of examples in each split to verify.
print("Number of training examples:", len(train_data))
print("Number of test examples:", len(test_data))

Number of training examples: 90000
Number of test examples: 10000


In [17]:
# Extract the training split and get unique publication names
unique_pubs = list(set(train_data['publication']))
unique_pubs.sort()
print("Unique publications:", unique_pubs)

# Create a mapping from publication names to integer labels
pub2label = {pub: i for i, pub in enumerate(unique_pubs)}
print("Mapping (publication -> label):", pub2label)

Unique publications: ['Buzzfeed News', 'CNN', 'Economist', 'Fox News', 'People', 'Politico', 'Reuters', 'The Hill', 'The New York Times', 'Vice']
Mapping (publication -> label): {'Buzzfeed News': 0, 'CNN': 1, 'Economist': 2, 'Fox News': 3, 'People': 4, 'Politico': 5, 'Reuters': 6, 'The Hill': 7, 'The New York Times': 8, 'Vice': 9}


In [18]:
# Define a function that adds a new field "label" based on publication
def encode_labels(example):
    example['labels'] = pub2label[example['publication']]
    return example

# Apply the mapping to both train and test splits
train_data = train_data.map(encode_labels)
test_data = test_data.map(encode_labels)

# Check one example from the training data
print(train_data[0])
print(test_data[0])

{'date': '2016-04-29 00:00:00', 'year': 2016, 'month': 4.0, 'day': 29, 'author': 'Catherine Rampell', 'title': '‘Bayside! The Musical!,’ a Parody From Bob and Tobly McSmith', 'url': 'http://www.nytimes.com/2013/10/09/theater/reviews/bayside-the-musical-parody-from-bob-and-tobly-mcsmith.html', 'section': 'theater', 'publication': 'The New York Times', 'split': 'train', 'clean_article': ' a love of [NAME] and slap bracelets, [NAME] share an almost aggressive nostalgia for the harebrained sitcom “Saved by the Bell.” So perhaps it was inevitable: “Bayside! The Musical!” — a bawdy, ridiculous, unauthorized parody of the show — is now playing to feverishly enthusiastic, standing-room-only crowds in the East Village. Attending “Bayside!” can seem like a midnight screening of “The Rocky Horror Picture Show”: there are many inside jokes and familiar call-and-response cues. And plenty of audience members dress up like their favorite characters. (Though the sartorial twinning at “Bayside!” is mor

# Part 3: Tokenization
We will use the DistilBERT tokenizer. Note that we tokenize the content from the `clean_article` column instead of a generic `text` field.


In [19]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Define a preprocessing (tokenization) function
def preprocess_function(examples):
    # Tokenize the clean_article field, use truncation,
    # and optionally pad to a maximum length (here 512 tokens).
    return tokenizer(examples["clean_article"], truncation=True, padding="max_length", max_length=512)

# %%
# Apply the tokenization function to the dataset splits (using batched processing)
tokenized_train = train_data.map(preprocess_function, batched=True)
tokenized_test = test_data.map(preprocess_function, batched=True)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [20]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [21]:
import numpy as np
from evaluate import load

def compute_metrics(eval_pred):
    accuracy_metric = load("accuracy")
    f1_metric = load("f1")
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)["accuracy"]
    # Use the weighted F1 score for multi-class classification
    f1 = f1_metric.compute(predictions=predictions, references=labels, average='weighted')["f1"]
    return {"accuracy": accuracy, "f1": f1}

In [22]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

num_labels = len(unique_pubs)
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=num_labels)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [23]:
training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/NLP/Results/BERT_publication",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,  # You can increase this if needed
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    report_to="none"  # disables logging to external platforms like Weights & Biases
)

In [24]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

  trainer = Trainer(


In [25]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.1968,0.181792,0.9414,0.941965


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.79k [00:00<?, ?B/s]

TrainOutput(global_step=5625, training_loss=0.3569199937608507, metrics={'train_runtime': 4572.3182, 'train_samples_per_second': 19.684, 'train_steps_per_second': 1.23, 'total_flos': 1.1923766784e+16, 'train_loss': 0.3569199937608507, 'epoch': 1.0})

### Results

I only got to print accuracy before colab kicked me off :(

In [26]:
eval_results = trainer.evaluate()
print("Evaluation results:", eval_results)

Evaluation results: {'eval_loss': 0.18179155886173248, 'eval_accuracy': 0.9414, 'eval_f1': 0.9419654109868348, 'eval_runtime': 158.2161, 'eval_samples_per_second': 63.205, 'eval_steps_per_second': 3.95, 'epoch': 1.0}


In [27]:
from sklearn.metrics import classification_report
import pandas as pd

# Run predictions
predictions = trainer.predict(tokenized_test)

# Get predicted and true labels
y_true = predictions.label_ids
y_pred = predictions.predictions.argmax(axis=1)

# Assume your label2id and id2label mappings look something like this:
# e.g., label2id = {'BuzzFeed News': 0, 'Fox News': 1, ...}
# Make sure you have:
id2label = model.config.id2label

# Generate classification report
report = classification_report(y_true, y_pred, target_names=[id2label[i] for i in range(len(id2label))], output_dict=True)

# Convert to DataFrame
df_report = pd.DataFrame(report).T.reset_index()
df_report = df_report.rename(columns={'index': 'Publication', 'precision': 'Precision', 'recall': 'Recall', 'f1-score': 'F1'})

# Filter out the "accuracy" row for individual rows first, we’ll handle top-line accuracy separately
df_per_class = df_report[df_report['Publication'].isin(id2label.values())][['Publication', 'Precision', 'Recall', 'F1']]

# Format and add Top-Line row
top_line = {
    "Publication": "**Top‑Line**",
    "Precision": report["weighted avg"]["precision"],
    "Recall": report["weighted avg"]["recall"],
    "F1": report["weighted avg"]["f1-score"]
}
df_per_class = df_per_class.append(top_line, ignore_index=True)

# Format as markdown table
print(df_per_class.to_markdown(index=False))


AttributeError: DistilBertTokenizerFast has no attribute label2id