# BERT Classification for Determining News Publication

In this notebook we will fine-tune a pre-trained BERT model (using DistilBERT for efficiency) to determine the publication of a news article based on its clean text (in the `clean_article` column). Our dataset is a CSV with about 14k training and 1k test articles per publication (10 publications total).

In [1]:
!pip install datasets
!pip install transformers
!pip install evaluate



## Part 1: Setup and Mounting Google Drive

First, mount your Google Drive so that we can load the dataset files from your Drive.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Part 2: Loading and Preprocessing the Data
`date, year, month, day, author, title, url, section, publication, split, clean_article`

In our case, we want to train on the `clean_article` column and predict the publication. To do this, we need to:

1. Load the dataset.
2. Map the `publication` column (a string) to an integer label.
3. Tokenize the `clean_article` text.

In [3]:
from datasets import load_dataset

# Path to the single CSV file containing all the data.
data_file = '/content/drive/MyDrive/NLP/NLP_Group_Project/Data/all-the-news-2-1-SMALL-CLEANED.csv'

# Load the CSV file into a dataset.
# Note: When loading a single CSV file, Hugging Face places all data into the 'train' split by default.
news_data = load_dataset('csv', data_files=data_file)

# Now, use the 'split' column to filter the dataset into training and test sets.
# (Assumes that the 'split' column contains the strings 'train' or 'test')
train_data = news_data['train'].filter(lambda example: example['split'].lower() == 'train')
test_data  = news_data['train'].filter(lambda example: example['split'].lower() == 'test')

# You can print the number of examples in each split to verify.
print("Number of training examples:", len(train_data))
print("Number of test examples:", len(test_data))

Number of training examples: 140000
Number of test examples: 10000


In [4]:
# Extract the training split and get unique publication names
unique_pubs = list(set(train_data['publication']))
unique_pubs.sort()
print("Unique publications:", unique_pubs)

# Create a mapping from publication names to integer labels
pub2label = {pub: i for i, pub in enumerate(unique_pubs)}
print("Mapping (publication -> label):", pub2label)

Unique publications: ['Buzzfeed News', 'CNN', 'Economist', 'Fox News', 'People', 'Politico', 'Reuters', 'The Hill', 'The New York Times', 'Vice']
Mapping (publication -> label): {'Buzzfeed News': 0, 'CNN': 1, 'Economist': 2, 'Fox News': 3, 'People': 4, 'Politico': 5, 'Reuters': 6, 'The Hill': 7, 'The New York Times': 8, 'Vice': 9}


In [5]:
# Define a function that adds a new field "label" based on publication
def encode_labels(example):
    example['labels'] = pub2label[example['publication']]
    return example

# Apply the mapping to both train and test splits
train_data = train_data.map(encode_labels)
test_data = test_data.map(encode_labels)

# Check one example from the training data
print(train_data[0])
print(test_data[0])

Map:   0%|          | 0/140000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

{'date': '2016-04-29 00:00:00', 'year': 2016, 'month': 4.0, 'day': 29, 'author': 'Catherine Rampell', 'title': '‘Bayside! The Musical!,’ a Parody From Bob and Tobly McSmith', 'url': 'http://www.nytimes.com/2013/10/09/theater/reviews/bayside-the-musical-parody-from-bob-and-tobly-mcsmith.html', 'section': 'theater', 'publication': 'The New York Times', 'split': 'train', 'clean_article': 'Theater Review Along with a love of Lisa Frank Trapper Keepers and slap bracelets, Gen Y-ers share an almost aggressive nostalgia for the harebrained sitcom “Saved by the Bell.” So perhaps it was inevitable: “Bayside! The Musical!” — a bawdy, ridiculous, unauthorized parody of the show — is now playing to feverishly enthusiastic, standing-room-only crowds in the East Village. Attending “Bayside!” can seem like a midnight screening of “The Rocky Horror Picture Show”: there are many inside jokes and familiar call-and-response cues. And plenty of audience members dress up like their favorite characters. (Th

# Part 3: Tokenization
We will use the DistilBERT tokenizer. Note that we tokenize the content from the `clean_article` column instead of a generic `text` field.


In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Define a preprocessing (tokenization) function
def preprocess_function(examples):
    # Tokenize the clean_article field, use truncation,
    # and optionally pad to a maximum length (here 512 tokens).
    return tokenizer(examples["clean_article"], truncation=True, padding="max_length", max_length=512)

# %%
# Apply the tokenization function to the dataset splits (using batched processing)
tokenized_train = train_data.map(preprocess_function, batched=True)
tokenized_test = test_data.map(preprocess_function, batched=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Map:   0%|          | 0/140000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [7]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [8]:
import numpy as np
from evaluate import load

def compute_metrics(eval_pred):
    accuracy_metric = load("accuracy")
    f1_metric = load("f1")
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)["accuracy"]
    # Use the weighted F1 score for multi-class classification
    f1 = f1_metric.compute(predictions=predictions, references=labels, average='weighted')["f1"]
    return {"accuracy": accuracy, "f1": f1}

In [9]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

num_labels = len(unique_pubs)
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=num_labels)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/NLP/Results/BERT_publication",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,  # You can increase this if needed
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    report_to="none"  # disables logging to external platforms like Weights & Biases
)



In [14]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

  trainer = Trainer(


In [15]:
trainer.train()

Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

In [None]:
eval_results = trainer.evaluate()
print("Evaluation results:", eval_results)