Models available on the Hugging Face hub can serve as a good starting point, but they may not have been trained specifically for the task we are focused on, nor have they necessarily encountered data similar to ours. While these models possess general language knowledge, they may lack the specific task knowledge required for our purposes. Therefore, fine-tuning these models on our own data is an effective way to adapt them to our needs.

<br>
<a target="_blank" href="https://colab.research.google.com/drive/1QogIeR13OmhBs9fif_cycNACscxb5Bt9#scrollTo=EjyJfZJhQ1Uc">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [2]:
!pip install -q  scikit-learn datasets tokenizers sentencepiece protobuf accelerate transformers torch

In [3]:
# set random seed for reproducibility
import numpy as np

SEED_GLOBAL = 42
np.random.seed(SEED_GLOBAL)

In [4]:
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/luissattelmayer/intro-css/refs/heads/main/data/tweets_annotations.csv")

df = df[["id", "text", "label"]]

df

# Create label_text column 1 being about economy and 0 for not

df["label_text"] = df["label"].apply(lambda x: "economy" if x == 1 else "other")

df.value_counts("label_text")

Unnamed: 0_level_0,count
label_text,Unnamed: 1_level_1
other,547
economy,178


## Split into training and test set

As in classical supervised learning, to fine-tune a pre-trained model, we need to split our dataset into training and testing sets, which can be done using the sklearn library in Python.

In [5]:
# Split into training and test set

from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, test_size=0.2, random_state=SEED_GLOBAL, stratify=df['label'])

# label distribution train set
print("Train set label distribution:\n", df_train.label_text.value_counts(), "\n")
# label distribution test set
print("Test set label distribution:\n", df_test.label_text.value_counts())

Train set label distribution:
 label_text
other      438
economy    142
Name: count, dtype: int64 

Test set label distribution:
 label_text
other      109
economy     36
Name: count, dtype: int64


## Load a model

The next step is to load a pre-trained model from the Hugging Face library, which we will fine-tune for our classification task using our labeled dataset. For this, we do not use the pipeline but instead load the model and its associated tokenizer. Here, I load a version of BERT called DeBERTa, but it could be any model that is relevant on the platform.

In [6]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
import torch
import numpy as np

## load a model and its tokenizer
model_name = "microsoft/deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, model_max_length=512)

# Define the desired label mapping directly
label2id = {
    'Other': 0,                   # Set 'Other' to 0
    'Economy': 1   # Set 'Economy' to 1
}

# Create an inverse mapping
id2label = {v: k for k, v in label2id.items()}  # Inverse mapping

# Print the mappings to verify
print("Label to ID mapping:", label2id)  # This should print: {'Other': 0, 'Environment and Energy': 1}
print("ID to Label mapping:", id2label)  # This should print: {0: 'Other', 1: 'Environment and Energy'}

config = AutoConfig.from_pretrained(model_name, label2id=label2id, id2label=id2label, num_labels=len(label2id));
print("\n", label2id, "\n")

# load model with config
model = AutoModelForSequenceClassification.from_pretrained(model_name, config=config, ignore_mismatched_sizes=True);

# use GPU (cuda) if available, otherwise use CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")
model.to(device)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Label to ID mapping: {'Other': 0, 'Economy': 1}
ID to Label mapping: {0: 'Other', 1: 'Economy'}

 {'Other': 0, 'Economy': 1} 



model.safetensors:   0%|          | 0.00/371M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Device: cuda


DebertaV2ForSequenceClassification(
  (deberta): DebertaV2Model(
    (embeddings): DebertaV2Embeddings(
      (word_embeddings): Embedding(128100, 768, padding_idx=0)
      (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): DebertaV2Encoder(
      (layer): ModuleList(
        (0-11): 12 x DebertaV2Layer(
          (attention): DebertaV2Attention(
            (self): DisentangledSelfAttention(
              (query_proj): Linear(in_features=768, out_features=768, bias=True)
              (key_proj): Linear(in_features=768, out_features=768, bias=True)
              (value_proj): Linear(in_features=768, out_features=768, bias=True)
              (pos_dropout): Dropout(p=0.1, inplace=False)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): DebertaV2SelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): Layer

## Tokenize the data

Transformer models take text data that is already tokenized in a specific manner, which is what we do below for both our training and test sets. BERT has a limit of 512 tokens that it can process, and any text longer than this will be truncated. Since we are working with tweets, this does not pose any problem

In [7]:
# convert pandas dataframes to Hugging Face dataset object to facilitate pre-processing

import datasets

dataset = datasets.DatasetDict({
    "train": datasets.Dataset.from_pandas(df_train),
    "test": datasets.Dataset.from_pandas(df_test)
})


# tokenize
def tokenize(examples):
  return tokenizer(examples["text"], truncation=True, max_length=512)

dataset = dataset.map(tokenize, batched=True)


Map:   0%|          | 0/580 [00:00<?, ? examples/s]

Map:   0%|          | 0/145 [00:00<?, ? examples/s]

## Training the model

Before training the model, we need to specify several key training arguments using the TrainingArguments() function. This function does not actually train the model; instead, it sets the configuration for how the training process will be conducted. These arguments control various aspects of training, such as the number of epochs, batch sizes, learning rate, and evaluation strategies. Once the training arguments are defined, they are used later in the Trainer function to train the model.

- Epochs : Number of times the model will iterate over the entire dataset. More epochs generally improve performance but increase training time.

- Learning rate : Step size used in optimization. A small learning rate can stabilize training, while a larger one may speed up convergence.

- per_device_train_batch_size: Batch size for training. A larger batch size speeds up training but requires more memory. Reduce it if you encounter memory issues.

- per_device_eval_batch_size: Batch size for evaluation, similar to the training batch size. Reducing it helps avoid memory problems during evaluation.


In [8]:
import transformers
from transformers import TrainingArguments, Trainer, logging

training_directory = "BERT-demo"

# Overview of all training arguments: https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments
# Hugging Face tipps to increase training speed and decrease out-of-memory (OOM) issues: https://huggingface.co/transformers/performance.html?
train_args = TrainingArguments(
    num_train_epochs=4,  # this can be increased, but higher values increase training time.
    learning_rate=2e-5,
    per_device_train_batch_size=16,  # if you get an out-of-memory error, reduce this value to 8 or 4 and restart the runtime. Higher values increase training speed, but also increase memory requirements. Ideal values here are always a multiple of 8.
    per_device_eval_batch_size=64,  # if you get an out-of-memory error, reduce this value, e.g. to 40 and restart the runtime
    #gradient_accumulation_steps=2, # Can be used in case of memory problems to reduce effective batch size. accumulates gradients over X steps, only then backward/update. decreases memory usage, but also slightly speed. (!adapt/halve batch size accordingly)
    warmup_ratio=0.06,  # a good normal default value is 0.06 for normal BERT-base models, but since we want to reuse prior NLI knowledge and avoid catastrophic forgetting, we set the value higher
    weight_decay=0.1,
    seed=SEED_GLOBAL,
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    #fp16=fp16_bool,  # Can speed up training and reduce memory consumption, but only makes sense at batch-size > 8. loads two copies of model weights, which creates overhead. https://huggingface.co/transformers/performance.html?#fp16
    #fp16_full_eval=fp16_bool,
    evaluation_strategy="epoch", # options: "no"/"steps"/"epoch"
    #eval_steps=10_000,  # evaluate after n steps if evaluation_strategy!='steps'. defaults to logging_steps
    save_strategy = "epoch",  # options: "no"/"steps"/"epoch"
    #save_steps=10_000,              # Number of updates steps before two checkpoint saves.
    #save_total_limit=10,             # If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in output_dir
    #logging_strategy="steps",
    report_to="all",  # "all"  # logging
    #push_to_hub=False,
    #push_to_hub_model_id=f"{model_name}-finetuned-{task}",
    output_dir=f'./results/{training_directory}',
    logging_dir=f'./logs/{training_directory}',
)



Before training a model, it’s crucial to define how the model’s performance will be evaluated during the training process. This typically involves specifying which evaluation metrics will be used to monitor the model's progress. Evaluation metrics are important because they provide feedback on how well the model is learning, guiding decisions such as when to stop training or adjust hyperparameters.


In [9]:
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
import warnings
import numpy as np

def compute_metrics_standard(eval_pred):
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore")

        labels = eval_pred.label_ids
        pred_logits = eval_pred.predictions
        preds_max = np.argmax(pred_logits, axis=1)  # argmax on each row (axis=1) in the tensor

        # Calculate metrics
        precision, recall, f1, _ = precision_recall_fscore_support(labels, preds_max, average='weighted')  # Weighted for multi-class problems
        accuracy = accuracy_score(labels, preds_max)

        # Return desired metrics
        metrics = {
            'accuracy': accuracy,
            'f1': f1,
            'precision': precision,
            'recall': recall,
        }

        return metrics


Now we can train the model by specifying the Trainer function, which encapsulates all the components necessary for training. This includes the model (which we load from Hugging Face), its tokenizer, the training arguments we defined earlier, the training and test datasets, and the evaluation metrics. Once everything is set up, we launch the training process by calling trainer.train().

In [None]:
# training
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=train_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    compute_metrics=compute_metrics_standard
)

trainer.train()


The model is then evaluated on the test set with trainer.evaluate()

In [None]:
results = trainer.evaluate()
print(results)

## Inference

In [None]:
from transformers import pipeline

economy_classifier = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer,
    framework="pt",
    device=device,
)

In [None]:
# Import tweet data

full_corpus = pd.read_csv("https://www.dropbox.com/s/dpu5m3xqz4u4nv7/tweets_house_rep_party.csv?dl=1")
full_corpus

In [None]:
from tqdm import tqdm # Import library to have progress bars

def apply_model_to_text(df, text_column, classifier):
    # Initialize tqdm for progress bar
    tqdm.pandas()  # This allows the progress_apply method to work

    # Apply the classifier to the text column and get the result as a DataFrame with a progress bar
    results = df[text_column].progress_apply(lambda x: classifier(x)[0])  # Apply model on each text

    # Create new columns for the label and score
    df['label'] = results.apply(lambda x: x['label'])  # Extract the label
    df['score'] = results.apply(lambda x: x['score'])  # Extract the score

    return df

# Example usage
full_corpus_classified = apply_model_to_text(full_corpus, 'text', economy_classifier)
