<a href="https://colab.research.google.com/github/ilsilfverskiold/smaller-models-docs/blob/main/nlp/cook/fine-tune/albert_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Classification with Transformers (ALBERT)

This script helps you fine-tune a pre-trained model (ALBERT) and encoder model for text classification with a dataset from the HuggingFace.

The use case uses binary classes to produce a model to identify clickbait versus factual content with the use of a synthetic dataset found [here](https://huggingface.co/datasets/ilsilfverskiold/clickbait_titles_synthetic_data). This script follows a tutorial that you can find here.

You may use any encoder model such as BERT, RoBERTa and DeBERTa instead.

In [None]:
!pip install -U datasets
!pip install -U accelerate
!pip install -U transformers
!pip install -U huggingface_hub

In [None]:
!pip install scikit-learn

In [None]:
!pip install sentencepiece

In [14]:
!pip install protobuf

Collecting protobuf
  Downloading protobuf-5.27.3-cp38-abi3-manylinux2014_x86_64.whl.metadata (592 bytes)
Downloading protobuf-5.27.3-cp38-abi3-manylinux2014_x86_64.whl (309 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.3/309.3 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hInstalling collected packages: protobuf
Successfully installed protobuf-5.27.3
[0m

Import the dataset you'll be trainin on. This dataset has a 'text' field and a 'label' field. Be sure to tweak the script if you need to.

In [1]:
import pandas as pd

In [2]:
# Load the dataset
set_1_df = pd.read_csv('set_1_df.csv')  # Adjust the file path as needed

# Display the first few rows of the dataset
set_1_df.head()

Unnamed: 0,text,label
0,cpl.org.pe,0
1,faithandreason.com,0
2,kimhauser.ch,0
3,the national association for honesty in medici...,1
4,lokoml.cz,0


In [3]:
# Load the dataset
set_2_df = pd.read_csv('set_2_df.csv')  # Adjust the file path as needed

# Display the first few rows of the dataset
set_2_df.head()

Unnamed: 0,text,label
0,CommentaryIt is time to refinance!Your credit ...,1
1,https://unigeol.quip.com/2JtcASZCsaZa/CLICK-HE...,1
2,canada update the following commercial have si...,0
3,eselworkshop.com,0
4,california update 3 / 27 / 01 executive summar...,0


In [4]:
from datasets import load_dataset, DatasetDict, Dataset
from sklearn.model_selection import train_test_split

# dataset = load_dataset('json', data_files='df_sampled.json')
# dataset = load_dataset("ealvaradob/phishing-dataset", "combined_reduced", trust_remote_code=True)

# Convert dataset to a pandas DataFrame
# df = dataset['train'].to_pandas()

# Split the DataFrame into train and test sets
# train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Convert the train and test DataFrames back to Hugging Face Datasets
train_dataset = Dataset.from_pandas(set_1_df, preserve_index=False)
test_dataset = Dataset.from_pandas(set_2_df, preserve_index=False)

# Combine the train and test Datasets into a DatasetDict
dataset_split = DatasetDict({
    'train': train_dataset,
    'test': test_dataset
})



In [5]:
dataset_split

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 20268
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 5076
    })
})

Decide on your pre-trained model along with your new model's name.

In [6]:
model_name = "microsoft/mdeberta-v3-base"
your_path = 'scenario_3'

Look over your distribution of the labels (optional)

In [7]:
from collections import Counter

train_label_distribution = Counter(train_dataset['label'])
test_label_distribution = Counter(test_dataset['label'])

print("Training Label Distribution:", train_label_distribution)
print("Test Label Distribution:", test_label_distribution)

Training Label Distribution: Counter({0: 10290, 1: 9978})
Test Label Distribution: Counter({1: 2589, 0: 2487})


Create a label encoder that converts categorical labels to a standardized numerical format. Labels in their original categorical form (e.g., 'clickbait', 'factual') need to be converted into numerical values so that they can be processed by the algorithms.

In [8]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

label_encoder.fit(train_dataset['label'])

def encode_labels(example):
    return {'encoded_label': label_encoder.transform([example['label']])[0]}

for split in dataset_split:
    dataset_split[split] = dataset_split[split].map(encode_labels, batched=False)

Map:   0%|          | 0/20268 [00:00<?, ? examples/s]

Map:   0%|          | 0/5076 [00:00<?, ? examples/s]

The id2label and label2id mappings in AutoConfig are used to inform the model of the specific label-to-ID mappings so we can get the actual label names rather than the numerical reps when we do inference with the model.

In [10]:
from transformers import AutoConfig

unique_labels = sorted(list(set(dataset_split)))
# id2label = {i: label for i, label in enumerate(unique_labels)}
# label2id = {label: i for i, label in enumerate(unique_labels)}

id2label = {0: "benign", 1: "phishing"}
label2id = {"benign": 0, "phishing": 1}

config = AutoConfig.from_pretrained(model_name)
config.id2label = id2label
config.label2id = label2id

# Verify the correct labels
print("ID to Label Mapping:", config.id2label)
print("Label to ID Mapping:", config.label2id)

config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

ID to Label Mapping: {0: 'benign', 1: 'phishing'}
Label to ID Mapping: {'benign': 0, 'phishing': 1}


The provided code snippet is responsible for loading a tokenizer and a model from the Hugging Face Transformers library. Here we use ALBERT as a model, you can use AutoTokenizer and AutoModelForSequenceClassification if you want to use another model or it's specified tokenizer.

In [18]:
from transformers import AutoTokenizer, AutoModel, AutoConfig, DebertaV2ForSequenceClassification

# tokenizer = DebertaTokenizer.from_pretrained("microsoft/deberta-base")
# tokenizer("Hello world")["input_ids"]

# tokenizer(" Hello world")["input_ids"]

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = DebertaV2ForSequenceClassification.from_pretrained(model_name, config=config)



pytorch_model.bin:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


This next function makes sure the text data is properly tokenized and labeled, preparing the dataset for efficient training of the transformer model.

In [19]:
def filter_invalid_content(example):
    return isinstance(example['text'], str)

dataset = dataset_split.filter(filter_invalid_content, batched=False)

# def encode_data(batch):
#     tokenized_inputs = tokenizer(batch["text"], padding=True, truncation=True, max_length=256)
#     tokenized_inputs["labels"] = batch["encoded_label"]
#     return tokenized_inputs

# dataset_encoded = dataset.map(encode_data, batched=True)
# # dataset_encoded


def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=256)

# Map the tokenization function to the dataset
dataset_encoded = dataset_split.map(tokenize_function, batched=True)



Filter:   0%|          | 0/20268 [00:00<?, ? examples/s]

Filter:   0%|          | 0/5076 [00:00<?, ? examples/s]

Map:   0%|          | 0/20268 [00:00<?, ? examples/s]

Map:   0%|          | 0/5076 [00:00<?, ? examples/s]

In [20]:
# Set the format for PyTorch
dataset_encoded.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])



In [21]:
# Print the dataset to check the splits
print(dataset_encoded)

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'encoded_label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 20268
    })
    test: Dataset({
        features: ['text', 'label', 'encoded_label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5076
    })
})


The DataCollatorWithPadding ensures that all input sequences in a batch are padded to the same length, using the padding logic defined by the tokenizer.

In [22]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer)

Next we'll set up LabelEncoder to encode labels and defines a function to compute per-label accuracy from a confusion matrix, providing label-specific accuracy metrics. I.e. when we train the model we want to see the accuracy metrics per label as well as the average metrics. This is more relevant if you have more than two labels, and one is underperforming.

In [23]:
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix
import numpy as np

label_encoder = LabelEncoder()
label_encoder.fit(unique_labels)

def per_label_accuracy(y_true, y_pred, labels):
    cm = confusion_matrix(y_true, y_pred, labels=labels)
    correct_predictions = cm.diagonal()
    label_totals = cm.sum(axis=1)
    per_label_acc = np.divide(correct_predictions, label_totals, out=np.zeros_like(correct_predictions, dtype=float), where=label_totals != 0)
    return dict(zip(labels, per_label_acc))

Next we set up our compute metrics. Here I've set up several, but you may reduce them if needed be. You can read more on this metrics [here.](https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9)

In [24]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    decoded_labels = label_encoder.inverse_transform(labels)
    decoded_preds = label_encoder.inverse_transform(preds)

    precision = precision_score(decoded_labels, decoded_preds, average='weighted')
    recall = recall_score(decoded_labels, decoded_preds, average='weighted')
    f1 = f1_score(decoded_labels, decoded_preds, average='weighted')
    acc = accuracy_score(decoded_labels, decoded_preds)

    labels_list = list(label_encoder.classes_)
    per_label_acc = per_label_accuracy(decoded_labels, decoded_preds, labels_list)

    per_label_acc_metrics = {}
    for label, accuracy in per_label_acc.items():
        label_key = f"accuracy_label_{label}"
        per_label_acc_metrics[label_key] = accuracy

    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall,
        **per_label_acc_metrics
    }

Lastly, we set up our training metrics to train the model. I'm following the paper ["How to Fine-Tune BERT for Text Classification?"](https://arxiv.org/abs/1905.05583) on epochs, batch size and learning rate but do play around with it if you want to.

When it is in training, be sure to look out for training loss and validation loss. Both should decrease consistently. If validation is increasing consistently you may be overfitting your model and you can try to decrease number of epochs.

In [25]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir=your_path,
    num_train_epochs=3,
    warmup_steps=500,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    logging_steps=10,
    evaluation_strategy='steps',
    eval_steps=100,
    learning_rate=2e-5,
    save_steps=1000,
    gradient_accumulation_steps=2
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_encoded['train'],
    eval_dataset=dataset_encoded['test'],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()



Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Accuracy Label Test,Accuracy Label Train
100,0.5918,0.540752,0.758668,0.745036,0.820086,0.758668,0.528347,0.979915
200,0.0971,0.057711,0.984634,0.984635,0.984813,0.984634,0.993969,0.975666
300,0.0549,0.065681,0.983058,0.983059,0.983413,0.983058,0.996381,0.970259
400,0.0279,0.052318,0.988377,0.988378,0.988521,0.988377,0.996783,0.980301
500,0.0613,0.047961,0.985619,0.985613,0.985911,0.985619,0.972658,0.998069
600,0.0427,0.062442,0.98621,0.986211,0.986488,0.98621,0.99799,0.974894
700,0.0322,0.018487,0.995863,0.995863,0.995865,0.995863,0.994773,0.99691
800,0.0398,0.082428,0.982467,0.982468,0.982969,0.982467,0.998392,0.967169
900,0.01,0.033574,0.993302,0.993302,0.99334,0.993302,0.997587,0.989185
1000,0.0241,0.021316,0.994681,0.994681,0.994687,0.994681,0.996381,0.993048


TrainOutput(global_step=1899, training_loss=0.07372614171337817, metrics={'train_runtime': 836.8511, 'train_samples_per_second': 72.658, 'train_steps_per_second': 2.269, 'total_flos': 7993457212661760.0, 'train_loss': 0.07372614171337817, 'epoch': 2.997632202052092})

Once you're finito, you can evaluate the results, save your model and the state.

In [26]:
trainer.evaluate()
trainer.save_model(your_path)
trainer.save_state()

If you want to test it out, you can run the pipeline directly with the model. I just used some new example titles to see how it did.

In [27]:
from transformers import pipeline
pipe = pipeline('text-classification', model='scenario_3')

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [28]:
example_titles = [
    "Sex up ur mobile with a FREE sexy pic of Aho! Just text BABE to 88600. Then every wk get a sexy celeb! PocketBabe.co.uk 4 more pics. 16 \u00a33/wk 087016248",
    "Pity.  Reading that woman's ad and knowing Rohit for years, they sound like a match made in heaven.  But why, oh, why, keep that shaved-head photo on prominent display???  There are lots of photos of Rohit looking rather dashing, and with the crucial hair feature enabled!R",
]

for title in example_titles:
    result = pipe(title)
    print(f"Title: {title}")
    print(f"Output: {result[0]['label']}")

Title: Sex up ur mobile with a FREE sexy pic of Aho! Just text BABE to 88600. Then every wk get a sexy celeb! PocketBabe.co.uk 4 more pics. 16 £3/wk 087016248
Output: phishing
Title: Pity.  Reading that woman's ad and knowing Rohit for years, they sound like a match made in heaven.  But why, oh, why, keep that shaved-head photo on prominent display???  There are lots of photos of Rohit looking rather dashing, and with the crucial hair feature enabled!R
Output: benign


If you're satisfied, you can log in to HuggingFace with a token (you'll find these in your account under Settings - make sure it has write access).

In [31]:
!huggingface-cli login --token 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

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Push the model with your new name for it. It usually just takes the name you set when you trained it so whatever you put here doesn't matter.

In [32]:
tokenizer.push_to_hub("jordan2889/scenario_3")
trainer.push_to_hub("jordan2889/scenario_3")

tokenizer.json:   0%|          | 0.00/16.4M [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.11k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/jordan2889/scenario_3/commit/2431ad2e38089e35a651368502c6e3005adebcd5', commit_message='jordan2889/scenario_3', commit_description='', oid='2431ad2e38089e35a651368502c6e3005adebcd5', pr_url=None, pr_revision=None, pr_num=None)

Now, you're done. You got your text classifier.