<a href="https://colab.research.google.com/github/jmhalawi/boom.ai/blob/main/Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#0.Setting up

Installing the libraries (uncomment if needed)

In [17]:
#! pip install datasets transformers

Share your model

In [18]:
#pip install huggingface_hub

In [19]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [20]:
#!apt install git-lfs

Set-up the Glue Benchmark tasks with a list

In [21]:
GLUE_TASKS = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]

Set the three parameters

In [22]:
task = "cola" ###select a task from the list
model_checkpoint = "distilbert-base-uncased" ###the model name
batch_size = 16 ###can adjust batch size to avoid out-of-memory errors

#Load the dataset

In [23]:
#Load datasets library and functions
from datasets import load_dataset, load_metric

In [None]:
actual_task = "mnli" if task == "mnli-mm" else task
dataset = load_dataset("glue", actual_task) #"glue" is the name of dataset for this case
metric = load_metric('glue', actual_task)

Define a function that will produce a dataset picked randomly

In [25]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
#load defined function
show_random_elements(dataset["train"])

Unnamed: 0,sentence,label,idx
0,I wonder what who bought?,unacceptable,455
1,Which version did they recommend?,acceptable,4746
2,"We'll go together, us.",unacceptable,1845
3,Mary has more than two friends.,acceptable,5548
4,John placed him busy.,unacceptable,3892
5,The monkeys seem want to leave the meeting.,unacceptable,3658
6,The king kept put his gold under the bathtub.,unacceptable,3875
7,The issue was dealt with promptly.,acceptable,4676
8,Fanny stopped talking when in came Aunt Norris.,unacceptable,6778
9,They ignored the suggestion that Lee made.,acceptable,4990


Getting the metric(s) value. Note that load_metric has loaded the proper metric associated to your task

In [26]:
import numpy as np

fake_preds = np.random.randint(0, 2, size=(64,))
fake_labels = np.random.randint(0, 2, size=(64,))
metric.compute(predictions=fake_preds, references=fake_labels)

{'matthews_correlation': -0.054187192118226604}

#Preprocessing the data

##Tokenize the data

In [None]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

In [28]:
#Names of the columns containing the sentence/s

task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

In [29]:
#Create function for tokenizer

sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

def preprocess_function(examples):
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

Sentence: Our friends won't buy this analysis, let alone the next one we propose.


In [30]:
#Applying to function on all the sentences

encoded_dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/8551 [00:00<?, ? examples/s]

Map:   0%|          | 0/1043 [00:00<?, ? examples/s]

Map:   0%|          | 0/1063 [00:00<?, ? examples/s]

#Fine-tuning the Model

##Downloading the pretrained model

In [32]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2 ###always 2 except for mnli and stsb
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'pre_classi

##Instantiate a trainer

Define the training argurments

In [43]:
metric_name = "pearson" if task == "stsb" else "matthews_correlation" if task == "cola" else "accuracy"
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    push_to_hub=True,
)

Computing for the metrics

In [44]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

##Training the model with the loaded dataset

In [45]:
validation_key = "validation_mismatched" if task == "mnli-mm" else "validation_matched" if task == "mnli" else "validation"
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

RepositoryNotFoundError: ignored

##Finetune the model

In [38]:
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.5214,0.534495,0.412288
2,0.3461,0.498686,0.502552
3,0.2352,0.532655,0.532088
4,0.1806,0.731957,0.538383
5,0.131,0.796613,0.538018


Several commits (2) will be pushed upstream.
Several commits (3) will be pushed upstream.
Several commits (4) will be pushed upstream.
Several commits (5) will be pushed upstream.


TrainOutput(global_step=2675, training_loss=0.2721077771053136, metrics={'train_runtime': 270.5307, 'train_samples_per_second': 158.041, 'train_steps_per_second': 9.888, 'total_flos': 229309863736728.0, 'train_loss': 0.2721077771053136, 'epoch': 5.0})

Evaluate the model

In [39]:
trainer.evaluate()

{'eval_loss': 0.7319574952125549,
 'eval_matthews_correlation': 0.5383825234212567,
 'eval_runtime': 1.5909,
 'eval_samples_per_second': 655.605,
 'eval_steps_per_second': 41.486,
 'epoch': 5.0}

In [42]:
trainer.push_to_hub()

Several commits (7) will be pushed upstream.
The progress bars may be unreliable.
batch response: Authorization error.
error: failed to push some refs to 'https://user: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@huggingface.co/jmhalawi/FineTuning_TextClassification'

error: failed to push some refs to 'https://user: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@huggingface.co/jmhalawi/FineTuning_TextClassification'

Error pushing update to the model card. Please read logs and retry.
$batch response: Authorization error.
error: failed to push some refs to 'https://user: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@huggingface.co/jmhalawi/FineTuning_TextClassification'

