<a href="https://colab.research.google.com/github/jpcoleman1/Udacity-GenAI/blob/main/udacity_genai_project_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments
import numpy as np
from datasets import load_metric
from transformers import Trainer
from peft import LoraModel, LoraConfig
from peft import get_peft_model, TaskType

# Import dataset - we are using the ag_news dataset from Huggingface
dataset = load_dataset("ag_news", split={'train': 'train', 'test': 'test'})

splits = ["train", "test"]


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


# Prepare foundation model

## Tokenize dataset

In [2]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")



Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

## Split dataset

In [3]:
train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(5000))
test_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

## Load Pre-trained model

In [4]:
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=4,  # AG News has 4 labels
    id2label={0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"},
    label2id={"World": 0, "Sports": 1, "Business": 2, "Sci/Tech": 3},
)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Train foundational model

In [5]:
# Set up training arguments

training_args = TrainingArguments(
    output_dir="./data/ag_news",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=2,
    weight_decay=0.01,
    logging_dir="./logs",
    load_best_model_at_end=True,
)

In [6]:
# define evaluation metric

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

  metric = load_metric("accuracy")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

In [7]:
# Create trainer instance

trainer = Trainer(
    model=model,  #
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [8]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.345617,0.89
2,0.395100,0.322644,0.897


TrainOutput(global_step=626, training_loss=0.3628858651596898, metrics={'train_runtime': 172.0857, 'train_samples_per_second': 58.111, 'train_steps_per_second': 3.638, 'total_flos': 1324721233920000.0, 'train_loss': 0.3628858651596898, 'epoch': 2.0})

In [9]:
evaluation_results = trainer.evaluate()
print(evaluation_results)

{'eval_loss': 0.32264429330825806, 'eval_accuracy': 0.897, 'eval_runtime': 4.5243, 'eval_samples_per_second': 221.031, 'eval_steps_per_second': 3.536, 'epoch': 2.0}


Initial foundational model training yielded an accuracy approaching 90% as a benchmark for LoRA fine tuning.

## Apply LoRA fine tuning

In [10]:
config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_lin", "k_lin","v_lin"],
    lora_dropout=0.01,
    task_type=TaskType.SEQ_CLS # Seqence to Classification Task
)

# Loading the model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=4,
    id2label={0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"},
    label2id={"World": 0, "Sports": 1, "Business": 2, "Sci/Tech": 3},
)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
model = get_peft_model(model, config)
model.config.id2label = {0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"} # ensure custom lables carry through
model.config.label2id = {"World": 0, "Sports": 1, "Business": 2, "Sci/Tech": 3}


model.print_trainable_parameters()

trainable params: 814,852 || all params: 67,771,400 || trainable%: 1.202353795258767


# Perform lightweight tuning

In [12]:
# Set up training arguments

training_args = TrainingArguments(
    output_dir="./data/ag_news",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=2,
    weight_decay=0.01,
    logging_dir="./logs",
    load_best_model_at_end=True,
)


In [13]:
# Create trainer instance

trainer = Trainer(
    model=model,  #
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)


dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


## Train the model

In [14]:
trainer.train()


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.488027,0.873
2,0.744400,0.400649,0.877


Checkpoint destination directory ./data/ag_news/checkpoint-313 already exists and is non-empty. Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./data/ag_news/checkpoint-626 already exists and is non-empty. Saving will proceed but saved results may be invalid.


TrainOutput(global_step=626, training_loss=0.6773065865611116, metrics={'train_runtime': 127.9557, 'train_samples_per_second': 78.152, 'train_steps_per_second': 4.892, 'total_flos': 1349753487360000.0, 'train_loss': 0.6773065865611116, 'epoch': 2.0})

## Evaluate trained model

In [15]:
evaluation_results = trainer.evaluate()
print(evaluation_results)

{'eval_loss': 0.4006485641002655, 'eval_accuracy': 0.877, 'eval_runtime': 4.7929, 'eval_samples_per_second': 208.643, 'eval_steps_per_second': 3.338, 'epoch': 2.0}


Model is performing at ~88% accuracy on unseen data after fine tuning. This is a slight reduction in performance compared to the 90% achieved from training the base model. Fine tuning could be improved with hyper parameter tuning and more epochs.

## Save trained model

In [16]:
model.save_pretrained("./results/ag_news_fine_tuned")
tokenizer.save_pretrained("./results/ag_news_fine_tuned")


('./results/ag_news_fine_tuned/tokenizer_config.json',
 './results/ag_news_fine_tuned/special_tokens_map.json',
 './results/ag_news_fine_tuned/vocab.txt',
 './results/ag_news_fine_tuned/added_tokens.json',
 './results/ag_news_fine_tuned/tokenizer.json')

# Load fine-tuned model (if necessary)

In [17]:
# Load model
model_path = "./results/ag_news_fine_tuned"

model = AutoModelForSequenceClassification.from_pretrained(model_path, num_labels=4,
    id2label={0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"},
    label2id={"World": 0, "Sports": 1, "Business": 2, "Sci/Tech": 3},
)
tokenizer = AutoTokenizer.from_pretrained(model_path)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Inference

In [18]:
def predict(text, model, tokenizer):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512, padding=True)
    outputs = model(**inputs)
    logits = outputs.logits
    predictions = logits.argmax(-1).tolist()
    return [model.config.id2label[prediction] for prediction in predictions]


sample_text = "The stock market closed lower today after a volatile trading session."
print(predict(sample_text, model, tokenizer))


['Business']
