# Overview of Large Language Models (LLMs)

### What are LLMs?

Large Language Models (LLMs) are a class of machine learning models designed to understand and generate human-like text. These models, such as GPT-4, Llama-3, BERT, and T5, are built using deep learning techniques, primarily using the Transformer architecture. LLMs have been trained on vast amounts of text data, making them capable of various natural language processing (NLP) tasks like text completion, translation, summarization, and more.

### Real-World Applications

- **Text Generation**: Creating coherent and contextually relevant text for chatbots, virtual assistants, and content creation.
- **Machine Translation**: Translating text from one language to another with high accuracy.
- **Text Classification**: Categorizing text into predefined labels, useful in sentiment analysis, spam detection, etc.
- **Question Answering**: Providing answers to user queries based on context.

### Evolution of LLMs

- **Early Models (Pre-2017)**: RNNs and LSTMs dominated the NLP landscape but faced limitations in handling long dependencies.
- **Transformers (2017 Onwards)**: Introduction of the Transformer architecture by Vaswani et al. with the now famous paper "Attention Is All You Need", which solved many limitations of earlier models.
- **Recent Advances**: Large-scale pre-training of foundation models, fine-tuning and specialized architectures for specific tasks further downstream.

### Challenges and Opportunities

- **Challenges**: High computational cost, energy consumption, biases in training data, interpretability, halluzinations.
- **Opportunities**: Interaction with applications through human language, speed-up of labour-intense tasks.


### Overview Fine-Tuning a Pretrained Model using Hugging Face

Lets walk through the usual steps of fine-tuning a model

In [1]:
# imports
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

In [2]:
# Load dataset
# We'll use the IMDb dataset, but won't be downloading it from Hugging Face, but from a shared directory
dataset = load_dataset("/leonardo_scratch/fast/EUHPC_D20_063/huggingface/datasets/stanfordnlp--imdb")

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [3]:
# On VSC use this path instead:
# dataset = load_dataset("/gpfs/data/fs70824/LLMs_models_datasets/datasets/stanfordnlp--imdb")

In [4]:
# Split into train and test sets
train_dataset = dataset['train']
test_dataset = dataset['test']

In [5]:
# Load pretrained tokenizer
# Using a BERT-based model for sequence classification
model_dir = "/leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/google--bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_dir)

In [6]:
# On VSC use this path instead:
# model_dir = "/gpfs/data/fs70824/LLMs_models_datasets/models/google--bert-base-uncased"
# tokenizer = AutoTokenizer.from_pretrained(model_dir)

In [7]:
# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)

train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

In [8]:
# Remove unnecessary columns
train_dataset = train_dataset.remove_columns(["text"])
test_dataset = test_dataset.remove_columns(["text"])

In [9]:
# Set format for PyTorch
train_dataset.set_format("torch")
test_dataset.set_format("torch")

In [10]:
train_dataset

Dataset({
    features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 25000
})

In [11]:
# Load pretrained model
model = AutoModelForSequenceClassification.from_pretrained(model_dir, num_labels=2)

# Define training arguments and trainer
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    num_train_epochs=2,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_dir="./logs",
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/google--bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


[2025-09-08 09:48:48,989] [INFO] [real_accelerator.py:260:get_accelerator] Setting ds_accelerator to cuda (auto detect)


df: /leonardo/home/usertrain/a08trb29/one-click-hpc-access-home-trainee29/.triton/autotune: No such file or directory


[2025-09-08 09:48:51,841] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False


In [12]:
# Fine-tunine the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.2779,0.261601
2,0.1151,0.276983


TrainOutput(global_step=6250, training_loss=0.23531618347167968, metrics={'train_runtime': 1426.4942, 'train_samples_per_second': 35.051, 'train_steps_per_second': 4.381, 'total_flos': 1.3155552768e+16, 'train_loss': 0.23531618347167968, 'epoch': 2.0})

In [13]:
# Evaluating the model
trainer.evaluate()

{'eval_loss': 0.2769826352596283,
 'eval_runtime': 148.0398,
 'eval_samples_per_second': 168.873,
 'eval_steps_per_second': 21.109,
 'epoch': 2.0}

## Conclusion

In this notebook, we briefly introduced the concept of Large Language Models, their applications, and their evolution. We also explored various LLMs available on the Huggingface Model Hub, learning to filter and analyze models based on different criteria.

In [14]:
# Shut down the kernel
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)

{'status': 'ok', 'restart': False}