# Large Language Models (LLMs)
Large language models (LLMs) are a type of neural network model. They utilize deep learning techniques, specifically architectures like transformers, to understand and generate human-like text at scale. Here's how they relate to neural networks:

## Architecture:
LLMs are built using neural network architectures, typically based on transformers. Transformers consist of multiple layers of neural network components, including self-attention mechanisms and feed-forward neural networks, that process sequential data such as text.

## Training:
LLMs are trained using deep learning techniques, which involve optimizing the parameters of the neural network model to minimize a loss function. During training, the model learns the statistical patterns and structures of natural language by processing vast amounts of text data.
Representation Learning: LLMs learn to represent and encode textual information in distributed vector representations, often referred to as embeddings. These embeddings capture semantic and syntactic information about words and phrases in the input text, enabling the model to understand and generate coherent text.

## Fine-tuning:
Similar to other neural network models, LLMs can be fine-tuned on specific tasks or domains to improve their performance. Fine-tuning involves updating the parameters of the pre-trained LLM on a smaller dataset relevant to the target task, allowing the model to adapt its representations and predictions accordingly.
Scalability: One of the key advantages of LLMs is their scalability, which is achieved through parallelization and distributed computing techniques. LLMs can be trained on massive datasets using distributed training across multiple GPUs or even multiple machines, allowing them to capture complex patterns in natural language effectively.

Here I will show how to fine-tune Google's BERT for sentiment analysis. This will be implemented on the IMDB Movie Review Dataset, with binary classifications: positive and negative.

In [1]:
!pip install transformers numpy torch scikit-learn
!pip install accelerate>=0.21.0



In [2]:
import torch
from transformers.file_utils import is_tf_available, is_torch_available
from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import numpy as np
import pandas as pd
import random
from sklearn.model_selection import train_test_split

In [3]:
device = torch.device("cuda")

In [4]:
def set_seed(seed: int):
    """
    Helper function for reproducible behavior to set the seed in ``random``, ``numpy``, ``torch`` and/or ``tf`` (if
    installed).

    Args:
        seed (:obj:`int`): The seed to set.
    """
    random.seed(seed)
    np.random.seed(seed)
    if is_torch_available():
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        # ^^ safe to call this function even if cuda is not available
    if is_tf_available():
        import tensorflow as tf

        tf.random.set_seed(seed)

set_seed(1)

We set a seed to ensure reproducibility.

In [5]:
# the model we gonna train, base uncased BERT
# check text classification models here: https://huggingface.co/models?filter=text-classification
model_name = "bert-base-uncased"
# max sequence length for each document/sentence sample
max_length = 512

BERT's tokenizer has a max_length of 512, so any review beyond 512 tokens will be truncated.

In [6]:
# load the tokenizer
tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [7]:
import os
import tarfile
import urllib.request

def download_and_extract_imdb(data_dir='aclImdb'):
    """Download and extract the IMDb dataset.

    Args:
        data_dir (str): Path to store the dataset.

    Returns:
        None
    """
    url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
    dataset_path = os.path.join(data_dir, "aclImdb_v1.tar.gz")

    # Create directory if it doesn't exist
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)

    # Download the dataset
    if not os.path.exists(dataset_path):
        print("Downloading IMDb dataset...")
        urllib.request.urlretrieve(url, dataset_path)
        print("Download completed.")

    # Extract the dataset
    print("Extracting IMDb dataset...")
    with tarfile.open(dataset_path, 'r:gz') as tar:
        tar.extractall(data_dir)
    print("Extraction completed.")

# Download and extract IMDb dataset
download_and_extract_imdb()

# Now you can proceed to use the previously provided code to load and preprocess the dataset.

Extracting IMDb dataset...
Extraction completed.


Here we import the IMDB dataset, and we will split it up into train and test.

In [8]:
import os
from sklearn.model_selection import train_test_split

def read_imdb_data(data_dir='aclImdb'):
    """Read IMDb data from the provided directory.

    Args:
        data_dir (str): Path to the IMDb dataset directory.

    Returns:
        tuple: A tuple of training texts, training labels, testing texts, testing labels.
    """
    train_texts = []
    train_labels = []
    test_texts = []
    test_labels = []

    for category in ['pos', 'neg']:
        train_path = os.path.join(data_dir, data_dir, 'train', category)
        test_path = os.path.join(data_dir, data_dir, 'test', category)

        # Read training data
        for fname in os.listdir(train_path):
            with open(os.path.join(train_path, fname), 'r', encoding='utf-8') as f:
                train_texts.append(f.read())
                train_labels.append(1 if category == 'pos' else 0)

        # Read testing data
        for fname in os.listdir(test_path):
            with open(os.path.join(test_path, fname), 'r', encoding='utf-8') as f:
                test_texts.append(f.read())
                test_labels.append(1 if category == 'pos' else 0)

    return train_texts, train_labels, test_texts, test_labels

# Load IMDb dataset
train_texts, train_labels, test_texts, test_labels = read_imdb_data()

# Split into training and testing sets
train_texts, valid_texts, train_labels, valid_labels = train_test_split(train_texts, train_labels, test_size=0.2)

# Print size of the datasets
print(f"Training size: {len(train_texts)}")
print(f"Validation size: {len(valid_texts)}")
print(f"Testing size: {len(test_texts)}")


Training size: 20000
Validation size: 5000
Testing size: 25000


In [9]:
# tokenize the dataset, truncate when passed `max_length`,
# and pad with 0's when less than `max_length`
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=max_length)

In [10]:
class IMDBDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor([self.labels[idx]])
        return item

    def __len__(self):
        return len(self.labels)

# convert our tokenized data into a torch Dataset
train_dataset = IMDBDataset(train_encodings, train_labels)
valid_dataset = IMDBDataset(valid_encodings, valid_labels)

In [11]:
# Define target names based on your dataset
target_names = ["negative", "positive"]

# load the model and pass to CUDA
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=len(target_names)).to("cuda")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
from sklearn.metrics import accuracy_score

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  # calculate accuracy using sklearn's function
  acc = accuracy_score(labels, preds)
  return {
      'accuracy': acc,
  }

We will be judging our model's performance through accuracy.

In [16]:
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=8,  # batch size per device during training
    per_device_eval_batch_size=20,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    load_best_model_at_end=True,     # load the best model when finished training (default metric is loss)
    # but you can specify `metric_for_best_model` argument to change to accuracy or other metric
    logging_steps=400,               # log & save weights each logging_steps
    save_steps=400,
    evaluation_strategy="steps",     # evaluate each `logging_steps`
)

In [17]:
trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=valid_dataset,          # evaluation dataset
    compute_metrics=compute_metrics,     # the callback that computes metrics of interest
)

In [18]:
# train the model
trainer.train()

Step,Training Loss,Validation Loss,Accuracy
400,0.4438,0.342545,0.898


Step,Training Loss,Validation Loss,Accuracy
400,0.4438,0.342545,0.898
800,0.3776,0.271921,0.9118
1200,0.3409,0.354372,0.9016
1600,0.2943,0.29791,0.915
2000,0.2985,0.317626,0.909
2400,0.2326,0.277936,0.9262


TrainOutput(global_step=2500, training_loss=0.3277172790527344, metrics={'train_runtime': 2941.8441, 'train_samples_per_second': 6.798, 'train_steps_per_second': 0.85, 'total_flos': 5262221107200000.0, 'train_loss': 0.3277172790527344, 'epoch': 1.0})

We can see the slow (not always increasing) progression of accuracy with each step.

Side note: this is an extremely computationally expensive process, and takes close to an hour to fully train. That's why I limited it to only 1 epoch, when it is traditionally 3.

---



In [19]:
# evaluate the current model after training
trainer.evaluate()

{'eval_loss': 0.2719208598136902,
 'eval_accuracy': 0.9118,
 'eval_runtime': 164.0067,
 'eval_samples_per_second': 30.487,
 'eval_steps_per_second': 1.524,
 'epoch': 1.0}

Our models ended with a test accuracy of 91.18% Great success! Onto the next ML model.