# Llama 3.2 1B model Training

This notebook demonstrates the setup and execution of a training pipeline for the LLaMA 3.2-1B model. It includes steps for loading the dataset, defining model configurations, and running training iterations to observe improvements in model performance. Each section provides insight into the functionality and parameters used, making it straightforward to reproduce the results by running cells in sequence.

### Importing Required Libraries

This cell imports essential libraries, such as `torch`, `transformers`, and `datasets`, which are required for model training, data handling, and pre-training analysis.

Here, we define model configurations with weights stored in 4 bits to save memory cost for efficient training. We would need to import pretrained weights of llama 3.2 1b model from HuggingFace. To do so you would need a personalized read token from your huggingface account and would need to request for permission of using llama 3.2 1b model on huggingface. After you get access, replace the login token with yours and execute the code.

In [1]:
from huggingface_hub import login
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

login(token="replace with your HuggingFace token")
model_name = 'meta-llama/Llama-3.2-1B'
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, device_map='auto', trust_remote_code=True)
model.resize_token_embeddings(len(tokenizer))
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(r=16, lora_alpha=32, lora_dropout=0.05, bias='none', task_type="CAUSAL_LM")
model = get_peft_model(model, peft_config)

### Loading and Preprocessing the Dataset

This cell loads the specified dataset for pretraining the model, using the Hugging Face `datasets` library. For illustration, we use wikitext-103 dataset open sourced on hugging face. The following cell includes preprocessing steps to format the data for compatibility with LLaMA's input requirements with tokenization and labeling for supervised training.

In [3]:
from datasets import load_dataset
from torch.utils.data import DataLoader

dataset = load_dataset('Salesforce/wikitext', 'wikitext-103-v1')

In [4]:
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(examples):
    batch = tokenizer(examples["text"], padding="max_length", 
        max_length=128, truncation=True)
    batch["labels"] = batch["input_ids"]
    return batch

tokenized_dataset = dataset.map(tokenize_function, batched=True)

Map: 100%|██████████| 1801350/1801350 [02:49<00:00, 10607.04 examples/s]


## Setting Up and Logging into Weights and Biases (W&B)

This cell initializes Weights and Biases for experiment tracking. W&B helps monitor metrics like training loss and accuracy over time, making it easy to compare runs and visualize progress. If this is your first time using W&B, ensure you have an account at [wandb.ai](https://wandb.ai/) and run `wandb.login()` to authenticate. You may be prompted to enter an API key, which you can find in your W&B account settings.

In [5]:
import wandb

wandb.login()  # Log in directly without setting env variable
wandb.init(project='llama-training', entity='replace with your team name')

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mfjiang7[0m ([33mfjiang7-ucsd[0m). Use [1m`wandb login --relogin`[0m to force relogin


## Training Setup with Hugging Face Trainer

In this section, we configure the training parameters and initialize the Hugging Face Trainer for our model. The `Trainer` class provides a high-level API that simplifies the training loop and integrates various components needed for model training and evaluation. 

### Key Components of the Setup:

- **Training Arguments**: We define `TrainingArguments`, which includes essential parameters such as:
  - `output_dir`: Directory where the model predictions and checkpoints will be saved.
  - `num_train_epochs`: Total number of training epochs, we use 3 here to match up with llama
  - `per_device_train_batch_size`: Batch size per device (GPU/TPU) during training.
  - `learning_rate`: The initial learning rate for the optimizer.
  - `evaluation_strategy`: Strategy to evaluate the model during training (e.g., after a set number of steps, set to 20 here but can also be updated to after each epoch).
  - `weight_decay`: Strength of weight decay to apply during training to avoid overfitting.
  - `report_to`: Report loss and other metrics to weights and bias for real time tracking of the plot.
  - `logging_dir`: Directory for storing logs.
  - `fp16`: Allows mixed precision training to save memory cost and improve efficiency.

- **Trainer Initialization**: We create an instance of the `Trainer` class by passing the model, training arguments, and datasets. The `Trainer` handles the training loop, loss computation, gradient updates, and evaluation automatically, making it easier to focus on model development rather than the intricacies of training.

We chose to use Hugging Face Trainer here because it implifies the training workflow by abstracting repetitive tasks, supports mixed-precision training and distributed training across multiple GPUs with minimal configuration, and provides built-in evaluation metrics and logging capabilities for monitoring model performance. This setup will allow us to efficiently train our language model, leveraging the power of Hugging Face's libraries and tools.

In [10]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="steps",
    eval_steps=20,
    learning_rate=2e-5,
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
    num_train_epochs=3,
    weight_decay=0.01,
    gradient_accumulation_steps=6,
    report_to="wandb",
    logging_dir='./logs',  
    logging_steps=50,
    save_steps=500,
    dataloader_num_workers=4,
    fp16=True,
)



In [11]:
from trl import SFTTrainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['test'],
)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [12]:
train_stats = trainer.train()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Step,Training Loss,Validation Loss
20,No log,3.185857
40,No log,3.116699
60,3.165800,3.046834
80,3.165800,2.985568
100,3.007700,2.930325
120,3.007700,2.882129
140,3.007700,2.840423
160,2.889200,2.80826
180,2.889200,2.78548
200,2.825400,2.768691


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

KeyboardInterrupt: 

In [13]:
wandb.finish()

0,1
eval/loss,█▆▅▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
eval/runtime,▂▄▄▂▃▂▂▂▃▂▂▂▂▂█▂▄▃▃▁▁▃▃▃▁▁▂▂▁▁▁▂▂▃▂▁▂▂▁▃
eval/samples_per_second,▅▁▄▆▅▆▆▅▄▆▇▆▅▆▆▆▆▄▇▇▆▅▆▅█▅▇▆▆▇▇▃▆▇█▅▇▇▆▇
eval/steps_per_second,▇▆▁▆▆▅▇▆▇▇▅▇▇▆▆▆▅▆█▆▇▆█▆▇▇▇▄▇█▆▇▇▆▇█▇▇▆▅
train/epoch,▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▅▆▆▇▇▇▇▇▇▇▇▇████
train/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆▇▇█
train/grad_norm,▆▄▃▁▁▅▁▁▂▃▁▂▁▂▃▂▂▄▃▂▃▃▄▆▄▆▇▆█▆▅▇
train/learning_rate,███▇▇▇▇▆▆▆▆▆▅▅▅▅▄▄▄▄▃▃▃▃▃▂▂▂▂▁▁▁
train/loss,█▆▄▃▃▃▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
eval/loss,2.61732
eval/runtime,31.3313
eval/samples_per_second,139.094
eval/steps_per_second,1.117
train/epoch,0.68211
train/global_step,1600.0
train/grad_norm,0.54705
train/learning_rate,2e-05
train/loss,2.6621
