### Training and Evaluation of LaMoE language model.

> This notebook walks through the training and evaluation of LaMoE model using the train and validation/test data.

**!!Primary Note!!:** Make sure **train.h5**, **val.h5** and **Tokenizer.json** is created using *Dataset.ipynb*. 

**!!Caution Note!!:** This training and evluation uses **mlflow** for logging. So, make sure you started the mlfow
                      tracking server using **mlflow ui** in terminal opened in current directory and then start the training.

*Note*: To make changes in model, training and evaluation configuration, edit *config.py* to make changes.

In [1]:
from collections import defaultdict
from tqdm import tqdm
from pathlib import Path
import os
import urllib.request
import glob
import sys

import numpy as np
import torch
import torch.nn as nn
from torchsummary import summary
import mlflow

In [2]:
cd ..

d:\Envs\Projects\Transformer_Decoder\MOE


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [None]:
from lamoe.transformer import Transformer
from lamoe.config import ModelArgs, TrainEvalArgs
from lamoe.utils import get_data, get_vocab_size, get_batch

_CudaDeviceProperties(name='NVIDIA GeForce RTX 3060 Laptop GPU', major=8, minor=6, total_memory=6143MB, multi_processor_count=30, uuid=eafe3e1a-82bb-fec8-2948-094f6277e5f8, L2_cache_size=3MB)


In [None]:
# Perplexity score is used to evaluate quality of text generation. It is computed as exponential of Cross entropy loss.
# Lower the perplexity, model is good in text generation.
def perplexity_score(loss: torch.Tensor) -> float:
    return torch.exp(loss).item() 

In [None]:
# Funtion to perform evaluation on train and validation data.
@torch.no_grad()
def estimate_model(train_data: np.ndarray,
                   val_data: np.ndarray,
                   model: Transformer,
                   model_args: ModelArgs, 
                   train_eval_args: TrainEvalArgs, 
                   loss_criterion: nn.CrossEntropyLoss) -> defaultdict:
    
    out = defaultdict(dict)
    data = {"train": train_data, "val": val_data}
    model.eval()
    for key, split in data.items():
        losses = torch.zeros(train_eval_args.max_eval_iter)
        perplexities = torch.zeros(train_eval_args.max_eval_iter)
        for k in range(train_eval_args.max_eval_iter):

            x, y = get_batch(split, model_args.max_seq_length, model_args.max_batch_size)
            logits, aux_loss = model(x.to(model_args.device))
            task_loss = loss_criterion(logits.view(-1, logits.size(2)), y.view(-1).to(model_args.device))

            perplexity = perplexity_score(task_loss)
            perplexities[k] = perplexity

            total_loss = task_loss + aux_loss if aux_loss is not None else task_loss
            losses[k] = total_loss.item()

        out[key] = {"loss": losses.mean(), "perplexity": perplexities.mean().item()}

    return out

In [None]:
# Funtion to train the model.
def train(model_args: ModelArgs,
          train_eval_args: TrainEvalArgs,
          model: Transformer,
          train_data: np.ndarray,
          val_data: np.ndarray,
          ckpt_dir: Path) -> None:
    
    try:
        urllib.request.urlopen("http://127.0.0.1:5000/").getcode()
        run_name = f"Experiment-{np.random.randint(1e6)}"
        mlflow.set_experiment("MoE Training")
        mlflow.set_tracking_uri(uri = "http://127.0.0.1:5000/") 
        print("Mlflow current run name inside MoE Training: ", run_name, "\n")
    except Exception as e:
        print("Mlflow tracking server is not initiated. Initiate the server to start the training.")
        print("To start the tracking server. Open terminal in parent directory and type mlflow ui\n")
        sys.exit()

    optimizer = torch.optim.AdamW(model.parameters(), lr = train_eval_args.lr, weight_decay = train_eval_args.weight_decay, eps = 1e-8)
    loss_criterion = nn.CrossEntropyLoss()

    val_temp = 0
    print(".......................................Executing training of the model.......................................\n")
    with mlflow.start_run(run_name = run_name):
        params = {"Num_layers": model_args.n_layers, "Num_Q_heads": model_args.n_heads, "Num_KV_heads": model_args.n_kv_heads,
                  "Num_experts": model_args.num_experts, "Top_experts": model_args.k, "Vocab_size": model_args.vocab_size,
                  "Dimension": model_args.dim, "batch_size": model_args.max_batch_size , "context_length" : model_args.max_seq_length, 
                  "Max_iters": train_eval_args.max_train_iter, "eval_interval": train_eval_args.eval_interval,  
                  "Device": model_args.device, "eval_iters": train_eval_args.max_eval_iter, "aux_loss_coeff": model_args.aux_loss_coeff, 
                  "optimizer": "AdamW", "learning_rate": train_eval_args.lr, "weight_decay": train_eval_args.weight_decay}
        mlflow.log_params(params) # Logging of params

        for iter in tqdm(range(train_eval_args.max_train_iter)):

            x, y = get_batch(train_data, model_args.max_seq_length, model_args.max_batch_size)
            model.train()
            logits, aux_loss = model(x.to(model_args.device))
            task_loss = loss_criterion(logits.view(-1, logits.size(2)), y.view(-1).to(model_args.device))
            mlflow.log_metric("Task_Loss", task_loss.item(), step = iter)

            perplexity = perplexity_score(task_loss)
            mlflow.log_metric("Perplexity", perplexity, step = iter)

            if aux_loss is not None:
                total_loss = task_loss + aux_loss 
                mlflow.log_metric("Aux_Loss", aux_loss.item(), step = iter)
            else:
                total_loss = task_loss

            optimizer.zero_grad()
            total_loss.backward()
            optimizer.step()

            if (iter % train_eval_args.eval_interval == 0) or (iter == train_eval_args.max_train_iter - 1):
                estimates = estimate_model(train_data, val_data, model, model_args, train_eval_args, loss_criterion)
                print(f"\nStep {iter}: Train Loss - {estimates['train']['loss']:.4f}, Train Perplexity - {estimates['train']['perplexity']:.4f}, Val Loss - {estimates['val']['loss']:.4f}, Val Perplexity -  {estimates['val']['perplexity']:.4f}")
                
                if iter == 0:
                    files_list = glob.glob(os.path.join(ckpt_dir, "*.pth"))
                    if files_list:
                        for file in files_list:
                            os.remove(file)
                    val_temp = estimates['val']['loss']
                    training_state = dict()
                    training_state['ckpt_state_dict'] = model.state_dict()
                    training_state['optimizer_state'] = optimizer.state_dict()
                    ckpt_name = f"checkpoint-{iter}-{val_temp:.3f}.pth"
                    print(f"Saving first checkpoint in {os.path.join(ckpt_dir, ckpt_name)}")
                    torch.save(training_state, os.path.join(ckpt_dir, ckpt_name))
                    
                if (iter > 0) and (estimates['val']['loss'] < val_temp):
                    files_list = glob.glob(os.path.join(ckpt_dir, "*.pth"))
                    if files_list:
                        for file in files_list:
                            os.remove(file)
                    print(f"Val loss improved from {val_temp:.4f} to {estimates['val']['loss']:.4f}.")
                    val_temp = estimates['val']['loss']
                    training_state = dict()
                    training_state['ckpt_state_dict'] = model.state_dict()
                    training_state['optimizer_state'] = optimizer.state_dict()
                    ckpt_name = f"checkpoint-{iter}-{val_temp:.3f}.pth"
                    print(f"Saving checkpoint in {os.path.join(ckpt_dir, ckpt_name)}")
                    torch.save(training_state, os.path.join(ckpt_dir, ckpt_name))


                metrics = {"Train_Loss": float(estimates['train']['loss']), "Train_Perplexity": float(estimates['train']['perplexity']),
                           "Val_Loss": float(estimates['val']['loss']), "Val_Perplexity": float(estimates['val']['perplexity'])}
                mlflow.log_metrics(metrics, step = iter)

        print()
        model_name = "MoE-LM.pth"
        torch.save(model.state_dict(), os.path.join(Path(ckpt_dir).parent, model_name))
        print(f"Training is completed successfully. Final model is saved in {os.path.join(Path(ckpt_dir).parent, model_name)}")

In [7]:
model_dir = os.path.join("Saved", "model")
checkpoint_dir = os.path.join(model_dir, "checkpoints")
os.makedirs(checkpoint_dir, exist_ok = True)

In [None]:
# Loading the saved train and validation/test tokens.
train_data = get_data(os.path.join('Saved', 'train.h5'))
val_data = get_data(os.path.join('Saved', 'val.h5'))
print(f"Tokens in training data is {len(train_data)}. Tokens in validtion/test data is {len(val_data)}.")

Tokens in training data is 2286222. Tokens in validtion/test data is 616263.


In [9]:
vocab_size = get_vocab_size(os.path.join('Saved', 'Tokenizer.json'))

'Saved\Tokenizer.json' exists. Loading dictionary values from 'Saved\Tokenizer.json'.
Size of Vocabulary:  29627


In [10]:
model_args = ModelArgs()
model_args.vocab_size = vocab_size
print("Model Args: ", model_args)

Model Args:  ModelArgs(dim=512, ffn_hidden_dim=2048, n_layers=4, n_heads=8, n_kv_heads=4, vocab_size=29627, norm_eps=1e-05, num_experts=8, k=2, eos='<eos>', pad='<pad>', unk='<unk>', aux_loss=True, aux_loss_coeff=0.01, inference=False, cache=False, max_batch_size=32, max_seq_length=300, device=device(type='cuda'))


In [11]:
model = Transformer(model_args)
model.to(model_args.device)

Num_of_parameters = sum(p.numel() for p in model.parameters())
print("Model Parameters : {:.3f} M".format(Num_of_parameters / 1e6)) # Prints Total number of Model Parameters.

Model Parameters : 134.189 M


In [12]:
x, y = get_batch(train_data, model_args.max_seq_length, model_args.max_batch_size)
print("Input shape: ", x.shape, "Output shape: ", y.shape)
print()
print("Summary of the model:\n", summary(model, [(x)], device = model_args.device, verbose = 0))

Input shape:  torch.Size([32, 300]) Output shape:  torch.Size([32, 300])

Summary of the model:
Layer (type:depth-idx)                        Output Shape              Param #
├─Embedding: 1-1                              [-1, 300, 512]            15,169,024
├─ModuleList: 1                               []                        --
|    └─Block: 2-1                             [-1, 300, 512]            --
|    |    └─RMSNorm: 3-1                      [-1, 300, 512]            512
|    |    └─MHA: 3-2                          [-1, 300, 512]            787,456
|    |    └─RMSNorm: 3-3                      [-1, 300, 512]            512
|    |    └─MoE: 3-4                          [-1, 300, 512]            25,174,032
|    └─Block: 2-2                             [-1, 300, 512]            --
|    |    └─RMSNorm: 3-5                      [-1, 300, 512]            512
|    |    └─MHA: 3-6                          [-1, 300, 512]            787,456
|    |    └─RMSNorm: 3-7                     

In [13]:
train_eval_args = TrainEvalArgs()
train(model_args, train_eval_args, model, train_data, val_data, checkpoint_dir)

2025/04/21 13:44:25 INFO mlflow.tracking.fluent: Experiment with name 'MoE Training' does not exist. Creating a new experiment.


Mlflow current run name inside MoE Training:  Experiment-273889 

.......................................Executing training of the model.......................................



  0%|          | 0/30 [00:00<?, ?it/s]


Step 0: Train Loss - 10.1466, Train Perplexity - 25499.7773, Val Loss - 10.1594, Val Perplexity -  25829.4590
Saving first checkpoint in Saved\model\checkpoints\checkpoint-0-10.159.pth


 33%|███▎      | 10/30 [01:25<01:59,  5.98s/it]


Step 10: Train Loss - 7.9938, Train Perplexity - 2947.1538, Val Loss - 8.0461, Val Perplexity -  3101.9954
Val loss improved from 10.1594 to 8.0461.
Saving checkpoint in Saved\model\checkpoints\checkpoint-10-8.046.pth


 67%|██████▋   | 20/30 [02:45<01:03,  6.36s/it]


Step 20: Train Loss - 7.8404, Train Perplexity - 2530.9663, Val Loss - 7.9065, Val Perplexity -  2701.8994
Val loss improved from 8.0461 to 7.9065.
Saving checkpoint in Saved\model\checkpoints\checkpoint-20-7.906.pth


 97%|█████████▋| 29/30 [04:01<00:06,  6.46s/it]


Step 29: Train Loss - 7.7290, Train Perplexity - 2266.8250, Val Loss - 7.7793, Val Perplexity -  2382.4126
Val loss improved from 7.9065 to 7.7793.
Saving checkpoint in Saved\model\checkpoints\checkpoint-29-7.779.pth


100%|██████████| 30/30 [04:23<00:00,  8.78s/it]



Training is completed successfully. Final model is saved in Saved\model\MoE-LM.pth
🏃 View run Experiment-273889 at: http://127.0.0.1:5000/#/experiments/818474842340393126/runs/c0703b1bf90c4642898e9ef032caac4b
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/818474842340393126
