<a href="https://colab.research.google.com/github/argonne-lcf/ai-science-training-series/blob/main/04_intro_to_llms/IntroLLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Parallel and distributed Deep Learning

## Author: Marieme Ngom, Argonne National Laboratory
(combining and adapting materials/discussion evolved over time by Huihuo Zheng, Bethany Lusch, Asad Khan, Prasanna Balaprakash, Taylor Childers, Corey Adams, Kyle Felker, Varuni Sastry, Sam Foreman, Archit Vasan, Carlo Graziani, Tanwi Mallick, and Venkat Vishwanath)
## Outline 
1. Day 1
    - Evolution of computig systems
    - Parallel computing
    - Introduction to Deep Learning
    - Parallel computing in AI


2. ***Day 2***
    - Brief Introduction to LLMs
    - Hands-on LLM training


# Brief introduction to LLMs

![llms](images/llms.gif)
*Source: [Hannibal046/Awesome-LLM](https://github.com/Hannibal046/Awesome-LLM)*

![emergent](images/emergent-abilities.gif)


## Training LLMs
![evolllms](images/evolution.gif)


![ithungers](images/it_hungers.png)

## Life-cycle of a LLM
1. Data collection + preprocessing
2. ***Pre-training***
    - Architecture decisions, model size, etc.
3. Supervised Fine-Tuning
    - Instruction Tuning
    - Alignment
4. Deploy (+ monitor, re-evaluate, etc.)

![gptcycle](images/gpt3-training-step-back-prop.gif)
*Source:Figure from [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)*

## Life-cycle of a LLM
1. Data collection + preprocessing
2. Pre-training
    - Architecture decisions, model size, etc.
3. ***Supervised Fine-Tuning***
    - Instruction Tuning
    - Alignment
4. Deploy (+ monitor, re-evaluate, etc.)

![gptcycle](images/gpt3-fine-tuning.gif)
*Source:Figure from [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)*

## Forward pass
![fwdpass](images/hf_assisted_generation.mov)
*Source: [Generation with LLMs](https://huggingface.co/docs/transformers/main/en/llm_tutorial)*

## Generating text
![fwdpass](images/hf_assisted_generation2.mov)
*Source: [Generation with LLMs](https://huggingface.co/docs/transformers/main/en/llm_tutorial)*

# Hands-on LLM Training



***Good practice***: Create and activate a conda (or virtual) environment 
```conda create -n env_mlss_dnn python=3.9```
then on jupyter do new ->terminal

```
 conda activate env_mlss_dnn
 pip install ipykernel 
 python -m ipykernel install --user --name env_mlss_dnn
```

then go back to your .ipynb file, change kernel to env_mlss_dnn.



In [None]:
!git clone https://github.com/karpathy/nanoGPT.git

In [None]:
%pwd

# change to /path/to/your/folder
%cd nanoGPT

# confirm
%pwd

In [None]:
#!pip install torch numpy transformers datasets tiktoken wandb tqdm

In [None]:
!python3 data/shakespeare_char/prepare.py

In [None]:
!python3 train.py config/train_shakespeare_char.py --compile=False --eval_iters=20 --log_interval=1 --block_size=64 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=128 --max_iters=500 --lr_decay_iters=2000 --dropout=0.0

In [None]:
!pip install git+https://github.com/openai/whisper.git

In [None]:
import tiktoken

In [None]:
!python3 sample.py --out_dir=out-shakespeare-char

In [None]:
!export NCCL_DEBUG=INFO
!export NCCL_DEBUG_SUBSYS=ALL
!export NCCL_DEBUG_FILE=nccl_trace.log

# Running on one NVIDIA T4 Tensor Cores, 4GPUS/node
# Bamba


In [None]:
!export CUDA_VISIBLE_DEVICES=0,1,2,3

# launch 4 processes → 4 GPUs
!torchrun \
  --nproc_per_node=4 \
  train.py \
    config/train_shakespeare_char.py \
    --batch_size=12 \
    --gradient_accumulation_steps=40 \
    --compile=False 2>&1 | tee full_train.log

#I am on my personal laptop hence the following error

# Running on N NVIDIA T4 Tensor Cores (4N GPUs), Because each participant has on ly one node we set --nnodes=1.  With multiple nodes, you need to do the following on each node:


In [None]:
import socket
ip=socket.gethostbyname(socket.gethostname())
print(ip)

In [None]:
!export CUDA_VISIBLE_DEVICES=0,1,2,3          
!export MASTER_ADDR=ip           
!export MASTER_PORT=29500                     

!torchrun \
  --nnodes=1 \
  --node_rank=0 \
  --nproc_per_node=4 \
  --master_addr=$ip \
  --master_port=29500 \
  train.py \
    config/train_shakespeare_char.py \
    --batch_size=12 \
    --gradient_accumulation_steps=40
