<a href="https://colab.research.google.com/github/laurauguc/llama_grading/blob/main/Llama_Grading_Model_Exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring Llama Models

## Models Info

Llama-3.1-8B-Instruct.

Model card: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

Llama models GitHub: https://github.com/meta-llama/llama-models

Meta release (see performance): https://ai.meta.com/blog/meta-llama-3-1/

Selected because multi-lingual but not multi-modal. Smallest of the medium sized varieties. Fits on Colab Pro's GPUs, leaving plenty of memory for fine-tuning.

Start with a lightweight model (Llama-3.2-1B-Instruct) for experimentation. See performance here: https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/

Performance note:  Llama 3.1 8B outperforms GPT 3.5 Turbo. With fine-tuning can achieve further improvement.

Languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

Also: Deploying LLaMA 3 70B model is much more challenging. No GPU has enough VRAM for this model so you will need to provision a multi-GPU instance (need g5.48xlarge instance on AWS). Much more expensive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Model memory needs


| Model                 | Inference Speed | Memory |
|-----------------------|------------------------------|----------------|
| Llama-3.2-1B-Instruct | 53 tokens/sec                | 5 GB         |
| Llama-3.2-3B-Instruct | 32.5 tokens/sec                | 13 GB         |
| Llama-3.1-8B-Instruct | 28.5 tokens/sec                | 32.1 GB         |

Note: Used NVIDIA A100-SXM4-40GB GPU to estimate inference speeds. Not counting input tokens.

## Colab Pro hardware options

| Accelerator     | High-RAM | CPU Cores | GPU Model  | GPU Memory | RAM         | Disk Space | GPU Capability Score |
|-----------------|----------|-----------|------------|------------|-------------|------------|----------------------|
| **CPU Only**    | No       | 2-4       | None       | N/A        | ~12 GB      | ~100 GB    | N/A                  |
| **CPU Only**    | Yes      | 2-4       | None       | N/A        | ~25 GB      | ~150 GB    | N/A                  |
| **A100 GPU**    | No       | 8         | NVIDIA A100 | 40 GB     | ~12 GB      | ~100 GB    | 9                    |
| **A100 GPU**    | Yes      | 8         | NVIDIA A100 | 40 GB     | ~25 GB      | ~150 GB    | 9                    |
| **L4 GPU**      | No       | 4-8       | NVIDIA L4  | 24 GB      | ~12 GB      | ~100 GB    | 8                    |
| **L4 GPU**      | Yes      | 4-8       | NVIDIA L4  | 24 GB      | ~25 GB      | ~150 GB    | 8                    |
| **T4 GPU**      | No       | 4-8       | NVIDIA T4  | 16 GB      | ~12 GB      | ~100 GB    | 7                    |
| **T4 GPU**      | Yes      | 4-8       | NVIDIA T4  | 16 GB      | ~25 GB      | ~150 GB    | 7                    |
| **TPU v2-8**    | No       | -         | TPU v2-8   | 64 GB      | ~12 GB      | ~100 GB    | 7.5                  |
| **TPU v2-8**    | Yes      | -         | TPU v2-8   | 64 GB      | ~25 GB      | ~150 GB    | 7.5                  |

Notes

- **RAM**: With High-RAM selected, Colab increases available RAM from around 12 GB to approximately 25 GB or more.
- **CPU Cores**: Actual CPU core count can vary depending on availability; typically, Colab Pro allocates between 2-8 cores depending on the selected hardware.
- **Disk Space**: Disk space also varies based on High-RAM selection and can fluctuate slightly based on Colab's backend resources.
- **GPU Capability Score**: Estimated relative score based on GPU performance for deep learning tasks; the scores are approximations for comparison purposes only.
- **TPU v2-8**: TPUs do not use traditional GPUs; instead, they are Tensor Processing Units designed specifically for TensorFlow/PyTorch-based deep learning tasks.

The **High-RAM option** generally adds more memory (RAM) and disk space, but does not impact GPU memory or capability score.


Changes noticed:

* CPU-only High Ram: RAM ~ 50 GB and number of CPU cores 8. Total disk space 220 GB.

* A 100 GPU. GPU capabability score: 8. RAM 83 GB. disk space 235 GB.

## Modeling Strategy: Use Criterion-Specific LLMs.

To streamline fine-tuning, consider building a separate LLM for each individual grading criterion. Since each criterion may appear across multiple rubrics, organizing fine-tuning by criterion rather than by rubric simplifies the process, addressing each assessment dimension independently.

Fine-tuning needs vary depending on the complexity of the writing criterion:

* **Basic tasks** (e.g., grammar, sentence structure) can be managed by smaller models.
* **Context-sensitive tasks** (e.g., audience appropriateness) require medium-sized models.
* **High-level assessments** (e.g., cohesiveness, content quality) are best suited for larger, fine-tuned models.

Also, justifying assessments can be more complex than generating a score alone. Decide whether to handle justification independently or together with scoring. Need to consider which approach is more flexible to custom grading rubrics.

### Writing Assessment Alone

| Writing Criteria           | Model Complexity (Parameters)     | Fine-Tuning Required? |
|---------------------------------------|-----------------------------------|------------------------|
| Grammar                               | Low (100M - 500M)                | No                    |
| Citation and Referencing              | Low (100M - 500M)                | No                    |
| Adherence to Prompt                   | Low (100M - 500M)                | No                    |
| Sentence Structure                    | Medium (500M - 1B)               | No                    |
| Language Appropriateness for Audience | Medium (500M - 1B)               | Yes                   |
| Vocabulary Usage                      | Medium (500M - 1B)               | Yes                   |
| Writing Cohesiveness                  | High (1B - 6B)                   | Yes                   |
| Argumentation Quality                 | High (1B - 6B)                   | Yes                   |
| Creativity and Originality            | High (1B - 6B)                   | Yes                   |
| Content Quality                       | Very High (6B+)                  | Yes                   |

### Writing Assessment Justifaction

| Writing Criteria                     | Model Complexity (Parameters)     | Fine-Tuning Required? |
|--------------------------------------|-----------------------------------|------------------------|
| Grammar                              | Medium (500M - 1B)               | No                     |
| Citation and Referencing             | Medium (500M - 1B)               | No                     |
| Adherence to Prompt                  | Medium (500M - 1B)               | No                     |
| Sentence Structure                   | Medium-High (1B - 3B)            | Yes                    |
| Language Appropriateness for Audience | High (3B - 6B)                   | Yes                    |
| Vocabulary Usage                     | High (3B - 6B)                   | Yes                    |
| Writing Cohesiveness                 | Very High (6B+)                   | Yes                    |
| Argumentation Quality                | Very High (6B+)                   | Yes                    |
| Creativity and Originality           | Very High (6B+)                   | Yes                    |
| Content Quality                      | Very High (6B+)                   | Yes                    |



## Steps

0. Set-Up

1. Prompting

2. Fine-tuning: https://www.llama.com/docs/how-to-guides/fine-tuning/

3. Orchestrate. Use LangChain to orchestrate LLM calls.

4. Deploy: https://nlpcloud.com/how-to-install-and-deploy-llama-3-into-production.html?



## 0. Set-Up

In [None]:
# Set device
import torch
device = 0 if torch.cuda.is_available() else -1 # HF: Users can specify device argument as an integer, -1 meaning “CPU”, >= 0 referring the CUDA device ordinal.
print("Device: ", device)

Device:  0


In [None]:
import psutil
import torch
import os

def computing_info():
  #print(f"PyTorch version: {torch.__version__}")
  #print(f"CUDA version: {torch.version.cuda}")
  # Get RAM information
  mem = psutil.virtual_memory()
  print("CPU")
  print(f"Total RAM: {round(mem.total / (1024.0 ** 3), 2)} GB")
  print(f"Available RAM: {round(mem.available / (1024.0 ** 3), 2)} GB")
  NUM_WORKERS = os.cpu_count()
  print(f"Number of cores: {NUM_WORKERS}")

  # Get GPU information (if available)
  if torch.cuda.is_available():
    total_free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()
    print("\nGPU")
    print(f"Total VRAM: {round(total_gpu_memory * 1e-9, 3)} GB")
    print(f"Available VRAM: {round(total_free_gpu_memory * 1e-9, 3)} GB")
    GPU_SCORE = torch.cuda.get_device_capability()
    print(f"GPU capability score: {GPU_SCORE}")
  else:
    print("No GPU available.")

  # Get disk space information
  disk = psutil.disk_usage('/')
  print("\nDisk")
  print(f"Total space: {round(disk.total / (1024.0 ** 3), 2)} GB")
  print(f"Available space: {round(disk.free / (1024.0 ** 3), 2)} GB")
computing_info()

CPU
Total RAM: 83.48 GB
Available RAM: 79.91 GB
Number of cores: 12

GPU
Total VRAM: 42.482 GB
Available VRAM: 36.881 GB
GPU capability score: (8, 0)

Disk
Total space: 235.68 GB
Available space: 201.09 GB


In [None]:
!nvidia-smi

Mon Oct 28 17:31:55 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0              48W / 400W |   5341MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## 1. Prompting

In [None]:
#!huggingface-cli login

In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline

model_name = "meta-llama/Llama-3.2-1B-Instruct"

pipe = pipeline("text-generation",
                model=model_name,
                device = device,
                max_new_tokens = 1000)

In [None]:
computing_info()

CPU
Total RAM: 83.48 GB
Available RAM: 79.25 GB
Number of cores: 12

GPU
Total VRAM: 42.482 GB
Available VRAM: 32.001 GB
GPU capability score: (8, 0)

Disk
Total space: 235.68 GB
Available space: 201.09 GB


In [None]:
import time
from transformers import AutoTokenizer, AutoModelForCausalLM

user_promts = ["Who are you?",
               "Explain why 8 + 8 = 16.",
               "Write a poem about LLMs.",
               "Explain the impact of generative AI on corporate carbon goals."]
outputs = []

start_time = time.time()
for user_promt in user_promts:
  messages = [
      {"role": "user", "content": user_promt},
  ]
  output = pipe(messages,
                do_sample=False, temperature = None, top_p = None, # by default, temperature = 0.6 and top_p = .9. Need to unset these variables when do_sample = False
                pad_token_id = pipe.tokenizer.eos_token_id)
  outputs.append(output[0]['generated_text'][1]['content'])

end_time = time.time()

tokenizer = AutoTokenizer.from_pretrained(model_name)
n_output_tokens = len(tokenizer.encode(". ".join(outputs)))

# Calculate inference speed
elapsed_time = end_time - start_time
inference_speed = n_output_tokens / elapsed_time  # tokens per second
print(inference_speed)

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


56.36671471830743


## Next

* Start tackling small no-fine tuning evaluation tasks.

## 2. Fine-Tuning

## 3. Orchestration

Will likely have a variety of models to use in different circumstances (e.g. a particular grading rubric will require a set of models). Note: need to consider how to handle "unsupported criteria" (likely through a proprietary model until supported) and how to handle custom grading rubric that define the the criteria levels differently.

## 4. Deployment