# Llama3 "Hello World"

In this notebook, we will demonstrate a simple "Hello World" example running on the BioData Catalyst (BDC) environment using [Llama 3](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct), an advanced large language model. Llama 3 is a sophisticated model built by Meta and capable of natural language understanding and generation, enabling users to build sophisticated text-based applications with minimal setup. A key feature of running Llama 3 within the BDC security perimeter is that it ensures the model operates entirely within the secure environment, allowing it to access sensitive data the user is authorized to view. This also ensures that no data is leaked back to external providers like OpenAI or Meta, maintaining data privacy and compliance. This notebook will guide you through the setup and provide an example of generating text from a simple input prompt, serving as a foundational starting point for working with Llama 3 securely.

Currently this notebook works on the BDC Powered by [Terra](https://terra.biodatacatalyst.nhlbi.nih.gov/) notebook environment.  In the future we will work with Velsera to ensure it works in the BDC Powered by Seven Bridges notebook environment but it currently does not due to CUDA driver limitations.

## Requirements

To get this demo to work properly you will need the following:

* Setup your [Terra](https://terra.biodatacatalyst.nhlbi.nih.gov/) account including a billing group (you can apply for startup credits in BDC if needed).  You do not need access to any controlled access BDC data for this demo.
* Setup a [Hugging Face](https://huggingface.co/) account and generate an [access token](https://huggingface.co/settings/tokens) with READ permissions (minimally).
* Apply to access the [Llama 3](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model(s).  This can take minutes to hours so make sure you do that early.
* Startup a Jupyter environment on Terra using the settings below
* Upload this notebook file into your running Jupyter environment and then execute each cell

The host environment settings we confirmed work:

<img src="https://raw.githubusercontent.com/nhlbidatastage/bdc-ai-tiger-team/25e338665f3a327d8263a5d74baecf09b2f92f54/objective_1.1/env.png" alt="The host env settings" width="500"/>

## Dependency Install

The `pip install` below installs the minimal dependencies.

In [1]:
! pip install --upgrade transformers torch
# !pip install 'accelerate>=0.26.0'



## Confirm CUDA & Check GPUs

The following just checks if CUDA (GPU acceleration) is working.  You want to make sure you get a "True" here before moving on.

In [2]:
# Check CUDA support
import torch
print(torch.cuda.is_available())
print(torch.version.cuda)
print(torch.cuda.current_device())
print(torch.cuda.get_device_name())

# check GPUs
import tensorflow as tf
print(tf.config.list_physical_devices('GPU'))
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

True
12.4
0
Tesla V100-SXM2-16GB


2024-10-22 18:32:51.261317: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-22 18:32:51.390844: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-22 18:32:54.177642: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 8269998018210698186
xla_global_id: -1
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 15510405120
locality {
  bus_id: 1
  links {
    link {
      device_id: 1
      type: "StreamExecutor"
      strength: 1
    }
  }
}
incarnation: 10206793880554098797
physical_device_desc: "device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:04.0, compute capability: 7.0"
xla_global_id: 416903419
, name: "/device:GPU:1"
device_type: "GPU"
memory_limit: 15510405120
locality {
  bus_id: 1
  links {
    link {
      type: "StreamExecutor"
      strength: 1
    }
  }
}
incarnation: 7677092801118402115
physical_device_desc: "device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0"
xla_global_id: 2144165316
]


2024-10-22 18:32:56.863940: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-10-22 18:32:56.864306: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-10-22 18:32:56.873115: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-10-22 18:32:56.873454: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-10-22 18:32:56.873738: I tensorflow/compiler/xla/stream_executo

## Login to Hugging Face

The next step is to login to Hugging Face so you can pull the Llama 3 model.  Hugging Face acts as a model repository, providing all the model files that this notebook needs in order to launch and use the model.

You execute the cell below and it will present you with a token form.  Copy and paste your Hugging Face READ access token in here and click login.  You can uncheck the "add token as git credential" option.

**Ensure you have applied for access to use the [Llama 3](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) models before moving past this step**

In [3]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Create Transformer Pipeline

This code is creating a text generation pipeline using the Hugging Face transformers library. The pipeline function is configured to perform the task of text generation by specifying "text-generation" as the task type. The model parameter refers to the pre-trained model being used for this purpose, and model_kwargs is used to set the model's data type to float16 for reduced memory usage and faster computation. Additionally, the device parameter specifies whether the model should run on a CPU or a GPU, depending on the system setup. 

In [4]:
import transformers
import torch
model = "meta-llama/Meta-Llama-3.1-8B"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.cuda.empty_cache()  # Clears the GPU cache
print(torch.cuda.memory_summary())

pipeline = transformers.pipeline(
    #"text-generation", model=model, device=device
    #"text-generation", model=model, device=device, device_map="balanced"
    #"text-generation", model=model, device_map="balanced"
    #"text-generation", model=model, model_kwargs={"torch_dtype": torch.float16}, device_map="auto"
    # possibly reduces memory, see https://medium.com/@rohanvermaAI/llama-3-what-we-know-and-how-to-use-it-in-free-collab-24ec5d6058ff
    # "text-generation", model=model, model_kwargs={"torch_dtype": torch.bfloat16, "load_in_4bit": True}, device=device
    "text-generation", model=model, model_kwargs={"torch_dtype": torch.float16}, device=device
)

|                  PyTorch CUDA memory summary, device ID 1                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Active memory         |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

## Text Generation

This code is performing text generation using a pre-trained language model. First, it initializes a tokenizer using the AutoTokenizer class from the Hugging Face library, which is responsible for converting input text into a format that the model can process. The pipeline function is then called with a prompt asking what meal can be made with tomatoes, basil, and cheese. The do_sample=True argument allows for random sampling during text generation, making the output less deterministic. top_k=10 restricts the sampling to the top 10 most likely tokens at each step, and num_return_sequences=1 specifies that only one generated sequence should be returned. The eos_token_id ensures the model stops generating when it reaches the end-of-sequence token, and truncation=True limits the text to a maximum length of 400 tokens. Finally, the code prints the generated text output, labeled as the result.

In [5]:
from transformers import AutoTokenizer, AutoModelForCausalLM

print(torch.cuda.memory_summary())

tokenizer = AutoTokenizer.from_pretrained(model)
sequences = pipeline(
    'I have tomatoes, basil and cheese at home. What can I cook for dinner?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    truncation = True,
    max_length=400,
)

for seq in sequences:
    print(f"Result: {seq['generated_text']}")

|                  PyTorch CUDA memory summary, device ID 1                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  15316 MiB |  15316 MiB |  15316 MiB |      0 B   |
|       from large pool |  15316 MiB |  15316 MiB |  15316 MiB |      0 B   |
|       from small pool |      0 MiB |      0 MiB |      0 MiB |      0 B   |
|---------------------------------------------------------------------------|
| Active memory         |  15316 MiB |  15316 MiB |  15316 MiB |      0 B   |
|       from large pool |  15316 MiB |  15316 MiB |  15316 MiB |      0 B   |
|       from small pool |      0 MiB |      0 MiB |      0 MiB |      0 B   |
|---------------------------------------------------------------

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


Result: I have tomatoes, basil and cheese at home. What can I cook for dinner?
I have tomatoes, basil and cheese at home. What can I cook for dinner?
Tomatoes, basil and cheese are the three main ingredients of one of the most popular Italian dishes: the Caprese salad! The Caprese salad is a typical dish of the Campania region, in particular of the island of Capri. It consists of a simple mix of cherry tomatoes, mozzarella cheese and basil, seasoned with a drizzle of extra virgin olive oil and a pinch of salt. The Caprese salad is usually served as an appetizer or as a side dish to meat dishes, but it can also be served as a main course.
Tomatoes, basil and cheese are the three main ingredients of one of the most popular Italian dishes: the Caprese salad! The Caprese salad is a typical dish of the Campania region, in particular of the island of Capri. It consists of a simple mix of cherry tomatoes, mozzarella cheese and basil, seasoned with a drizzle of extra virgin olive oil and a pin