### Specific Model Loading vs. AutoModel Loading

* This is an attempt to compare specific model loading using {ModelName}ForCausalLM.from_pretrained() to AutoModelForCausalLM.from_pretrained() to determine if there are any problems, specifically with the Falcon model.
* We will use Falcon1B because this is being done on a computer with a small GPU.

#### Check CUDA Availability

* We first need to check to ensure that CUDA is available.  We can start with the nvidia-smi shell tool.

In [1]:
!nvidia-smi

Tue Jan 16 21:00:41 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.146.02             Driver Version: 535.146.02   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce GTX 1070        Off | 00000000:01:00.0 Off |                  N/A |
| 27%   32C    P8               6W / 151W |     16MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

#### Check Torch Capability

* We're going to need to be able to install Torch.
* If you check out the README-BOOT file at https://github.com/pwdel/gpu-jupyter-tensorflow/tree/main it should help you get this going.
* Long story short, you need to install torch with the -U flag to get the right version, and you may need to restart the kernel.

In [4]:
!pip install -U torch



* The version we're looking for is `2.1.2+cu121`

import torch
print(torch.__version__)

* Moreover, the following command sshould show the GPU and the memory capacity.

In [10]:
import torch
print(torch.__version__)

if torch.cuda.is_available():
    print("GPU is available.")
else:
    print("GPU is not available. Check your CUDA installation.")

current_device = torch.cuda.current_device()
print(f"Current GPU device: {current_device}")

device_properties = torch.cuda.get_device_properties(0)  # Replace 0 with the desired GPU index
print(f"GPU Name: {device_properties.name}")
print(f"GPU Memory Capacity: {device_properties.total_memory / 1e9} GB")

2.1.2+cu121
GPU is available.
Current GPU device: 0
GPU Name: NVIDIA GeForce GTX 1070
GPU Memory Capacity: 8.501919744 GB


#### Automatic Model Loading

* First off, we will try to do this with auto model loading, which we have been able to do successfully previously.
* We start out by importing ouf transformer libraries for AutoModelForCausalLM, AutoTokenizer.

In [11]:
# imports - Auto{Stuff}
from transformers import AutoModelForCausalLM, AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


* Having imported the right tools, we can ensure that we're loading from https://huggingface.co/tiiuae/falcon-rw-1b, which is a 1B parameter and should not take up much memory.

In [13]:
model_identifier = "tiiuae/falcon-rw-1b"
# tokenizers are generally lightweight and are loaded into RAM. They are used to convert text into a format that the model can understand (like token IDs).
tokenizer = AutoTokenizer.from_pretrained(model_identifier)
# if a GPU is available and PyTorch is configured to use it, the model will be loaded into the GPU's memory. 
model = AutoModelForCausalLM.from_pretrained(model_identifier, trust_remote_code=True)

Downloading tokenizer_config.json: 100%|██████████| 234/234 [00:00<00:00, 44.2kB/s]
Downloading vocab.json: 100%|██████████| 798k/798k [00:00<00:00, 5.49MB/s]
Downloading merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 13.3MB/s]
Downloading tokenizer.json: 100%|██████████| 2.11M/2.11M [00:00<00:00, 8.07MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 99.0/99.0 [00:00<00:00, 47.6kB/s]
Downloading config.json: 100%|██████████| 1.05k/1.05k [00:00<00:00, 846kB/s]
Downloading (…)figuration_falcon.py: 100%|██████████| 6.70k/6.70k [00:00<00:00, 3.10MB/s]
A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-rw-1b:
- configuration_falcon.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Downloading modeling_falcon.py: 100%|██████████| 56.9k/56.9k [00:00<00:00, 2.00MB/s]
A new version of the following files was downloaded from https://hug

* from_pretrained method in the Hugging Face Transformers library does not automatically move the model to a GPU device.
* Hence, given that torch was shown to be available above, we should move the model to cuda.

In [14]:
# Move model to GPU if available
if torch.cuda.is_available():
    model = model.to('cuda:0')  # Move model to the first GPU device
else:
    print("No GPU available, using CPU.")

* We may then verify that the model has been moved to the device.

In [15]:
device = next(model.parameters()).device
print(f"Model is on device: {device}")

Model is on device: cuda:0


* The above demonstrates that we can use AutoModel loading to move falcon to the GPU. We may now delete the model from the device.
* If we want to go even more specific, we can look at how much memory it is taking up in the GPU specifically.

In [17]:
!pip install pynvml

Collecting pynvml
  Downloading pynvml-11.5.0-py3-none-any.whl (53 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.1/53.1 kB[0m [31m524.9 kB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hInstalling collected packages: pynvml
Successfully installed pynvml-11.5.0


In [19]:
import pynvml
import logging

# Initialize pynvml library
pynvml.nvmlInit()

def log_gpu_stats():
    try:
        gpu_count = pynvml.nvmlDeviceGetCount()
        print(f"Number of GPUs: {gpu_count}")

        for i in range(gpu_count):
            print(i)
            handle = pynvml.nvmlDeviceGetHandleByIndex(i)
            name = pynvml.nvmlDeviceGetName(handle)
            temperature = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
            memory_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
            utilization = pynvml.nvmlDeviceGetUtilizationRates(handle)

            print(f"GPU {i + 1} - Name: {name}, Temperature: {temperature}°C,"
                        f" Memory Used: {memory_info.used / 1024 / 1024} MB,"
                        f" GPU Utilization: {utilization.gpu}%, Memory Utilization: {utilization.memory}%")
    except Exception as e:
        print(f"Error logging GPU stats: {e}")

In [20]:
log_gpu_stats()

Number of GPUs: 1
0
GPU 1 - Name: NVIDIA GeForce GTX 1070, Temperature: 32°C, Memory Used: 5190.4375 MB, GPU Utilization: 0%, Memory Utilization: 0%


In [None]:
#### Automatic Model Loading

* First off, we will try to do this with auto model loading, which we have been able to do successfully previously.
* We start out by importing ouf transformer libraries for AutoModelForCausalLM, AutoTokenizer.