<a href="https://colab.research.google.com/github/olonok69/LLM_Notebooks/blob/main/onnx/Phi3__ONNX_gpu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ONNX

ONNX Runtime is a cross-platform machine-learning model accelerator, with a flexible interface to integrate hardware-specific libraries. ONNX Runtime can be used with models from PyTorch, Tensorflow/Keras, TFLite, scikit-learn, and other frameworks.


ONNX Runtime Inference powers machine learning models in key Microsoft products and services across Office, Azure, Bing, as well as dozens of community projects.

Examples use cases for ONNX Runtime Inferencing include:

Improve inference performance for a wide variety of ML models
Run on different hardware and operating systems
Train in Python but deploy into a C#/C++/Java app
Train and perform inference with models created in different frameworks


https://onnxruntime.ai/docs/get-started/with-python.html

https://github.com/onnx/onnx/blob/main/docs/Versioning.md

# Run generative AI models with ONNX Runtime
https://onnxruntime.ai/docs/genai/

# ONNXRuntime-genai-cuda
https://onnxruntime.ai/docs/genai/howto/install

```pip install numpy
pip install onnxruntime-genai-cuda --pre --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/
```

# CudNN
https://docs.nvidia.com/deeplearning/cudnn/latest/installation/linux.html

https://github.com/Hardware-Alchemy/cuDNN-sample/tree/master

# Phi3

https://github.com/microsoft/Phi-3CookBook/tree/main?tab=readme-ov-file

https://onnxruntime.ai/docs/genai/tutorials/phi3-python.html#run-with-nvidia-cuda


https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
! pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/


Looking in indexes: https://pypi.org/simple, https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/
Collecting onnxruntime-gpu
  Downloading https://aiinfra.pkgs.visualstudio.com/2692857e-05ef-43b4-ba9c-ccf1c22c437c/_packaging/9387c3aa-d9ad-4513-968c-383f6f7f53b8/pypi/download/onnxruntime-gpu/1.18.1/onnxruntime_gpu-1.18.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (201.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m201.5/201.5 MB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting coloredlogs (from onnxruntime-gpu)
  Downloading https://aiinfra.pkgs.visualstudio.com/2692857e-05ef-43b4-ba9c-ccf1c22c437c/_packaging/9387c3aa-d9ad-4513-968c-383f6f7f53b8/pypi/download/coloredlogs/15.0.1/coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
Collecting humanfriendly>=9.1 (from coloredlogs->on

In [7]:
! pip install onnx==1.14.1 transformers==4.33.1 psutil pandas py-cpuinfo py3nvml coloredlogs wget netron sympy protobuf -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.6/14.6 MB[0m [31m81.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m77.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.5/55.5 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m90.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m97.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for wget (setup.py) ... [?25l[?25hdone


# Install onnxruntime-genai-cuda cuda 12 Support

In [8]:
! pip install numpy
! pip install onnxruntime-genai-cuda --pre --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/

Looking in indexes: https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/
Collecting onnxruntime-genai-cuda
  Downloading https://aiinfra.pkgs.visualstudio.com/2692857e-05ef-43b4-ba9c-ccf1c22c437c/_packaging/9387c3aa-d9ad-4513-968c-383f6f7f53b8/pypi/download/onnxruntime-genai-cuda/0.3/onnxruntime_genai_cuda-0.3.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (200.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.0/200.0 MB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: onnxruntime-genai-cuda
Successfully installed onnxruntime-genai-cuda-0.3.0


# Install Cudnn 9 with cuda support 12

In [9]:
! sudo apt-get -y install zlib1g cudnn9-cuda-12 --quiet

Reading package lists...
Building dependency tree...
Reading state information...
zlib1g is already the newest version (1:1.2.11.dfsg-2ubuntu9.2).
The following additional packages will be installed:
  cudnn9-cuda-12-5 libcudnn9-cuda-12 libcudnn9-dev-cuda-12
  libcudnn9-static-cuda-12
The following NEW packages will be installed:
  cudnn9-cuda-12 cudnn9-cuda-12-5 libcudnn9-cuda-12 libcudnn9-dev-cuda-12
  libcudnn9-static-cuda-12
0 upgraded, 5 newly installed, 0 to remove and 45 not upgraded.
Need to get 762 MB of archives.
After this operation, 1,886 MB of additional disk space will be used.
Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcudnn9-cuda-12 9.2.0.82-1 [380 MB]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcudnn9-dev-cuda-12 9.2.0.82-1 [34.1 kB]
Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcudnn9-static-cuda-12 9.2.0.82-1 [383 MB]
Get:4 https://developer.downl

In [113]:
import torch
import onnx
import onnxruntime
import transformers
import sys
print("pytorch:", torch.__version__)
print("onnxruntime:", onnxruntime.__version__)
print("onnx:", onnx.__version__)
print("transformers:", transformers.__version__)

pytorch: 2.3.0+cu121
onnxruntime: 1.18.1
onnx: 1.14.1
transformers: 4.33.1


In [114]:
!{sys.executable} -m onnxruntime.transformers.machine_info --silent

2024-07-04 21:04:09.891116: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-04 21:04:09.891169: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-04 21:04:09.892659: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-04 21:04:12,542 - numexpr.utils - INFO: NumExpr defaulting to 12 threads.
{
  "gpu": {
    "driver_version": "535.104.05",
    "devices": [
      {
        "memory_total": 24152899584,
        "memory_available": 11734482944,
        "name": "NVIDIA L4"
      }
    ]
  },
  "cpu": {
    "brand": "Intel(R) Xeon(R) CPU @ 2.20GHz",
    "cores": 6,
    "logic

In [40]:
from huggingface_hub import snapshot_download

In [44]:
snapshot_location = snapshot_download(repo_id="microsoft/Phi-3-mini-4k-instruct-onnx",  local_dir="/content/drive/MyDrive/models/onnx_2")

Fetching 75 files:   0%|          | 0/75 [00:00<?, ?it/s]

In [45]:
snapshot_location

'/content/drive/MyDrive/models/onnx_2'

In [14]:
! huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cuda/cuda-int4-rtn-block-32/* --local-dir /content/drive/MyDrive/models/onnx


Fetching 10 files: 100% 10/10 [00:04<00:00,  2.23it/s]
/content/drive/MyDrive/models/onnx


In [115]:
import onnxruntime_genai as og
import datetime

In [116]:
 model = og.Model("/content/drive/MyDrive/models/onnx/cuda/cuda-int4-rtn-block-32")

In [118]:
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()

search_options = {"max_length": 2048,"temperature":0.3}

In [119]:
params = og.GeneratorParams(model)
params.try_use_cuda_graph_with_max_batch_size(1)
params.set_search_options(**search_options)

prompt = "<|user|>Who are you not allowed to marry in the UK?<|end|><|assistant|>"
input_tokens = tokenizer.encode(prompt)
params.input_ids = input_tokens

generator = og.Generator(model, params)

In [120]:
time1 = datetime.datetime.now()
out = model.generate(params)
tokenizer.decode(out[0])
time2 = datetime.datetime.now()
print(time2-time1)

0:00:02.323243


In [121]:
tokenizer.decode(out[0])

"Who are you not allowed to marry in the UK?In the United Kingdom, the law does not prohibit marriage between individuals based on their nationality or citizenship. As of my knowledge cutoff in 2023, there are no restrictions on marriage between a British citizen and a foreign national. However, it is important to note that while the law does not prevent such unions, there may be practical considerations regarding visa status and residency rights that could affect a foreign national'dependent' partner's ability to live and work in the UK.\n\nIt is also worth mentioning that while the UK does not have laws that prevent interracial or interethnic marriages, there have been historical instances of discrimination and prejudice. However, these societal attitudes have significantly diminished over time, and the UK is now considered to be a diverse and inclusive society.\n\nIf you are considering marriage in the UK, it is advisable to consult with legal experts or immigration advisors to unde

In [122]:
out =""
while not generator.is_done():
                generator.compute_logits()
                generator.generate_next_token()

                new_token = generator.get_next_tokens()[0]
                t= tokenizer_stream.decode(new_token)
                out = out +t

In [123]:
out

" In the United Kingdom, the law does not prohibit marriage between individuals based on their nationality or citizenship. As of my knowledge cutoff in 2023, there are no restrictions on marriage between a British citizen and a foreign national. However, it is important to note that while the law does not prevent such unions, there may be practical considerations regarding visa status and residency rights that could affect a foreign national'dependent' partner's ability to live and work in the UK.\n\nIt is also worth mentioning that while the UK does not have laws that prevent interracial or interethnic marriages, there have been historical instances of discrimination and prejudice. However, these societal attitudes have significantly diminished over time, and the UK is now considered to be a diverse and inclusive society.\n\nIf you are considering marriage in the UK, it is advisable to consult with legal experts or immigration advisors to understand the implications for both parties, 

In [124]:
onnxruntime.get_available_providers()

['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']

In [61]:
!python3 -m onnxruntime_genai.models.builder --help

usage: builder.py [-h] [-m MODEL_NAME] [-i INPUT] -o OUTPUT -p {int4,fp16,fp32} -e
                  {cpu,cuda,dml,web} [-c CACHE_DIR] [--extra_options KEY=VALUE [KEY=VALUE ...]]

options:
  -h, --help            show this help message and exit
  -m MODEL_NAME, --model_name MODEL_NAME
                        Model name in Hugging Face. Do not use if providing an input path to a Hugging Face directory in -i/--input.
  -i INPUT, --input INPUT
                        Input model source. Currently supported options are:
                            hf_path: Path to folder on disk containing the Hugging Face config, model, tokenizer, etc.
                            gguf_path: Path to float16/float32 GGUF file on disk containing the GGUF model
  -o OUTPUT, --output OUTPUT
                        Path to folder to store ONNX model and additional files (e.g. GenAI config, external data files, etc.)
  -p {int4,fp16,fp32}, --precision {int4,fp16,fp32}
                        Precision of model
 

In [69]:
%pip install flash_attn einops timm mlflow pyngrok transformers accelerate --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m40.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.2/43.2 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m85.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.1/314.1 kB[0m [31m43.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m73.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for flash_attn (setup.py) ... [?25l[?25hdone


In [71]:
%pip install -U transformers accelerate --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.3/9.3 MB[0m [31m64.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m99.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# microsoft/Phi-3-mini-128k-instruct

In [67]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: write)

In [72]:
!python3 -m onnxruntime_genai.models.builder -m 'microsoft/Phi-3-mini-128k-instruct' -o '/content/drive/MyDrive/models/onnx/Phi-3-mini-128k-instruct' -p int4 -e cuda -c '/tmp'

Valid precision + execution provider combinations are: FP32 CPU, FP32 CUDA, FP16 CUDA, FP16 DML, INT4 CPU, INT4 CUDA, INT4 DML
Extra options: {}
GroupQueryAttention (GQA) is used in this model.
model.safetensors.index.json: 100% 16.3k/16.3k [00:00<00:00, 63.8MB/s]
Downloading shards:   0% 0/2 [00:00<?, ?it/s]
model-00001-of-00002.safetensors:   0% 0.00/4.97G [00:00<?, ?B/s][A
model-00001-of-00002.safetensors:   1% 31.5M/4.97G [00:00<00:16, 297MB/s][A
model-00001-of-00002.safetensors:   2% 83.9M/4.97G [00:00<00:13, 360MB/s][A
model-00001-of-00002.safetensors:   3% 126M/4.97G [00:00<00:14, 346MB/s] [A
model-00001-of-00002.safetensors:   3% 168M/4.97G [00:00<00:13, 357MB/s][A
model-00001-of-00002.safetensors:   4% 220M/4.97G [00:00<00:12, 383MB/s][A
model-00001-of-00002.safetensors:   5% 273M/4.97G [00:00<00:11, 401MB/s][A
model-00001-of-00002.safetensors:   6% 315M/4.97G [00:00<00:12, 384MB/s][A
model-00001-of-00002.safetensors:   7% 357M/4.97G [00:00<00:12, 385MB/s][A
model-000

In [126]:
del model

In [127]:
import torch
import gc
gc.collect()
torch.cuda.empty_cache()


In [128]:
 model2 = og.Model("/content/drive/MyDrive/models/onnx/Phi-3-mini-128k-instruct")

In [129]:
tokenizer2 = og.Tokenizer(model2)
tokenizer_stream2= tokenizer2.create_stream()

search_options = {"max_length": 512,"temperature":0.9}

In [130]:
params = og.GeneratorParams(model2)
params.try_use_cuda_graph_with_max_batch_size(1)
params.set_search_options(**search_options)

prompt = "<|user|>Tell me a joke about taxidrivers<|end|><|assistant|>"
input_tokens = tokenizer2.encode(prompt)
params.input_ids = input_tokens

generator2 = og.Generator(model2, params)

In [131]:
out2 = model2.generate(params)

In [132]:
print(tokenizer2.decode(out2[0]))

Tell me a joke about taxidriversCon-tribute Answer: The question is- ising with the knowledge with--(outside of-know- 3 to respond about a new-

Think: (al: the taxi:
comet', for--tell-erici, in Italian,oss,i ions of not) and this  peuvent- oislock—with theft

\"\'\ custom\ rice\t\tpic\r-lo the, <<:teh) Call-  (6)$\fword tori\tagas) the has exceed?
using  ]ersersophentrees.  for 8

 "The Osbubba_Rushland_R_ and - and Reali, &_ andre.Ter_ and t is of the course of day has, kt::
