From baa747d3cdd2a0e9f8957c27daeda73c6e8612fe Mon Sep 17 00:00:00 2001 From: Jay Rodge Date: Tue, 5 Aug 2025 09:53:24 -0700 Subject: [PATCH 1/4] Add NVIDIA TensorRT-LLM optimization guide for GPT-OSS models --- articles/run-nvidia.md | 123 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 123 insertions(+) create mode 100644 articles/run-nvidia.md diff --git a/articles/run-nvidia.md b/articles/run-nvidia.md new file mode 100644 index 0000000000..fd6f278521 --- /dev/null +++ b/articles/run-nvidia.md @@ -0,0 +1,123 @@ +# Optimizing OpenAI GPT-OSS Models with NVIDIA TensorRT-LLM + +This notebook provides a step-by-step guide on how to optimizing `gpt-oss` models using NVIDIA's TensorRT-LLM for high-performance inference. TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way. + + +TensorRT-LLM supports both models: +- `gpt-oss-20b` +- `gpt-oss-120b` + +In this guide, we will run `gpt-oss-20b`, if you want to try the larger model or want more customization refer to [this](https://github.com/NVIDIA/TensorRT-LLM/tree/main/docs/source/blogs/tech_blog) deployment guide. + +Note: It’s important to ensure that your input prompts follow the [harmony response](http://cookbook.openai.com/articles/openai-harmony) format as the model will not function correctly otherwise, not needed in this guide. + +## Prerequisites + +### Hardware +To run the 20B model and the TensorRT-LLM build process, you will need an NVIDIA GPU with at least 16GB+ of VRAM. + +> Recommended GPUs: NVIDIA RTX 50 Series (e.g. RTX 5090), NVIDIA H100, or L40S. + +### Software +- CUDA Toolkit 12.8 or later +- Python 3.12 or later + +## Installling TensorRT-LLM + +There are various ways to install TensorRT-LLM, in this guide, we will using pre-built docker container from NVIDIA NGC and build it from source. + +## Using NGC + +Pull the pre-built TensorRT-LLM container for GPT-OSS from NVIDIA NGC. +This is the easiest way to get started and ensures all dependencies are included. + +```bash +docker pull nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev +docker run --gpus all -it --rm -v $(pwd):/workspace nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev +``` + +## Using Docker (build from source) + +Alternatively, you can build the TensorRT-LLM container from source. +This is useful if you want to modify the source code or use a custom branch. +See the official instructions here: https://github.com/NVIDIA/TensorRT-LLM/tree/feat/gpt-oss/docker + +The following commands will install required dependencies, clone the repository, +check out the GPT-OSS feature branch, and build the Docker container: + +```bash +#Update package lists and install required system packages +sudo apt-get update && sudo apt-get -y install git git-lfs build-essential cmake + +# Initialize Git LFS (Large File Storage) for handling large model files +git lfs install + +# Clone the TensorRT-LLM repository +git clone https://github.com/NVIDIA/TensorRT-LLM.git +cd TensorRT-LLM + +# Check out the branch with GPT-OSS support +git checkout feat/gpt-oss + +# Initialize and update submodules (required for build) +git submodule update --init --recursive + +# Pull large files (e.g., model weights) managed by Git LFS +git lfs pull + +# Build the release Docker image +make -C docker release_build + +# Run the built Docker container +make -C docker release_run +``` + +TensorRT-LLM will be available through pip soon + +> Note on GPU Architecture: The first time you run the model, TensorRT-LLM will build an optimized engine for your specific GPU architecture (e.g., Hopper, Ada, or Blackwell). If you see warnings about your GPU's CUDA capability (e.g., sm_90, sm_120) not being compatible with the PyTorch installation, ensure you have the latest NVIDIA drivers and a matching CUDA Toolkit version for your version of PyTorch. + +# Verifying TensorRT-LLM Installation + +```python +from tensorrt_llm import LLM, SamplingParams +``` + +# Utilizing TensorRT-LLM Python API + +In the next code cell, we will demonstrate how to use the TensorRT-LLM Python API to: +1. Downloads the specified model weights from Hugging Face +2. Automatically build the TensorRT engine for your GPU architecture if it does not already exist. +3. Load the model and prepare it for inference. +4. Run a simple text generation example to verify everything is working. + +**Note**: The first run may take several minutes as it downloads the model and builds the engine. +Subsequent runs will be much faster, as the engine will be cached. + +```python +llm = LLM(model="openai/gpt-oss-20b") +``` + +```python +prompts = ["Hello, my name is", "The capital of France is"] +sampling_params = SamplingParams(temperature=0.8, top_p=0.95) +for output in llm.generate(prompts, sampling_params): + print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}") +``` + +# Conclusion and Next Steps +Congratulations! You have successfully optimized and run a large language model using the TensorRT-LLM Python API. + +In this notebook, you have learned how to: +- Set up your environment with the necessary dependencies. +- Use the `tensorrt_llm.LLM` API to download a model from the Hugging Face Hub. +- Automatically build a high-performance TensorRT engine tailored to your GPU. +- Run inference with the optimized model. + + +You can explore more advanced features to further improve performance and efficiency: + +- Benchmarking: Try running a [benchmark](https://nvidia.github.io/TensorRT-LLM/performance/performance-tuning-guide/benchmarking-default-performance.html#benchmarking-with-trtllm-bench) to compare the latency and throughput of the TensorRT-LLM engine against the original Hugging Face model. You can do this by iterating over a larger number of prompts and measuring the execution time. + +- Quantization: TensorRT-LLM [supports](https://github.com/NVIDIA/TensorRT-Model-Optimizer) various quantization techniques (like INT8 or FP8) to reduce model size and accelerate inference with minimal impact on accuracy. This is a powerful feature for deploying models on resource-constrained hardware. + +- Deploy with NVIDIA Dynamo: For production environments, you can deploy your TensorRT-LLM engine using the [NVIDIA Dynamo](https://docs.nvidia.com/dynamo/latest/) for robust, scalable, and multi-model serving. \ No newline at end of file From 1f7a931beebe00e3843edb924629957bb9a91e46 Mon Sep 17 00:00:00 2001 From: Jay Rodge Date: Tue, 5 Aug 2025 10:06:52 -0700 Subject: [PATCH 2/4] Convert NVIDIA TensorRT guide to Jupyter notebook format --- articles/run-nvidia.ipynb | 219 ++++++++++++++++++++++++++++++++++++++ articles/run-nvidia.md | 123 --------------------- authors.yaml | 5 + registry.yaml | 9 ++ 4 files changed, 233 insertions(+), 123 deletions(-) create mode 100644 articles/run-nvidia.ipynb delete mode 100644 articles/run-nvidia.md diff --git a/articles/run-nvidia.ipynb b/articles/run-nvidia.ipynb new file mode 100644 index 0000000000..47b983b70f --- /dev/null +++ b/articles/run-nvidia.ipynb @@ -0,0 +1,219 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Optimizing OpenAI GPT-OSS Models with NVIDIA TensorRT-LLM" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This notebook provides a step-by-step guide on how to optimizing `gpt-oss` models using NVIDIA's TensorRT-LLM for high-performance inference. TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.\n", + "\n", + "\n", + "TensorRT-LLM supports both models:\n", + "- `gpt-oss-20b`\n", + "- `gpt-oss-120b`\n", + "\n", + "In this guide, we will run `gpt-oss-20b`, if you want to try the larger model or want more customization refer to [this](https://github.com/NVIDIA/TensorRT-LLM/tree/main/docs/source/blogs/tech_blog) deployment guide." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prerequisites" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Hardware\n", + "To run the 20B model and the TensorRT-LLM build process, you will need an NVIDIA GPU with at least 20 GB of VRAM.\n", + "\n", + "> Recommended GPUs: NVIDIA RTX 50 Series (e.g.RTX 5090), NVIDIA H100, or L40S.\n", + "\n", + "### Software\n", + "- CUDA Toolkit 12.8 or later\n", + "- Python 3.12 or later\n", + "- Access to the Orangina model checkpoint from Hugging Face" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Installling TensorRT-LLM" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using NGC\n", + "\n", + "Pull the pre-built TensorRT-LLM container for GPT-OSS from NVIDIA NGC.\n", + "This is the easiest way to get started and ensures all dependencies are included.\n", + "\n", + "`docker pull nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev`\n", + "`docker run --gpus all -it --rm -v $(pwd):/workspace nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev`\n", + "\n", + "## Using Docker (build from source)\n", + "\n", + "Alternatively, you can build the TensorRT-LLM container from source.\n", + "This is useful if you want to modify the source code or use a custom branch.\n", + "See the official instructions here: https://github.com/NVIDIA/TensorRT-LLM/tree/feat/gpt-oss/docker\n", + "\n", + "The following commands will install required dependencies, clone the repository,\n", + "check out the GPT-OSS feature branch, and build the Docker container:\n", + " ```\n", + "#Update package lists and install required system packages\n", + "sudo apt-get update && sudo apt-get -y install git git-lfs build-essential cmake\n", + "\n", + "# Initialize Git LFS (Large File Storage) for handling large model files\n", + "git lfs install\n", + "\n", + "# Clone the TensorRT-LLM repository\n", + "git clone https://github.com/NVIDIA/TensorRT-LLM.git\n", + "cd TensorRT-LLM\n", + "\n", + "# Check out the branch with GPT-OSS support\n", + "git checkout feat/gpt-oss\n", + "\n", + "# Initialize and update submodules (required for build)\n", + "git submodule update --init --recursive\n", + "\n", + "# Pull large files (e.g., model weights) managed by Git LFS\n", + "git lfs pull\n", + "\n", + "# Build the release Docker image\n", + "make -C docker release_build\n", + "\n", + "# Run the built Docker container\n", + "make -C docker release_run \n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "TensorRT-LLM will be available through pip soon" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> Note on GPU Architecture: The first time you run the model, TensorRT-LLM will build an optimized engine for your specific GPU architecture (e.g., Hopper, Ada, or Blackwell). If you see warnings about your GPU's CUDA capability (e.g., sm_90, sm_120) not being compatible with the PyTorch installation, ensure you have the latest NVIDIA drivers and a matching CUDA Toolkit version for your version of PyTorch." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Verifying TensorRT-LLM Installation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from tensorrt_llm import LLM, SamplingParams" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Utilizing TensorRT-LLM Python API" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the next code cell, we will demonstrate how to use the TensorRT-LLM Python API to:\n", + "1. Download the specified model weights from Hugging Face (using your HF_TOKEN for authentication).\n", + "2. Automatically build the TensorRT engine for your GPU architecture if it does not already exist.\n", + "3. Load the model and prepare it for inference.\n", + "4. Run a simple text generation example to verify everything is working.\n", + "\n", + "**Note**: The first run may take several minutes as it downloads the model and builds the engine.\n", + "Subsequent runs will be much faster, as the engine will be cached." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "llm = LLM(model=\"openai/gpt-oss-20b\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "prompts = [\"Hello, my name is\", \"The capital of France is\"]\n", + "sampling_params = SamplingParams(temperature=0.8, top_p=0.95)\n", + "for output in llm.generate(prompts, sampling_params):\n", + " print(f\"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Conclusion and Next Steps\n", + "Congratulations! You have successfully optimized and run a large language model using the TensorRT-LLM Python API.\n", + "\n", + "In this notebook, you have learned how to:\n", + "- Set up your environment with the necessary dependencies.\n", + "- Use the `tensorrt_llm.LLM` API to download a model from the Hugging Face Hub.\n", + "- Automatically build a high-performance TensorRT engine tailored to your GPU.\n", + "- Run inference with the optimized model.\n", + "\n", + "\n", + "You can explore more advanced features to further improve performance and efficiency:\n", + "\n", + "- Benchmarking: Try running a [benchmark](https://nvidia.github.io/TensorRT-LLM/performance/performance-tuning-guide/benchmarking-default-performance.html#benchmarking-with-trtllm-bench) to compare the latency and throughput of the TensorRT-LLM engine against the original Hugging Face model. You can do this by iterating over a larger number of prompts and measuring the execution time.\n", + "\n", + "- Quantization: TensorRT-LLM [supports](https://github.com/NVIDIA/TensorRT-Model-Optimizer) various quantization techniques (like INT8 or FP8) to reduce model size and accelerate inference with minimal impact on accuracy. This is a powerful feature for deploying models on resource-constrained hardware.\n", + "\n", + "- Deploy with NVIDIA Dynamo: For production environments, you can deploy your TensorRT-LLM engine using the [NVIDIA Dynamo](https://docs.nvidia.com/dynamo/latest/) for robust, scalable, and multi-model serving.\n", + "\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.3" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/articles/run-nvidia.md b/articles/run-nvidia.md deleted file mode 100644 index fd6f278521..0000000000 --- a/articles/run-nvidia.md +++ /dev/null @@ -1,123 +0,0 @@ -# Optimizing OpenAI GPT-OSS Models with NVIDIA TensorRT-LLM - -This notebook provides a step-by-step guide on how to optimizing `gpt-oss` models using NVIDIA's TensorRT-LLM for high-performance inference. TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way. - - -TensorRT-LLM supports both models: -- `gpt-oss-20b` -- `gpt-oss-120b` - -In this guide, we will run `gpt-oss-20b`, if you want to try the larger model or want more customization refer to [this](https://github.com/NVIDIA/TensorRT-LLM/tree/main/docs/source/blogs/tech_blog) deployment guide. - -Note: It’s important to ensure that your input prompts follow the [harmony response](http://cookbook.openai.com/articles/openai-harmony) format as the model will not function correctly otherwise, not needed in this guide. - -## Prerequisites - -### Hardware -To run the 20B model and the TensorRT-LLM build process, you will need an NVIDIA GPU with at least 16GB+ of VRAM. - -> Recommended GPUs: NVIDIA RTX 50 Series (e.g. RTX 5090), NVIDIA H100, or L40S. - -### Software -- CUDA Toolkit 12.8 or later -- Python 3.12 or later - -## Installling TensorRT-LLM - -There are various ways to install TensorRT-LLM, in this guide, we will using pre-built docker container from NVIDIA NGC and build it from source. - -## Using NGC - -Pull the pre-built TensorRT-LLM container for GPT-OSS from NVIDIA NGC. -This is the easiest way to get started and ensures all dependencies are included. - -```bash -docker pull nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev -docker run --gpus all -it --rm -v $(pwd):/workspace nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev -``` - -## Using Docker (build from source) - -Alternatively, you can build the TensorRT-LLM container from source. -This is useful if you want to modify the source code or use a custom branch. -See the official instructions here: https://github.com/NVIDIA/TensorRT-LLM/tree/feat/gpt-oss/docker - -The following commands will install required dependencies, clone the repository, -check out the GPT-OSS feature branch, and build the Docker container: - -```bash -#Update package lists and install required system packages -sudo apt-get update && sudo apt-get -y install git git-lfs build-essential cmake - -# Initialize Git LFS (Large File Storage) for handling large model files -git lfs install - -# Clone the TensorRT-LLM repository -git clone https://github.com/NVIDIA/TensorRT-LLM.git -cd TensorRT-LLM - -# Check out the branch with GPT-OSS support -git checkout feat/gpt-oss - -# Initialize and update submodules (required for build) -git submodule update --init --recursive - -# Pull large files (e.g., model weights) managed by Git LFS -git lfs pull - -# Build the release Docker image -make -C docker release_build - -# Run the built Docker container -make -C docker release_run -``` - -TensorRT-LLM will be available through pip soon - -> Note on GPU Architecture: The first time you run the model, TensorRT-LLM will build an optimized engine for your specific GPU architecture (e.g., Hopper, Ada, or Blackwell). If you see warnings about your GPU's CUDA capability (e.g., sm_90, sm_120) not being compatible with the PyTorch installation, ensure you have the latest NVIDIA drivers and a matching CUDA Toolkit version for your version of PyTorch. - -# Verifying TensorRT-LLM Installation - -```python -from tensorrt_llm import LLM, SamplingParams -``` - -# Utilizing TensorRT-LLM Python API - -In the next code cell, we will demonstrate how to use the TensorRT-LLM Python API to: -1. Downloads the specified model weights from Hugging Face -2. Automatically build the TensorRT engine for your GPU architecture if it does not already exist. -3. Load the model and prepare it for inference. -4. Run a simple text generation example to verify everything is working. - -**Note**: The first run may take several minutes as it downloads the model and builds the engine. -Subsequent runs will be much faster, as the engine will be cached. - -```python -llm = LLM(model="openai/gpt-oss-20b") -``` - -```python -prompts = ["Hello, my name is", "The capital of France is"] -sampling_params = SamplingParams(temperature=0.8, top_p=0.95) -for output in llm.generate(prompts, sampling_params): - print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}") -``` - -# Conclusion and Next Steps -Congratulations! You have successfully optimized and run a large language model using the TensorRT-LLM Python API. - -In this notebook, you have learned how to: -- Set up your environment with the necessary dependencies. -- Use the `tensorrt_llm.LLM` API to download a model from the Hugging Face Hub. -- Automatically build a high-performance TensorRT engine tailored to your GPU. -- Run inference with the optimized model. - - -You can explore more advanced features to further improve performance and efficiency: - -- Benchmarking: Try running a [benchmark](https://nvidia.github.io/TensorRT-LLM/performance/performance-tuning-guide/benchmarking-default-performance.html#benchmarking-with-trtllm-bench) to compare the latency and throughput of the TensorRT-LLM engine against the original Hugging Face model. You can do this by iterating over a larger number of prompts and measuring the execution time. - -- Quantization: TensorRT-LLM [supports](https://github.com/NVIDIA/TensorRT-Model-Optimizer) various quantization techniques (like INT8 or FP8) to reduce model size and accelerate inference with minimal impact on accuracy. This is a powerful feature for deploying models on resource-constrained hardware. - -- Deploy with NVIDIA Dynamo: For production environments, you can deploy your TensorRT-LLM engine using the [NVIDIA Dynamo](https://docs.nvidia.com/dynamo/latest/) for robust, scalable, and multi-model serving. \ No newline at end of file diff --git a/authors.yaml b/authors.yaml index d84aa18f87..901d2987dd 100644 --- a/authors.yaml +++ b/authors.yaml @@ -2,6 +2,11 @@ # You can optionally customize how your information shows up cookbook.openai.com over here. # If your information is not present here, it will be pulled from your GitHub profile. +jayrodge: + name: "Jay Rodge" + website: "https://www.linkedin.com/in/jayrodge/" + avatar: "https://developer-blogs.nvidia.com/wp-content/uploads/2024/05/Jay-Rodge.png" + rajpathak-openai: name: "Raj Pathak" website: "https://www.linkedin.com/in/rajpathakopenai/" diff --git a/registry.yaml b/registry.yaml index 7e9cf0b1b9..041d9ab6dc 100644 --- a/registry.yaml +++ b/registry.yaml @@ -4,6 +4,15 @@ # should build pages for, and indicates metadata such as tags, creation date and # authors for each page. +- title: Using NVIDIA TensorRT-LLM to run the 20B model + path: examples/articles/run-nvidia.ipynb + date: 2025-08-05 + authors: + - jayrodge + tags: + - nvidia + - tensorrt-llm + - title: Temporal Agents with Knowledge Graphs path: examples/partners/temporal_agents_with_knowledge_graphs/temporal_agents_with_knowledge_graphs.ipynb date: 2025-07-22 From c4f665d824dd482ddfafb5becf2df12df0890740 Mon Sep 17 00:00:00 2001 From: Jay Rodge Date: Tue, 5 Aug 2025 10:09:36 -0700 Subject: [PATCH 3/4] Update registry.yaml for NVIDIA notebook --- registry.yaml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/registry.yaml b/registry.yaml index 041d9ab6dc..66f0b84eec 100644 --- a/registry.yaml +++ b/registry.yaml @@ -10,8 +10,8 @@ authors: - jayrodge tags: - - nvidia - - tensorrt-llm + - gpt-oss + - open-models - title: Temporal Agents with Knowledge Graphs path: examples/partners/temporal_agents_with_knowledge_graphs/temporal_agents_with_knowledge_graphs.ipynb From b5b50b8b77228aca79bf61d850a548942d391f62 Mon Sep 17 00:00:00 2001 From: Jay Rodge Date: Tue, 5 Aug 2025 13:48:22 -0700 Subject: [PATCH 4/4] Improve guide formatting and add NVIDIA Brev integration --- articles/run-nvidia.ipynb | 74 +++++++++++++++++---------------------- registry.yaml | 3 +- 2 files changed, 33 insertions(+), 44 deletions(-) diff --git a/articles/run-nvidia.ipynb b/articles/run-nvidia.ipynb index 47b983b70f..7de45f035f 100644 --- a/articles/run-nvidia.ipynb +++ b/articles/run-nvidia.ipynb @@ -18,7 +18,21 @@ "- `gpt-oss-20b`\n", "- `gpt-oss-120b`\n", "\n", - "In this guide, we will run `gpt-oss-20b`, if you want to try the larger model or want more customization refer to [this](https://github.com/NVIDIA/TensorRT-LLM/tree/main/docs/source/blogs/tech_blog) deployment guide." + "In this guide, we will run `gpt-oss-20b`, if you want to try the larger model or want more customization refer to [this](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md) deployment guide.\n", + "\n", + "Note: Your input prompts should use the [harmony response](http://cookbook.openai.com/articles/openai-harmony) format for the model to work properly, though this guide does not require it." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Launch on NVIDIA Brev\n", + "You can simplify the environment setup by using [NVIDIA Brev](https://developer.nvidia.com/brev). Click the button below to launch this project on a Brev instance with the necessary dependencies pre-configured.\n", + "\n", + "Once deployed, click on the \"Open Notebook\" button to get start with this guide\n", + "\n", + "[![Launch on Brev](https://brev-assets.s3.us-west-1.amazonaws.com/nv-lb-dark.svg)](https://brev.nvidia.com/launchable/deploy?launchableID=env-30i1YjHsRWT109HL6eYxLUeHIwF)" ] }, { @@ -33,69 +47,45 @@ "metadata": {}, "source": [ "### Hardware\n", - "To run the 20B model and the TensorRT-LLM build process, you will need an NVIDIA GPU with at least 20 GB of VRAM.\n", + "To run the gpt-oss-20b model, you will need an NVIDIA GPU with at least 20 GB of VRAM.\n", "\n", - "> Recommended GPUs: NVIDIA RTX 50 Series (e.g.RTX 5090), NVIDIA H100, or L40S.\n", + "Recommended GPUs: NVIDIA Hopper (e.g., H100, H200), NVIDIA Blackwell (e.g., B100, B200), NVIDIA RTX PRO, NVIDIA RTX 50 Series (e.g., RTX 5090).\n", "\n", "### Software\n", "- CUDA Toolkit 12.8 or later\n", - "- Python 3.12 or later\n", - "- Access to the Orangina model checkpoint from Hugging Face" + "- Python 3.12 or later" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Installling TensorRT-LLM" + "## Installing TensorRT-LLM\n", + "\n", + "There are multiple ways to install TensorRT-LLM. In this guide, we'll cover using a pre-built Docker container from NVIDIA NGC as well as building from source.\n", + "\n", + "If you're using NVIDIA Brev, you can skip this section." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Using NGC\n", + "## Using NVIDIA NGC\n", "\n", - "Pull the pre-built TensorRT-LLM container for GPT-OSS from NVIDIA NGC.\n", + "Pull the pre-built [TensorRT-LLM container](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags) for GPT-OSS from [NVIDIA NGC](https://www.nvidia.com/en-us/gpu-cloud/).\n", "This is the easiest way to get started and ensures all dependencies are included.\n", "\n", - "`docker pull nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev`\n", - "`docker run --gpus all -it --rm -v $(pwd):/workspace nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev`\n", + "```bash\n", + "docker pull nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev\n", + "docker run --gpus all -it --rm -v $(pwd):/workspace nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev\n", + "```\n", "\n", - "## Using Docker (build from source)\n", + "## Using Docker (Build from Source)\n", "\n", "Alternatively, you can build the TensorRT-LLM container from source.\n", - "This is useful if you want to modify the source code or use a custom branch.\n", - "See the official instructions here: https://github.com/NVIDIA/TensorRT-LLM/tree/feat/gpt-oss/docker\n", - "\n", - "The following commands will install required dependencies, clone the repository,\n", - "check out the GPT-OSS feature branch, and build the Docker container:\n", - " ```\n", - "#Update package lists and install required system packages\n", - "sudo apt-get update && sudo apt-get -y install git git-lfs build-essential cmake\n", - "\n", - "# Initialize Git LFS (Large File Storage) for handling large model files\n", - "git lfs install\n", - "\n", - "# Clone the TensorRT-LLM repository\n", - "git clone https://github.com/NVIDIA/TensorRT-LLM.git\n", - "cd TensorRT-LLM\n", - "\n", - "# Check out the branch with GPT-OSS support\n", - "git checkout feat/gpt-oss\n", - "\n", - "# Initialize and update submodules (required for build)\n", - "git submodule update --init --recursive\n", - "\n", - "# Pull large files (e.g., model weights) managed by Git LFS\n", - "git lfs pull\n", - "\n", - "# Build the release Docker image\n", - "make -C docker release_build\n", - "\n", - "# Run the built Docker container\n", - "make -C docker release_run \n", - "```" + "This approach is useful if you want to modify the source code or use a custom branch.\n", + "For detailed instructions, see the [official documentation](https://github.com/NVIDIA/TensorRT-LLM/tree/feat/gpt-oss/docker)." ] }, { diff --git a/registry.yaml b/registry.yaml index d6eb0bc184..e2cbd25578 100644 --- a/registry.yaml +++ b/registry.yaml @@ -4,8 +4,7 @@ # should build pages for, and indicates metadata such as tags, creation date and # authors for each page. - -- title: Using NVIDIA TensorRT-LLM to run the 20B model +- title: Using NVIDIA TensorRT-LLM to run gpt-oss-20b path: examples/articles/run-nvidia.ipynb date: 2025-08-05 authors: