# End-To-End LLM 

---

## Overview  

The End-to-End LLM (Large Language Model) Bootcamp is designed from a real-world perspective that follows the data processing, development, and deployment pipeline paradigm. Attendees walk through the workflow of preprocessing the SQuAD (Stanford Question Answering Dataset) dataset for Question Answering task, training the dataset using BERT (Bidirectional Encoder Representations from Transformers), and executing prompt learning strategy using NVIDIA® NeMo™ and a transformer-based language model, NVIDIA Megatron. Attendees will also learn to optimize an LLM using NVIDIA TensorRT™, an SDK for high-performance deep learning inference, guardrail prompts and responses from the LLM model using NeMo Guardrails, and deploy the AI pipeline using NVIDIA Triton™ Inference Server, an open-source software that standardizes AI model deployment and execution across every workload. Furthermore, we introduced two activity notebooks to test your understanding of the material and solidify your experience in the Question Answering (QA) domain.


### Why End-to-End LLM?

Solving real-world problems in the AI domain requires using a set of tools (software stacks and frameworks). The solution process always follows the `data processing,` `development,` and `deployment` pattern. This material is to:
- assist AI hackathon participants to learn and apply the knowledge to solve their tasks using NVIDIA software stacks and frameworks
- enables bootcamp attendees to solve real-world problem using end-to-end approach (data processing --> development --> deployment)




The table of contents below will walk you through the QA phases, and the activities included will test your understanding of the concept.

### Table of Content

The following contents will be covered:

1. Megatron-GPT
    1. [Nemo Fundamentals](jupyter_notebook/nemo/NeMo_Primer.ipynb)
    1. [Question Answering](jupyter_notebook/nemo/Question_Answering.ipynb)
    1. [Lab Activity 1](jupyter_notebook/nemo/Activity1.ipynb)
    1. [Prompt Tuning/P-Tuning](jupyter_notebook/nemo/Multitask_Prompt_and_PTuning.ipynb) 
    1. [Lab Activity 2](jupyter_notebook/nemo/Activity2.ipynb)
    1. [Megatron-GPT 1.3B: Language Model Inferencing](jupyter_notebook/nemo/demo.ipynb)
1. TensorRT-LLM and Triton Deployment with LLama2 7B Model
    1. [LLama2 7B Inference using TensorRT-LLM](jupyter_notebook/trt-llm/TRT-LLM-Part1.ipynb)
    1. [LLama2 7B deployment using Triton Inference server](jupyter_notebook/trt-llm/TRT-LLM-Part2.ipynb)
1.  NeMo Guardrails
    1. [NeMo Guardrails Topical Rails](jupyter_notebook/nemo-guardrails/guardrails/workspace/examples/topical_rail/topical_rail.ipynb)
    1. [NeMo Guardrails Jailbreak Rails](jupyter_notebook/nemo-guardrails/guardrails/workspace/examples/jailbreak_check/jailbreak_check.ipynb)
    1. [NeMo Guardrails Grounding Rails](jupyter_notebook/nemo-guardrails/guardrails/workspace/examples/grounding_rail/grounding_rail.ipynb)
    1. [NeMo Guardrails Moderation Rails](jupyter_notebook/nemo-guardrails/guardrails/workspace/examples/moderation_rail/moderation_rail.ipynb)
    1. [NeMo Guardrails Langchain and Prompt Templates](jupyter_notebook/nemo-guardrails/guardrails/workspace/examples/custom_prompt_context/custom_prompt_context.ipynb)


### Check your GPU

Let's execute the cell below to display information about the CUDA driver and GPUs running on the server by running the nvidia-smi command. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting `Ctrl-Enter`, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell.

In [None]:
#!nvidia-smi

### Tutorial Duration

The material will be presented in three labs in a total of 8 hours 45mins sessions as follows:
- Megatron-GPT Lab: `4hrs: 30mins`
- TensorRT-LLM and Triton Deployment with LLama2 7B Model Labs: `1hrs: 10mins`
- NeMo Guardrails : `3hrs: 05mins`

### Content Level
Beginner to Advanced

### Target Audience and Prerequisites
The target audience for these labs are researchers, graduate students, and developers interested in the End-to-End approach to solving LLM tasks via GPUs. Audiences should have Python programming background Knowledge.

### Acknowledgments

The `Megatron-GPT,` `TensorRT-LLM and Triton Deployment with LLama2 7B Model,` and `NeMo Guardrails` labs were adapted from [NVIDIA NeMo](https://github.com/NVIDIA/NeMo), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), and [NeMo Guardrails](https://github.com/NVIDIA/NeMo-Guardrails) repositories respectively.  

---
## Licensing

Copyright © 2022 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.