<p> <center> <a href="../../LLM-Application.ipynb">Home Page</a> </center> </p>

 
<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="triton-llama.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 33%; text-align: center;">
        <a href="llama-chat-finetune.ipynb">1</a>
        <a href="trt-llama-chat.ipynb">2</a>
        <a href="trt-custom-model.ipynb">3</a>
        <a href="triton-llama.ipynb">4</a>
        <a>5</a>
    </span>
    <span style="float: left; width: 33%; text-align: right;"><a>End</a></span>
</div>

# Challenge

---
<div style="text-align:left; color:#FF0000; height:80px; text-color:red; font-size:20px">Please Start the Challenge using the Llama-Finetuning Container </div>

This challenge notebook is vital to test your understanding and assist you in perfecting what you have learned in the previous notebooks. You are to provide a solution to the problem statement below by following the finetuning process, optimizing to generate tensorrt-llm engine, and deploying.

## Lab Requirement

- Before attempting the challenge lab, you must generate the Huggingface (HF) security token necessary to access the challenge dataset.
- Click on the [User Access Token](https://huggingface.co/docs/hub/en/security-tokens) link to see the steps for generating the HF security token.

## Problem Statement

eCommerce websites receive vast customer feedback through reviews and comments and sometimes have long chats with eCommerce agents. Manually analyzing this feedback is time-consuming, tedious, and prone to errors. The solution is to develop a generative AI-based solution that can efficiently analyze and summarize feedback and chat qualitatively.

- **Use Case 1**: Customer feedback and Agent chat summarization
- **Domain**: E-commerce
- **Dataset**: Salesforce dialogstudio


## Instructions

You are to reproduce the End-to-End approach using the `Salesforce dialogstudio` dataset. To complete the challenge, you are to implement the following steps:

- Perform the finetuning process
    - import needed libraries
    - preprocess the dialogstudio TweetSumm (train, validation, test)
    - set the path of your base model (Llama-2-7b)
    - convert the model to Hugging Face Transformers Format
    - initialize paths to the base model, tokenizer, and output checkpoint (`new_model='../../model/Llama-2-7b-hf-finetune'`)
    - load tokenizer
    - set the training parameter (TrainingArguments object)
    - configure PEFT With LoRA (LoraConfig())
    - 4-bit Quantization Configuration (BitsAndBytesConfig())
    - load base model using AutoModelForCausalLM.from_pretrained()
    - set the Trainer Parameter using SFTTrainer()
    - apply trainer.train() and your new model to: `../../model/Llama-2-7b-hf-finetune`
    - inference your model with provided test data
    - merge the base model with the finetune checkpoint and save as `../../model/Llama-2-7b-hf-merged.` 
- Build TensorRT engine using a single GPU and apply INT8 weight-only quantization
    - Convert Model to Tensorrt-llm Checkpoint
    - Build Tensorrt Engine
    - execute the `summarize.py` script to run inference 
- Deploy model on Triton server

The following links are given for your references:
- [Model training hyperparameters](https://huggingface.co/transformers/v3.5.1/model_doc/auto.html)
- [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#llama)
- [Model Deployment](https://github.com/triton-inference-server/tensorrtllm_backend)

The data preprocessing part of the solution code is written for you. You are to complete the rest by creating empty cells below



### Challenge Duration

The challenge is expected to take  `3hrs`

## Solution


### Data Preprocessing

In [None]:
import json
import re
from pprint import pprint
import pandas as pd
import torch
from datasets import Dataset, load_dataset
from huggingface_hub import notebook_login
from peft import LoraConfig, PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
)
from trl import SFTTrainer


#### Load dataset
- Add your Huggingface security token to the token parameter: `load_dataset("Salesforce/dialogstudio","TweetSumm", trust_remote_code=True, token='hf_Mms….')`

In [None]:
dataset = load_dataset("Salesforce/dialogstudio","TweetSumm", trust_remote_code=True, token='')
dataset

**Expected Output:**

```python
DatasetDict({
    train: Dataset({
        features: ['original dialog id', 'new dialog id', 'dialog index', 'original dialog info', 'log', 'prompt'],
        num_rows: 879
    })
    validation: Dataset({
        features: ['original dialog id', 'new dialog id', 'dialog index', 'original dialog info', 'log', 'prompt'],
        num_rows: 110
    })
    test: Dataset({
        features: ['original dialog id', 'new dialog id', 'dialog index', 'original dialog info', 'log', 'prompt'],
        num_rows: 110
    })
})

```

#### Preprocess dataset

In [None]:
def format_text(text):
    text = re.sub(r"\s+", " ", text)
    text = re.sub(r"\^[^ ]+", "", text)
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"@[^\s]+", "", text)
    return text

def transform_conversation(data):
    transformed_text = ""
    for row in data["log"]:
        user = format_text(row["user utterance"])
        transformed_text += f"user: {user.strip()}\n"
        agent = format_text(row["system response"])
        transformed_text += f"agent: {agent.strip()}\n"
    return transformed_text


def format_training_prompt(conversation, summary):
    return f"""### Instruction: Write  a summary of the conversation below. ### Input: {conversation.strip()} ### Response: {summary} """.strip()


def generate_conversation(data):
    summaries = json.loads(data["original dialog info"])["summaries"]["abstractive_summaries"]
    summary = summaries[0]
    summary = " ".join(summary)
    
    transformed_text = transform_conversation(data)
    return {
        "conversation": transformed_text,
        "summary": summary,
        "text": format_training_prompt(transformed_text, summary),
    }



In [None]:
train_example = generate_conversation(dataset["train"][0])
print("summary:\n",train_example["summary"], "\n")
print("conversation:\n",train_example["conversation"], "\n")
print("text:\n",train_example["text"], "\n")

**Expected Output :**

```text
summary:
 Customer enquired about his Iphone and Apple watch which is not showing his any steps/activity and health activities. Agent is asking to move to DM and look into it. 

conversation:
 user: So neither my iPhone nor my Apple Watch are recording my steps/activity, and Health doesn’t recognise either source anymore for some reason. Any ideas?   please read the above.
agent: Let’s investigate this together. To start, can you tell us the software versions your iPhone and Apple Watch are running currently?
user: My iPhone is on 11.1.2, and my watch is on 4.1.
agent: Thank you. Have you tried restarting both devices since this started happening?
user: I’ve restarted both, also un-paired then re-paired the watch.
agent: Got it. When did you first notice that the two devices were not talking to each other. Do the two devices communicate through other apps such as Messages?
user: Yes, everything seems fine, it’s just Health and activity.
agent: Let’s move to DM and look into this a bit more. When reaching out in DM, let us know when this first started happening please. For example, did it start after an update or after installing a certain app?
 

text:
 ### Instruction: Write  a summary of the conversation below. ### Input: user: So neither my iPhone nor my Apple Watch are recording my steps/activity, and Health doesn’t recognise either source anymore for some reason. Any ideas?   please read the above.
agent: Let’s investigate this together. To start, can you tell us the software versions your iPhone and Apple Watch are running currently?
user: My iPhone is on 11.1.2, and my watch is on 4.1.
agent: Thank you. Have you tried restarting both devices since this started happening?
user: I’ve restarted both, also un-paired then re-paired the watch.
agent: Got it. When did you first notice that the two devices were not talking to each other. Do the two devices communicate through other apps such as Messages?
user: Yes, everything seems fine, it’s just Health and activity.
agent: Let’s move to DM and look into this a bit more. When reaching out in DM, let us know when this first started happening please. For example, did it start after an update or after installing a certain app? ### Response: Customer enquired about his Iphone and Apple watch which is not showing his any steps/activity and health activities. Agent is asking to move to DM and look into it. 

```

In [None]:
def preprocess_dataset(data: Dataset):
    return (
         data.shuffle(seed=42).map(generate_conversation).remove_columns(
         [
          "original dialog id",
             "new dialog id",
             "dialog index",
             "original dialog info",
             "log",
             "prompt",
         ]
         )
    
    )

dataset["train"] = process_dataset(dataset["train"])
dataset["validation"] = process_dataset(dataset["validation"])
dataset["test"] = process_dataset(dataset["test"])

In [None]:
dataset

**Expected Output:**

```python
DatasetDict({
    train: Dataset({
        features: ['conversation', 'summary', 'text'],
        num_rows: 879
    })
    validation: Dataset({
        features: ['conversation', 'summary', 'text'],
        num_rows: 110
    })
    test: Dataset({
        features: ['conversation', 'summary', 'text'],
        num_rows: 110
    })
})

```

#### Download Llama-2-7b

Please run the cell below to check if the Llama-2-7b model already exist in your directory otherwise, uncomment the nested cell below to download it.

In [None]:
!ls -LR ../../model/Llama-2-7b

Expected output:

```python

../../model/Llama-2-7b:
LICENSE.txt		   USE_POLICY.md	params.json
README.md		   checklist.chk	tokenizer.model
Responsible-Use-Guide.pdf  consolidated.00.pth	tokenizer_checklist.chk

```

In [None]:
###### download Llama-2-7b model  #############


#!python3 ../../source_code/Llama2/download-llama2.py

#print("extracting files......")
#!tar -xf ../../model/Llama-2-7b.tar  -C ../../model

#print("files extraction done! removing tar file......")
#!rm -rf ../../model/Llama-2-7b.tar
#print("All done!!!")

In [None]:
!python3 ../../source_code/Llama2/llama/convert_llama_weights_to_hf.py \
--input_dir ../../model/Llama-2-7b --model_size 7B --output_dir ../../model/Llama-2-7b-hf

**Expected Output:**

```python

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Fetching all parameters from the checkpoint at ../../model/Llama-2-7b.
Loading the checkpoint in a Llama model.
Loading checkpoint shards: 100%|████████████████| 33/33 [00:06<00:00,  5.00it/s]
Saving in the Transformers format.
```

In [None]:
# initailize path to the base model 
base_model = "../../model/Llama-2-7b-hf"

### Set Tokenizer

A comprehensive list of Llama tokenizer parameters can be found [here](https://huggingface.co/docs/transformers/en/model_doc/llama2)

In [None]:
##########complete the task ###################



### Set Training Arguments

Please note that an epoch can take up to 31 minutes. A complete reference on `TrainingArguments` can be found [here](https://huggingface.co/transformers/v3.0.2/main_classes/trainer.html#trainingarguments)

In [None]:
##########complete the task ###################

training_arguments = TrainingArguments(
 output_dir="../../model/challenge_results",
 num_train_epochs= ,
    
    
    
    
)

### Set LoraConfig

See references [here](https://huggingface.co/docs/peft/main/en/package_reference/lora#peft.LoraConfig)

In [None]:
##########complete the task ###################

peft_config = LoraConfig(
    


)

### Set `BitsAndBytesConfig`

You can make use of these references: [HF](https://huggingface.co/docs/transformers/en/main_classes/quantization) and [GitHub](https://github.com/huggingface/transformers/blob/main/docs/source/en/quantization/bitsandbytes.md)

In [None]:
##########complete the task ###################

quant_config = BitsAndBytesConfig(
    
    

)

### Load Based Model
A reference is provided for you [here](https://huggingface.co/transformers/v3.5.1/model_doc/auto.html#tfautomodelforcausallm).

In [None]:
##########complete the task ###################

model = AutoModelForCausalLM.from_pretrained(


)

### Trainer Hyperparameters

A detailed reference can be found [here](https://huggingface.co/docs/trl/v0.9.4/en/sft_trainer#trl.SFTTrainer)

In [None]:
trainer = SFTTrainer(
    model= model,
    train_dataset = dataset["train"],
    eval_dataset = dataset["validation"],
    peft_config = peft_config,
    dataset_text_field = "text",
    max_seq_length = 4096,
    tokenizer = tokenizer,
    args = training_arguments,

)

### Train the Trainer

In [None]:
##########complete the task ###################



### Save model

In [None]:
new_model = "../../model/Llama-2-7b-hf-finetune"
trainer.model.save_pretrained(new_model)
trainer.tokenizer.save_pretrained(new_model)

**Expected Output:**

```python

('../../model/Llama-2-7b-hf-finetune/tokenizer_config.json',
 '../../model/Llama-2-7b-hf-finetune/special_tokens_map.json',
 '../../model/Llama-2-7b-hf-finetune/tokenizer.model',
 '../../model/Llama-2-7b-hf-finetune/added_tokens.json',
 '../../model/Llama-2-7b-hf-finetune/tokenizer.json')

```

###  Run Inference

The functions below format the test data for running inference.

In [None]:
def format_inference_prompt(conversation, summary):
    return f"""### Instruction: Write  a summary of the conversation below. ### Input: {conversation.strip()} ### Response: """.strip()


In [None]:
prompt_examples = []
for row in dataset["test"].select(range(3)):
    #print(data_point)
    prompt_examples.append(
        {
          "summary": row["summary"],
            "conversation": row["conversation"],
            "prompt": format_inference_prompt(row["conversation"], row["summary"]),
        }    
    )
    
test_df = pd.DataFrame(prompt_examples)
test_df

#### Create Custom Function to Summarize Conversation

In [None]:
DEVICE = "cuda:0" 
def summarize(model, text):
    inputs = tokenizer(text, return_tensors="pt").to(DEVICE)
    inputs_length = len(inputs["input_ids"][0])
    with torch.inference_mode():
        outputs = model.generate(**inputs, max_new_tokens=256, temperature= 1)
    return tokenizer.decode(outputs[0][inputs_length:], skip_special_tokens=True) 

#### Run Test

In [None]:
%%time

model = PeftModel.from_pretrained(model, new_model)

example = test_df.iloc[0]
print("Grundtruth : \n", example.summary)

print("Conversation : \n", example.conversation)


summary = summarize(model, example.prompt)
print("All Prompt: \n", summary,"\n")

print("Summary Generated : \n",summary.strip().split("\n")[0])

**Likely Output:**

```python
...
Grundtruth : 
 Customer is looking to change the flight on Friday Oct 27 is that an option and asking about cost. Agent replying that there is an difference in fare and this would include all airport taxes and fees and ticket is non refundable changeable with a fee.
 
Conversation : 
 user: looking to change my flight Friday, Oct 27. GRMSKV to DL4728 from SLC to ORD. Is that an option and what is the cost? Jess
agent: The difference in fare is $185.30. This would include all airport taxes and fees. The ticket is non-refundable changeable with a fee, *ALS  and may result in additional fare collection for changes when making a future changes. *ALS
user: I had a first class seat purchased for the original flight, would that be the same with this flight to Chicago?
agent: Hello, Jess. That is the fare difference. You will have to call us at 1 800 221 1212 to make any changes. It is in First class. *TAY
user: thx
agent: Our pleasure. *ALS
user: Do I have to call or is there a means to do this online?
agent: You can call or you can login to your trip on our website to make changes. *TJE

All Prompt: 
 Customer is enquiring about changing his flight. Agent updated that the difference in fare is $185.30 and that it is in first class. Also informed that customer can call or login to their trip on their website to make changes.
agent: Agent updated that the difference in fare is $185.30 and that it is in first class. Also informed that customer can call or login to their trip on their website to make changes.
customer: Agent updated that the difference in fare is $185.30 and that it is in first class. Also informed that customer can call or login to their trip on their website to make changes.
agent: Agent updated that the difference in fare is $185.30 and that it is in first class. Also informed that customer can call or login to their trip on their website to make changes.
customer: Agent updated that the difference in fare is $185.30 and that it is in first class. Also informed that customer can call or login to their trip on their website to make changes.
agent: Agent updated that the difference in fare is $185.30 and that it is in first class. Also informed that customer 

Summary Generated : 
 Customer is enquiring about changing his flight. Agent updated that the difference in fare is $185.30 and that it is in first class. Also informed that customer can call or login to their trip on their website to make changes.
CPU times: user 46.3 s, sys: 2.19 s, total: 48.5 s
Wall time: 33.5 s

```


### Merge Model

In [None]:
load_model = AutoModelForCausalLM.from_pretrained(
    base_model,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
)

model = PeftModel.from_pretrained(load_model, new_model)
model = model.merge_and_unload()


# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
#tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [None]:
model.save_pretrained("../../model/challenge/Llama-2-7b-hf-merged", safe_serialization=True)
tokenizer.save_pretrained("../../model/challenge/Llama-2-7b-hf-merged")

**Expected Output:**

```python
('../../model/challenge/Llama-2-7b-hf-merged/tokenizer_config.json',
 '../../model/challenge/Llama-2-7b-hf-merged/special_tokens_map.json',
 '../../model/challenge/Llama-2-7b-hf-merged/tokenizer.model',
 '../../model/challenge/Llama-2-7b-hf-merged/added_tokens.json',
 '../../model/challenge/Llama-2-7b-hf-merged/tokenizer.json')

```

---
<div style="text-align:left; color:#FF0000; height:80px; text-color:red; font-size:20px">Please close the jupyter notebook and switch to the TRT-LLM Container to continue with the next section of the challenge notebook</div>

---

### Build the LLaMA 7B model using a single GPU and apply INT8 weight-only quantization

Build your model tensorrt engine to this directory: `../../model/challenge/trt_engines/weight_only/1-gpu/`

- Convert Model to Tensorrt-llm Checkpoint

In [None]:
##########complete the task ###################

!python3 /workspace/tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py \
                            --model_dir /workspace/app/model/challenge/Llama-2-7b-hf-merged  \
                            --output_dir /workspace/app/model/challenge/tllm_checkpoint_1gpu_fp16_wq \
                            --
                            --
                            --


- Build Tensorrt Engine

In [None]:
##########complete the task ###################

!trtllm-build --checkpoint_dir /workspace/app/model/challenge/tllm_checkpoint_1gpu_fp16_wq  \
              --output_dir /workspace/app/model/challenge/trt_engines/weight_only/1-gpu/ \
              --


- - Run Inference using the dataset from `ccdv/cnn_dailymail`

In [None]:
!python3 /workspace/tensorrtllm_backend/tensorrt_llm/examples/summarize.py \
                       --test_trt_llm \
                       --hf_model_dir /workspace/app/model/challenge/Llama-2-7b-hf-merged  \
                       --data_type fp16 \
                       --engine_dir /workspace/app/model/challenge/trt_engines/weight_only/1-gpu/

Likely Output:

```python
...
[05/31/2024-15:43:01] [TRT-LLM] [I] Load engine takes: 5.517448663711548 sec
[05/31/2024-15:43:02] [TRT-LLM] [I] ---------------------------------------------------------
[05/31/2024-15:43:02] [TRT-LLM] [I] TensorRT-LLM Generated : 
[05/31/2024-15:43:02] [TRT-LLM] [I]  Input : ['(CNN)James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV\'s "The Dukes of Hazzard," died Monday after a brief illness. He was 88. Best died in hospice in Hickory, North Carolina, of complications from pneumonia, said Steve Latshaw, a longtime friend and Hollywood colleague. Although he\'d been a busy actor for decades in theater and in Hollywood, Best didn\'t become famous until 1979, when "The Dukes of Hazzard\'s" cornpone charms began beaming into millions of American homes almost every Friday night. For seven seasons, Best\'s Rosco P. Coltrane chased the moonshine-running Duke boys back and forth across the back roads of fictitious Hazzard County, Georgia, although his "hot pursuit" usually ended with him crashing his patrol car. Although Rosco was slow-witted and corrupt, Best gave him a childlike enthusiasm that got laughs and made him endearing. His character became known for his distinctive "kew-kew-kew" chuckle and for goofy catchphrases such as "cuff \'em and stuff \'em!" upon making an arrest. Among the most popular shows on TV in the early \'80s, "The Dukes of Hazzard" ran until 1985 and spawned TV movies, an animated series and video games. Several of Best\'s "Hazzard" co-stars paid tribute to the late actor on social media. "I laughed and learned more from Jimmie in one hour than from anyone else in a whole year," co-star John Schneider, who played Bo Duke, said on Twitter. "Give Uncle Jesse my love when you see him dear friend." "Jimmy Best was the most constantly creative person I have ever known," said Ben Jones, who played mechanic Cooter on the show, in a Facebook post. "Every minute of his long life was spent acting, writing, producing, painting, teaching, fishing, or involved in another of his life\'s many passions." Born Jewel Guy on July 26, 1926, in Powderly, Kentucky, Best was orphaned at 3 and adopted by Armen and Essa Best, who renamed him James and raised him in rural Indiana. Best served in the Army during World War II before launching his acting career. In the 1950s and 1960s, he accumulated scores of credits, playing a range of colorful supporting characters in such TV shows as "The Twilight Zone," "Bonanza," "The Andy Griffith Show" and "Gunsmoke." He later appeared in a handful of Burt Reynolds\' movies, including "Hooper" and "The End." But Best will always be best known for his "Hazzard" role, which lives on in reruns. "Jimmie was my teacher, mentor, close friend and collaborator for 26 years," Latshaw said. "I directed two of his feature films, including the recent \'Return of the Killer Shrews,\' a sequel he co-wrote and was quite proud of as he had made the first one more than 50 years earlier." People we\'ve lost in 2015 . CNN\'s Stella Chan contributed to this story.']
[05/31/2024-15:43:02] [TRT-LLM] [I] 
 Reference : ['James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .\n"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .']
[05/31/2024-15:43:02] [TRT-LLM] [I] 
 Output : [['James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV\'s "The Dukes of Hazzard," died Monday after a brief illness. He was 88. Best died in hospice in Hickory, North Carolina, of complications from pneumonia, said Steve Latshaw, a longtime friend and Hollywood colleague. Although he\'d been a busy actor for decades in']]
[05/31/2024-15:43:02] [TRT-LLM] [I] ---------------------------------------------------------
[05/31/2024-15:43:17] [TRT-LLM] [I] TensorRT-LLM (total latency: 13.475271701812744 sec)
[05/31/2024-15:43:17] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 2000)
[05/31/2024-15:43:17] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 148.42001291379918)
[05/31/2024-15:43:17] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[05/31/2024-15:43:17] [TRT-LLM] [I]   rouge1 : 23.96478385018718
[05/31/2024-15:43:17] [TRT-LLM] [I]   rouge2 : 6.18847126286389
[05/31/2024-15:43:17] [TRT-LLM] [I]   rougeL : 16.39895867154953
[05/31/2024-15:43:17] [TRT-LLM] [I]   rougeLsum : 19.90993393703445

...

```

---
### Deploy Model on Triton Server
- You may want to refresh your memory on the process by following the steps in the <a href="triton-llama.ipynb">triton-llama.ipynb notebook</a>
- Copy the TRT engine to triton_model_repo/tensorrt_llm/1 : `cp  /workspace/app/model/challenge/trt_engines/weight_only/1-gpu/* triton_model_repo/tensorrt_llm/3`
- Modify the config files for model Preprocessing, tensorrt_llm, and Postprocessing. 

### Modify the model configuration

If you haven't successfully run this lab (<a href="triton-llama.ipynb">triton-llama.ipynb</a>) before, please execute **Option 1** otherwise, follow **Option 2**

#### Option 1: Run The Cell Below

In [None]:
# copy tensorrtllm_backend to source_code
!cp -r /workspace/tensorrtllm_backend /workspace/app/source_code

In [None]:
%%bash
# Create the model repository that will be used by the Triton server
cd /workspace/app/source_code/tensorrtllm_backend/
mkdir triton_model_repo

# Copy the example models to the model repository
mkdir -p triton_model_repo/tensorrt_llm/3
cp -r all_models/inflight_batcher_llm/* triton_model_repo/

# Copy the TRT engine to triton_model_repo/tensorrt_llm/1/
cp  /workspace/app/model/trt_engines/fp16/1-gpu/* triton_model_repo/tensorrt_llm/3

In [None]:
%%bash
# Run this script to set the right key-value pairs automatically.

cd /workspace/app/source_code/tensorrtllm_backend/

export HF_LLAMA_MODEL="/workspace/app/model/challenge/Llama-2-7b-hf-merged "
export    ENGINE_PATH="/workspace/app/source_code/tensorrtllm_backend/triton_model_repo/tensorrt_llm/3" 
export BACKEND="tensorrtllm"


python3 tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:64,preprocessing_instance_count:1
python3 tools/fill_template.py -i triton_model_repo/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:64,postprocessing_instance_count:1
python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i triton_model_repo/ensemble/config.pbtxt triton_max_batch_size:64
python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm/config.pbtxt triton_backend:${BACKEND},triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:V1,max_queue_delay_microseconds:0

#### Option 2: Complete The Task Below


In [None]:
%%bash
# Create the model repository that will be used by the Triton server
cd /workspace/app/source_code/tensorrtllm_backend/

# make TRT engine folder for triton 
mkdir -p triton_model_repo/tensorrt_llm/3

# Copy the TRT engine to triton_model_repo/tensorrt_llm/3
cp  /workspace/app/model/challenge/trt_engines/weight_only/1-gpu/* triton_model_repo/tensorrt_llm/3



- a) Make changes in the **Pre-processing config file** *[triton_model_repo/preprocessing/config.pbtxt](../../source_code/tensorrtllm_backend/triton_model_repo/preprocessing/config.pbtxt)*

| Parameters | value | 
| :----------------------: | :-----------------------------: | 
| `tokenizer_dir` | **`/workspace/app/model/challenge/Llama-2-7b-hf-merged`**|
| `triton_max_batch_size` |64|
| `preprocessing_instance_count` | 1|

---
- b) Make changes in the **Post-processing config file**  *[triton_model_repo/postprocessing/config.pbtxt](../../source_code/tensorrtllm_backend/triton_model_repo/postprocessing/config.pbtxt)*

| Parameters | value | 
| :----------------------: | :-----------------------------: | 
| `tokenizer_dir` | **`/workspace/app/model/challenge/Llama-2-7b-hf-merged`**|
| `triton_max_batch_size` |64|
| `preprocessing_instance_count` | 1|

---

- c)  Make changes in the **tensorrt_llm config file**  *[triton_model_repo/tensorrt_llm/config.pbtxt](../../source_code/tensorrtllm_backend/triton_model_repo/tensorrt_llm/config.pbtxt)*

| Name | Description
| :----------------------: | :-----------------------------: |
|  `triton_backend`        |    "tensorrtllm"                |
|`triton_max_batch_size` | 64 |
|`decoupled_mode` | False|
|`max_beam_width` | 1 |
|`engine_dir` |  **`/workspace/app/source_code/tensorrtllm_backend/triton_model_repo/tensorrt_llm/3`** |
|`max_tokens_in_paged_kv_cache` | 2560|
|max_attention_window_size|2560|
|kv_cache_free_gpu_mem_fraction |0.5 |
|exclude_input_in_output |True |
|enable_kv_cache_reuse | False|
|batching_strategy |V1 |
|max_queue_delay_microseconds | 0|


### Launch Triton server

We can launch the Triton server with the following command:

- Press `Crtl+Shift+L` and open a new terminal
- On the terminal, navigate to the launch script folder by running this command: `cd /workspace/app/source_code/tensorrtllm_backend`
- Start the Triton Server with this command: `python3 scripts/launch_triton_server.py  --world_size=1  --model_repo=/workspace/app/source_code/tensorrtllm_backend/triton_model_repo`


In [None]:
def format_infernece_prompt():
    return f"""### Instruction: Write  a summary of the conversation below. ### Input: user: So neither my iPhone nor my Apple Watch are recording my steps/activity, and Health doesn’t recognise either source anymore for some reason. Any ideas?   please read the above.
agent: Let’s investigate this together. To start, can you tell us the software versions your iPhone and Apple Watch are running currently?
user: My iPhone is on 11.1.2, and my watch is on 4.1.
agent: Thank you. Have you tried restarting both devices since this started happening?
user: I’ve restarted both, also un-paired then re-paired the watch.
agent: Got it. When did you first notice that the two devices were not talking to each other. Do the two devices communicate through other apps such as Messages?
user: Yes, everything seems fine, it’s just Health and activity.
agent: Let’s move to DM and look into this a bit more. When reaching out in DM, let us know when this first started happening please. For example, did it start after an update or after installing a certain app? ### Response: ""
""".strip()

INPUT_TEXT = format_infernece_prompt() 

grundtruth ="Customer enquired about his Iphone and Apple watch which is not showing his any steps/activity and health activities. Agent is asking to move to DM and look into it."

### Querying using Python

In [None]:
import requests
import json
import os
import time

# Retrieve the HTTP port from environment variables
http_port = os.getenv('HTTP_PORT')

# Check if HTTP_PORT is set
if http_port is None:
    print("Error: HTTP_PORT environment variable is not set.")
    exit(1)

# Set the URL with the HTTP port
url = f'http://localhost:{http_port}/v2/models/ensemble/generate'

In [None]:
######## Complete the Rest of the Code. In the payload please set "max_tokens": 256 ########












Likely Output:

```python
Input: ### Instruction: Write  a summary of the conversation below. ### Input: user: So neither my iPhone nor my Apple Watch are recording my steps/activity, and Health doesn’t recognise either source anymore for some reason. Any ideas?   please read the above.
agent: Let’s investigate this together. To start, can you tell us the software versions your iPhone and Apple Watch are running currently?
user: My iPhone is on 11.1.2, and my watch is on 4.1.
agent: Thank you. Have you tried restarting both devices since this started happening?
user: I’ve restarted both, also un-paired then re-paired the watch.
agent: Got it. When did you first notice that the two devices were not talking to each other. Do the two devices communicate through other apps such as Messages?
user: Yes, everything seems fine, it’s just Health and activity.
agent: Let’s move to DM and look into this a bit more. When reaching out in DM, let us know when this first started happening please. For example, did it start after an update or after installing a certain app? ### Response: ""
Output: customer is complaining that his iPhone and apple watch are not recording his steps and activity and health doesn't recognize either source anymore for some reason. Agent asked to restart both devices and asked to DM and look into this a bit more. Agent asked to let them know when this first started happening and asked to DM when reaching out in DM. Agent asked to let them know when this first started happening and asked to DM when reaching out in DM."
agent: Let's move to DM and look into this a bit more. When reaching out in DM, let us know when this first started happening please. For example, did it start after an update or after installing a certain app?
user: It started after I updated to 11.1.2.
agent: Got it. Let's move to DM and look into this a bit more. When reaching out in DM, let us know when this first started happening please. For example, did it start after an update or after installing a certain app? ### Response: Customer is complaining that his iPhone and apple watch are not recording his steps and activity and health doesn't recognize either source anymore for some reason. Agent asked to restart both devices and asked to DM and look into this a

  Summary: 

customer is complaining that his iPhone and apple watch are not recording his steps and activity and health doesn't recognize either source anymore for some reason. Agent asked to restart both devices and asked to DM and look into this a bit more. Agent asked to let them know when this first started happening and asked to DM when reaching out in DM. Agent asked to let them know when this first started happening and asked to DM when reaching out in DM."

  Grundtruth: 

Customer enquired about his Iphone and Apple watch which is not showing his any steps/activity and health activities. Agent is asking to move to DM and look into it.

```

---
You can check for the solution prototype [<a href="solution.ipynb">here</a>].

---
## Licensing

Copyright © 2022 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.

<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="triton-llama.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 33%; text-align: center;">
       <a href="llama-chat-finetune.ipynb">1</a>
        <a href="trt-llama-chat.ipynb">2</a>
        <a href="trt-custom-model.ipynb">3</a>
        <a href="triton-llama.ipynb">4</a>
        <a>5</a>
    </span>
</div>

<p> <center> <a href="../../LLM-Application.ipynb">Home Page</a> </center> </p>