### NOTEBOOK FOR TESTING LLAMA-3-8B-INSTRUCT ###
#### TESTING 2 APPROACHES FOR LOADING MODEL ####
#### 1) Using pipeline
#### 2) Loading model using AutoModelForCausalLM

In [None]:
import transformers
import torch

## APPROACH 2
#### Create the model

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
# model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", load_in_4bit = True)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", device_map="auto") ## pip install accelerate based on error message
# model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
# question_to_model = "Who is the CEO of Tesla"
# model_inputs = tokenizer([question_to_model], return_tensors="pt").to("cuda")
# model_inputs = tokenizer([question_to_model], return_tensors="pt")

## Prompt the model and print output

In [None]:
question_to_model = "Who is the CEO of Tesla"
model_inputs = tokenizer([question_to_model], return_tensors="pt").to("cuda")
# calculate the number of tokens in model_inputs
num_tokens = model_inputs['input_ids'].shape[1]
# print the number of tokens
print(num_tokens)
# generate output using the model
output = model.generate(**model_inputs, max_length=num_tokens+50, num_return_sequences=5)
# decode the output
output = tokenizer.batch_decode(output, skip_special_tokens=True)
# print the output
print(output)

## APPROACH 1
#### Create the pipeline

In [1]:
# Use a pipeline as a high-level helper
import torch
from transformers import pipeline
llama3_hf_token = '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'
# pipe = pipeline("text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", token = llama3_hf_token)
# create pipeline with cuda
# pipe = pipeline("text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", token = llama3_hf_token, device=0)
# pipe = pipeline("text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", token = llama3_hf_token, device=0, model_kwargs={
  #      "torch_dtype": torch.float16,
   #     "quantization_config": {"load_in_4bit": True}},)
# without device
#pipe = pipeline("text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", token = llama3_hf_token,device_map = "cuda:0" , model_kwargs={
 #       "torch_dtype": torch.float16,
  #      "quantization_config": {"load_in_4bit": True}},)
pipe = pipeline("text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", token = llama3_hf_token,device_map = "auto" , model_kwargs={
        "torch_dtype": torch.float16,
        # "quantization_config": {"load_in_4bit": True}},
        "quantization_config": {"load_in_4bit": True,"bnb_4bit_compute_dtype":torch.float16}})

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [2]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## PROMPT THE PIPELINE

#### Directly send the prompt as a sentence

In [None]:
sequences = pipe(
    'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)
# print the generated sequences

In [None]:
for sequence in sequences:
    print(sequence['generated_text'])

In [None]:
question_to_model = "Who is the CEO of Tesla"
# prompt the llama 3 model with the question and print the answer
output = pipe(question_to_model, max_length=100)

In [None]:
print(output[0]['generated_text'])

#### Send the prompt using message template to the pipeline

In [None]:
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
output = pipe(messages, max_new_tokens=128)

In [None]:
type(output[0]['generated_text'])

In [None]:
print(output[0]['generated_text'][-1]['content'])

In [None]:
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in a factual manner and answer in a single sentence",
    },
    {"role": "user", "content": "What is cluster analysis?"},
]
print(pipe(messages, max_new_tokens=20)[0]['generated_text'][-1])

In [None]:
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in a factual manner and answer in a single sentence",
    },
    
]
print(pipe(messages, max_new_tokens=20)[0]['generated_text'][-1])

In [None]:
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in a factual manner and answer in a single sentence. In addition to your knowledge, you will also use the following information on hydraulic hose types ---\
            Class 1:Same as 30R2, Type 1, per SAE J30 (latest issue). Reinforced between tube and cover with one ply of braided, knit, spiral or woven fabric. \
                Class 2:Same as 30R2, Type 2, per SAE J30 (latest issue). Reinforced between tube and cover with two braided plies of woven fabric. \
                    Class 3:Same as 30R2, Type 3, per SAE J30 (latest issue). Reinforced between tube and cover with one braided ply of textile yarn. \
                        Class 4:Same as 100R4, per SAE J517 (latest issue). Usually used for vacuum application.Reinforced between tube and cover with a ply or plies of woven or braided textile"
    },
    {"role": "user", "content": "What is cluster analysis?"},
    {'role': 'assistant', 'content': 'Cluster analysis is a type of unsupervised machine learning technique used to group similar objects or data points'},
    {"role": "user", "content": "What are its benefits?"},
    {'role': 'assistant', 'content': 'Cluster analysis helps to identify patterns, relationships, and structures in data, enables data visualization, and facilitates'},
    {"role": "user", "content": "What is class 1?"},

]
print(pipe(messages, max_new_tokens=20)[0]['generated_text'][-1])

In [6]:
# text_file = "/home/vp899/projects/llm_pilot/Data/MAT1130_Context.txt"
text_file = "/home/vp899/projects/llm_pilot/Data/MAT1130_Context - shorter.txt"
# read the contents of text_file into str
with open(text_file, 'r') as file:
    input_text = file.read()
# print(text)
system_prompt = "You are a helpful digital assistant.\
    You will provide clear answers in 3 sentences. Your main reference is the text included within triple quotes.At the beginning of each answer,\
        you will include the main json tag from the portion of the reference text that you are using for the answer.''' " +  input_text + " '''"

In [7]:
messages = [
    {
        "role": "system",
        "content": system_prompt,
    },
    {"role": "user", "content": "What are the key points in MAT1130?"},
    
]
output = pipe(messages, max_new_tokens=500)[0]['generated_text'][-1]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [8]:
print(output)

{'role': 'assistant', 'content': '**"TITLE"**: Material Specification for HYDRAULIC LINE TUBING - MAT1130**\n\nThe key points in MAT1130 are:\n\n1. The specification provides reference information for CNH hydraulic line tubing grades listed in Table 1.\n2. It is intended to replace Former CNH Company Material Specifications listed in Table 4 and should be used on all applicable new and updated engineering drawings.\n3. All National Standards and related test method designations are to be latest issue unless otherwise specified.\n\nNote: This summary is based on the main json tag "TITLE" and provides a brief overview of the key points in MAT1130.'}


In [None]:
messages = [
    {
        "role": "system",
        "content": system_prompt,
    },
    {"role": "user", "content": "What is MAT1130?"},
    
]

In [9]:
print(pipe(messages, max_new_tokens=500)[0]['generated_text'][-1])

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


{'role': 'assistant', 'content': '**"TITLE"**: Material Specification for HYDRAULIC LINE TUBING - MAT1130**\n\nThe key points in MAT1130 are:\n\n* Provides reference information for CNH hydraulic line tubing grades listed in Table 1.\n* Intended to replace Former CNH Company Material Specifications listed in Table 4.\n* Applicable to steel hydraulic line tubing used for applications where fluids are transferred under pressure.\n* Grades A, B, BF, C, E, and F are furnished normalized or annealed.\n* Grade A tubing can be used for applications specifying Grade B or BF tubing, and vice versa.\n* Grade C tubing can be used for applications specifying Grade A tubing, but requires review and approval by CNH Industrial Design Engineering and Manufacturing Engineering.\n* Dimensional differences are particularly important for brazed applications since clearances between tubing OD and fitting ID significantly affect brazed joint integrity.'}


In [11]:
content = "".join([message["content"] for message in messages])
tokens = tokenizer(content, return_tensors="pt").to("cuda")
num_tokens = tokens['input_ids'].shape[1]
print(num_tokens)

2442


### Calculate the number of tokens

In [None]:
# Consolidate all the content from messages
content = "".join([message["content"] for message in messages])
# tokenize the content
tokens = tokenizer(content, return_tensors="pt").to("cuda")
# get the number of tokens
num_tokens = tokens['input_ids'].shape[1]
# print the number of tokens
print(num_tokens)


## APPROACH 3
#### AutoModelForCausalLM && load_in_4bits

In [None]:
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

In [None]:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)

In [None]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

In [None]:
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
model_4b = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")


In [None]:
text_file = "/home/vp899/projects/llm_pilot/Data/MAT1130_Context.txt"
# read the contents of text_file into str
with open(text_file, 'r') as file:
    input_text = file.read()
# print(text)
system_prompt = "You are a helpful digital assistant.\
    You will provide clear answers in 3 sentences. Your main reference is the text included within triple quotes.At the beginning of each answer,\
        you will include the main json tag from the portion of the reference text that you are using for the answer.''' " +  input_text + " '''"
system_prompt = "You are a helpful digital assistant.\
    You will provide clear answers in 3 sentences. Your main reference is the text included within triple quotes.''' " +  input_text + " '''"

In [None]:
system_prompt_token = tokenizer(system_prompt, return_tensors="pt").to("cuda")

In [None]:
system_prompt_token.input_ids.shape

In [None]:
messages = [
    {
        "role": "system",
        "content": system_prompt,
    },
    {"role": "user", "content": "What are the key points in MAT1130?"},
    
]
# output = pipe(messages, max_new_tokens=500)[0]['generated_text'][-1]

In [None]:
tokenizer.pad_token = "[PAD]"
tokenizer.padding_side = "left"

In [None]:
# model_inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to("cuda") --- this is not working
model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
# model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").input_ids.to("cuda")


terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

In [None]:
# generate output using the model
output = model_4b.generate(model_inputs, max_new_tokens = 500, eos_token_id = terminators, do_sample=True, temperature=0.8, top_p =0.9)
# output = model_4b.generate(model_inputs.input_ids, max_new_tokens = 500, eos_token_id = terminators, do_sample=True, temperature=0.8, top_p =0.9)
# decode the output



In [None]:
output = tokenizer.batch_decode(output, skip_special_tokens=True)

In [None]:
# save output to a text file
with open('output.txt', 'w') as f:
    f.write(output[0])

## SUMMARIZE INPUT FILE

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", device_map="auto") ## pip install accelerate based on error message

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [2]:
text_file = "/home/vp899/projects/llm_pilot/Data/MAT1130_Context.txt"
# read the contents of text_file into str
with open(text_file, 'r') as file:
    input_text = file.read()
# print(text)
system_prompt = "You are a helpful digital assistant. You will provide clear and factual answers"

In [3]:
# find the number of words in the input_text
num_words = len(input_text.split())
# print the number of words
print(num_words)


2496


In [4]:
messages = [
    {
        "role": "system",
        "content": system_prompt,
    },
    {"role": "user", "content": "Can you please create a 1200 word summary of the text within the triple quotes?. Include all the quantitative details ''' " + input_text + "'''"},
    
]

In [7]:
model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

model_output = model.generate(model_inputs, max_length = 5500, pad_token_id=tokenizer.eos_token_id, do_sample=True, temperature=0.8, top_p =0.9)

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


In [8]:
# decode the output
output = tokenizer.batch_decode(model_output, skip_special_tokens=True)

In [9]:
# print the output
print(output[0])

system

You are a helpful digital assistant. You will provide clear and factual answersuser

Can you please create a 1200 word summary of the text within the triple quotes?. Include all the quantitative details ''' {"TITLE": "Material Specification for HYDRAULIC LINE TUBING - MAT1130",
"Document": [
{"SCOPE": "1.1) This specification provides reference information for CNH hydraulic line tubing grades listed in Table 1. It is intended to replace Former CNH Company Material Specifications listed in Table 4 and should be used on all applicable new and updated engineering drawings.1.2) All National Standards and related test method designations are to be latest issue unless otherwise specified."},
{"CNH Material Grade and Material Description":"
Following are the CNH Material Grade and Material Description shown within paranthesis:
(CNH Grade A: Cold Drawn, Single Wall, Welded Tubing, with No Internal Flash. For use in applications involving transmission of fluid at higher pressure; recomm

## GENERATE INPUT SUMMARY USING PIPELINE

In [1]:
import torch
from transformers import pipeline
llama3_hf_token = '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'
# pipe = pipeline("text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", token = llama3_hf_token)
# create pipeline with cuda
pipe = pipeline("text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", token = llama3_hf_token, device=0)


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [2]:
text_file = "/home/vp899/projects/llm_pilot/Data/MAT1130_Context.txt"
# read the contents of text_file into str
with open(text_file, 'r') as file:
    input_text = file.read()
# print(text)
system_prompt = "You are a helpful digital assistant. You will provide clear and factual answers"

In [3]:
messages = [
    {
        "role": "system",
        "content": system_prompt,
    },
    {"role": "user", "content": "Can you please create a 1200 word summary of the text within the triple quotes?. Include all the quantitative details ''' " + input_text + "'''"},
    
]

In [4]:
model_output = pipe(messages, max_new_tokens=2500)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [5]:
print(model_output[0]['generated_text'][-1])

{'role': 'assistant', 'content': 'Here is a 1200-word summary of the text:\n\nThe Material Specification for Hydraulic Line Tubing (MAT1130) provides reference information for CNH hydraulic line tubing grades listed in Table 1. This specification is intended to replace former CNH Company Material Specifications and should be used on all applicable new and updated engineering drawings.\n\nThe specification outlines the different CNH Material Grades and their descriptions, including Grade A, Grade B, Grade BF, Grade C, Grade D, Grade E, and Grade F. Each grade has its own unique characteristics, such as cold drawn, single wall, welded tubing, and flash control. The grades are categorized based on their material properties, such as tensile strength, yield strength, and elongation.\n\nTable 2 lists the local material and national standard references for each CNH Material Grade. The table shows the corresponding national standards and materials that can be used as alternatives to the CNH Ma

## KV cache experiment
## Parse output before sending to decoder [based on example in HF model card]

In [2]:
import torch

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", device_map="auto", torch_dtype = torch.float16) ## pip install accelerate based on error message

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [4]:
# model.generation_config.cache_implementation = "static"

In [5]:
text_file = "/home/vp899/projects/llm_pilot/Data/MAT1130_Context.txt"
# read the contents of text_file into str
with open(text_file, 'r') as file:
    input_text = file.read()
# print(text)
system_prompt = "You are a helpful digital assistant. You will provide clear and factual answers"

In [6]:
# find the number of words in the input_text
num_words = len(input_text.split())
# print the number of words
print(num_words)


2496


In [7]:
messages = [
    {
        "role": "system",
        "content": system_prompt,
    },
    {"role": "user", "content": "Can you please create a 1200 word summary of the text within the triple quotes?. Include all the quantitative details ''' " + input_text + "'''"},
    
]

In [8]:
# compiled_model = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
# compiled_model = model

#### Changes to tokenizer input based on HF model card example -- 5/15

In [9]:
# model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
model_inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to("cuda")

In [10]:
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

In [11]:
# model_output = model.generate(model_inputs, max_length = 5500, pad_token_id=tokenizer.eos_token_id, do_sample=True, temperature=0.8, top_p =0.9)
model_output = model.generate(
    model_inputs,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
    cache_implementation="static",
)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before t

BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Triton Error [CUDA]: device kernel image is invalid

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True


In [None]:
response = model_output[0][model_inputs.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))

NameError: name 'model_output' is not defined