## Notebook-2 Model Testing 
## Fine-tune Llama 2 for Inverse Information Generation 
(Reverse-Thinking) 
An instruction is a piece of text or prompt that is provided to an LLM to perform text generation of an answer. 
>
***The goal is to create a model which can perform inverse information generation or filtering based on user input***. The idea behind this is that we want to the model able to filter large amount of information down into key points based on the fine-tune training dataset.

### 1. Environment Setup and Info

In [None]:
!pip install "transformers==4.31.0" "datasets==2.13.0" "peft==0.4.0" "accelerate==0.21.0" "bitsandbytes==0.40.2" "t

In [6]:
def show_cuda_info():
    import torch
    # setting device on GPU if available, else CPU
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print('Using device:*', device)
    #Additional Info when using cuda
    if device.type == 'cuda':
        print('Device count:', torch.cuda.device_count())    
        print('Device  Name:',torch.cuda.get_device_name(0))
        print('Memory Usage:')
        print('   Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
        print('      Cached:', round(torch.cuda.torch.cuda.memory_reserved(0)/1024**3,1), 'GB')
show_cuda_info()

Using device:* cuda
Device count: 1
Device  Name: NVIDIA GeForce RTX 3090
Memory Usage:
   Allocated: 6.7 GB
      Cached: 7.1 GB


## Test-1 Use trained adapter 

In [2]:
import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

train_model_output_dir = "llama-7-int4-dolly"
# load base LLM model and tokenizer
train_model = AutoPeftModelForCausalLM.from_pretrained(
    train_model_output_dir,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    load_in_4bit=True,
    device_map="auto"
)
train_tokenizer = AutoTokenizer.from_pretrained(train_model_output_dir)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [3]:
text_input="1. Mashing 2. Separation 3. Boiling 4. Fermentation. The ingredients are brought together through these 4 steps."

prompt = f"""### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM.

### Input:
{text_input}

### Response:
"""
input_ids = train_tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [4]:
outputs = train_model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9,temperature=0.9)
print(f"Generated instruction:\n{train_tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}")



Generated instruction:
What are the 4 main steps in making beer?



Note: expected result: "What are the 4 main steps in making beer?"

## Test-2 Use merged model

In [9]:
## Use the Merged Model
merged_model = "merged_model"
model_id = merged_model

In [10]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from peft import AutoPeftModelForCausalLM

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    quantization_config=bnb_config, 
    use_cache=False, 
    device_map="auto")

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

##### Test Model and run Inference
After the training is done we want to run and test our model. We will use peft and transformers to load our LoRA adapter into our model.

In [6]:
## ramdom question test 
##
# input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
# # with torch.inference_mode():
# outputs = model.generate(input_ids=input_ids, max_new_tokens=1000, do_sample=True, top_p=0.9,temperature=0.9)
# result = tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]
# print(f"Prompt:\n{sample['response']}\n")
# print(f"**Generated instruction**:\n{result}")
# #print(f"Generated instruction:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}")
# print(f"Ground truth:\n{sample['instruction']}")

In [7]:
# print ("=== Original ===")
# print("Instructions:\n", sample["instruction"])
# # print("Context:\n", sample["context"])
# print("Response:\n",sample["response"])
# print("Category:\n",sample["category"]) 
# print ("=== Prompt ===")
# print(prompt)
# print ("=== Result ===")
# print(f"Generated instruction:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}")

In [11]:
text_input1="1. Mashing 2. Separation 3. Boiling 4. Fermentation. The ingredients are brought together through these 4 steps."
text_input2='A polygon is a form in Geometry.  It is a single dimensional plane made of connecting lines and any number of vertices.  It is a closed chain of connected line segments or edges.  The vertices of the polygon are formed where two edges meet.  Examples of polygons are hexagons, pentagons, and octagons.  Any plane that does not contain edges or vertices is not a polygon.  An example of a non-polygon is a circle.'
text_input3='Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.'

In [None]:
device = "cuda:0"

In [12]:
## Direct input-1
inputs = tokenizer(text_input1, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

1. Mashing 2. Separation 3. Boiling 4. Fermentation. The ingredients are brought together through these 4 steps.

### Response:
What are the 4 steps in the brewing process?



Note: expected result: "What are the 4 main steps in making beer?"

In [13]:
## Direct input-2
device = "cuda:0"
inputs = tokenizer(text_input2, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

A polygon is a form in Geometry.  It is a single dimensional plane made of connecting lines and any number of vertices.  It is a closed chain of connected line segments or edges.  The vertices of the polygon are formed where two edges meet.  Examples of polygons are hexagons, pentagons, and octagons.  Any plane that does not contain edges or vertices is not a polygon.  An example of a non-polygon is a circle.

### Response:
What is a polygon?



Note: expected result: "What is a polygon?"

In [14]:
## Direct input-3
device = "cuda:0"
inputs = tokenizer(text_input3, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

### Response:
What is Delta Lake?



Note: expected result: "What is Delta Lake?"

## Test-3 Use huggingface transformer pipeline

In [15]:
import transformers
from transformers import AutoTokenizer
import torch

pipe = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    # torch_dtype=torch.bfloat16,
    torch_dtype=torch.float16,
    trust_remote_code=True,
    device_map="auto",
)

In [16]:
## pipeline test input-1
sequences = pipe(
    text_input1,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Result: 1. Mashing 2. Separation 3. Boiling 4. Fermentation. The ingredients are brought together through these 4 steps.

### Response:
What are the 4 key steps in the brewing of beer?



Note: expected result: "What are the 4 main steps in making beer?"

In [17]:
## pipeline test input-2
sequences = pipe(
    text_input2,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Result: A polygon is a form in Geometry.  It is a single dimensional plane made of connecting lines and any number of vertices.  It is a closed chain of connected line segments or edges.  The vertices of the polygon are formed where two edges meet.  Examples of polygons are hexagons, pentagons, and octagons.  Any plane that does not contain edges or vertices is not a polygon.  An example of a non-polygon is a circle.

### Response:
What is geometry?



Note: expected result: "What is a polygon?"

In [18]:
## pipeline test input-3
sequences = pipe(
    text_input3,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Result: Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs. Delta Lake also provides an optimized query engine, the Delta Live Query engine, that can execute DML queries on your data lake with sub-second latency.

Deltas are a powerful and intuitive way to manage change in your data lake, so it is natural to think about how you can make your data more like a delta.  This blog covers several ways to do this in Delta Lake and some important considerations when choosing among these options.

### Response:
What are ways you can make your data lake look more like a Delta?



Note: expected result: "What is Delta Lake?"

## Test-4 Langchain with system promt

In [1]:
## Use the Merged Model
merged_model = "merged_model"
model_id = merged_model

DEFAULT_SYSTEM_PROMPT = """\
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."""

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig,pipeline
from langchain import PromptTemplate, HuggingFaceHub, LLMChain
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
import torch

qa_template = """

Question: {user_question}

{model_context}

Answer: Let's think step by step."""

template = DEFAULT_SYSTEM_PROMPT+ qa_template
prompt = PromptTemplate(template=template, input_variables=["model_context","user_question"])

print('loading QLora model 4 or 8 bits from...', model_id)
# BitsAndBytesConfig int-4 config
# 4bits -- 8gb vram
# 8bits -- 12gb vram
bnb_config = BitsAndBytesConfig(
    #load_in_4bit=True,
    load_in_8bit=True,    
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
local_model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    quantization_config=bnb_config, 
    use_cache=False, 
    device_map="auto")

print('loading tokenizer from...',model_id)
local_tokenizer = AutoTokenizer.from_pretrained(model_id)

print('making a pipeline...')
hf_pipe = pipeline(
    "text-generation", 
    model=local_model, 
    tokenizer=local_tokenizer, 
    max_new_tokens=256, 
    model_kwargs={"temperature":0}
)

print('setup llm chain...')
hf_llm = HuggingFacePipeline(pipeline=hf_pipe)
llm_chain = LLMChain(prompt=prompt, llm=hf_llm, verbose=True)

#print ("==cuda after loading mode==")
#show_cuda_info()

loading QLora model 4 or 8 bits from... merged_model


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

loading tokenizer from... merged_model
making a pipeline...
setup llm chain...


In [3]:

input_text="1. Mashing 2. Separation 3. Boiling 4. Fermentation. The ingredients are brought together through these 4 steps."
#input_text='A polygon is a form in Geometry.  It is a single dimensional plane made of connecting lines and any number of vertices.  It is a closed chain of connected line segments or edges.  The vertices of the polygon are formed where two edges meet.  Examples of polygons are hexagons, pentagons, and octagons.  Any plane that does not contain edges or vertices is not a polygon.  An example of a non-polygon is a circle.'
input_context=""
local_result=llm_chain.run({'model_context': input_context,
                            'user_question':input_text})



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.

Question: 1. Mashing 2. Separation 3. Boiling 4. Fermentation. The ingredients are brought together through these 4 steps.



Answer: Let's think step by step.[0m





[1m> Finished chain.[0m


In [4]:
print("==ANSWER==")
print(local_result)

==ANSWER==


1. Mashing: The ingredients are mixed together to form a mash.
2. Separation: The mash is separated into liquid and solid.
3. Boiling: The liquid is boiled to evaporate the water.
4. Fermentation: The liquid is fermented to produce alcohol.

### Response:
What are the steps in making beer?



Note: expected result: "What are the 4 main steps in making beer?"

In [7]:
print ("==cuda after text generation==")
show_cuda_info()

==cuda after text generation==
Using device:* cuda
Device count: 1
Device  Name: NVIDIA GeForce RTX 3090
Memory Usage:
   Allocated: 6.7 GB
      Cached: 7.1 GB


In [26]:
## End of Testing Model