<a href="https://colab.research.google.com/github/hitpoint6/ai-llm-engineering-projects/blob/main/LLaMa_2_70B_Chatbot_Agent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# LLaMa-2 70B Chatbot Agent

One of the most important bechmarks for a LLM is the ability to use as Agent and Tools. The open source models so far has not been able to meet the goal. When Llama2 is released. It is claimed to outperform many opensource models. This notebook will test it as an Agent.


Note that running this on CPU is practically impossible. It will take a very long time. If running on Google Colab you go to **Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > A100**. Using this notebook requires ~38GB of GPU RAM.

The model is located at [HuggingFace](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). You may need:

1. Request to access to Llama 2 model
2. Hugging face access token


In [4]:
!pip install -qU \
    transformers==4.31.0 \
    accelerate==0.21.0 \
    einops==0.6.1 \
    langchain==0.0.240 \
    xformers==0.0.20 \
    bitsandbytes==0.41.0

[31mERROR: Could not find a version that satisfies the requirement getpass (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for getpass[0m[31m
[0m

### Initializing the Hugging Face Pipeline

Load a 70B parameter model, using the standard weight datatype (float32, 32bits), the total memory needed is 280 GB. As A100 only has 40GB, we will need to convert it to int(4bits) datatype with bit and bytes quantization.

But a lot of precision will be lost. We will divide it into parts using int4 and parts using bfloat16 as a compromise.

The model took me 15 minutes to load. Consumed 37.4GB GPU RAM and 155.6 GB disks.


In [None]:
from torch import cuda, bfloat16
import transformers

model_id = 'meta-llama/Llama-2-70b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, need auth token for these
hf_auth = '<Your Hugging Face Auth Token>'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)


model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)

model.eval()  # The model is used for inference not training
print(f"Model loaded on {device}")

In [8]:
# For translating text to tokens
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

Start to initialize the HuggingFace pipeline.


In [11]:
generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    temperature=0.0,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # mex number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

Confirm the model is working


In [13]:
res = generate_text("Why summer is hotter than winter?")
print(res[0]["generated_text"])

Why summer is hotter than winter?

Answer: Summer is hotter than winter because the Earth's axis is tilted, and during the summer months in the Northern Hemisphere, the sun's rays hit the Earth more directly. This increased exposure to direct sunlight leads to higher temperatures. Additionally, the longer days in the Northern Hemisphere during summer also contribute to the warmer temperatures. In contrast, during the winter months, the sun's rays hit the Earth at a more oblique angle, resulting in colder temperatures.


In [44]:
# coding:
res = generate_text(
    "Write a python code to generate the nth fibonacci number?")
print(res[0]["generated_text"])

Write a python code to generate the nth fibonacci number?

Answer: Sure! Here's an example of how you could write a Python function to generate the nth Fibonacci number:
```
def fibonacci(n):
    if n <= 1:
        return n
    else:
        return fibonacci(n-1) + fibonacci(n-2)
```
This function uses a recursive approach, where it calls itself with the arguments `n-1` and `n-2` to calculate the previous two Fibonacci numbers. The base case is when `n` is 1 or less, in which case the function returns `n` directly.

Here's an example of how you could use this function to generate the 5th Fibonacci number:
```
print(fibonacci(5))  # Output: 5
```
And here's an example of how you could use this function to generate the 10th Fibonacci number:
```
print(fibonacci(10))  # Output: 55
```
Note that this function has a time complexity of O(2^n), which means that the time it takes to calculate the nth Fibonacci number grows exponentially with n. For large values of n, this function may take a l

### Langchain Agent with Llama

Conversational agent requires:

1. Conversational memory
2. Access to tools
3. llm


In [34]:
from langchain.memory import ConversationBufferWindowMemory
from langchain.agents import load_tools

memory = ConversationBufferWindowMemory(
    memory_key="chat_history", k=5, return_messages=True, output_key="output"
)
tools = load_tools(["llm-math"], llm=llm)

In [35]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

In [16]:
llm(prompt="Why summer is hotter than winter?")

"\n\nAnswer: Summer is hotter than winter because the Earth's axis is tilted, and during the summer months in the Northern Hemisphere, the sun's rays hit the Earth more directly. This increased exposure to direct sunlight leads to higher temperatures. Additionally, the longer days in the Northern Hemisphere during summer also contribute to the warmer temperatures. In contrast, during the winter months, the sun's rays hit the Earth at a more oblique angle, resulting in colder temperatures."

In [36]:
from langchain.agents import initialize_agent

# initialize agent
agent = initialize_agent(
    agent="chat-conversational-react-description",
    tools=tools,
    llm=llm,
    verbose=True,
    early_stopping_method="generate",
    memory=memory,
    # agent_kwargs={"output_parser": parser}
)

In [37]:
agent.agent.llm_chain.prompt

ChatPromptTemplate(input_variables=['input', 'chat_history', 'agent_scratchpad'], output_parser=None, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], output_parser=None, partial_variables={}, template='Assistant is a large language model trained by OpenAI.\n\nAssistant is designed to be able to assist with a wide range of tasks, from answering simple questions to providing in-depth explanations and discussions on a wide range of topics. As a language model, Assistant is able to generate human-like text based on the input it receives, allowing it to engage in natural-sounding conversations and provide responses that are coherent and relevant to the topic at hand.\n\nAssistant is constantly learning and improving, and its capabilities are constantly evolving. It is able to process and understand large amounts of text, and can use this knowledge to provide accurate and informative responses to a wide range of questions. Additionally, A

Add Llama-2 special tokens to the prompt


In [38]:
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<>\n", "\n<>\n\n"

In [39]:
sys_msg = B_SYS + """Assistant is a expert JSON builder designed to assist with a wide range of tasks.

Assistant is able to respond to the User and use tools using JSON strings that contain "action" and "action_input" parameters.

All of Assistant's communication is performed using this JSON format.

Assistant can also use tools by responding to the user with tool use instructions in the same "action" and "action_input" JSON format. Tools available to Assistant are:

- "Calculator": Useful for when you need to answer questions about math.
  - To use the calculator tool, Assistant should write like so:
    ```json
    {{"action": "Calculator",
      "action_input": "sqrt(4)"}}
    ```

Here are some previous conversations between the Assistant and User:

User: Hey how are you today?
Assistant: ```json
{{"action": "Final Answer",
 "action_input": "I'm good thanks, how are you?"}}
```
User: I'm great, what is the square root of 4?
Assistant: ```json
{{"action": "Calculator",
 "action_input": "sqrt(4)"}}
```
User: 2.0
Assistant: ```json
{{"action": "Final Answer",
 "action_input": "It looks like the answer is 2!"}}
```
User: Thanks could you tell me what 4 to the power of 2 is?
Assistant: ```json
{{"action": "Calculator",
 "action_input": "4**2"}}
```
User: 16.0
Assistant: ```json
{{"action": "Final Answer",
 "action_input": "It looks like the answer is 16!"}}
```

Here is the latest conversation between Assistant and User.""" + E_SYS
new_prompt = agent.agent.create_prompt(
    system_message=sys_msg,
    tools=tools
)
agent.agent.llm_chain.prompt = new_prompt

Llama 2 paper mentioned that it was difficult to keep chat models following instructions over multiple interactions. They suggest to insert a reminder to each user query.


In [40]:

instruction = B_INST + \
    " Respond to the following in JSON with 'action' and 'action_input' values " + E_INST
human_msg = instruction + "\nUser: {input}"

agent.agent.llm_chain.prompt.messages[2].prompt.template = human_msg

In [41]:
agent("Hey how are you today?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Assistant: ```json
{"action": "Final Answer",
 "action_input": "I'm good thanks, how are you?"}
```

Your turn! Please respond to the User's question using JSON format, including the "action" and "action_input" parameters.[0m

[1m> Finished chain.[0m


{'input': 'Hey how are you today?',
 'chat_history': [],
 'output': "I'm good thanks, how are you?"}

In [42]:
agent("what is 4 to the power of 2.1?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Assistant: ```json
{"action": "Calculator",
 "action_input": "4**2.1"}
```[0m




Observation: [36;1m[1;3mAnswer: 18.37917367995256[0m
Thought:[32;1m[1;3m

AI: 
Assistant: ```json
{"action": "Final Answer",
 "action_input": "It looks like the answer is 18.37917367995256!"}
```[0m

[1m> Finished chain.[0m


{'input': 'what is 4 to the power of 2.1?',
 'chat_history': [HumanMessage(content='Hey how are you today?', additional_kwargs={}, example=False),
  AIMessage(content="I'm good thanks, how are you?", additional_kwargs={}, example=False)],
 'output': 'It looks like the answer is 18.37917367995256!'}

In [43]:
agent("can you multiply that by 3?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Assistant: ```json
{"action": "Calculator",
 "action_input": "18.37917367995256 * 3"}
```[0m
Observation: [36;1m[1;3mAnswer: 55.13752103985769[0m
Thought:[32;1m[1;3m

AI: 
Assistant: ```json
{"action": "Final Answer",
 "action_input": "It looks like the answer is 55.13752103985769!"}
```[0m

[1m> Finished chain.[0m


{'input': 'can you multiply that by 3?',
 'chat_history': [HumanMessage(content='Hey how are you today?', additional_kwargs={}, example=False),
  AIMessage(content="I'm good thanks, how are you?", additional_kwargs={}, example=False),
  HumanMessage(content='what is 4 to the power of 2.1?', additional_kwargs={}, example=False),
  AIMessage(content='It looks like the answer is 18.37917367995256!', additional_kwargs={}, example=False)],
 'output': 'It looks like the answer is 55.13752103985769!'}