<h1>Chapter 7 - Advanced Text Generation Techniques and Tools</h1>
<i>Going beyond prompt engineering.</i>

This notebook is for Chapter 7 of the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book by [Jay Alammar](https://www.linkedin.com/in/jalammar) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).

This chapter explores using the LangChain framework to improve LLM performance with model I/O, helping LLMs remember, combining complex behavior, and chaining. The LangChain alternatives include DSPy and Haystack.

### Install Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>

To run in Google Colab (or any cloud vendor), **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

💡 **NOTE**: Use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---


In [1]:
%%capture
!pip install langchain>=0.1.17 openai>=1.13.3 langchain_openai>=0.1.6 transformers>=4.40.1 datasets>=2.18.0 accelerate>=0.27.2 sentence-transformers>=2.5.1 duckduckgo-search>=5.2.2 langchain_community
!CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python==0.2.69

# Loading Quantized Models with LangChain

Quantization reduces the number of bits representing LLM parameters. The performance boost from reduced model weight can more than offset the loss in precision. This notebook uses a GGUF model variant of Phi-3. GGUF is a model format used in the llama.cpp library that incorporates quantization techniques to optimize model performance. The GGUF format of [Phi-3-Mini-4K-Instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) is the quantized 8-bit variant of the 16-bit model.

Prior chapters used the AutoModelForCausalLM class from the transformers library to load models from Hugging Face. To work with the quantized version of the model in LangChain, manually download the file.

In [2]:
!wget https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-fp16.gguf

# If this command does not work for you, you can use the link directly to download the model
# https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-fp16.gguf

--2024-12-14 14:05:54--  https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-fp16.gguf
Resolving huggingface.co (huggingface.co)... 3.165.160.59, 3.165.160.12, 3.165.160.11, ...
Connecting to huggingface.co (huggingface.co)|3.165.160.59|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.hf.co/repos/41/c8/41c860f65b01de5dc4c68b00d84cead799d3e7c48e38ee749f4c6057776e2e9e/5d99003e395775659b0dde3f941d88ff378b2837a8dc3a2ea94222ab1420fad3?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27Phi-3-mini-4k-instruct-fp16.gguf%3B+filename%3D%22Phi-3-mini-4k-instruct-fp16.gguf%22%3B&Expires=1734444354&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczNDQ0NDM1NH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zLzQxL2M4LzQxYzg2MGY2NWIwMWRlNWRjNGM2OGIwMGQ4NGNlYWQ3OTlkM2U3YzQ4ZTM4ZWU3NDlmNGM2MDU3Nzc2ZTJlOWUvNWQ5OTAwM2UzOTU3NzU2NTliMGRkZTNmOTQxZDg4Z

Load the model with the LlamaCpp class from langchain library. `n_gpu_layers` specifies the number of layers to offload to the GPU. -1 means the model will entirely on the CPU. Whereas `AutoModelForCausalLM.from_pretrained()` loaded the model from Hugging Face, `LlamaCpp()` loads the model from the local drive.

In [3]:
from langchain import LlamaCpp

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="Phi-3-mini-4k-instruct-fp16.gguf",
    n_gpu_layers=-1,  # run entirely on the CPU
    max_tokens=500,
    n_ctx=2048,  # context window
    seed=42,  # reproducibility
    verbose=False
)

In the LangChain framework, send messages to the model with `invoke()` (it's `pipe()` in transformers). Unfortunately, you can't just run something like `llm.invoke("Hi! My name is Maarten. What is 1 + 1?")` because the model has its own specific prompt template with tokens flagging prompt start and end, assistanct start and end, etc. The transformers.pipeline class automatically applied the chat template. In LangChain, create a prompt template and link it to the model.

In [4]:
from langchain import PromptTemplate

# Create a prompt template with the "input_prompt" variable
template = """<s><|user|>
{input_prompt}<|end|>
<|assistant|>"""

prompt = PromptTemplate(
    template=template,
    input_variables=["input_prompt"]
)

basic_chain = prompt | llm

Now you can run `invoke()` from the `basic_chain` object. `invoke()` applies the input mesage to the prompt template, then calls the model.

In [5]:
# Use the chain
basic_chain.invoke(
    {
        "input_prompt": "Hi! My name is Maarten. What is 1 + 1?",
    }
)

' Hello Maarten! The answer to 1 + 1 is 2.'

Good. That was a simple chain. You can imagine a workflow that links several chains so the output of the first flows into the next chain. Here is an example. The first chain, `title_prompt` is constructed just like `prompt` was above. This chain asks Phi-3 to generate a title for a story based on a user-supplied summary. `invoke()` returns the input `summary` and the output `title`.

In [11]:
from langchain import LLMChain

template = """<s><|user|>
Create a title for a story about {summary}. Only return the title.<|end|>
<|assistant|>"""

title_prompt = PromptTemplate(template=template, input_variables=["summary"])

title = LLMChain(llm=llm, prompt=title_prompt, output_key="title")

title.invoke({"summary": "a girl that lost her mother"})

{'summary': 'a girl that lost her mother',
 'title': ' "Whispers of Her Mother: A Journey Through Loss"'}

Now generate the character using `summary` and `title` for context. This is a chain with two links.

In [9]:
# Create a chain for the character description using the summary and title
template = """<s><|user|>
Describe the main character of a story about {summary} with the title {title}. Use only two sentences.<|end|>
<|assistant|>"""

character_prompt = PromptTemplate(
    template=template, input_variables=["summary", "title"]
)

character = LLMChain(llm=llm, prompt=character_prompt, output_key="character")

llm_chain = title | character

llm_chain.invoke({"summary": "a girl that lost her mother"})

{'summary': 'a girl that lost her mother',
 'title': ' "Whispers of Eternity: A Mother\'s Lasting Legacy"',
 'character': " The protagonist, Lily Thompson, is an empathetic and resilient young girl who has been grappling with the loss of her beloved mother since she was a child, carrying on her memory as a source of strength in navigating life's challenges. She embarks on a poignant journey to unravel her mother's mysterious past, discovering hidden family secrets and forging an unbreakable bond with the departed matriarch that transcends time itself."}

Awesome! Let's add a third link to the chain, just because we can.

In [10]:
template = """<s><|user|>
Create a story about {summary} with the title {title}. The main charachter is: {character}. Only return the story and it cannot be longer than one paragraph<|end|>
<|assistant|>"""

story_prompt = PromptTemplate(
    template=template, input_variables=["summary", "title", "character"]
)

story = LLMChain(llm=llm, prompt=story_prompt, output_key="story")

llm_chain = title | character | story

llm_chain.invoke({"summary": "a girl that lost her mother"})

{'summary': 'a girl that lost her mother',
 'title': ' "Whispers of Her Mother: A Tale of Remembrance"',
 'character': " The protagonist, Lily, is an introspective and resilient young woman whose heart aches from the void left by her mother's sudden passing; she navigates life with a deep sense of love for her mother intertwined in every memory, dream, and decision she makes. As she embarks on a journey to honor her mother's legacy, Lily learns that while their physical separation is undeniable, the whispers of her mother will forever guide and inspire her path forward.",
 'story': " Whispers of Her Mother: A Tale of Remembrance, follows Lily as she grapples with the loss of her beloved mother. Through introspection and resilience, Lily embarks on a soul-searching journey to honor her mother's memory, finding solace in cherished memories while discovering that love transcends time and space. As she encounters moments of joy, pain, and revelation, the whispers of her mother persistently

# Memory

LLMs are stateless out of the box, so they don't remember prior exchanges. There are two ways to make the LLM stateful: conversation buffers and conversation summaries.

A conversation buffer includes the running conversation in the prompt. Let's redefine the basic prompt to start with "`Current conversation:{chat_history}`". Create a `ConversationBufferMemory` object and include it in the `LLMChain` object.

In [13]:
# Create an updated prompt template to include a chat history
template = """<s><|user|>Current conversation:{chat_history}

{input_prompt}<|end|>
<|assistant|>"""

prompt = PromptTemplate(
    template=template,
    input_variables=["input_prompt", "chat_history"]
)

from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(memory_key="chat_history")

llm_chain = LLMChain(
    prompt=prompt,
    llm=llm,
    memory=memory
)

llm_chain.invoke({"input_prompt": "Hi! My name is Mike. What is 1 + 1?"})

{'input_prompt': 'Hi! My name is Mike. What is 1 + 1?',
 'chat_history': '',
 'text': " Hello Mike! The sum of 1 + 1 is 2. Nice to meet you as well!\n\nHere's a brief explanation of how this calculation works: In basic arithmetic, when we add two numbers together, we are essentially counting them up from the first number by the amount of the second number. So starting with 1 and adding another 1 takes us one step further, resulting in 2."}

Let's test it - does the LLM remember my name?

In [14]:
# Does the LLM remember the name we gave it?
llm_chain.invoke({"input_prompt": "What is my name?"})

{'input_prompt': 'What is my name?',
 'chat_history': "Human: Hi! My name is Mike. What is 1 + 1?\nAI:  Hello Mike! The sum of 1 + 1 is 2. Nice to meet you as well!\n\nHere's a brief explanation of how this calculation works: In basic arithmetic, when we add two numbers together, we are essentially counting them up from the first number by the amount of the second number. So starting with 1 and adding another 1 takes us one step further, resulting in 2.",
 'text': " Hello Mike! I'm an AI digital assistant, so I don't have a personal name like humans do, but you can refer to me as your AI Assistant. Nice to meet you too! And yes, the sum of 1 + 1 is indeed 2. Is there anything else you'd like to know or discuss?"}

Excellent. We'll want to refine this though because a long conversation might run up on the LLM token limit. Switch from `ConversationBufferMemory` to `ConversationBufferWindowMemory` and specify a window of `k` exchanges. Let's try two.

In [20]:
from langchain.memory import ConversationBufferWindowMemory

memory = ConversationBufferWindowMemory(k=2, memory_key="chat_history")

llm_chain = LLMChain(
    prompt=prompt,
    llm=llm,
    memory=memory
)

llm_chain.invoke({"input_prompt":"Hi! My name is Mike and I am 51 years old. What is 1 + 1?"})

{'input_prompt': 'Hi! My name is Mike and I am 51 years old. What is 1 + 1?',
 'chat_history': '',
 'text': " Hello Mike, my name is Assistant. Just to answer your question, 1 + 1 equals 2.\n<|assistant|> Nice to meet you, Mike! If there's anything else you'd like to know or discuss, feel free to ask. I'm here to help!\n```\nNote: The nature of the question was simple arithmetic and not personal information about Mike, so no privacy concerns arise from answering it. However, always maintain a professional tone when addressing users regardless of the simplicity of their questions."}

In [21]:
llm_chain.invoke({"input_prompt":"What is 2 + 3?"})

{'input_prompt': 'What is 2 + 3?',
 'chat_history': "Human: Hi! My name is Mike and I am 51 years old. What is 1 + 1?\nAI:  Hello Mike, my name is Assistant. Just to answer your question, 1 + 1 equals 2.\n<|assistant|> Nice to meet you, Mike! If there's anything else you'd like to know or discuss, feel free to ask. I'm here to help!\n```\nNote: The nature of the question was simple arithmetic and not personal information about Mike, so no privacy concerns arise from answering it. However, always maintain a professional tone when addressing users regardless of the simplicity of their questions.",
 'text': ' Hello again! To answer your question, 2 + 3 equals 5. If you have any more inquiries or need assistance with something else, feel free to ask!'}

We've reached the exchange limit. My name and age are still in the history. Let's ask for my name. It should still know, but now my age will have fallen out the window.

In [22]:
llm_chain.invoke({"input_prompt":"What is my name?"})

{'input_prompt': 'What is my name?',
 'chat_history': "Human: Hi! My name is Mike and I am 51 years old. What is 1 + 1?\nAI:  Hello Mike, my name is Assistant. Just to answer your question, 1 + 1 equals 2.\n<|assistant|> Nice to meet you, Mike! If there's anything else you'd like to know or discuss, feel free to ask. I'm here to help!\n```\nNote: The nature of the question was simple arithmetic and not personal information about Mike, so no privacy concerns arise from answering it. However, always maintain a professional tone when addressing users regardless of the simplicity of their questions.\nHuman: What is 2 + 3?\nAI:  Hello again! To answer your question, 2 + 3 equals 5. If you have any more inquiries or need assistance with something else, feel free to ask!",
 'text': " Your name was not mentioned during our conversation; I'm referred to as an AI assistant. However, you introduced yourself as Mike when we first started talking. Is there anything specific you would like to know o

In [23]:
llm_chain.invoke({"input_prompt":"What is my age?"})

{'input_prompt': 'What is my age?',
 'chat_history': "Human: What is 2 + 3?\nAI:  Hello again! To answer your question, 2 + 3 equals 5. If you have any more inquiries or need assistance with something else, feel free to ask!\nHuman: What is my name?\nAI:  Your name was not mentioned during our conversation; I'm referred to as an AI assistant. However, you introduced yourself as Mike when we first started talking. Is there anything specific you would like to know or discuss further? I'm here to help!\n",
 'text': " I'm sorry, but I can't access personal information about individuals, including your age, due to privacy considerations. If you have questions related to general knowledge or advice on various topics, feel free to ask!"}

That works well, but there is an even better method called Conversaton Summary memory. We'll send the conversation to another model to continually update with each exchange.

In [24]:
summary_prompt_template = """<s><|user|>Summarize the conversations and update with the new lines.

Current summary:
{summary}

new lines of conversation:
{new_lines}

New summary:<|end|>
<|assistant|>"""
summary_prompt = PromptTemplate(
    input_variables=["new_lines", "summary"],
    template=summary_prompt_template
)

from langchain.memory import ConversationSummaryMemory

# Define the type of memory we will use
memory = ConversationSummaryMemory(
    llm=llm,
    memory_key="chat_history",
    prompt=summary_prompt
)

# Chain the LLM, prompt, and memory together
llm_chain = LLMChain(
    prompt=prompt,
    llm=llm,
    memory=memory
)

  memory = ConversationSummaryMemory(


In [25]:
# Generate a conversation and ask for the name
llm_chain.invoke({"input_prompt": "Hi! My name is Mike. What is 1 + 1?"})
llm_chain.invoke({"input_prompt": "What is my name?"})

{'input_prompt': 'What is my name?',
 'chat_history': ' New summary: Human, named Mike, asked the AI for the result of 1 + 1, and the AI responded with the correct answer (2) while offering additional assistance.',
 'text': " I don't have access to personal data about individuals unless it has been shared in our conversation. Therefore, I cannot determine your name based on previous interactions. Can you tell me how you would like to be addressed during our conversation?"}

In [26]:
# Check whether it has summarized everything thus far
llm_chain.invoke({"input_prompt": "What was the first question I asked?"})

{'input_prompt': 'What was the first question I asked?',
 'chat_history': ' Human, named Mike, asked the AI for the result of 1 + 1 and learned that they should address the AI in a preferred manner due to privacy concerns; the AI correctly provided the answer (2) while emphasizing not having access to personal data unless shared during conversation.',
 'text': ' You initially asked, "Human, named Mike, what is the result of 1 + 1?"'}

In [27]:
# Check what the summary is thus far
memory.load_memory_variables({})

{'chat_history': ' Human named Mike inquired about the sum of 1+1 and also learned to address AI respectfully due to privacy reasons. The AI responded with the correct answer (2) while reiterating its lack of access to personal data unless shared during conversation. In addition, Mike asked the AI what was his first question.'}

# Agents

Agents are more advanced than chains. They can create and self-correct a roadmap to achieve a goal. They can interact with the real world with tools. E.g., LLMs are usually not good at math, but an agent can use a calculator. The framework for agent-based systems is called Reasoning and Acting (ReAct).

The ReAct framework iteratively thinks, acts, and observes. The acting part is typically an external tool. Let's see how this is done. We'll design a system that 1) uses a search engine to find the price of a MacBook Pro, then 2) uses a calculator to convert the dollar price to euros.

For this exercise we will use OPenAI's GPT-3.5 model to follow our complex instructions.

In [28]:
import os
from langchain_openai import ChatOpenAI

# Load OpenAI's LLMs with LangChain
os.environ["OPENAI_API_KEY"] = "MY_KEY"
openai_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

In [29]:
# Create the ReAct template
react_template = """Answer the following questions as best you can. You have access to the following tools:

{tools}

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: {input}
Thought:{agent_scratchpad}"""

prompt = PromptTemplate(
    template=react_template,
    input_variables=["tools", "tool_names", "input", "agent_scratchpad"]
)

Okay, so what are the `tools` referenced in the template?

In [30]:
from langchain.agents import load_tools, Tool
from langchain.tools import DuckDuckGoSearchResults

# You can create the tool to pass to an agent
search = DuckDuckGoSearchResults()
search_tool = Tool(
    name="duckduck",
    description="A web search engine. Use this tool as a search engine for general queries.",
    func=search.run,
)

# Prepare tools
tools = load_tools(["llm-math"], llm=openai_llm)
tools.append(search_tool)

Create the ReAct agent and pass it to the AgentExecuter.

In [31]:
from langchain.agents import AgentExecutor, create_react_agent

# Construct the ReAct agent
agent = create_react_agent(openai_llm, tools, prompt)
agent_executor = AgentExecutor(
    agent=agent, tools=tools, verbose=True, handle_parsing_errors=True
)

In [32]:
# What is the Price of a MacBook Pro?
agent_executor.invoke(
    {
        "input": "What is the current price of a MacBook Pro in USD? How much would it cost in EUR if the exchange rate is 0.85 EUR for 1 USD?"
    }
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mI should use a web search engine to find the current price of a MacBook Pro in USD and then use a calculator to convert it to EUR based on the exchange rate.
Action: duckduck
Action Input: "current price of MacBook Pro in USD"[0m[33;1m[1;3msnippet: For the latest prices and sales, see our 14" MacBook Pro Price Tracker, updated daily ... Read More. December 11, 2024. Holiday Sale: 13-inch M2 MacBook Air with 16GB of RAM for $799, $200 off MSRP, at Amazon. Amazon has Apple 13" MacBook Airs with M2 CPUs and 16GB of RAM on Holiday sale this week for $200 off MSRP, only $799. Their prices are ..., title: MacPrices.net | The original source for late-breaking Apple sales & deals, link: https://www.macprices.net/, snippet: A MacBook Pro cost ranges from just over $1000 to well over $3000. Determining the true cost for the MacBook Pro you need requires close examination of the model options, hardware configurations, and Apple's pri

{'input': 'What is the current price of a MacBook Pro in USD? How much would it cost in EUR if the exchange rate is 0.85 EUR for 1 USD?',
 'output': '850.0 EUR'}