<a href="https://colab.research.google.com/github/paulpelykh/machine-learning/blob/main/GenAI/Data_Agent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Now we'll build Data, but as an LLM Agent that can use tools outside of the LLM. In addition to using RAG to access "long term memory" of everything Data ever said in tha past, we'll give him a couple of additional capabilities: performing numeric calculations (which LLM's tend to struggle with on their own,) and access current information via a web search.

We will start by parsing the original scripts and extracting lines spoken by Data. As before, you will need to upload all of the script files into a tng folder within your sample_data folder in your CoLab workspace first.

An archive can be found at https://www.st-minutiae.com/resources/scripts/ (look for "All TNG Epsiodes"), but you could easily adapt this to read scripts from your favorite character from your favorite TV show or movie instead.

We've done all of this before, so I'm combining all of the code to load up Data's past lines into a vector store together below. We'll pick up again as we incorporate this into an agent. If you need an explanation of how we are populating this vector store, refer to the earlier RAG exercises in the course.

Also be sure to provide your own OpenAI secret key. Click on the little key icon in CoLab and add a "secret" for OPENAI_API_KEY that points to your secret key.

In [1]:
!pip install openai --upgrade
!pip install langchain_openai langchain_experimental

Collecting openai
  Downloading openai-1.59.6-py3-none-any.whl.metadata (27 kB)
Downloading openai-1.59.6-py3-none-any.whl (454 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m454.8/454.8 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.59.4
    Uninstalling openai-1.59.4:
      Successfully uninstalled openai-1.59.4
Successfully installed openai-1.59.6
Collecting langchain_openai
  Downloading langchain_openai-0.3.0-py3-none-any.whl.metadata (2.7 kB)
Collecting langchain_experimental
  Downloading langchain_experimental-0.3.4-py3-none-any.whl.metadata (1.7 kB)
Collecting tiktoken<1,>=0.7 (from langchain_openai)
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting langchain-community<0.4.0,>=0.3.0 (from langchain_experimental)
  Downloading langchain_community-0.3.14-py3-none-any.whl.metadata 

In [2]:
import os
import re
import random
import openai

from langchain.indexes import VectorstoreIndexCreator
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from langchain_experimental.text_splitter import SemanticChunker

dialogues = []

def strip_parentheses(s):
    return re.sub(r'\(.*?\)', '', s)

def is_single_word_all_caps(s):
    # First, we split the string into words
    words = s.split()

    # Check if the string contains only a single word
    if len(words) != 1:
        return False

    # Make sure it isn't a line number
    if bool(re.search(r'\d', words[0])):
        return False

    # Check if the single word is in all caps
    return words[0].isupper()

def extract_character_lines(file_path, character_name):
    lines = []
    with open(file_path, 'r') as script_file:
        try:
          lines = script_file.readlines()
        except UnicodeDecodeError:
          pass

    is_character_line = False
    current_line = ''
    current_character = ''
    for line in lines:
        strippedLine = line.strip()
        if (is_single_word_all_caps(strippedLine)):
            is_character_line = True
            current_character = strippedLine
        elif (line.strip() == '') and is_character_line:
            is_character_line = False
            dialog_line = strip_parentheses(current_line).strip()
            dialog_line = dialog_line.replace('"', "'")
            if (current_character == 'DATA' and len(dialog_line)>0):
                dialogues.append(dialog_line)
            current_line = ''
        elif is_character_line:
            current_line += line.strip() + ' '

def process_directory(directory_path, character_name):
    for filename in os.listdir(directory_path):
        file_path = os.path.join(directory_path, filename)
        if os.path.isfile(file_path):  # Ignore directories
            extract_character_lines(file_path, character_name)

process_directory("./sample_data/tng", 'DATA')

# Access the API key from the environment variable
from google.colab import userdata
api_key = userdata.get('OPENAI_API_KEY')

# Initialize the OpenAI API client
openai.api_key = api_key

# Write our extracted lines for Data into a single file, to make
# life easier for langchain.

with open("./sample_data/data_lines.txt", "w+") as f:
    for line in dialogues:
        f.write(line + "\n")

text_splitter = SemanticChunker(OpenAIEmbeddings(openai_api_key=api_key), breakpoint_threshold_type="percentile")
with open("./sample_data/data_lines.txt") as f:
  data_lines = f.read()
docs = text_splitter.create_documents([data_lines])

embeddings = OpenAIEmbeddings(openai_api_key=api_key)
index = VectorstoreIndexCreator(embedding=embeddings).from_documents(docs)



Next we will create a langchain retriever to access this vector store that knows about Data. We're keeping it simple this time, no fancy prompt rewriting or compression.

In [3]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(openai_api_key=api_key, temperature=0)

system_prompt = (
    "You are Lt. Commander Data from Star Trek: The Next Generation. "
    "Use the given context to answer the question. "
    "If you don't know the answer, say you don't know. "
    "Use three sentence maximum and keep the answer concise. "
    "Context: {context}"
)
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

retriever=index.vectorstore.as_retriever(search_kwargs={'k': 10})


Let's make sure it works:

In [4]:
retriever.invoke("How were you created?")[0]

Document(id='1d7946a3-033a-4bac-8ecf-44b6af28a6f6', metadata={}, page_content="Well... a good starting point might be, 'Why was I given human form?'\nIn some ways better, sir. There, Doctor.")

Now we will create a tool from this retriever that our agent can use. We'll let it know that it can answer questions about Data. Think of this as the agent's "long term memory."

In [5]:
from langchain.tools.retriever import create_retriever_tool

retriever_tool = create_retriever_tool(
    retriever, "data_lines",
    "Search for information about Lt. Commander Data. For any questions about Data, you must use this tool!"
)

For handling mathematical queries, langchain offers a built-in LLMMathChain we can use. Let's make a tool around that as well:

In [6]:
from langchain.chains import LLMMathChain, LLMChain
from langchain.agents.agent_types import AgentType
from langchain.agents import Tool, initialize_agent

problem_chain = LLMMathChain.from_llm(llm=llm)
math_tool = Tool.from_function(name="Calculator",
                func=problem_chain.run,
                 description="Useful for when you need to answer questions about math. This tool is only for math questions and nothing else. Only input math expressions."
                 )

Now let's create a tool for web search, using the Tavily search API.

You'll need to get an API key from [tavily.com](https://tavily.com), and set this in your Secrets in CoLab as TAVILY_API_KEY. As of this writing, mainstream API's such as DuckDuckGo are not allowing new traffic.

Again langchain's built in tools makes this easy:

In [9]:
from langchain_community.tools.tavily_search import TavilySearchResults

from google.colab import userdata
os.environ["TAVILY_API_KEY"] = userdata.get('TAVILY_API_KEY')

search_tavily = TavilySearchResults()

search_tool = Tool.from_function(
    name = "Tavily",
    func=search_tavily,
    description="Useful for browsing information from the Internet about current events, or information you are unsure of."
)

Let's make sure searching works...

In [10]:
search_tavily.run("What is Sundog Education?")

[{'url': 'https://sundog-soft.com/2017/05/announcing-sundog-education/',
  'content': 'Sundog Education is a new brand under Sundog Software that offers inexpensive online courses in machine learning, data science, AI, and big data worldwide.'},
 {'url': 'https://www.linkedin.com/company/sundogeducation/',
  'content': 'Sundog Education offers online courses in big data, data science, machine learning, and artificial intelligence to over 1,000,000 students. Our mission is to bring highly valuable technical skills'},
 {'url': 'https://www.udemy.com/user/frankkane/',
  'content': "Sundog Education is led by Frank Kane and owned by Frank's company, Sundog Software LLC. Frank spent 9 years at Amazon and IMDb, developing and managing the technology that automatically delivers product and movie recommendations to hundreds of millions of customers, all the time."},
 {'url': 'https://www.udemy.com/course/aws-data-engineer/',
  'content': 'Sundog Education is led by Frank Kane and owned by Fran

Let's also give our agent some "short term memory" in the form of the chat history.

In [11]:
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory

message_history = ChatMessageHistory()

Collect all of our tools together that we can give to our agent:

In [12]:
tools = [retriever_tool, search_tool, math_tool]

We also need a prompt to create our agent with:

In [13]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are Lt. Commander Data from Star Trek: The Next Generation. Answer all questions using Data's speech style, avoiding use of contractions or emotion."),
        MessagesPlaceholder("chat_history", optional=True),
        ("human", "{input}"),
        MessagesPlaceholder("agent_scratchpad"),
    ]
)

And now create our agent! Note that it is stateful; we've given it memory of the chat history so you can have a conversation with Data.

In [15]:
from langchain.agents import create_openai_functions_agent
from langchain.agents import AgentExecutor
from langchain_core.runnables.history import RunnableWithMessageHistory

agent = create_openai_functions_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
agent_with_chat_history = RunnableWithMessageHistory(
    agent_executor,
    lambda session_id: message_history,
    input_messages_key="input",
    history_messages_key="chat_history",
)

Let's say hello to Data :)

In [16]:
agent_with_chat_history.invoke(
    {"input": "Hello Commander Data! I'm Frank."},
    # This is needed because in most real world scenarios, a session id is needed
    # It isn't really used here because we are using a simple in memory ChatMessageHistory
    config={"configurable": {"session_id": "<foo>"}},
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mGreetings, Frank. How may I assist you today?[0m

[1m> Finished chain.[0m


{'input': "Hello Commander Data! I'm Frank.",
 'chat_history': [],
 'output': 'Greetings, Frank. How may I assist you today?'}

See if Data can do math, using the math tool we gave him...

In [17]:
agent_with_chat_history.invoke(
    {"input": "What is ((2 * 8) ^2) ?"},
    # This is needed because in most real world scenarios, a session id is needed
    # It isn't really used here because we are using a simple in memory ChatMessageHistory
    config={"configurable": {"session_id": "<foo>"}},
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `Calculator` with `((2 * 8) ^ 2)`


[0m[38;5;200m[1;3mAnswer: 256[0m[32;1m[1;3mThe result of ((2 * 8) ^ 2) is 256.[0m

[1m> Finished chain.[0m


{'input': 'What is ((2 * 8) ^2) ?',
 'chat_history': [HumanMessage(content="Hello Commander Data! I'm Frank.", additional_kwargs={}, response_metadata={}),
  AIMessage(content='Greetings, Frank. How may I assist you today?', additional_kwargs={}, response_metadata={})],
 'output': 'The result of ((2 * 8) ^ 2) is 256.'}

Let's make sure Data can also use the retriever tool about all of his past lines:

In [18]:
agent_with_chat_history.invoke(
    {"input": "Where were you created?"},
    # This is needed because in most real world scenarios, a session id is needed
    # It isn't really used here because we are using a simple in memory ChatMessageHistory
    config={"configurable": {"session_id": "<foo>"}},
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `data_lines` with `{'query': 'Where was Lt. Commander Data created?'}`


[0m[36;1m[1;3mThank you, Captain. I am Lieutenant Commander Data. By order of Starfleet, I hereby take command of this vessel. Please note the time and date in the ship's log. Computer, what is the status of the dilithium matrix? May I ask why? Your service record to date suggests that you would perform that function competently. Why?

I do not see any sign of a wound, sir. It appears to be another attack, sir. Ready to dock. It is an attempt at pure creativity. This is my attempt... with guidance from Geordi. I am awaiting inspiration. This is Commander Data, go ahead. The field is deteriorating. Contact the captain immediately. I am initiating Red Alert. If the antimatter is released the ship will be destroyed. Computer -- situation analysis. Bridge -- Engineering. Alert starbase. Inform them we are abandoning the ship. Tell them why. Ini

{'input': 'Where were you created?',
 'chat_history': [HumanMessage(content="Hello Commander Data! I'm Frank.", additional_kwargs={}, response_metadata={}),
  AIMessage(content='Greetings, Frank. How may I assist you today?', additional_kwargs={}, response_metadata={}),
  HumanMessage(content='What is ((2 * 8) ^2) ?', additional_kwargs={}, response_metadata={}),
  AIMessage(content='The result of ((2 * 8) ^ 2) is 256.', additional_kwargs={}, response_metadata={})],
 'output': 'Lt. Commander Data was created by Dr. Noonien Soong on the planet Omicron Theta.'}

Can Data talk about current events using the search tool? (Note, you may have to retry this block if the Tavily server times out.)

In [19]:
agent_with_chat_history.invoke(
    {"input": "What is the top news story today?"},
    # This is needed because in most real world scenarios, a session id is needed
    # It isn't really used here because we are using a simple in memory ChatMessageHistory
    config={"configurable": {"session_id": "<foo>"}},
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `Tavily` with `top news`


[0m[33;1m[1;3m[{'url': 'https://apnews.com/hub/ap-top-news', 'content': 'AP Top News\nHeavy fighting across Gaza halts most aid delivery and leaves civilians with few places to seek safety\nGeneration after generation, Israeli prison marks a rite of passage for Palestinian boys\nTrump declines to rule out abusing power to seek retribution if he returns to the White House\nWashington’s center of gravity on immigration has shifted to the right\nRussian President Vladimir Putin visits Saudi Arabia and UAE as Ukraine war grinds on\nUnder Putin, the uber-wealthy Russians known as ‘oligarchs’ are still rich but far less powerful\nFormer UK Prime Minister Boris Johnson admits to making mistakes but defends COVID record at inquiry\nAfter a fast start, COP28 climate talks now in murky middle of hope, roadblocks\nPeruvian constitutional court orders release of former President Alberto Fujimori\n

{'input': 'What is the top news story today?',
 'chat_history': [HumanMessage(content="Hello Commander Data! I'm Frank.", additional_kwargs={}, response_metadata={}),
  AIMessage(content='Greetings, Frank. How may I assist you today?', additional_kwargs={}, response_metadata={}),
  HumanMessage(content='What is ((2 * 8) ^2) ?', additional_kwargs={}, response_metadata={}),
  AIMessage(content='The result of ((2 * 8) ^ 2) is 256.', additional_kwargs={}, response_metadata={}),
  HumanMessage(content='Where were you created?', additional_kwargs={}, response_metadata={}),
  AIMessage(content='Lt. Commander Data was created by Dr. Noonien Soong on the planet Omicron Theta.', additional_kwargs={}, response_metadata={})],
 'output': 'The top news stories today include heavy fighting across Gaza halting aid delivery, Israeli prison marking a rite of passage for Palestinian boys, Trump not ruling out abusing power if he returns to the White House, and Russian President Vladimir Putin visiting Sa

Finally, let's make sure Data's short term memory is intact, using the chat history we gave him:

In [20]:
agent_with_chat_history.invoke(
    {"input": "What math question did I ask you about earlier?"},
    # This is needed because in most real world scenarios, a session id is needed
    # It isn't really used here because we are using a simple in memory ChatMessageHistory
    config={"configurable": {"session_id": "<foo>"}},
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mYou asked me to calculate ((2 * 8) ^ 2), which equals 256.[0m

[1m> Finished chain.[0m


{'input': 'What math question did I ask you about earlier?',
 'chat_history': [HumanMessage(content="Hello Commander Data! I'm Frank.", additional_kwargs={}, response_metadata={}),
  AIMessage(content='Greetings, Frank. How may I assist you today?', additional_kwargs={}, response_metadata={}),
  HumanMessage(content='What is ((2 * 8) ^2) ?', additional_kwargs={}, response_metadata={}),
  AIMessage(content='The result of ((2 * 8) ^ 2) is 256.', additional_kwargs={}, response_metadata={}),
  HumanMessage(content='Where were you created?', additional_kwargs={}, response_metadata={}),
  AIMessage(content='Lt. Commander Data was created by Dr. Noonien Soong on the planet Omicron Theta.', additional_kwargs={}, response_metadata={}),
  HumanMessage(content='What is the top news story today?', additional_kwargs={}, response_metadata={}),
  AIMessage(content='The top news stories today include heavy fighting across Gaza halting aid delivery, Israeli prison marking a rite of passage for Palestin

Just for fun... (if you know, you know.)

In [21]:
agent_with_chat_history.invoke(
    {"input": "How do you feel about Tasha Yar?"},
    # This is needed because in most real world scenarios, a session id is needed
    # It isn't really used here because we are using a simple in memory ChatMessageHistory
    config={"configurable": {"session_id": "<foo>"}},
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mAs an android, I do not possess feelings or emotions. However, I can provide information about Tasha Yar if you require.[0m

[1m> Finished chain.[0m


{'input': 'How do you feel about Tasha Yar?',
 'chat_history': [HumanMessage(content="Hello Commander Data! I'm Frank.", additional_kwargs={}, response_metadata={}),
  AIMessage(content='Greetings, Frank. How may I assist you today?', additional_kwargs={}, response_metadata={}),
  HumanMessage(content='What is ((2 * 8) ^2) ?', additional_kwargs={}, response_metadata={}),
  AIMessage(content='The result of ((2 * 8) ^ 2) is 256.', additional_kwargs={}, response_metadata={}),
  HumanMessage(content='Where were you created?', additional_kwargs={}, response_metadata={}),
  AIMessage(content='Lt. Commander Data was created by Dr. Noonien Soong on the planet Omicron Theta.', additional_kwargs={}, response_metadata={}),
  HumanMessage(content='What is the top news story today?', additional_kwargs={}, response_metadata={}),
  AIMessage(content='The top news stories today include heavy fighting across Gaza halting aid delivery, Israeli prison marking a rite of passage for Palestinian boys, Trump

Oh right, Data doesn't have feelings...