### New Agent Framework with Langchain Release v0.1.0

<p>
<img src="https://blog.langchain.dev/content/images/size/w1248/format/webp/2024/01/V0.1.0_Export--1-.png" 
      width="35%" height="auto"
      style="display: block; margin: 0 auto" />

[Update notes](https://blog.langchain.dev/langchain-v0-1-0/)

# Agents

We are going to learn how to build tools first, which are the engine making agents work

We'll work with two tools: A **search engine tool** and a **RAG tool**

In [152]:
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.tools.tavily_search import TavilySearchResults
from langchain_community.document_loaders import WebBaseLoader, Docx2txtLoader, UnstructuredWordDocumentLoader, PyPDFLoader
from langchain_community.vectorstores import Chroma, FAISS
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain import hub
from langchain.tools.retriever import create_retriever_tool
from langchain.agents import create_openai_functions_agent, AgentExecutor

In [2]:
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = ""
os.environ["TAVILY_API_KEY"] = ""
os.environ["OPENAI_API_KEY"] = ""


### Setup the Search Engine Tool

Load the Langchain built-in tool Tavily for live online search

In [3]:
online_search = TavilySearchResults()

In [7]:
online_search.invoke("Cuanto quedo el Boys contra el Cristal?")

[{'url': 'https://www.alaskacommons.com/sporting-cristal-vs-sport-boys-live-rimenses-win-3-0-in-liga-1-apertura-2024/123235/',
  'content': 'Sporting Cristal vs Sport Boys LIVE: ‘Rimenses’ win 3-0 in Liga 1 Apertura 2024 February 4, 2024 // News Team  duel being between Sporting Cristal and Sport Boys. Both teams have a rich history and aim to be protagonists in League  in defense for Cristal.  Sporting Cristal is leading 3-0 against Sport Boys at the end of the first half. Cazonatti scored a goal for SportingFebruary 4, 2024 // News Team Sporting Cristal is leading 3-0 against Sport Boys at the end of the first half. Cazonatti scored a goal for Sporting Cristal, establishing their lead in Callao. Quispe almost made a mistake by failing to catch a cross from Ignácio. Sport Boys are struggling to put together passes and create offensive plays.'},
 {'url': 'https://www.vsstats.com/football/2024-02-04/sport-boys-vs-sporting-cristal',
  'content': 'Sport Boys VS Sporting Cristal Team Stats

### Setup the Retriever tool

Recall that retrievers need:
1. Source text
2. Document loader
3. Text splitter
4. Embedding model
5. Vector store
6. Actual retriever

<p>
<img src="ILLUSTRATIONS/retriever_diagram.png" 
      width="75%" height="auto"
      style="display: block; margin: 0 auto" />

Illustration [reference](https://python.langchain.com/docs/use_cases/question_answering/)

#### We'll try two different word docx loaders and one pdf loader (word doc converted to pdf)

In [4]:
loader = UnstructuredWordDocumentLoader("sample_docs/WebMD_PBM_ETL_5.0.1_20170606.docx", mode="elements")
docs = loader.load()

In [33]:
print(len(docs))
docs[:5]

216


[Document(page_content='Source Data Mapping Approach to CDMV5.0.1', metadata={'source': 'sample_docs/WebMD_PBM_ETL_5.0.1_20170606.docx', 'category_depth': 0, 'last_modified': '2024-02-04T20:19:13', 'page_number': 1, 'languages': ['eng'], 'file_directory': 'sample_docs', 'filename': 'WebMD_PBM_ETL_5.0.1_20170606.docx', 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title'}),
 Document(page_content='', metadata={'source': 'sample_docs/WebMD_PBM_ETL_5.0.1_20170606.docx', 'languages': ['eng'], 'file_directory': 'sample_docs', 'filename': 'WebMD_PBM_ETL_5.0.1_20170606.docx', 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'PageBreak'}),
 Document(page_content='Table name: stem_table', metadata={'source': 'sample_docs/WebMD_PBM_ETL_5.0.1_20170606.docx', 'category_depth': 0, 'last_modified': '2024-02-04T20:19:13', 'page_number': 2, 'languages': ['eng'], 'file_directory': 'sample_docs', 'fil

In [25]:
sum([len(docs[i].dict()["page_content"]) for i in range(len(docs))])

47031

In [5]:
pdf_loader = PyPDFLoader("sample_docs/WebMD_PBM_ETL_5.0.1_20170606.pdf")
pdf_pages = loader.load_and_split()

In [34]:
print(len(pdf_pages))
pdf_pages[:5]

187


[Document(page_content='Source Data Mapping Approach to CDMV5.0.1', metadata={'source': 'sample_docs/WebMD_PBM_ETL_5.0.1_20170606.docx', 'category_depth': 0, 'last_modified': '2024-02-04T20:19:13', 'page_number': 1, 'languages': ['eng'], 'file_directory': 'sample_docs', 'filename': 'WebMD_PBM_ETL_5.0.1_20170606.docx', 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title'}),
 Document(page_content='Table name: stem_table', metadata={'source': 'sample_docs/WebMD_PBM_ETL_5.0.1_20170606.docx', 'category_depth': 0, 'last_modified': '2024-02-04T20:19:13', 'page_number': 2, 'languages': ['eng'], 'file_directory': 'sample_docs', 'filename': 'WebMD_PBM_ETL_5.0.1_20170606.docx', 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title'}),
 Document(page_content='Reading from sample_medical_claims_20170502.csv', metadata={'source': 'sample_docs/WebMD_PBM_ETL_5.0.1_20170606.docx', 'category_depth'

In [32]:
pdf_pages[-1].dict()["page_content"][-200:]



In [6]:
sum([len(docs[i].dict()["page_content"]) for i in range(len(pdf_pages))])

30995

pdfloader() is capturing less characters than docx (unstructured loader()), but, it might be because we are not introducing "\n" or empty characters, which is a good thing

#### Let's move on to character splitting now 

In [7]:
docs_split = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200).split_documents(docs)
pdf_split = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200).split_documents(pdf_pages)

We get about the same number of chunks with either method

In [8]:
print(len(docs_split))
print(len(pdf_split))

213
214


This what the chunk object looks like

In [24]:
docs_split[:5]

[Document(page_content='Source Data Mapping Approach to CDMV5.0.1', metadata={'source': 'sample_docs/WebMD_PBM_ETL_5.0.1_20170606.docx', 'category_depth': 0, 'last_modified': '2024-02-04T20:19:13', 'page_number': 1, 'languages': ['eng'], 'file_directory': 'sample_docs', 'filename': 'WebMD_PBM_ETL_5.0.1_20170606.docx', 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title'}),
 Document(page_content='Table name: stem_table', metadata={'source': 'sample_docs/WebMD_PBM_ETL_5.0.1_20170606.docx', 'category_depth': 0, 'last_modified': '2024-02-04T20:19:13', 'page_number': 2, 'languages': ['eng'], 'file_directory': 'sample_docs', 'filename': 'WebMD_PBM_ETL_5.0.1_20170606.docx', 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title'}),
 Document(page_content='Reading from sample_medical_claims_20170502.csv', metadata={'source': 'sample_docs/WebMD_PBM_ETL_5.0.1_20170606.docx', 'category_depth'

This is what each chunk contains

In [16]:
docs_split[-1].dict()

{'page_content': 'DECIMAL NULL value_as_string CHARACTER VARYING NULL value_source_value CHARACTER VARYING NULL anatomic_site_concept_id INTEGER 0 disease_status_concept_id INTEGER 0 specimen_source_id INTEGER NULL anatomic_site_source_value CHARACTER VARYING NULL disease_status_source_value CHARACTER VARYING NULL',
 'metadata': {'source': 'sample_docs/WebMD_PBM_ETL_5.0.1_20170606.docx',
  'last_modified': '2024-02-04T20:19:13',
  'page_number': 34,
  'languages': ['eng'],
  'parent_id': 'ae5354ac2bc8a6763e23e55740ca8299',
  'file_directory': 'sample_docs',
  'filename': 'WebMD_PBM_ETL_5.0.1_20170606.docx',
  'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
  'category': 'Table'},
 'type': 'Document'}

#### Now let's place them in the vector DB by embedding text into a vector space (embedding)
We'll use FAISS DB for this exercise and OpenAI's embedding

In [41]:
vector_docs_db = FAISS.from_documents(docs_split, OpenAIEmbeddings())
vector_pdfs_db = FAISS.from_documents(pdf_split, OpenAIEmbeddings())

We can do simple querying on the Vectorstore

In [56]:
query = "how can I query condition_occurrence?"
query_results = vector_pdfs_db.similarity_search(query)
query_results

[Document(page_content='Table name: condition_occurrence', metadata={'source': 'sample_docs/WebMD_PBM_ETL_5.0.1_20170606.docx', 'category_depth': 0, 'last_modified': '2024-02-04T20:19:13', 'page_number': 14, 'languages': ['eng'], 'file_directory': 'sample_docs', 'filename': 'WebMD_PBM_ETL_5.0.1_20170606.docx', 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title'}),
 Document(page_content='Table name: visit_occurrence', metadata={'source': 'sample_docs/WebMD_PBM_ETL_5.0.1_20170606.docx', 'category_depth': 0, 'last_modified': '2024-02-04T20:19:13', 'page_number': 11, 'languages': ['eng'], 'file_directory': 'sample_docs', 'filename': 'WebMD_PBM_ETL_5.0.1_20170606.docx', 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title'}),
 Document(page_content='Table name: procedure_occurrence', metadata={'source': 'sample_docs/WebMD_PBM_ETL_5.0.1_20170606.docx', 'category_depth': 0, 'last_modif

In [60]:
query_results[3].page_content

'Destination Field Source Field Logic Comment condition_occurrence_id id person_id person_id condition_concept_id concept_id condition_start_date start_date condition_end_date end_date condition_type_concept_id type_concept_id stop_reason stop_reason provider_id provider_id visit_occurrence_id visit_occurrence_id condition_source_value source_value condition_source_concept_id source_concept_id'

This retrieval can also be done with a score associated per retrieved entry. This is an **L2** distance, **lower is better**

In [62]:
query_results = vector_pdfs_db.similarity_search_with_score(query, 5)
query_results

[(Document(page_content='Table name: condition_occurrence', metadata={'source': 'sample_docs/WebMD_PBM_ETL_5.0.1_20170606.docx', 'category_depth': 0, 'last_modified': '2024-02-04T20:19:13', 'page_number': 14, 'languages': ['eng'], 'file_directory': 'sample_docs', 'filename': 'WebMD_PBM_ETL_5.0.1_20170606.docx', 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title'}),
  0.22454166),
 (Document(page_content='Table name: visit_occurrence', metadata={'source': 'sample_docs/WebMD_PBM_ETL_5.0.1_20170606.docx', 'category_depth': 0, 'last_modified': '2024-02-04T20:19:13', 'page_number': 11, 'languages': ['eng'], 'file_directory': 'sample_docs', 'filename': 'WebMD_PBM_ETL_5.0.1_20170606.docx', 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title'}),
  0.3295955),
 (Document(page_content='Table name: procedure_occurrence', metadata={'source': 'sample_docs/WebMD_PBM_ETL_5.0.1_20170606.docx', 

#### Now that we have them in a vector store, let's work on finalizing the retrieval

This is simply accomplished calling a retriever function on the class

In [86]:
docs_retriever = vector_docs_db.as_retriever(search_type="similarity", search_kwargs={"k": 6})

In [87]:
query = 'how can I query condition_occurrence?'

docs_retriever.get_relevant_documents(query)

[Document(page_content='Table name: condition_occurrence', metadata={'source': 'sample_docs/WebMD_PBM_ETL_5.0.1_20170606.docx', 'category_depth': 0, 'last_modified': '2024-02-04T20:19:13', 'page_number': 14, 'languages': ['eng'], 'file_directory': 'sample_docs', 'filename': 'WebMD_PBM_ETL_5.0.1_20170606.docx', 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title'}),
 Document(page_content='Table name: visit_occurrence', metadata={'source': 'sample_docs/WebMD_PBM_ETL_5.0.1_20170606.docx', 'category_depth': 0, 'last_modified': '2024-02-04T20:19:13', 'page_number': 11, 'languages': ['eng'], 'file_directory': 'sample_docs', 'filename': 'WebMD_PBM_ETL_5.0.1_20170606.docx', 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title'}),
 Document(page_content='Table name: procedure_occurrence', metadata={'source': 'sample_docs/WebMD_PBM_ETL_5.0.1_20170606.docx', 'category_depth': 0, 'last_modif

### The Retriever is a main component of RAG

Let's that up before jumping into tools and agents

RAG chains build sequentially through:

1. Question
2. Retrieve documents from retriever (Built on Vector DB)
3. Any formatting (if needed)
4. Prompt
5. LLM
6. Output to user


<p>
<img src="ILLUSTRATIONS/rag_diagram.png" 
      width="65%" height="auto"
      style="display: block; margin: 0 auto" />

Illustration [reference](https://python.langchain.com/docs/use_cases/question_answering/)

We'll create a function to parse the retrieved docs so that it extracts the "page_content" only and joins all docs together. We'll also use a prompt (*you can use your own*) from [langchain hub](https://smith.langchain.com/hub?organizationId=492b1b26-fec0-561f-87a0-ff990d4ecc60)

In [110]:
def format_docs(docs):
    return "\n\n".join(i.page_content for i in docs)

prompt = hub.pull("rlm/rag-prompt")

prompt

ChatPromptTemplate(input_variables=['context', 'question'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"))])

Notice the **{question}** and **{context}** input variables in the prompt

In [130]:
prompt.messages[0].dict()

{'prompt': {'name': None,
  'input_variables': ['context', 'question'],
  'input_types': {},
  'output_parser': None,
  'partial_variables': {},
  'template': "You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:",
  'template_format': 'f-string',
  'validate_template': False,
  '_type': 'prompt'},
 'additional_kwargs': {}}

We'll put this together in the chain using **LCEL**. The initial comma-separated elements within {} in the chain are the input to the chain. Each can be piped to be formatted further. Recall that **RunnablePassthrough()** allows passing of the input *invoked* query in the chain

We'll use the **gpt-4-0125-preview** model

In [140]:
openai_model_name = "gpt-4-0125-preview"
llm = ChatOpenAI(model=openai_model_name, temperature=0.5)

rag_chain = (
    {"context": docs_retriever | RunnableLambda(format_docs), "question": RunnablePassthrough()} |  #Noticeable RunnableLambda() wrapping is not needed
    prompt |
    llm |
    StrOutputParser()
)

# rag_chain.invoke("how can I query condition_occurrence?")

We can know invoke the chain with our original query to get a very-well formed answer

In [145]:
print(rag_chain.invoke("how can I query condition_occurrence?"))

To query the `condition_occurrence` table, you can select fields such as `condition_occurrence_id`, `person_id`, `condition_concept_id`, `condition_start_date`, `condition_end_date`, `condition_type_concept_id`, `provider_id`, and `visit_occurrence_id` based on your specific requirements. You can use SQL (Structured Query Language) to write a query that matches your criteria, for example, `SELECT * FROM condition_occurrence WHERE condition_concept_id = [your_concept_id]`. Be sure to replace `[your_concept_id]` with the actual ID you are interested in querying.


We can also at **streaming** added (for better user engagement experience)

In [137]:
for chunk in rag_chain.stream("how can I query condition_occurrence?"):
    print(chunk, end="", flush=True)

To query the "condition_occurrence" table, you should select fields based on your requirements, such as "condition_occurrence_id", "person_id", "condition_concept_id", "condition_start_date", "condition_end_date", and any other relevant fields listed under the "condition_occurrence" context. Use SQL syntax to structure your query, for example: `SELECT * FROM condition_occurrence WHERE condition_concept_id = [desired concept ID];` Replace `[desired concept ID]` with the specific condition concept ID you are interested in. This approach allows you to retrieve specific records based on condition concepts or any other criteria relevant to your query.

### After looking at how RAG works, we'll go ahead and create the actual retriever tool

This will give it a name and provide tool instructions so the Agent knows how to use it

In [150]:
retriever_tool = create_retriever_tool(
    docs_retriever,
    "ETL_Retriever",
    "Search for information about the ETL Data Mapping to CDM V5.0.1"
    )

### We'll now use our tools tools to build our Agent

Let's pull our tools and define our LLM model

In [154]:
agent_tools = [online_search, retriever_tool]

openai_model_name = "gpt-4-0125-preview"
llm = ChatOpenAI(model=openai_model_name, temperature=0.2)

We'll use a pre-defined prompt from the hub (which we can modify)

In [148]:
prompt = hub.pull("hwchase17/openai-functions-agent")
prompt.messages

[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], template='You are a helpful assistant')),
 MessagesPlaceholder(variable_name='chat_history', optional=True),
 HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['input'], template='{input}')),
 MessagesPlaceholder(variable_name='agent_scratchpad')]

Having all: prompt, llm and tools, we'll go ahead and **initialize the agent**

In [155]:
etl_plus_agent = create_openai_functions_agent(llm, agent_tools, prompt)

Now we'll create the **Agent Executor**, which calls the agent and execute as many times as needed

In [156]:
etl_plus_agent_executor = AgentExecutor(
    agent=etl_plus_agent, 
    tools=agent_tools,
    verbose=True
    )

#### Now we can finally call our agent

In [158]:
etl_plus_agent_executor.invoke({"input":"Hello!"})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mHello! How can I assist you today?[0m

[1m> Finished chain.[0m


{'input': 'Hello!', 'output': 'Hello! How can I assist you today?'}

In [159]:
etl_plus_agent_executor.invoke({"input":"how can query the person table in the new version of the CDM?"})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `ETL_Retriever` with `{'query': 'person table'}`


[0m[33;1m[1;3m[Document(page_content='Table name: person', metadata={'source': 'sample_docs/WebMD_PBM_ETL_5.0.1_20170606.docx', 'category_depth': 0, 'last_modified': '2024-02-04T20:19:13', 'page_number': 8, 'languages': ['eng'], 'file_directory': 'sample_docs', 'filename': 'WebMD_PBM_ETL_5.0.1_20170606.docx', 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title'}), Document(page_content='Table: stem_table', metadata={'source': 'sample_docs/WebMD_PBM_ETL_5.0.1_20170606.docx', 'category_depth': 0, 'last_modified': '2024-02-04T20:19:13', 'page_number': 34, 'languages': ['eng'], 'file_directory': 'sample_docs', 'filename': 'WebMD_PBM_ETL_5.0.1_20170606.docx', 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title'}), Document(page_content='Table name: note', metad

{'input': 'how can query the person table in the new version of the CDM?',
 'output': "In the new version of the Common Data Model (CDM), the `person` table is still present. This table is a crucial component of the CDM, designed to store information about individuals whose data is being captured in the model. The documentation does not provide specific details about the schema or the fields within the `person` table, but it confirms the table's existence and its continued importance in the model. If you need detailed information about the schema, fields, or how to query the `person` table, please let me know, and I can provide further assistance."}

In [160]:
etl_plus_agent_executor.invoke({"input":"What team will Universitario play against tonight?"})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `tavily_search_results_json` with `{'query': 'Universitario game tonight'}`


[0m[36;1m[1;3m[{'url': 'https://www.footballtipspredictions.com/alianza-lima-vs-universitario-de-deportes-prediction-odds-betting-tips-2024-02-11', 'content': 'Universitario de Deportes Last Games  Universitario de Deportes  The excitement for the 11 Feb 2024 match in Peru Liga 1 between Alianza Lima and Universitario de Deportes is palpable  Universitario de Deportes Away ResultsThe excitement for the 11 Feb 2024 match in Peru Liga 1 between Alianza Lima and Universitario de Deportes is palpable among the fans and sports pundits. Predictions are circulating, with various scenarios being contemplated for this pivotal game. While Alianza Lima aims to harness the energy of their home supporters, Universitario de Deportes ...'}, {'url': 'https://www.sportsmediawatch.com/college-basketball-tv-schedule-mens-2024-how-watch-games-today/', 'co

{'input': 'What team will Universitario play against tonight?',
 'output': 'Universitario de Deportes will play against Alianza Lima tonight, February 11, 2024, in a match in Peru Liga 1. The excitement for this game is palpable among fans and sports pundits.'}

#### As seen the Agent knows how to call which tool as needed.

While there might be edge, the better we describe, provide data in Vector DB and prompt, the better the Agent will be.