## 👋 Basics of LangChain 

The **key goal** of this tutorial is to get comfortable using the Python library, LangChain. 

Where you see a **🛸 TASK** in the tutorial, you will need to complete the task before moving on.

This tutorial relies heavily on a number of different resources mentioned below. Feel free to refer to some of these materials during the tutorial and in your project work: 

#### Documentation 

- [langchain](https://python.langchain.com/docs/get_started/introduction.html)
    -  [langchain integrations](https://python.langchain.com/docs/integrations): 

#### Repos 

- [The Practical Guides to Large Language Models](https://github.com/Mooler0410/LLMsPracticalGuide)
- [A series of langchain tutorials](https://github.com/gkamradt/langchain-tutorials)

#### Videos

- [LangChain Crash Course for Beginners](https://www.youtube.com/watch?v=nAmC7SoVLd8)

#### Blogs

- [Retrival Augmented Generation](https://betterprogramming.pub/harnessing-retrieval-augmented-generation-with-langchain-2eae65926e82)

**NOTE:** We won't be able to examine all of LangChain's capabilities - we are simply exploring its core principals. Please do refer to the additional material mentioned above and the library's documentation to explore its additional functionalities. 

In [44]:
import langchain

from dotenv import load_dotenv
from pathlib import Path
import os
import pandas as pd

from dap_taltech.utils.data_getters import DataGetter
from dap_taltech import logger

from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI, OpenAIChat

from langchain.output_parsers import (
    PydanticOutputParser, 
    CommaSeparatedListOutputParser,
    DatetimeOutputParser,
    EnumOutputParser,
    OutputFixingParser,
    RetryWithErrorOutputParser)

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.schema import SystemMessage

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

from langchain.chains import (ConversationalRetrievalChain, 
                              SequentialChain, 
                              LLMChain)

from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.chains.question_answering import load_qa_chain

from langchain.prompts.prompt import PromptTemplate
from langchain.prompts import MessagesPlaceholder, ChatPromptTemplate
from langchain.prompts.few_shot import FewShotPromptTemplate
from langchain.prompts.example_selector import LengthBasedExampleSelector

from langchain.output_parsers import StructuredOutputParser, ResponseSchema

from langchain.memory import ConversationTokenBufferMemory, ConversationBufferMemory

from langchain.document_loaders import WikipediaLoader, DataFrameLoader 

from langchain.agents import create_pandas_dataframe_agent, tool, OpenAIFunctionsAgent, AgentExecutor
from langchain.agents.agent_types import AgentType

### Preamble

The code block below is the preamble to this tutorial so that:

- we install our dependencies;
- have access to our datasets; and 
- our OpenAI API key. 

In [19]:
os.system(
    f"pip install -r {Path.cwd()}/llm_requirements.txt --quiet" #install requirements to run this notebook
)

load_dotenv() # load environment variables

oa_key = os.environ.get('OPENAI_API_KEY') #get our open api key from our environment variable 

if not oa_key:
    logger.error("No open api key found. Please set your openAI api key as an environment variable named OPENAI_API_KEY.")

## Tour of LangChain 🦜🔗: Building a language model application

LangChain is a python "framework for developing applications powered by language models." 

According to [LangChain's documentation](https://github.com/langchain-ai/langchain), there are **6 key areas** that LangChain is designed to help you with (in order of complexity):

**📃 LLMs and Prompts:**

This includes prompt management, prompt optimization, a generic interface for all LLMs, and common utilities for working with LLMs.

**🔗 Chains**:

Chains go beyond a single LLM call and involve sequences of calls (whether to an LLM or a different utility). LangChain provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications.

**📚 Data Augmented Generation:**

Data Augmented Generation involves specific types of chains that first interact with an external data source to fetch data for use in the generation step. Examples include summarization of long pieces of text and question/answering over specific data sources.

**🤖 Agents:**

Agents involve an LLM making decisions about which Actions to take, taking that Action, seeing an Observation, and repeating that until done. LangChain provides a standard interface for agents, a selection of agents to choose from, and examples of end-to-end agents.

**🧠 Memory:**

Memory refers to persisting state between calls of a chain/agent. LangChain provides a standard interface for memory, a collection of memory implementations, and examples of chains/agents that use memory.

**🧐 Evaluation:**

Generative models are notoriously hard to evaluate with traditional metrics. One new way of evaluating them is using language models themselves to do the evaluation. LangChain provides some prompts/chains for assisting in this.

Although we will touch on most of these concepts, this tutorial will go more in depth on:

1. 📃 LLMs and Prompts;
2. 🔗🤖 Chains & Agents; and
3. 📚 Data & Retrival Augmented Generation.

In [20]:
model = OpenAI(temperature=0.9) #instantiate our language model. The temperature parameter controls how "creative" the model is.

#make this innovation mapping related 
text = "What are 5 vacation destinations for someone who likes to eat pasta?" #define our prompt

print(model(text)) #print the output



1. Rome, Italy
2. Venice, Italy
3. Bologna, Italy
4. Naples, Italy
5. Tuscany, Italy


## ❓ Choosing a model 

LangChain allows you to pick from many different types of language models and use them in your application. Pending access resrictions (i.e. paying for an API) you can pick from a [range of LLMs](https://python.langchain.com/docs/integrations/llms/) or [chat models](https://python.langchain.com/docs/integrations/chat/). 

It's worth considering things such as:

1. **Privacy**. Do you want to use a model that is hosted on a server or do you want to use a model that is hosted locally?  

2. **Cost**. Do you want to pay for access to a model or do you want to use a free model?

3. **Token window size**. How long will your prompts be? You can use a model that supports a larger token window size if you need to accomodate longer prompts.

4. **LLMs vs. chat models**. What type of model do you need? LLMs are good for generating text, while chat models are good for having conversations. As LangChain states: _LLMs in LangChain refer to pure text completion models. The APIs they wrap take a string prompt as input and output a string completion. Meanwhile, chat models are often backed by LLMs but tuned specifically for having conversations. And, crucially, their provider APIs expose a different interface than pure text completion models. Instead of a single string, they take a list of chat messages as input._  

In this tutorial, we'll primarily be using OpenAI LLMs and chat models unless otherwise stated.



## 📃 Prompts 

Prompts refer to the textual input into the model.

You can use langchain to create prompts, such as a simple hardcoded prompt using the PromptTemplate class or passing examples as part of `few-shot prompting`. 

**🛸 TASK**: Can you re-write the `text` prompt to take as input variables:

1. The number of vacation destinations;
2. The meal type

And rerun the code block below?

Refer to the [prompt template documentation](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/) for guidance. 

In [10]:
#Define your prompt template to take as input "number" and "meal" variables.
multiple_input_prompt = PromptTemplate(###) 
    
#format your prompt template with the variables you want to use using PromptTemplate

prompt = ###

## Format the prompt with the PromptTemplate and convert to string 
input = prompt.format_prompt().to_string()

## pass the prompt to the language model

model(input)

**🛸 TASK**: Can you create a similar prompt using the `ChatPromptTemplate` class instead for a chat model in the cell below? You will need to instantiate a chat model first. What are the differences in the output?

In [12]:
chat_model = OpenAIChat(model="gpt-3.5-turbo", temperature=0.0)

##Define your prompt template to take as input "number" and "meal" variables.
template = ChatPromptTemplate.from_messages([#### ])

##Format your prompt template with the variables you want to use using PromptTemplate
messages = template.format_messages(
    number=#,
    meal=#,
)

##pass the prompt to the chat model

Great! Now that we understand the basics of constructing a prompt and using a prompt template, let's discuss a common prompt engineering technique called few-shot prompting.

### 📃 Prompts: Few-shot prompting

Few-shot prompting is a prompt engineering technique to improve the models output. To do so, you need to pass in a list of examples as part of `few-shot prompting`. In LangChain, a few shot prompt template can be constructed from either a set of examples, or from an Example Selector object. 

In [13]:
#We first need to create a list of examples to pass to the prompt template

examples = [
    {"question": "What is the capital of Estonia?", "answer": "The capital of Estonia is: Tallinn"},
    {"question": "What is the capital of France?", "answer": "The capital of France is: Paris"},
    {"question": "What is the capital of Germany?", "answer": "The capital of Germany is: Berlin"},
] 

#we then create our prompt template 
example_prompt = PromptTemplate(input_variables=["question", "answer"], template="Question: {question}\n{answer}")

print('The prompt looks like this:')
print('-----------------------------------')
print(example_prompt.format(**examples[0]))
print('-----------------------------------')

#instantiating our fewshot prompt template
prompt = FewShotPromptTemplate(
    examples=examples, 
    example_prompt=example_prompt, 
    suffix="Question: {input}", 
    input_variables=["input"]
)

input = prompt.format(input="What is the capital of Australia?")
#You can see that in the prompt, we have passed the examples as input to the prompt template:
print(input)

##lets see what our model does with this prompt
output_few_shot = model(input)
##what about without examples?
output_zero_shot = model("Question: What is the capital of Australia?")
print('')
print(f'the output using few-shot prompting is: {output_few_shot}')
print('')
print(f'the output not using examples is: {output_zero_shot}')

The prompt looks like this:
-----------------------------------
Question: What is the capital of Estonia?
The capital of Estonia is: Tallinn
-----------------------------------
Question: What is the capital of Estonia?
The capital of Estonia is: Tallinn

Question: What is the capital of France?
The capital of France is: Paris

Question: What is the capital of Germany?
The capital of Germany is: Berlin

Question: What is the capital of Australia?

the output using few-shot prompting is: 
The capital of Australia is: Canberra

the output not using examples is: 

Answer: The capital of Australia is Canberra.


**🛸 TASK** Reflect on the following questions: How does the input to the model change when you pass in a list of examples as part of `few-shot prompting`? Why might this be useful for applications? 

Refer to LangChain's [example selector documentation](https://python.langchain.com/docs/modules/model_io/prompts/example_selectors). When would you use an example selector as opposed to hardcoding examples into the prompt? What are the benefits to the different kinds of example selectors LangChain supports?   

Can you create a few-shot prompt template for a prompt of your choice using a [length based example selector](https://python.langchain.com/docs/modules/model_io/prompts/example_selectors/length_based) in the cell below?   

In [15]:
#assume that the list of examples is very large and you need a way to select the most relevant examples 
# using a length based example selector

#define your list of examples here
examples = [###
    
#define your example prompt
example_prompt =  PromptTemplate(###

#define your example selector 
example_selector = LengthBasedExampleSelector(###
                                              
#define your prompt using your example selector

prompt = FewShotPromptTemplate(###
                               
#An example with long input, so it selects only one example.
long_string = "this is a veeeerrrrrryyyyyy long stringggggggggg, it coulddddddnt beeeeeeeee longerrrrrrrr"
input = prompt.format(###
                 
model(input)                     

### 📃 Prompts: Parsing output 

Language models output text. However, sometimes you may want to get more structured information from the model's output. LangChain supports this vis-a-vi output parsers. There are two methods to implement output parsers in LangChain:

Via: 
1. A prompt's format instructions;
2. Parsing the output directly.

LangChain supports a number of [different output parsers](https://python.langchain.com/docs/modules/model_io/output_parsers/), such as:

- **List Parser**: Parses LLM output to be a list of strings.
- **JSON Parser**: Parses LLM output to be in JSON format.
- **Datetime Parser**: Parses LLM output into datetime format.

There are also external libraries that can help with structuring outputs, like [Kor](https://eyurtsev.github.io/kor/), a wrapper to that allows you to specify a schema of what should be extracted and provide examples (like what we saw in **💌 Prompts: Few-shot prompting**).

Let's explore the different types of output parsers, writing your own output parser and using an external library to parse the output in the cells below.


In [16]:
#Lets consider the same prompt again ("What are 5 vacation destinations for someone who likes to eat pasta?")

#we would like to return a list of vacation destination strings as output. We can do that using LangChain's output parsers

#First, lets define our output parser 
output_parser = CommaSeparatedListOutputParser()
format_instructions = output_parser.get_format_instructions()

#print the format instructions to get a sense of what the output parser expects
print(format_instructions)

#define your Prompt Template, passing the output parser as an argument
prompt = PromptTemplate(input_variables=["number", "meal"], 
                        #Add format instructions in the template
                        template="What are {number} vacation destinations for someone who likes to eat {meal}?\n{format_instructions}",
                        #we add our format instructions to the partial variable here
                        partial_variables={"format_instructions": format_instructions})

#define your input 
_input = prompt.format(number="3", meal="salad")

#pass the input to the model
output = model(_input)
output_parsed = output_parser.parse(output)

Your response should be a list of comma separated values, eg: `foo, bar, baz`


lets see what the raw output looks like and what the parsed output looks like...

In [17]:
print(f"the model returns the following output: {output}")
print(f"the parsed output is: {output_parsed}")

the model returns the following output: 

Rome, Italy, Paris, France, Bangkok, Thailand
the parsed output is: ['Rome', 'Italy', 'Paris', 'France', 'Bangkok', 'Thailand']


Now lets try a more complex example by defining our own response schema. You can define a response schema for which you want to return multiple fields using the `ResponseSchema` and `StructuredOutputParser` classes. 

In [18]:
schema = [ResponseSchema(name="name", 
                         description="The first and last name of a person. Must be a string."),
          ResponseSchema(name="date", 
                         description="The date of an event. Must be a string.")]

output_parser = StructuredOutputParser.from_response_schemas(schema)
format_instructions = output_parser.get_format_instructions()

#define your Prompt Template, passing the output parser as an argument
prompt = PromptTemplate(input_variables=["sentence"], 
                        #Add format instructions in the template
                        template="Extract names and dates from the following sentence:{sentence}\n{format_instructions}",
                        #we add our format instructions to the partial variable here
                        partial_variables={"format_instructions": format_instructions})

#define your input 
sentence = "Kylie Jenner turned 26-years-old on Thursday."
_input = prompt.format(sentence=sentence)

#pass the input to the model
output = model(_input)
output_parsed = output_parser.parse(output)

print(f"the initial output of the model is: {output}")
print(f"the parsed output is: {output_parsed}")

the initial output of the model is: 

```json
{
	"name": "Kylie Jenner",
	"date": "Thursday"
}
```
the parsed output is: {'name': 'Kylie Jenner', 'date': 'Thursday'}


**🛸 TASK**: Experiment with different prompts and types of output parsers that LangChain supports. What are the purposes of `OutputFixingParser` and `RetryWithErrorOutputParser`? Can you write your own custom output parser?

Refer to documentation on [output parsers](https://python.langchain.com/docs/modules/model_io/output_parsers/) for guidance.

In [19]:
#Experiment here

### 📃 Prompts: Chaining it all together

So far, we've learned about:

1. Writing a simple prompt;
2. Using LangChain's prompt templates;
3. Using few-shot prompting and;
4. Parsing the output of the model by passing formatting instructions. 

This should give you a good foundation to start building your own prompts and is a good departure point to explore LangChain's **🔗🤖 chains & agents.** 

## 🔗🤖 Chains & Agents

The core idea of the LangChain is that we can "chain" together different components to create more advanced use cases around LLMs. Chains are simply end-to-end wrappers around multiple individual components.

The simplest chain is the LLMchain. It consists of a PromptTemplate, a model (either an LLM or a ChatModel), and an optional output parser. We've already explored all 3 of these components in the previous sections.  

Let's see what that looks like in the cell below.

In [24]:
#define your prompt
prompt_template = "What is a good alternative name for a celebrity called {name}?"

#here we define the chain - we pass the prompt template and the model to the LLMChain
llm_chain = LLMChain(
    llm=model,
    prompt=PromptTemplate.from_template(prompt_template)
)

#execute your chain
llm_chain('Kim Kardashian')

{'name': 'Kim Kardashian', 'text': '\n\nKendall Jenner.'}

Let's also pass an output parser to the chain.

In [26]:
#instantiate the output parser
output_parser = CommaSeparatedListOutputParser()

#define your prompt
template = """Give me the ingredients for a {adjective} {meal}"""

#instantiate the prompt template
prompt = PromptTemplate(template=template, input_variables=["adjective", "meal"], output_parser=output_parser)

#chain the prompt template and the model
llm_chain = LLMChain(prompt=prompt, 
                     llm=model)

#execute your chain
llm_chain({'adjective': "spicy", 'meal': "laksa"})

{'adjective': 'spicy',
 'meal': 'laksa',
 'text': '\n\n-Rice vermicelli\n-Curry paste (of your choice)\n-Canned coconut milk\n-Vegetable stock\n-Shallots\n-Garlic\n-Shrimps\n-Fish sauce\n-Sugar\n-Lemongrass\n-Lime juice\n-Coriander\n-Cumin\n-Red chilli\n-Red onion\n-Tomatoes\n-Fresh coriander'}

### 🔗🤖 Chains & Agents: Sequential and Index-related chains

Lets explore a few other chains and categories of chains that LangChain supports: 

1. `SequentialChain`: A chain that executes a sequence of chains in order.
2. **Index-related chains**: A category of chains that allows you to combine your own data (stored in indexes) with LLMs. 

In [27]:
# This is an LLMChain to write a headline for a celebrity gossip article

llm = OpenAI(temperature=.7)
template = """You are a news reporter. Given a celebrity name and a date, write a gossip headline for an article about them.

name: {name}
date: {date}

news reporter: This just in:"""

prompt_template = PromptTemplate(input_variables=["name", "date"], template=template)
gossip_chain = LLMChain(llm=llm, prompt=prompt_template, output_key="gossip")

# This is an LLMChain to write a celebrity response, given the headline. 

llm = OpenAI(temperature=.7)

template = """You are a celebrity. Given a headline about you, you are writing a response to it.

News headline:
{gossip}

Response from a celebrity of the above headline:"""

prompt_template = PromptTemplate(input_variables=["gossip"], template=template)
response_chain = LLMChain(llm=llm, prompt=prompt_template, output_key="response")

In [None]:
#This is the overall chain where we chain the two chains together

overall_chain = SequentialChain(
    chains=[gossip_chain, response_chain],
    input_variables=["name", "date"],
    # Here we return multiple variables
    output_variables=["gossip", "response"],
    verbose=True)

overall_chain({"name":"Drake", "date": "April 2021"})



[1m> Entering new SequentialChain chain...[0m

[1m> Finished chain.[0m


{'name': 'Drake',
 'date': 'April 2021',
 'gossip': ' Drake Sparks Romance Rumors in April 2021!',
 'response': " \nI'm flattered by the rumors, but I'm focused on my music right now. I'm just enjoying the moment."}

Now let's take a look at index-related chains. This category of chains are used for interacting with indexes, with the key purpose of allowing you to combine your own data (stored in indexes) with LLMs. 

There are a few different types of index-related chains: `stuff`, `map_reduce` and `refine`.

`stuff`: you simply stuff all the related data into the prompt as context to pass to the language model.

`map_reduce`: This method involves running an initial prompt on each chunk of data (for summarization tasks, this could be a summary of that chunk; for question-answering tasks, it could be an answer based solely on that chunk). Then a different prompt is run to combine all the initial outputs. 

`refine`: This method involves running an initial prompt on the first chunk of data, generating some output. For the remaining documents, that output is passed in, along with the next document, asking the LLM to refine the output based on the new document.

In [29]:
#First, lets load a pre-defined question answering chain

#lets load some data
dg = DataGetter(local=False)
#we're going to get a sample of 10 estonian patents
#who's assignee is from tallinn
patents_sample = (dg.get_estonian_patents()
 .explode('assignee_harmonized_names')
 .query('assignee_harmonized_names.str.contains("TALLINN")')
 .drop_duplicates('family_id')
 .sample(10, random_state=42))

#lets load the data into a dataframe loader
loader = DataFrameLoader(patents_sample, page_content_column="abstract_localized")
docs =loader.load()

#lets define our prompt - we want to ask questions about patents
#from tallinn assignees using patent abstracts 
prompt = """You are a patent examiner. Given patent abstracts, answer the following question:

{Question}

"""

chain = load_qa_chain(llm, chain_type="map_reduce") #here, we can define the chain type to be "stuff", "map_reduce" or "refine".
chain({"input_documents": docs, 
       "question": "What are the types of inventions patented?"}, return_only_outputs=True)

[94;1;1m2023-08-11 15:17:18,293 - TalTech HackWeek 2023 - INFO - Loading data from open dap-taltech s3 bucket. (data_getters.py:58)[0m


{'output_text': ' Types of inventions patented include a method of making a transparent visible light activated photocatalytic superhydrophilic glass material, a system and method for power transfer between two DC voltage sources, a therapeutic mud mixture, and a sensor for the detection of Neurotrophic Factor.'}

**🛸 TASK**: Spend some time exploring the types of chains that LangChain supports. Can you build your own chain using a combination of the components we've explored so far?

In [30]:
#You can investigate the different types of chains by
#calling help on langchain.chains

help(langchain.chains)


Help on package langchain.chains in langchain:

NAME
    langchain.chains - Chains are easily reusable components which can be linked together.

DESCRIPTION
    Chains should be used to encode a sequence of calls to components like
    models, document retrievers, other chains, etc., and provide a simple interface
    to this sequence.
    
    The Chain interface makes it easy to create apps that are:
        - Stateful: add Memory to any Chain to give it state,
        - Observable: pass Callbacks to a Chain to execute additional functionality,
            like logging, outside the main sequence of component calls,
        - Composable: the Chain API is flexible enough that it is easy to combine
            Chains with other components, including other Chains.

PACKAGE CONTENTS
    api (package)
    base
    chat_vector_db (package)
    combine_documents (package)
    constitutional_ai (package)
    conversation (package)
    conversational_retrieval (package)
    elasticsearch_datab

### 🔗🤖 Chains & Agents: Agents

Now that we've explored a few different types of chains that LangChain supports, let's pivot to exploring agents.

Chains and agents are somewhat similar. However, in chains, a sequence of actions is hardcoded (in code). In agents, a language model is used as a reasoning engine to determine which actions to take and in which order. Here are a list of [Agent types that langchain supports](https://python.langchain.com/docs/modules/agents/agent_types/). 

Key to an Agent are tools. Tools are functions that an agent calls. You can define your own tools by adding a `tool` decorator to a function.

Let's walk through a simple example first.

In [31]:
#lets load our chat model 
llm = ChatOpenAI(temperature=0)

#lets define a really simple tool 
#to return the number of s's in a word
@tool
def get_s_count(word: str) -> int:
    """Returns a count of the number of s's in a word"""
    return word.lower().count("s")

system_message = SystemMessage(content="You are very powerful assistant, but bad at calculating the number of times the letter s appears in a word.")
prompt = OpenAIFunctionsAgent.create_prompt(system_message=system_message)

tools = [get_s_count]
#putting it alltogether
agent = OpenAIFunctionsAgent(llm=llm, tools=tools, prompt=prompt)
#this defines the run time for the agent
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

agent_executor.run("how many times do you spot the letter 's' in the word Sesquipedalian?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `get_s_count` with `{'word': 'Sesquipedalian'}`


[0m[36;1m[1;3m2[0m[32;1m[1;3mThe letter 's' appears 2 times in the word "Sesquipedalian".[0m

[1m> Finished chain.[0m


'The letter \'s\' appears 2 times in the word "Sesquipedalian".'

Voila - we have an agent! However, the agent is stateless - meaning it doesn't remember anything about previous interactions, making follow up questions difficult. I don't know about you, but I have no idea what "Sesquipedalian" means. Let's add memory to fix this and ask a few follow up questions.

In [32]:
MEMORY_KEY = "chat_history"
prompt = OpenAIFunctionsAgent.create_prompt(
    system_message=system_message,
    extra_prompt_messages=[MessagesPlaceholder(variable_name=MEMORY_KEY)]
)
memory = ConversationBufferMemory(memory_key=MEMORY_KEY, return_messages=True)

agent = OpenAIFunctionsAgent(llm=llm, tools=tools, prompt=prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, memory=memory, verbose=True)
agent_executor.run("how many times do you spot the letter 's' in the word Sesquipedalian?")
agent_executor.run("what does that word even mean?")
agent_executor.run("can you use it in a sentence?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `get_s_count` with `{'word': 'Sesquipedalian'}`


[0m[36;1m[1;3m2[0m[32;1m[1;3mThe letter 's' appears 2 times in the word "Sesquipedalian".[0m

[1m> Finished chain.[0m


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m"Sesquipedalian" is an adjective that means using long words or characterized by long words; long-winded. It is often used to describe someone who tends to use excessively long and complex words in their speech or writing.[0m

[1m> Finished chain.[0m


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mCertainly! Here's an example sentence using the word "sesquipedalian":

"During the lecture, the professor's sesquipedalian style of speaking made it difficult for the students to understand the concepts."[0m

[1m> Finished chain.[0m


'Certainly! Here\'s an example sentence using the word "sesquipedalian":\n\n"During the lecture, the professor\'s sesquipedalian style of speaking made it difficult for the students to understand the concepts."'

Great - we've created a simple agent that can remember previous interactions. 

Let's explore different agents by re-visiting our chain that summarised a series of estonian patents as a use case.

In [33]:
#Lets create a dataframe agent 

#lets get our patents data
patents_sample = (dg.get_estonian_patents()
 .explode('assignee_harmonized_names')
 .query('assignee_harmonized_names.str.contains("TALLINN")')
 .drop_duplicates('family_id')
 .sample(10, random_state=42)
 .drop(columns=['inventor_harmonized_country_codes', 'assignee_harmonized_country_codes', 'abstract_localized', 'country_code', 'application_number', 'filing_date', 'priority_date']))

#lets add memory so we can ask follow up questions
MEMORY_KEY = "chat_history"
prompt = OpenAIFunctionsAgent.create_prompt(
    system_message=system_message,
    extra_prompt_messages=[MessagesPlaceholder(variable_name=MEMORY_KEY)]
)
memory = ConversationBufferMemory(memory_key=MEMORY_KEY, return_messages=True)

#let's instantiate a pandas df agent with a chat model                    
agent = create_pandas_dataframe_agent(OpenAI(temperature=0), patents_sample, verbose=True, memory=memory)
agent.run("what are the patents about?")
agent.run("can you tell me more about the patents?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to look at the title_localized column
Action: python_repl_ast
Action Input: df['title_localized'].tolist()[0m
Observation: [36;1m[1;3m['A method of making a transparent visual light activated photocatalytic superhydrophilic glass material', 'Method and device for measuring charcteristics of refelection of light on surfaces', 'Method of shoot-through generation for modified sine wave z-source, quasi-z-source and trans-z-source inverters', 'Method of making a portable mip-based electrochemical sensor for the detection of the sars-cov-2 antigen', 'System and method for a partial power transfer between two dc sources', 'Synthesis and polymerization of isosorbide-based monomethacrylates', 'Therapeutic mud mixture and a method for its manufacture', 'Method and device for measuring and monitoring concentration of substances in a biological fluid', 'Method and device for frequency response measurement', 'Molecularl

'The dataframe contains 10 entries, with 10 columns of data. The columns contain information about the patent, such as the publication number, family id, title, publication date, grant date, cpc, inventor and assignee harmonized names, and inventor and assignee harmonized names.'

**🛸 TASK**: Reflect on the chain vs. agent approach. How do the two differ? 

Build your own **agent** using tools [from a list of available langchain tools](https://python.langchain.com/docs/integrations/tools/) and a combination of the components we've explored so far. 

In [181]:
#Build your own agent here

## 🔗🤖 Chains & Agents: Chaining it all together

Nice. Now we're familiar with prompts, chains and agents. We also dabbled in using our own data (primarily patents) as part of building our LLM application. 

LangChain provides much more functionality for using external data sources than what we've seen. Using external data is a key part to building real-world use cases with LLM applications. Let's explore this in more detail in the cells below. 

## 📚 Data Augmented Generation

### 📚 Data Augmented Generation: Document Loaders

So far we've been using data from our utils library. LangChain has many different integrations to be able to load data to use external sources in your LLM application, ranging from loading your own data from a local directory or AWS's S3 to external data sources from Twitter or Open City Data.

In [3]:
#We have loaded a list of Document objects related to the Wikipedia query "Barbie"
barbie_pages = WikipediaLoader(query="Barbie", load_max_docs=10).load()
len(barbie_pages)



  lis = BeautifulSoup(html).find_all('li')


9

In [4]:
#lets have a look at a Document
print(f"The page content begins as follows:")
print('') 
print(f"{barbie_pages[3].page_content[:1000]}...")

The page content begins as follows:

Barbenheimer is an Internet phenomenon that began on social media before the simultaneous theatrical release of two blockbuster films, Barbie and Oppenheimer, on July 21, 2023, in the United States and several other countries. The word is a portmanteau of the films' titles. The contrast of Barbie—a fantasy comedy by Greta Gerwig about the fashion doll Barbie—and Oppenheimer—an epic biographical thriller by Christopher Nolan about physicist J. Robert Oppenheimer, scientific director of the Manhattan Project, which developed the first nuclear weapons during World War II—prompted a comedic response from Internet users, including memes and merchandise. Polygon described the two films as "extreme opposites", and Variety called the phenomenon "the movie event of the year".The films' simultaneous release was an instance of counterprogramming. As their release date approached, instead of generating a rivalry, suggestions emerged to watch the films as a doub

It's as easy as that! Feel free to refer to LangChain's document loaders [here](https://python.langchain.com/docs/integrations/document_loaders/) to explore different types of loaders. 

We won't explore document loading or transformation too much in this tutorial. Instead, we pivot to focus on combining LLMs and traditional Information Retrieval (IR) techniques called Retrieval Augmented Generation (RAG), using langchain's document loaders as a departure point for building a vector database.

### 📚 Data Augmented Generation: Retrival

As we've already seen, we can load external data sources and pass them to our LLMs as context via a prompt. However, sometimes when our data is much larger, we want to be able to retrieve the most relevant datapoints first. This is where **Retrieval Augmented Generation (RAG)** comes in.

**RAG** is a new generative paradigm that fuses Large Language Models and traditional Information Retrieval (IR) techniques. We can use a retrieval system for the input prompt to augment the output generated by the LLM. This technique allows us to bypass fine-tuning as we can easily expose the model to external data (non-parametric), instead of having to retrain it on our domain-specific data. There are a number of advantages to RAG including:

1. **Easy Knowledge Acquisition.** RAG methods allow can easily acquire knowledge from external sources, improving LLM performance within domain specific tasks.   

2. **Minimal Training Cost.** The only training needed is the indexing of your knowledge base. No fine-tuning necessary.

3. **Multiple Sources of Knowledge.**  With RAG, one can make use of multiple sources of knowledge, including those that are baked into the model parameters as well as information contained within many different knowledge bases.

4. **Scalability.** Using performant vector databases, we can easily scale RAG to large datasets and handle complex queries.

5. **Improved Performance & Reduced Hallucination.** RAG generates more accurate and contextually informed content by leveraging retrieval techniques, reducing the likelihood of generating incorrect or fabricated information.

6. **Overcome Context-Window Limit.** All language models have a fixed length of tokens they can process at once, known as the context-window. Using Retrieval Augmentation, we can overcome this fixed text constraint, allowing the model to incorporate data from larger document collections 

7. **Return Sources.** RAG also offers explainability, which is essential for building trust in LLMs. Unlike a black-box LLM, RAG allows users to read the sources they retrieved and judge their relevance and credibility for themselves.

_Taken from [Harnessing Retrieval Augmented Generation With Langchain](https://betterprogramming.pub/harnessing-retrieval-augmented-generation-with-langchain-2eae65926e82)_

Let's build on our knowledge of prompts, chains, agents and document loading to explore RAG in more detail in the cells below.

In [5]:
#1. Load data

# First, we need to load data from a document loader - let's revisit our barbie example by loading barbie related wikipedia pages
print(f"we will create a vector database from {len(barbie_pages)} barbie related wikipedia pages...")

#2. Preprocess data

# As we've already loaded the barbie pages, we need to preprocess the data next. 
# Let's revisit some of the techniques we learned in the text analysis tutorial to 
#preprocess our wikipedia pages. Let's chunk and tokenize our documents.

##It’s important to chunk the data as we want to embed a meaningful length of context within our vector index. 
# Embedding just a word or two is too little information to match relevant vectors, and embedding entire pages would be too long 
# to fit within the context window of the prompt. Try to strike the right balance for your use case and dataset.
from langchain.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=500, chunk_overlap=25)
docs = text_splitter.split_documents(barbie_pages)


we will create a vector database from 9 barbie related wikipedia pages...


In [38]:
##3. Index your data
 
# Once we’ve gathered our data sources, it’s time to build our knowledge-base index. 
# In general, the term “index” refers to a data structure that is used to optimize the retrieval of information 
# from a larger collection of data.
# In this demo, I'll use the chromadb vector database, a free, open-source vector store that
#runs on your local machine and OpenAI embeddings. 
embeddings = OpenAIEmbeddings()

#build our vector store with OpenAI embeddings and barbie pages
db = Chroma.from_documents(docs, embeddings)

##4. Build a Retriever

#Once our vector store is indexed, it’s time to define our retriever. Retriever is the module that determines 
# how the relevant documents are fetched from the vector database, determined by its search algorithm.
# load index

# initialize base retriever
retriever = db.as_retriever(search_kwargs={"k": 3}) #we will return the top 3 results

llm = ChatOpenAI(temperature=0)

compressor = LLMChainExtractor.from_llm(llm) #this will add contextual compression, meaning it will
#iterate over the initially returned documents 
# and extract from each only the context relevant to the query, not the whole wikipedia page.
reranker = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=retriever) 

##5. Create our Conversational Retrieval Chain

#Now that we've stored our data in a vector store and defined our retriever, 
# we can create our conversational retrieval chain.


#Lets define memory so we can ask follow up questions
memory = ConversationTokenBufferMemory(llm=llm, 
                                       memory_key="chat_history", 
                                       return_messages=True, 
                                       input_key='question', 
                                       max_token_limit=1000)
#Let's define our LLM chain

_template = """Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:"""
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(_template)

question_generator = LLMChain(llm=llm, 
                              prompt=CONDENSE_QUESTION_PROMPT, 
                              verbose=True)
#Let's define our answer chain
answer_chain = load_qa_with_sources_chain(llm, chain_type="stuff", verbose=True)

chain = ConversationalRetrievalChain(
            retriever=reranker,
            question_generator=question_generator,
            combine_docs_chain=answer_chain,
            verbose=True,
            memory=memory,
            rephrase_question=False
)

Amazing! Now we can ask all our barbie related questions 💅 .

In [39]:
chain.run("Who directed the Barbie film?")



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following extracted parts of a long document and a question, create a final answer with references ("SOURCES"). 
If you don't know the answer, just say that you don't know. Don't try to make up an answer.
ALWAYS return a "SOURCES" part in your answer.

QUESTION: Which state/country's law governs the interpretation of the contract?
Content: This Agreement is governed by English law and the parties submit to the exclusive jurisdiction of the English courts in  relation to any dispute (contractual or non-contractual) concerning this Agreement save that either party may apply to any court for an  injunction or other relief to protect its Intellectual Property Rights.
Source: 28-pl
Content: No Waiver. Failure or delay in exercising any right or remedy under this Agreement shall not cons

'The director of the Barbie film is Greta Gerwig.\nSOURCES: https://en.wikipedia.org/wiki/Barbie_(film), https://en.wikipedia.org/wiki/List_of_Barbie_animated_films'

In [40]:
chain.run("Who did she produce the film with?")



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:

Human: Who directed the Barbie film?
Assistant: The director of the Barbie film is Greta Gerwig.
SOURCES: https://en.wikipedia.org/wiki/Barbie_(film), https://en.wikipedia.org/wiki/List_of_Barbie_animated_films
Follow Up Input: Who did she produce the film with?
Standalone question:[0m

[1m> Finished chain.[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following extracted parts of a long document and a question, create a final answer with references ("SOURCES"). 
If you don't know the answer, just say that you don't know. Don't try to make up an answer.
ALWAYS return a "SOURCES" par

'She produced the film with Laurence Mark.\nSOURCES: https://en.wikipedia.org/wiki/Barbie_(film)'

In [41]:
chain.run("What does Barbie have to do with Oppenheimer?")



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:

Human: Who directed the Barbie film?
Assistant: The director of the Barbie film is Greta Gerwig.
SOURCES: https://en.wikipedia.org/wiki/Barbie_(film), https://en.wikipedia.org/wiki/List_of_Barbie_animated_films
Human: Who did she produce the film with?
Assistant: She produced the film with Laurence Mark.
SOURCES: https://en.wikipedia.org/wiki/Barbie_(film)
Follow Up Input: What does Barbie have to do with Oppenheimer?
Standalone question:[0m

[1m> Finished chain.[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following extracted parts of a long document and a question, create a final

'Barbie and Oppenheimer are two blockbuster films that were released simultaneously on July 21, 2023. The contrast between the two films, Barbie being a fantasy comedy and Oppenheimer being an epic biographical thriller about physicist J. Robert Oppenheimer, led to an Internet phenomenon called Barbenheimer. This phenomenon included memes and merchandise related to the combination of the two films. \nSOURCES: https://en.wikipedia.org/wiki/Barbenheimer'

**🛸 TASK**: Build your own `ConversationalRetrievalChain` using a different data source to index in a vector store. 

In [None]:
#Here is my own example...

### 📚 Data Augmented Generation: Chaining it all together

Great! We've learned about:

1. **Document Loaders**: Ways to load external data sources into our LLM application.
2. **Retrival**: How to use a vector database to retrieve the most relevant datapoints from our external data sources.

### 🧐 A note on Evaluation

Throughout these exercises, responses from LLMs and chat models have not always been accurate. Evaluating LLMs systems is the wild west. There are some ways to evaluate compontents of a system like A/B testing prompts and examples or using LLMs to evaluate the quality of its responses. 

To learn more about evaluating LLM applications, check out [this video of Josh Tobin discussing evaluation from LLMs in prod conference](https://www.youtube.com/watch?v=r-HUnht-Gns).

## 🍉 Conclusions

From this tutorial, you should have a sense of how to build LLM applications using LangChain. You've familiarised yourself with:

1. **Prompting.** How to make use of a prompt template, the benefits of few-shot prompting, how to format instructions to parse the output of a model.

2. **Chains & Agents.** How to write a simple LLMChain, how to build a sequential chain and investigating agents.  

3. **Data Augmented Generation.** How to use document loaders to load external data sources; how to create a vector database and store external data as embeddings; how to build a qa chain using a vector database.

Let's see how we can build an LLM application around these areas in a practical use case, focused on innovation mapping. Please head to the `./llm_innovation_mapping.ipynb` notebook to follow along. 