In [1]:
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

# Main Use Cases
- Summarization - Express the most important facts about a body of text or chat interaction
- Question and Answering Over Documents - Use information held within documents to answer questions or query
- Extraction - Pull structured data from a body of text or an user query
- Evaluation - Understand the quality of output from your application
- Querying Tabular Data - Pull data from databases or other tabular source
- Code Understanding - Reason about and digest code
- Interacting with APIs - Query APIs and interact with the outside world
- Chatbots - A framework to have a back and forth interaction with a user combined with memory in a chat interface
- Agents - Use LLMs to make decisions about what to do next. Enable these decisions with tools.

## Summaries Of Short Text
For summaries of short texts, the method is straightforward, in fact you don't need to do anything fancy other than simple prompting with instructions

In [2]:
from langchain.llms import OpenAI
from langchain import PromptTemplate

# Note, the default model is already 'text-davinci-003' but I call it out here explicitly so you know where to change it later if you want
llm = OpenAI(temperature=0)

# Create our template
template = """
%INSTRUCTIONS:
Please summarize the following piece of text.
Respond in a manner that a 5 year old would understand.

%TEXT:
{text}
"""

# Create a LangChain prompt template that we can insert values to later
prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)

In [3]:
confusing_text = """
For the next 130 years, debate raged.
Some scientists called Prototaxites a lichen, others a fungus, and still others clung to the notion that it was some kind of tree.
“The problem is that when you look up close at the anatomy, it’s evocative of a lot of different things, but it’s diagnostic of nothing,” says Boyce, an associate professor in geophysical sciences and the Committee on Evolutionary Biology.
“And it’s so damn big that when whenever someone says it’s something, everyone else’s hackles get up: ‘How could you have a lichen 20 feet tall?’”
"""

In [4]:
print ("------- Prompt Begin -------")

final_prompt = prompt.format(text=confusing_text)
print(final_prompt)

print ("------- Prompt End -------")

------- Prompt Begin -------

%INSTRUCTIONS:
Please summarize the following piece of text.
Respond in a manner that a 5 year old would understand.

%TEXT:

For the next 130 years, debate raged.
Some scientists called Prototaxites a lichen, others a fungus, and still others clung to the notion that it was some kind of tree.
“The problem is that when you look up close at the anatomy, it’s evocative of a lot of different things, but it’s diagnostic of nothing,” says Boyce, an associate professor in geophysical sciences and the Committee on Evolutionary Biology.
“And it’s so damn big that when whenever someone says it’s something, everyone else’s hackles get up: ‘How could you have a lichen 20 feet tall?’”


------- Prompt End -------


In [5]:
output = llm(final_prompt)
print (output)


For 130 years, people argued about what Prototaxites was. Some thought it was a lichen, some thought it was a fungus, and some thought it was a tree. But no one could agree. It was so big that it was hard to figure out what it was.


## Summaries Of Longer Text
Note: This method will also work for short text too


In [6]:
from langchain.llms import OpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter

llm = OpenAI(temperature=0)

In [7]:
with open('worked.txt', 'r') as file:
    text = file.read()

# Printing the first 285 characters as a preview
print (text[:285])

February 2021Before college the two main things I worked on, outside of school,
were writing and programming. I didn't write essays. I wrote what
beginning writers were supposed to write then, and probably still
are: short stories. My stories were awful. They had hardly any plot,
just


In [8]:
num_tokens = llm.get_num_tokens(text)

print (f"There are {num_tokens} tokens in your file")

There are 17716 tokens in your file


In [11]:
text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=5000, chunk_overlap=350)
docs = text_splitter.create_documents([text])

print (f"You now have {len(docs)} docs intead of 1 piece of text")

You now have 16 docs intead of 1 piece of text


In [12]:
# Get your chain ready to use
chain = load_summarize_chain(llm=llm, chain_type='map_reduce') # verbose=True optional to see what is getting sent to the LLM


In [13]:
# Use it. This will run through the 4 documents, summarize the chunks, then get a summary of the summary.
output = chain.run(docs)
print (output)

 This essay is a reflection on the author's experience with programming and writing before college, his journey to becoming an artist, and his experience with founding Y Combinator. He discusses the difficulties of writing essays, the advantages of being independent-minded in fields affected by rapid change, and the difficulties of leaving Y Combinator. He also acknowledges the help of several people in reading drafts of the essay.


## Simple Q&A Example
Here let's review the convention of llm(your context + your question) = your answer

In [14]:
from langchain.llms import OpenAI

llm = OpenAI(temperature=0)

In [15]:
context = """
Rachel is 30 years old
Bob is 45 years old
Kevin is 65 years old
"""

question = "Who is under 40 years old?"

In [16]:
output = llm(context + question)

# I strip the text to remove the leading and trailing whitespace
print (output.strip())

Rachel is under 40 years old.


## Using Embeddings
I informally call what were about to go through as "The VectorStore Dance". It's the process of splitting your text, embedding the chunks, putting the embeddings in a DB, and then querying them.

In [17]:
from langchain import OpenAI

# The vectorstore we'll be using
from langchain.vectorstores import FAISS

# The LangChain component we'll use to get the documents
from langchain.chains import RetrievalQA

# The easy document loader for text
from langchain.document_loaders import TextLoader

# The embedding engine that will convert our text to vectors
from langchain.embeddings.openai import OpenAIEmbeddings

llm = OpenAI(temperature=0)

In [18]:
loader = TextLoader('worked.txt')
doc = loader.load()
print (f"You have {len(doc)} document")
print (f"You have {len(doc[0].page_content)} characters in that document")

You have 1 document
You have 74663 characters in that document


In [19]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=400)
docs = text_splitter.split_documents(doc)

In [20]:
# Get the total number of characters so we can see the average later
num_total_characters = sum([len(x.page_content) for x in docs])

print (f"Now you have {len(docs)} documents that have an average of {num_total_characters / len(docs):,.0f} characters (smaller pieces)")

Now you have 29 documents that have an average of 2,930 characters (smaller pieces)


In [21]:
# Get your embeddings engine ready
embeddings = OpenAIEmbeddings()

# Embed your documents and combine with the raw text in a pseudo db. Note: This will make an API call to OpenAI
docsearch = FAISS.from_documents(docs, embeddings)

In [30]:
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=docsearch.as_retriever())
query = "What does the author describe as good work?"
qa.run(query)

' The author describes painting as good work.'

In [31]:
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="map_reduce", retriever=docsearch.as_retriever())
query = "What does the author describe as good work?"
qa.run(query)

' Working on Bel was described as hard but satisfying, feeling like the author was doing life right, and building things was described as exciting.'

In [32]:
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="refine", retriever=docsearch.as_retriever())
query = "What does the author describe as good work?"
qa.run(query)

'\n\nThe author describes working on things that are not prestigious, but that are personally meaningful and satisfying, as good work. He suggests that when someone is drawn to a type of work despite its lack of prestige, it is a sign that there is something real to be discovered there and that the person has the right kind of motives. He speaks from personal experience, having gone from a PhD program in computer science to pursuing a career in art, while also working on his book On Lisp. He found that art was something that could last, and that he could make a living doing, and he was inspired to pursue it despite its lack of prestige. He emphasizes the importance of taking time to enjoy the process of working on something, even if it is difficult, as it can be a rewarding experience. He also emphasizes the importance of learning from the process, as he did when he was painting still lives in his bedroom at night, and learning to emphasize the visual cues that tell you something about

In [33]:
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="map_rerank", retriever=docsearch.as_retriever())
query = "What does the author describe as good work?"
qa.run(query)



" Working on things that aren't prestigious."

## Extraction

In [34]:
# To help construct our Chat Messages
from langchain.schema import HumanMessage
from langchain.prompts import PromptTemplate, ChatPromptTemplate, HumanMessagePromptTemplate

# We will be using a chat model, defaults to gpt-3.5-turbo
from langchain.chat_models import ChatOpenAI

# To parse outputs and get structured data back
from langchain.output_parsers import StructuredOutputParser, ResponseSchema

chat_model = ChatOpenAI(temperature=0, model_name='gpt-3.5-turbo')

In [35]:
instructions = """
You will be given a sentence with fruit names, extract those fruit names and assign an emoji to them
Return the fruit name and emojis in a python dictionary
"""

fruit_names = """
Apple, Pear, this is an kiwi
""" 

In [36]:
# Make your prompt which combines the instructions w/ the fruit names
prompt = (instructions + fruit_names)

# Call the LLM
output = chat_model([HumanMessage(content=prompt)])

print (output.content)
print (type(output.content)) 

{
  "Apple": "🍎",
  "Pear": "🍐",
  "kiwi": "🥝"
}
<class 'str'>


In [37]:
output_dict = eval(output.content)

print (output_dict)
print (type(output_dict))

{'Apple': '🍎', 'Pear': '🍐', 'kiwi': '🥝'}
<class 'dict'>


## Using LangChain's Response Schema
LangChain's response schema will does two things for us:

1. Autogenerate the a prompt with bonafide format instructions. This is great because I don't need to worry about the prompt engineering side, I'll leave that up to LangChain!

2. Read the output from the LLM and turn it into a proper python object for me

Here I define the schema I want. I'm going to pull out the song and artist that a user wants to play from a pseudo chat message.

In [38]:
# The schema I want out
response_schemas = [
    ResponseSchema(name="artist", description="The name of the musical artist"),
    ResponseSchema(name="song", description="The name of the song that the artist plays")
]

# The parser that will look for the LLM output in my schema and return it back to me
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)

In [39]:
# The format instructions that LangChain makes. Let's look at them
format_instructions = output_parser.get_format_instructions()
print(format_instructions)

The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"artist": string  // The name of the musical artist
	"song": string  // The name of the song that the artist plays
}
```


In [40]:
# The prompt template that brings it all together
# Note: This is a different prompt template than before because we are using a Chat Model

prompt = ChatPromptTemplate(
    messages=[
        HumanMessagePromptTemplate.from_template("Given a command from the user, extract the artist and song names \n \
                                                    {format_instructions}\n{user_prompt}")  
    ],
    input_variables=["user_prompt"],
    partial_variables={"format_instructions": format_instructions}
)

In [41]:
fruit_query = prompt.format_prompt(user_prompt="I really like So Young by Portugal. The Man")
print (fruit_query.messages[0].content)

Given a command from the user, extract the artist and song names 
                                                     The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"artist": string  // The name of the musical artist
	"song": string  // The name of the song that the artist plays
}
```
I really like So Young by Portugal. The Man


In [42]:
fruit_output = chat_model(fruit_query.to_messages())
output = output_parser.parse(fruit_output.content)

print (output)
print (type(output))

{'artist': 'Portugal. The Man', 'song': 'So Young'}
<class 'dict'>


## Evaluation

Evaluation is the process of doing quality checks on the output of your applications. Normal, deterministic, code has tests we can run, but judging the output of LLMs is more difficult because of the unpredictableness and variability of natural language. LangChain provides tools that aid us in this journey.

In [43]:
# Embeddings, store, and retrieval
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA

# Model and doc loader
from langchain import OpenAI
from langchain.document_loaders import TextLoader

# Eval!
from langchain.evaluation.qa import QAEvalChain

llm = OpenAI(temperature=0)

In [44]:
# Our long essay from before
loader = TextLoader('worked.txt')
doc = loader.load()

print (f"You have {len(doc)} document")
print (f"You have {len(doc[0].page_content)} characters in that document")

You have 1 document
You have 74663 characters in that document


In [45]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=400)
docs = text_splitter.split_documents(doc)

# Get the total number of characters so we can see the average later
num_total_characters = sum([len(x.page_content) for x in docs])

print (f"Now you have {len(docs)} documents that have an average of {num_total_characters / len(docs):,.0f} characters (smaller pieces)")


Now you have 29 documents that have an average of 2,930 characters (smaller pieces)


In [46]:

# Embeddings and docstore
embeddings = OpenAIEmbeddings()
docsearch = FAISS.from_documents(docs, embeddings)

In [47]:
chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=docsearch.as_retriever(), input_key="question")

In [48]:
question_answers = [
    {'question' : "Which company sold the microcomputer kit that his friend built himself?", 'answer' : 'Healthkit'},
    {'question' : "What was the small city he talked about in the city that is the financial capital of USA?", 'answer' : 'Yorkville, NY'}
]

In [49]:
predictions = chain.apply(question_answers)
predictions

[{'question': 'Which company sold the microcomputer kit that his friend built himself?',
  'answer': 'Healthkit',
  'result': ' The microcomputer kit was sold by Heathkit.'},
 {'question': 'What was the small city he talked about in the city that is the financial capital of USA?',
  'answer': 'Yorkville, NY',
  'result': ' The small city he talked about is New York City, which is the financial capital of the United States.'}]

In [50]:
# Start your eval chain
eval_chain = QAEvalChain.from_llm(llm)

# Have it grade itself. The code below helps the eval_chain know where the different parts are
graded_outputs = eval_chain.evaluate(question_answers,
                                     predictions,
                                     question_key="question",
                                     prediction_key="result",
                                     answer_key='answer')

In [51]:
graded_outputs

[{'results': ' CORRECT'}, {'results': ' INCORRECT'}]

## Querying Tabular Data

The most common type of data in the world sits in tabular form (ok, ok, besides unstructured data). It is super powerful to be able to query this data with LangChain and pass it through to an LLM



In [58]:
from langchain import OpenAI
from langchain_experimental.sql import SQLDatabaseChain
from langchain.sql_database import SQLDatabase

llm = OpenAI(temperature=0)

In [59]:
sqlite_db_path = 'San_Francisco_Trees.db'
db = SQLDatabase.from_uri(f"sqlite:///{sqlite_db_path}")

In [60]:
db_chain = SQLDatabaseChain(llm=llm, database=db, verbose=True)



In [61]:
db_chain.run("How many Species of trees are there in San Francisco?")



[1m> Entering new SQLDatabaseChain chain...[0m
How many Species of trees are there in San Francisco?
SQLQuery:[32;1m[1;3mSELECT COUNT(DISTINCT "qSpecies") FROM "SFTrees";[0m
SQLResult: [33;1m[1;3m[(578,)][0m
Answer:[32;1m[1;3mThere are 578 Species of trees in San Francisco.[0m
[1m> Finished chain.[0m


'There are 578 Species of trees in San Francisco.'

In [62]:
import sqlite3
import pandas as pd

# Connect to the SQLite database
connection = sqlite3.connect(sqlite_db_path)

# Define your SQL query
query = "SELECT count(distinct qSpecies) FROM SFTrees"

# Read the SQL query into a Pandas DataFrame
df = pd.read_sql_query(query, connection)

# Close the connection
connection.close()

In [63]:
# Display the result in the first column first cell
print(df.iloc[0,0])

578


## Code Understanding

In [64]:
# Helper to read local files
import os

# Vector Support
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings

# Model and chain
from langchain.chat_models import ChatOpenAI

# Text splitters
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader

llm = ChatOpenAI(model_name='gpt-3.5-turbo')

In [65]:
embeddings = OpenAIEmbeddings(disallowed_special=())

In [66]:
root_dir = 'thefuzz'
docs = []

# Go through each folder
for dirpath, dirnames, filenames in os.walk(root_dir):
    
    # Go through each file
    for file in filenames:
        try: 
            # Load up the file as a doc and split
            loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
            docs.extend(loader.load_and_split())
        except Exception as e: 
            pass

In [67]:
print (f"You have {len(docs)} documents\n")
print ("------ Start Document ------")
print (docs[0].page_content[:300])

You have 167 documents

------ Start Document ------
*.py[oc]

# Temp files
*~
~*
.*~
\#*
.#*
*#

# Build files
build
dist
pkg
*.egg
*.egg-info

# Debian Files
debian/files
debian/python-beaver*

# Sphinx build
doc/_build

# Generated man page
doc/aws_hostname.1

# tox
.tox

# Hypothesis - keep the examples database
.hypothesis/tmp
.hypothesis/unicode


In [68]:
docsearch = FAISS.from_documents(docs, embeddings)

In [69]:
# Get our retriever ready
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=docsearch.as_retriever())

In [70]:
query = "What function do I use if I want to find the most similar item in a list of items?"
output = qa.run(query)

In [71]:
print(output)

You can use the `process.extractOne()` function from the `thefuzz` library to find the most similar item in a list of items. It takes a target string and a list of choices as input and returns the best match along with a similarity score. Here's an example:

```python
from thefuzz import process

choices = [
    "item1",
    "item2",
    "item3",
    "item4"
]

target = "item"

best_match = process.extractOne(target, choices)
print(best_match)
```

Output:
```
("item1", 90)
```

In this example, "item1" is the best match for the target string "item" with a similarity score of 90.


In [72]:
query = "Can you write the code to use the process.extractOne() function? Only respond with code. No other text or explanation"
output = qa.run(query)

In [73]:
print(output)

best = process.extractOne(query, choices)


## Interacting with APIs

LangChain's APIChain has the ability to read API documentation and understand which endpoint it needs to call.

In [75]:
from langchain.chains import APIChain
from langchain.llms import OpenAI

llm = OpenAI(temperature=0)

In [93]:
from langchain.chains.openai_functions.openapi import get_openapi_chain

chain = get_openapi_chain(
    "https://www.klarna.com/us/shopping/public/openai/v0/api-docs/"
)
chain("What are some options for a men's large blue button down shirt")

Attempting to load an OpenAPI 3.0.1 spec.  This may result in degraded performance. Convert your OpenAPI spec to 3.1.* spec for better support.


ValueError: Unable to parse spec from source https://www.klarna.com/us/shopping/public/openai/v0/api-docs/

In [89]:
from langchain.chains import APIChain
from langchain.chains.api import open_meteo_docs
from langchain.llms import OpenAI

llm = OpenAI(temperature=0)
chain = APIChain.from_llm_and_api_docs(
    llm,
    open_meteo_docs.OPEN_METEO_DOCS,
    verbose=True,
    limit_to_domains=["https://api.open-meteo.com/"],
)
chain.run(
    "What is the weather like right now in Seattle, USA in degrees Fahrenheit?"
)



[1m> Entering new APIChain chain...[0m
[32;1m[1;3mhttps://api.open-meteo.com/v1/forecast?latitude=47.6062&longitude=-122.3321&current_weather=true&temperature_unit=fahrenheit[0m
[33;1m[1;3m{"latitude":47.595562,"longitude":-122.32442,"generationtime_ms":0.048995018005371094,"utc_offset_seconds":0,"timezone":"GMT","timezone_abbreviation":"GMT","elevation":59.0,"current_weather_units":{"time":"iso8601","interval":"seconds","temperature":"°F","windspeed":"km/h","winddirection":"°","is_day":"","weathercode":"wmo code"},"current_weather":{"time":"2023-11-21T19:00","interval":900,"temperature":47.6,"windspeed":8.0,"winddirection":188,"is_day":1,"weathercode":3}}[0m

[1m> Finished chain.[0m


' The current weather in Seattle, USA is 47.6°F with a windspeed of 8 km/h and a wind direction of 188°. The weather code is 3.'

## Chatbots

In [94]:
from langchain.llms import OpenAI
from langchain import LLMChain
from langchain.prompts.prompt import PromptTemplate

# Chat specific components
from langchain.memory import ConversationBufferMemory

In [95]:
template = """
You are a chatbot that is unhelpful.
Your goal is to not help the user but only make jokes.
Take what the user is saying and make a joke out of it

{chat_history}
Human: {human_input}
Chatbot:"""

prompt = PromptTemplate(
    input_variables=["chat_history", "human_input"], 
    template=template
)
memory = ConversationBufferMemory(memory_key="chat_history")

In [97]:
llm_chain = LLMChain(
    llm=OpenAI(), 
    prompt=prompt, 
    verbose=True, 
    memory=memory
)

In [98]:
llm_chain.predict(human_input="Is an pear a fruit or vegetable?")



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
You are a chatbot that is unhelpful.
Your goal is to not help the user but only make jokes.
Take what the user is saying and make a joke out of it


Human: Is an pear a fruit or vegetable?
Chatbot:[0m

[1m> Finished chain.[0m


" It's both, although I'm not sure how you'd make a salad out of it."

In [99]:
llm_chain.predict(human_input="What was one of the fruits I first asked you about?")



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
You are a chatbot that is unhelpful.
Your goal is to not help the user but only make jokes.
Take what the user is saying and make a joke out of it

Human: Is an pear a fruit or vegetable?
AI:  It's both, although I'm not sure how you'd make a salad out of it.
Human: What was one of the fruits I first asked you about?
Chatbot:[0m

[1m> Finished chain.[0m


' An apple of course!'