# Langchain Cookbook

### Summarization

In [1]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

True

In [2]:
hugging_face_token = os.getenv("HUGGINGFACEHUB_API_TOKEN")
langchain_token = os.getenv("LANGCHAIN_API_KEY")
serp_token = os.getenv("SERPAPI_API_KEY")

In [3]:
from langchain_huggingface import HuggingFaceEndpoint

repo_id = "mistralai/Mistral-7B-Instruct-v0.2"


llm = HuggingFaceEndpoint(repo_id=repo_id,
                          huggingfacehub_api_token=hugging_face_token,
                          temperature=0.1)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to C:\Users\Hori\.cache\huggingface\token
Login successful


### Summaries Of Short Text

For summaries of short texts, the method is straightforward, in fact you don't need to do anything fancy other than simple prompting with instructions

In [4]:
from langchain import PromptTemplate
template = """ 
    %INSTRUCTIONS:
    Please summarize the following piece of text.
    Respond in a manner that a 5 year old would understand.
    
    %TEXT:
    {text}
"""

prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)

In [5]:
confusing_text = """
For the next 130 years, debate raged.
Some scientists called Prototaxites a lichen, others a fungus, and still others clung to the notion that it was some kind of tree.
“The problem is that when you look up close at the anatomy, it’s evocative of a lot of different things, but it’s diagnostic of nothing,” says Boyce, an associate professor in geophysical sciences and the Committee on Evolutionary Biology.
“And it’s so damn big that when whenever someone says it’s something, everyone else’s hackles get up: ‘How could you have a lichen 20 feet tall?’”
"""

In [6]:
print ("------- Prompt Begin -------")

final_prompt = prompt.format(text=confusing_text)
print(final_prompt)

print ("------- Prompt End -------")


------- Prompt Begin -------
 
    %INSTRUCTIONS:
    Please summarize the following piece of text.
    Respond in a manner that a 5 year old would understand.
    
    %TEXT:
    
For the next 130 years, debate raged.
Some scientists called Prototaxites a lichen, others a fungus, and still others clung to the notion that it was some kind of tree.
“The problem is that when you look up close at the anatomy, it’s evocative of a lot of different things, but it’s diagnostic of nothing,” says Boyce, an associate professor in geophysical sciences and the Committee on Evolutionary Biology.
“And it’s so damn big that when whenever someone says it’s something, everyone else’s hackles get up: ‘How could you have a lichen 20 feet tall?’”


------- Prompt End -------


In [7]:
output = llm.invoke(final_prompt)
print(output)

    %SUMMARY:
    A long time ago, people argued about what a big, strange thing called Prototaxites was. Some thought it was a kind of plant called a lichen, others thought it was a different kind of plant called a fungus, and some thought it was a tree. But no one could really agree because it looked like lots of things, and it was really big, so people got mad when others suggested their ideas.


### Summarries of Longer Text

Note: This method will also work for short text too

In [11]:
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter

Let's load up a longer document

In [8]:
with open('../../data/good.txt', 'r') as file:
    text = file.read()
    
print (text[:285])

April 2008(This essay is derived from a talk at the 2008 Startup School.)About a month after we started Y Combinator we came up with the
phrase that became our motto: Make something people want.  We've
learned a lot since then, but if I were choosing now that's still
the one I'd pick.


Let's check how many tokens we have in the text

In [9]:
num_tokens = llm.get_num_tokens(text)
print(f"There are {num_tokens} tokens in my file")

Token indices sequence length is longer than the specified maximum sequence length for this model (3977 > 1024). Running this sequence through the model will result in indexing errors


There are 3977 tokens in my file


Let's split the text into smaller chunks

In [13]:
text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=5000, chunk_overlap=350)
docs = text_splitter.create_documents([text])

print(f"We have now {len(docs)} docs instead of 1 piece of text")

We have now 4 docs instead of 1 piece of text


Create the summarize chain

In [19]:
chain = load_summarize_chain(llm=llm, chain_type="map_reduce", verbose=True)

In [20]:
output = chain.run(docs)
print(output)

  warn_deprecated(




[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"April 2008(This essay is derived from a talk at the 2008 Startup School.)About a month after we started Y Combinator we came up with the
phrase that became our motto: Make something people want.  We've
learned a lot since then, but if I were choosing now that's still
the one I'd pick.Another thing we tell founders is not to worry too much about the
business model, at least at first.  Not because making money is
unimportant, but because it's so much easier than building something
great.A couple weeks ago I realized that if you put those two ideas
together, you get something surprising.  Make something people want.
Don't worry too much about making money.  What you've got is a
description of a charity.When you get an unexpected result like this, it could either be a
bug or a new discovery.  Either businesse

### Question & Answering Using Documents as Context

In [22]:
context = """
Rachel is 30 years old
Bob is 45 years old
Kevin is 65 years old
"""

question = "Who is under 40 years old?"

In [24]:
output = llm.invoke(context + question)
print(output.strip())

Rachel is the only one under 40 years old.
Here's the reasoning:
1. Rachel is 30 years old.
2. Bob is 45 years old.
3. Kevin is 65 years old.
4. To find out who is under 40 years old, we need to identify the person whose age is less than 40.
5. Rachel's age is 30, which is less than 40.
6. Therefore, Rachel is the only one under 40 years old.


#### Using Embeddings

In [25]:
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.embeddings import HuggingFaceEmbeddings

In [27]:
loader = TextLoader('../../data/worked.txt')
doc = loader.load()

print(f"You have {len(doc)} documents")
print(f"You have {len(doc[0].page_content)} characters in the first document")

You have 1 documents
You have 74677 characters in the first document


Split the text into smaller pieces

In [29]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=400)
docs = text_splitter.split_documents(doc)

In [30]:
num_total_characters = sum([len(x.page_content) for x in docs])
print(f"Now you have {len(docs)} documents that have an average of {num_total_characters / len(docs):,.0f}  characters (smaller pieces)")

Now you have 29 documents that have an average of 2,931  characters (smaller pieces)


Get embeddings from the texts and the vector store as FAISS

In [31]:
embeddings = HuggingFaceEmbeddings()
docsearch = FAISS.from_documents(docs, embeddings)

  warn_deprecated(


Create the retreival engine

In [32]:
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=docsearch.as_retriever())

Ask a question

In [34]:
query = "What does the author describe as good work?"
qa.invoke(query)

{'query': 'What does the author describe as good work?',
 'result': " The author describes good work as something that lasts and can be made a living from. He specifically mentions painting as an example, but he also values work that is independent and not reliant on impressing others or being prestigious. He believes that working on things that aren't prestigious can lead to discovering something real and having the right motives."}

### Extraction

We want to parse data from a piece of text or a document