SUMMARIZATION

https://python.langchain.com/en/latest/modules/chains/index_examples/summarize.html

In [None]:
from langchain import OpenAI, PromptTemplate, LLMChain
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains.mapreduce import MapReduceChain
from langchain.prompts import PromptTemplate

llm = OpenAI(temperature=0)

text_splitter = CharacterTextSplitter()

In [None]:
path = "/Users/jeffreydiament/Desktop/Naamah.txt"
with open(path) as f:
    book = f.read()
texts = text_splitter.split_text(book)

In [None]:
char_count = len(book)
word_count = len(book.split(' '))
texts_count = len(texts)

print(f"Character count: {char_count}")
print(f"Word count: {word_count}")
print(f"Texts count: {texts_count}")

In [None]:
from langchain.docstore.document import Document

docs = [Document(page_content=t) for t in texts]

In [None]:
from langchain.chains.summarize import load_summarize_chain

STUFFING CHAIN

Stuffing is the simplest method, whereby you simply stuff all the related data into the prompt as context to pass to the language model. This is implemented in LangChain as the StuffDocumentsChain.

Pros: Only makes a single call to the LLM. When generating text, the LLM has access to all the data at once.

Cons: Most LLMs have a context length, and for large documents (or many documents) this will not work as it will result in a prompt larger than the context length.

The main downside of this method is that it only works on smaller pieces of data. Once you are working with many pieces of data, this approach is no longer feasible. The next two approaches are designed to help deal with that.

-> does not work on large data sets

In [None]:
chain = load_summarize_chain(llm, chain_type="stuff")
chain.run(docs)

MAP REDUCE CHAIN

This method involves running an initial prompt on each chunk of data (for summarization tasks, this could be a summary of that chunk; for question-answering tasks, it could be an answer based solely on that chunk). Then a different prompt is run to combine all the initial outputs. This is implemented in the LangChain as the MapReduceDocumentsChain.

Pros: Can scale to larger documents (and more documents) than StuffDocumentsChain. The calls to the LLM on individual documents are independent and can therefore be parallelized.

Cons: Requires many more calls to the LLM than StuffDocumentsChain. Loses some information during the final combined call.

In [None]:
chain = load_summarize_chain(llm, chain_type="map_reduce")
chain.run(docs)

REFINE CHAIN

This method involves running an initial prompt on the first chunk of data, generating some output. For the remaining documents, that output is passed in, along with the next document, asking the LLM to refine the output based on the new document.

Pros: Can pull in more relevant context, and may be less lossy than MapReduceDocumentsChain.

Cons: Requires many more calls to the LLM than StuffDocumentsChain. The calls are also NOT independent, meaning they cannot be paralleled like MapReduceDocumentsChain. There is also some potential dependencies on the ordering of the documents.

In [None]:
chain = load_summarize_chain(OpenAI(temperature=0), chain_type="refine", return_intermediate_steps=True)

chain({"input_documents": docs}, return_only_outputs=True)

QUESTION ANSWERING OVER DOCS

https://python.langchain.com/en/latest/use_cases/question_answering.html

QUICK OVERVIEW

In [None]:
# Load Your Documents
from langchain.document_loaders import TextLoader
loader = TextLoader(path)

In [None]:
# Create Your Index
from langchain.indexes import VectorstoreIndexCreator
index = VectorstoreIndexCreator().from_loaders([loader])

In [None]:
#Query Your Index
query = "Who is Bethel?"
index.query(query)

In [None]:
index.query_with_sources(query)

MORE DETAILS

In [None]:
# Load Your Documents
documents = loader.load()

In [None]:
# Next, we will split the documents into chunks. This is necessary because the OpenAI API has a limit of 2048 tokens per request. We will use the CharacterTextSplitter to split the documents into chunks of 1000 characters with 0 character overlap between chunks.
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

In [None]:
# We will then select which embeddings we want to use.
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [None]:
# We now create the vectorstore to use as the index.
from langchain.vectorstores import Chroma
db = Chroma.from_documents(texts, embeddings)

In [None]:
# So that’s creating the index. Then, we expose this index in a retriever interface.
retriever = db.as_retriever()

In [None]:
from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="refine", retriever=retriever)


In [None]:
# Then, as before, we create a chain and use it to answer questions!
query = "Who is Bethel?"
qa.run(query)
