# Summarize via Stuffing

"Stuffing" is a silly as it sounds -- essentially you "stuff" the large document into the input prompt.

In [None]:
!pip install langchain
!pip install langchain-community
!pip install gpt4all

## Import the model

In [1]:
from langchain_community.llms.gpt4all import GPT4All

# mistral-7b download available from the gpt4all website. Use the "Model Explorer"
# https://gpt4all.io/
llm = GPT4All(
    model="../../models/mistral-7b-openorca.Q4_0.gguf",
    max_tokens=1024,
)

## Load the data

I will be working with two datasets:

- `data/small-document.txt` - Paragraph from [this article](https://www.nature.com/articles/s41467-017-01082-6).
- `data/large-document.txt` - Full transcript of [this podcast](https://anchor.fm/s/74aab30/podcast/play/1593261/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fproduction%2F2018-9-22%2F5313967-44100-1-ae7cde1436c24.mp3).

In [2]:
from pathlib import Path

SMALL_DOC = Path("./data/small-document.txt").read_text()
LARGE_DOC = Path("./data/large-document.txt").read_text()

## Stuffing with a simple chain

This is how we can achieve stuffing with a `PromptTemplate` and llm.

In [12]:
from langchain.prompts import PromptTemplate

prompt = PromptTemplate.from_template(
    "Here is the full document: \n\n"
    '{document}\n\n'
    "And here is the concise summary: \n\n"
)

summary = (prompt | llm).invoke({
    "document": SMALL_DOC
})

print("Summary: \n\n", summary, "\n\n")
print("Number of words in original document: ", len(SMALL_DOC.split()))
print("Number of words in summary: ", len(summary.split()))

Summary: 

 RNA is recognized as a powerful tool in controlling gene expression and engineering synthetic cellular functions due to its ability to create specific structures that can interact with cellular machinery. This has led to the development of RNA regulators capable of controlling almost every aspect of gene expression, which can be further controlled by small RNAs or ligand binding. The potential for using nucleic acid design algorithms to create de novo RNA regulators offers a major advantage over protein-based regulation and promises significant advancements in synthetic biology. 


Number of words in original document:  190
Number of words in summary:  87


Attempting to stuff the larger document should produce an error, since the document is larger than the allowed token context. I will learn more about tokenization in a future lesson -- for now it's a bit of trial and error. Let's see what that error looks like:

In [6]:
summary = (prompt | llm).invoke({
    "document": LARGE_DOC
})

print("Summary: \n\n", summary, "\n\n")
print("Number of words in original document: ", len(LARGE_DOC.split()))
print("Number of words in summary: ", len(summary.split()))

Summary: 

 ERROR: The prompt size exceeds the context window size and cannot be processed. 


Number of words in original document:  7251
Number of words in summary:  13


Huh. Worth noting that this did not raise an error in Python -- instead the chain responded with error text. Curious how I should go about catching those errors properly in the future?

### Langchain & Summarization Chains

It is worth noting that LangChain has a builtin chain for summarizing: [StuffDocumentsChain](https://api.python.langchain.com/en/latest/chains/langchain.chains.combine_documents.stuff.StuffDocumentsChain.html). I followed [this guide](https://www.comet.com/site/blog/mastering-document-chains-in-langchain/) to write the following code.

> 🌞 Side Note: This code covers several concepts which I have not learned about yet. Once I have written lessons on them, I will come back and provide links to those learnings.

In [13]:
from langchain.chains.summarize import load_summarize_chain
from langchain.docstore.document import Document
from langchain.text_splitter import CharacterTextSplitter

# splits the text based on a character
text_splitter = CharacterTextSplitter(
    separator='\n',
    chunk_size = 512,
    chunk_overlap=20,
    length_function=len)

small_doc_chunks = text_splitter.split_text(SMALL_DOC)

docs = [Document(page_content=t) for t in small_doc_chunks]

stuff_chain = load_summarize_chain(
    llm,
    chain_type="stuff",
    prompt=prompt,
    document_variable_name="document",
)

output_summary = stuff_chain.run(docs)
print(output_summary)

RNA is recognized as a powerful tool in controlling gene expression and engineering synthetic cellular functions due to its ability to create specific structures that can interact with cellular machinery. This has led to the development of RNA regulators capable of controlling almost every aspect of gene expression, which can be further controlled by small RNAs or ligand binding. The potential for using nucleic acid design algorithms to create de novo RNA regulators offers a major advantage over protein-based regulation and promises significant advancements in synthetic biology.


In [14]:
large_doc_chunks = text_splitter.split_text(LARGE_DOC)
docs = [Document(page_content=t) for t in large_doc_chunks]

output_summary = stuff_chain.run(docs)
print(output_summary)

ERROR: The prompt size exceeds the context window size and cannot be processed.
