# Summarize via Refine

As the name implies, the Refine summarization method works by:

  1. Splitting the source text into chunks.
  2. Asking the LLM to summarize the first chunk.
  3. Ask the LLM to improve the summary based on information in chunk 2.
  4. Repeat for n chunks.

In [None]:
!pip install langchain
!pip install langchain-community
!pip install gpt4all
# I use Google's Vertex AI to process the large document, only uncomment if you need it
#!pip install langchain-google-vertexai

## Import the model


In [1]:
from langchain_community.llms.gpt4all import GPT4All

# mistral-7b download available from the gpt4all website. Use the "Model Explorer"
# https://gpt4all.io/
llm = GPT4All(
    model="../../models/mistral-7b-openorca.Q4_0.gguf",
    max_tokens=1024,
)

## Load the data

I will be working with two datasets:

- `data/small-document.txt` - Paragraph from [this article](https://www.nature.com/articles/s41467-017-01082-6).
- `data/large-document.txt` - Full transcript of [this podcast](https://anchor.fm/s/74aab30/podcast/play/1593261/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fproduction%2F2018-9-22%2F5313967-44100-1-ae7cde1436c24.mp3).

In [1]:
from pathlib import Path

SMALL_DOC = Path("./data/small-document.txt").read_text(encoding="utf-8")
LARGE_DOC = Path("./data/large-document.txt").read_text(encoding="utf-8")

## Use LangChain's refine chain

LangChain offers a [RefineDocumentsChain](https://api.python.langchain.com/en/latest/chains/langchain.chains.combine_documents.refine.RefineDocumentsChain.html).

> 🌞 Side Note: This code covers several concepts which I have not learned about yet. Once I have written lessons on them, I will come back and provide links to those learnings. 

### Prompt Design

The Refine approach requires two prompts:

* A prompt for summarizing the first chunk of text.
* A prompt for refining the summary based on new information.

In [2]:
from langchain.prompts import PromptTemplate

question_prompt = PromptTemplate.from_template(
    "I have taken the following text:\n\n"
    "TEXT: {text}\n\n"
    "And written a short, concise summary below. Let me know what you think!\n\n"
    "SUMMARY: "
)

refine_prompt = PromptTemplate.from_template(
    "Here is the concise summary from the last document:\n\n"
    "SUMMARY: {summary}\n\n"
    "But we have the following new information:\n\n"
    "TEXT: {text}\n\n"
    "So I have refined the original summary to include the new information.\n\n"
    "NEW SUMMARY: "
)

### Document Chunking

I chose to split the source text by sentence. As discussed in [this notebook](./2-summarize-via-map-reduce.ipynb), this is not an optimal approach. There will be a future lesson on better approaches to document splitting.

In [3]:
from langchain.docstore.document import Document
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator=".",
    chunk_size=550,
    chunk_overlap=0,
    length_function=len,
)

small_doc_chunks = text_splitter.split_text(SMALL_DOC)
small_docs = [Document(page_content=t + ".") for t in small_doc_chunks]

large_doc_chunks = text_splitter.split_text(LARGE_DOC)
large_docs = [Document(page_content=t + ".") for t in large_doc_chunks]

### Refine Chain

In [None]:
from langchain.chains.summarize import load_summarize_chain

refine_chain = load_summarize_chain(
    llm=llm,
    chain_type="refine",
    question_prompt=question_prompt,
    refine_prompt=refine_prompt,
    initial_response_name="summary",
    return_intermediate_steps=True,
    verbose=True,
)

### Run on the small document

This took 2m 35s on my local machine.

In [10]:
small_doc_summary = refine_chain({"input_documents": small_docs})



[1m> Entering new RefineDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mI have taken the following text:

TEXT: RNA is increasingly recognized as a powerful biomolecule for controlling gene expression and engineering synthetic cellular functions. One of the reasons for this is that natural and engineered RNA-based regulators are now available that can control almost every aspect of gene expression. In addition these regulatory functions can be enacted and tuned by the programmable formation of specific RNA structures, which mediate interactions with cellular machinery to perform gene regulation.

And written a short, concise summary below. Let me know what you think!

SUMMARY: [0m

[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mHere is the concise summary from the last document:

SUMMARY: 
RNA is becoming increasingly important in controlling gene expression and 

: 

In [8]:
import tabulate

left_column = [page.page_content for page in small_doc_summary["input_documents"]] + ["Final Summary"]
right_column = [out for out in small_doc_summary["intermediate_steps"]] + [small_doc_summary["output_text"]]

table_data = list(zip(left_column, right_column))
table = tabulate.tabulate(table_data, tablefmt='html')
table

0,1
"RNA is increasingly recognized as a powerful biomolecule for controlling gene expression and engineering synthetic cellular functions. One of the reasons for this is that natural and engineered RNA-based regulators are now available that can control almost every aspect of gene expression. In addition these regulatory functions can be enacted and tuned by the programmable formation of specific RNA structures, which mediate interactions with cellular machinery to perform gene regulation","RNA is becoming increasingly important in controlling gene expression and engineering synthetic cellular functions due to the availability of natural and engineered RNA-based regulators that can control almost every aspect of gene expression. These regulatory functions are performed through programmable formation of specific RNA structures, which interact with cellular machinery for effective gene regulation."
"For example in bacteria, the formation of simple RNA structures such as hairpins within mRNAs can prevent their transcription and translation. Moreover, these cis-acting RNA structures can be further controlled through interacting with trans-acting small RNAs (sRNAs) or binding of a ligand, to prevent or allow their formation—in effect creating inducible genetic control elements","RNA is becoming increasingly important in controlling gene expression and engineering synthetic cellular functions due to the availability of natural and engineered RNA-based regulators that can control almost every aspect of gene expression. These regulatory functions are performed through programmable formation of specific RNA structures, which interact with cellular machinery for effective gene regulation. For example, in bacteria, simple RNA structures such as hairpins within mRNAs can prevent their transcription and translation. Moreover, these cis-acting RNA structures can be further controlled through interacting with trans-acting small RNAs (sRNAs) or binding of a ligand, to prevent or allow their formation—in effect creating inducible genetic control elements."
"This combination of versatile genetic regulation controlled by simple RNA structures creates the intriguing possibility of using nucleic acid design algorithms to create RNA regulators de novo. Thus RNA as a substrate for molecular programming has a potential major advantage over less designable protein regulators, and there is great promise for RNA synthetic biology to allow for the bottom up molecular-level design of genetic control systems","RNA is becoming increasingly important in controlling gene expression due to its versatile regulatory functions, which can be controlled through programmable formation of specific RNA structures that interact with cellular machinery for effective gene regulation. This has led to the possibility of using nucleic acid design algorithms to create RNA regulators de novo, offering a major advantage over less designable protein regulators and promising potential for RNA synthetic biology in designing genetic control systems at the molecular level."
Final Summary,"RNA is becoming increasingly important in controlling gene expression due to its versatile regulatory functions, which can be controlled through programmable formation of specific RNA structures that interact with cellular machinery for effective gene regulation. This has led to the possibility of using nucleic acid design algorithms to create RNA regulators de novo, offering a major advantage over less designable protein regulators and promising potential for RNA synthetic biology in designing genetic control systems at the molecular level."


### 💭 Thoughts on Small Document Summary

As I expected, there is some information loss in the summary, but no explicitly incorrect information.

One thing I like about this process is that you can summarize large files on LLMs with small context windows. There would be significat information loss if attempting to summarize a very large document - but perhaps there are other use cases such as extracting references from a large document.

Apart from the information loss, the other issue with the Refine chain is it cannot be parallelized since each new summarization depends on the previous one. So summarizing large documents will be significantly slower than stuffing or Map-Reduce (if the mapping stage is parallelized).

### Run on the large document

I don't want to wait 45 minutes for this to run [like I did in the map-reduce notebook](./2-summarize-via-map-reduce.ipynb). So I'm cheating a bit by using Google's gemini llm. As always, you should be able to replace with any llm of your choice -- just keep in mind that prompts can vary quite a bit between LLMs.

In this example, I rely on langchain's default prompts because Google's Gemini model has some instruction training, which work well with the defaults. Also I have opted to exclude the cell outputs since processing the large document produces a _lot_ of output. Feel free to run through it yourself!

In [22]:
from langchain.chains.summarize import refine_prompts

print(refine_prompts.PROMPT.template)

Write a concise summary of the following:


"{text}"


CONCISE SUMMARY:


In [23]:
print(refine_prompts.REFINE_PROMPT.template)

Your job is to produce a final summary.
We have provided an existing summary up to a certain point: {existing_answer}
We have the opportunity to refine the existing summary (only if needed) with some more context below.
------------
{text}
------------
Given the new context, refine the original summary.
If the context isn't useful, return the original summary.


In [8]:
import os
import getpass
from langchain_google_vertexai import VertexAI

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = Path(getpass.getpass("Path to google credentials json file")).absolute().as_posix()
google_project_id = getpass.getpass("Google Project ID")


llm = VertexAI(model_name="gemini-pro", project=google_project_id, max_output_tokens=1024)

refine_chain = load_summarize_chain(
    llm=llm,
    chain_type="refine",
    return_intermediate_steps=True,
    verbose=True,
)

In [None]:
large_doc_summary = refine_chain({"input_documents": large_docs})

In [None]:
import tabulate

left_column = [page.page_content for page in large_doc_summary["input_documents"]] + ["Final Summary"]
right_column = [' '.join(out.split("- ")) for out in large_doc_summary["intermediate_steps"]] + [large_doc_summary["output_text"]]

table_data = list(zip(left_column, right_column))
table = tabulate.tabulate(table_data, tablefmt='html')
table

### 💭 Thoughts on Large Document Summary

1. Despite leveraging a cloud llm, this still took about 11 minutes to run due to the size of the large document and the context windows size I set (1024). It resulted in 74 calls to Google's gemini model.
2. The final result was a combination of high level themes found throughout the document, and any important details from the tail end. I think with some proper prompt engineering, this method could produce a concise summary of large document.