# Summarize via Map-Reduce

Map-Reduce has been a paradigm in programming for a long time. I was first introduced to this paradigm when I learned functional programming in 2013, but Google has written papers on this going back to 2004. Here is a [link to a Cornell lecture](https://www.cs.cornell.edu/courses/cs3110/2014sp/lectures/5/map-fold-map-reduce) that dives deep into the Map-Reduce paradigm if you are interested in learning more.

But for this notebook, it can be simplified to:
  * Iterate over the document via chunks and summarize each chunk (Map)
  * Combine the mini-summaries into the final summary (Reduce)

In [None]:
!pip install langchain
!pip install langchain-community
!pip install gpt4all

## Import the model

In [10]:
from langchain_community.llms.gpt4all import GPT4All

# mistral-7b download available from the gpt4all website. Use the "Model Explorer"
# https://gpt4all.io/
llm = GPT4All(
    model="../../models/mistral-7b-openorca.Q4_0.gguf",
    max_tokens=1024,
)

## Load the data

I will be working with two datasets:

- `data/small-document.txt` - Paragraph from [this article](https://www.nature.com/articles/s41467-017-01082-6).
- `data/large-document.txt` - Full transcript of [this podcast](https://anchor.fm/s/74aab30/podcast/play/1593261/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fproduction%2F2018-9-22%2F5313967-44100-1-ae7cde1436c24.mp3).

In [45]:
from pathlib import Path

SMALL_DOC = Path("./data/small-document.txt").read_text(encoding="utf-8")
LARGE_DOC = Path("./data/large-document.txt").read_text(encoding="utf-8")

## Use LangChain's map-reduce chain

LangChain offers a [MapReduceDocumentsChain](https://api.python.langchain.com/en/latest/chains/langchain.chains.combine_documents.map_reduce.MapReduceDocumentsChain.html#langchain.chains.combine_documents.map_reduce.MapReduceDocumentsChain). I followed [this notebook](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-summarization/summarization_large_documents_langchain.ipynb) to write the following code.

> 🌞 Side Note: This code covers several concepts which I have not learned about yet. Once I have written lessons on them, I will come back and provide links to those learnings. 

### Prompt Design

The MapReduce approach requires two prompts:

  * A prompt for summarizing each chunk of text from the original data.
  * A prompt for combining the summaries into a cohesive summary for the entire data.

In [69]:
from langchain.prompts import PromptTemplate

map_prompt = PromptTemplate.from_template(
    "I have taken the following text:\n\n"
    "TEXT:: {text}.\n\n"
    "And wrote a brief synopsis of the important information below. Let me know what you think!\n\n"
    "SUMMARY:: "
)

combine_prompt = PromptTemplate.from_template(
    "Here is the article: \n\n"
    "{text}\n\n"
    "I have written a concise summary of the article below. My summary is written in bullet point format. \n\n"
)

### Document Chunking

I chose to split the source text by sentence, but this approach is very caveman-esque because if the source text has a line like: `Something akin to Mr. Bean.`, then this text splitter would assume `Something akin to Mr` and `Bean` as two separate sentences.

There have been many [long discussions](https://www.linkedin.com/pulse/very-long-discussion-legal-document-summarization-using-leonard-park/) about the best ways to split large documents. I recognize that my approach is not the best, perhaps somwhere in the next 30 days I will deep dive on this topic.

In [59]:
from langchain.docstore.document import Document
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator=".",
    chunk_size=550,
    chunk_overlap=0,
    length_function=len,
)

small_doc_chunks = text_splitter.split_text(SMALL_DOC)
small_docs = [Document(page_content=t) for t in small_doc_chunks]

large_doc_chunks = text_splitter.split_text(LARGE_DOC)
large_docs = [Document(page_content=t) for t in large_doc_chunks]

### Map Reduce Chain

In [70]:
from langchain.chains.summarize import load_summarize_chain

map_reduce_chain = load_summarize_chain(
    llm=llm,
    chain_type="map_reduce",
    map_prompt=map_prompt,
    combine_prompt=combine_prompt,
    return_intermediate_steps=True,
    verbose=True,
)

### Run on the small document

This took 3m on my local machine.

In [72]:
map_reduce_outputs = map_reduce_chain({"input_documents": small_docs})



[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mI have taken the following text:

TEXT:: RNA is increasingly recognized as a powerful biomolecule for controlling gene expression and engineering synthetic cellular functions. One of the reasons for this is that natural and engineered RNA-based regulators are now available that can control almost every aspect of gene expression. In addition these regulatory functions can be enacted and tuned by the programmable formation of specific RNA structures, which mediate interactions with cellular machinery to perform gene regulation.

And wrote a brief synopsis of the important information below. Let me know what you think!

SUMMARY:: [0m
Prompt after formatting:
[32;1m[1;3mI have taken the following text:

TEXT:: For example in bacteria, the formation of simple RNA structures such as hairpins within mRNAs can prevent their transcription and translation.

In [74]:
import tabulate

left_column = [page.page_content for page in map_reduce_outputs["input_documents"]] + ["Final Summary"]
right_column = [out for out in map_reduce_outputs["intermediate_steps"]] + [map_reduce_outputs["output_text"]]

table_data = list(zip(left_column, right_column))
table = tabulate.tabulate(table_data, tablefmt='html')
table

0,1
"RNA is increasingly recognized as a powerful biomolecule for controlling gene expression and engineering synthetic cellular functions. One of the reasons for this is that natural and engineered RNA-based regulators are now available that can control almost every aspect of gene expression. In addition these regulatory functions can be enacted and tuned by the programmable formation of specific RNA structures, which mediate interactions with cellular machinery to perform gene regulation","RNA is becoming increasingly recognized for its potential in controlling gene expression and engineering synthetic cellular functions due to the availability of natural and engineered RNA-based regulators that can control almost every aspect of gene expression. These regulatory functions are performed through programmable formation of specific RNA structures, which mediate interactions with cellular machinery to carry out gene regulation."
"For example in bacteria, the formation of simple RNA structures such as hairpins within mRNAs can prevent their transcription and translation. Moreover, these cis-acting RNA structures can be further controlled through interacting with trans-acting small RNAs (sRNAs) or binding of a ligand, to prevent or allow their formation—in effect creating inducible genetic control elements","In bacteria, RNA structures such as hairpins within mRNAs can regulate transcription and translation by preventing or allowing their formation. These cis-acting elements are controlled through interactions with small RNAs (sRNAs) or ligand binding, creating inducible genetic control elements."
"This combination of versatile genetic regulation controlled by simple RNA structures creates the intriguing possibility of using nucleic acid design algorithms to create RNA regulators de novo. Thus RNA as a substrate for molecular programming has a potential major advantage over less designable protein regulators, and there is great promise for RNA synthetic biology to allow for the bottom up molecular-level design of genetic control systems","The text discusses the potential use of nucleic acid design algorithms in creating RNA regulators de novo, which could have significant advantages over protein-based genetic control systems due to its versatile genetic regulation controlled by simple RNA structures. This suggests a promising future for RNA synthetic biology and molecular programming at the bottom up level."
Final Summary,"1. RNA-based regulators are becoming increasingly recognized due to their potential in controlling gene expression and engineering synthetic cellular functions. 2. These regulatory functions are performed through programmable formation of specific RNA structures, which mediate interactions with cellular machinery to carry out gene regulation. 3. In bacteria, RNA structures such as hairpins within mRNAs can regulate transcription and translation by preventing or allowing their formation. 4. These cis-acting elements are controlled through interactions with small RNAs (sRNAs) or ligand binding, creating inducible genetic control elements. 5. The potential use of nucleic acid design algorithms in creating RNA regulators de novo could have significant advantages over protein-based genetic control systems due to its versatile genetic regulation controlled by simple RNA structures. 6. This suggests a promising future for RNA synthetic biology and molecular programming at the bottom up level."


### Run on the large document

> ⚠ Caution! This took 47 minutes to run on my local machine with the suggested model. Probably not worth your time since it fails in the end 😊

In [None]:
map_reduce_outputs = map_reduce_chain({"input_documents": large_docs})

As I suspected, it summarized each block of text, but failed in the final `Reduce` step. 