# Summarize via Map-Reduce

Map-Reduce has been a paradigm in programming for a long time. I was first introduced to this paradigm when I learned functional programming in 2013, but Google has written papers on this going back to 2004. Here is a [link to a Cornell lecture](https://www.cs.cornell.edu/courses/cs3110/2014sp/lectures/5/map-fold-map-reduce) that dives deep into the Map-Reduce paradigm if you are interested in learning more.

But for this notebook, it can be simplified to:
  * Iterate over the document via chunks and summarize each chunk (Map)
  * Combine the mini-summaries into the final summary (Reduce)

In [None]:
!pip install langchain
!pip install langchain-community
!pip install gpt4all

## Import the model

In [10]:
from langchain_community.llms.gpt4all import GPT4All

# mistral-7b download available from the gpt4all website. Use the "Model Explorer"
# https://gpt4all.io/
llm = GPT4All(
    model="../../models/mistral-7b-openorca.Q4_0.gguf",
    max_tokens=1024,
)

## Load the data

I will be working with two datasets:

- `data/small-document.txt` - Paragraph from [this article](https://www.nature.com/articles/s41467-017-01082-6).
- `data/large-document.txt` - Full transcript of [this podcast](https://anchor.fm/s/74aab30/podcast/play/1593261/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fproduction%2F2018-9-22%2F5313967-44100-1-ae7cde1436c24.mp3).

In [45]:
from pathlib import Path

SMALL_DOC = Path("./data/small-document.txt").read_text(encoding="utf-8")
LARGE_DOC = Path("./data/large-document.txt").read_text(encoding="utf-8")

## Use LangChain's map-reduce chain

LangChain offers a [MapReduceDocumentsChain](https://api.python.langchain.com/en/latest/chains/langchain.chains.combine_documents.map_reduce.MapReduceDocumentsChain.html#langchain.chains.combine_documents.map_reduce.MapReduceDocumentsChain). I followed [this notebook](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-summarization/summarization_large_documents_langchain.ipynb) to write the following code.

> ðŸŒž Side Note: This code covers several concepts which I have not learned about yet. Once I have written lessons on them, I will come back and provide links to those learnings. 

### Prompt Design

The MapReduce approach requires two prompts:

  * A prompt for summarizing each chunk of text from the original data.
  * A prompt for combining the summaries into a cohesive summary for the entire data.

In [69]:
from langchain.prompts import PromptTemplate

map_prompt = PromptTemplate.from_template(
    "I have taken the following text:\n\n"
    "TEXT:: {text}.\n\n"
    "And wrote a brief synopsis of the important information below. Let me know what you think!\n\n"
    "SUMMARY:: "
)

combine_prompt = PromptTemplate.from_template(
    "Here is the article: \n\n"
    "{text}\n\n"
    "I have written a concise summary of the article below. My summary is written in bullet point format. \n\n"
)

### Document Chunking

I chose to split the source text by sentence, but this approach is very caveman-esque because if the source text has a line like: `Something akin to Mr. Bean.`, then this text splitter would assume `Something akin to Mr` and `Bean` as two separate sentences.

There have been many [long discussions](https://www.linkedin.com/pulse/very-long-discussion-legal-document-summarization-using-leonard-park/) about the best ways to split large documents. I recognize that my approach is not the best, perhaps somwhere in the next 30 days I will deep dive on this topic.

In [59]:
from langchain.docstore.document import Document
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator=".",
    chunk_size=550,
    chunk_overlap=0,
    length_function=len,
)

small_doc_chunks = text_splitter.split_text(SMALL_DOC)
small_docs = [Document(page_content=t) for t in small_doc_chunks]

large_doc_chunks = text_splitter.split_text(LARGE_DOC)
large_docs = [Document(page_content=t) for t in large_doc_chunks]

### Map Reduce Chain

In [70]:
from langchain.chains.summarize import load_summarize_chain

map_reduce_chain = load_summarize_chain(
    llm=llm,
    chain_type="map_reduce",
    map_prompt=map_prompt,
    combine_prompt=combine_prompt,
    return_intermediate_steps=True,
    verbose=True,
)

### Run on the small document

This took 3m on my local machine.

In [72]:
map_reduce_outputs = map_reduce_chain({"input_documents": small_docs})



[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mI have taken the following text:

TEXT:: RNA is increasingly recognized as a powerful biomolecule for controlling gene expression and engineering synthetic cellular functions. One of the reasons for this is that natural and engineered RNA-based regulators are now available that can control almost every aspect of gene expression. In addition these regulatory functions can be enacted and tuned by the programmable formation of specific RNA structures, which mediate interactions with cellular machinery to perform gene regulation.

And wrote a brief synopsis of the important information below. Let me know what you think!

SUMMARY:: [0m
Prompt after formatting:
[32;1m[1;3mI have taken the following text:

TEXT:: For example in bacteria, the formation of simple RNA structures such as hairpins within mRNAs can prevent their transcription and translation.

In [74]:
import tabulate

left_column = [page.page_content for page in map_reduce_outputs["input_documents"]] + ["Final Summary"]
right_column = [out for out in map_reduce_outputs["intermediate_steps"]] + [map_reduce_outputs["output_text"]]

table_data = list(zip(left_column, right_column))
table = tabulate.tabulate(table_data, tablefmt='html')
table

0,1
"RNA is increasingly recognized as a powerful biomolecule for controlling gene expression and engineering synthetic cellular functions. One of the reasons for this is that natural and engineered RNA-based regulators are now available that can control almost every aspect of gene expression. In addition these regulatory functions can be enacted and tuned by the programmable formation of specific RNA structures, which mediate interactions with cellular machinery to perform gene regulation","RNA is becoming increasingly recognized for its potential in controlling gene expression and engineering synthetic cellular functions due to the availability of natural and engineered RNA-based regulators that can control almost every aspect of gene expression. These regulatory functions are performed through programmable formation of specific RNA structures, which mediate interactions with cellular machinery to carry out gene regulation."
"For example in bacteria, the formation of simple RNA structures such as hairpins within mRNAs can prevent their transcription and translation. Moreover, these cis-acting RNA structures can be further controlled through interacting with trans-acting small RNAs (sRNAs) or binding of a ligand, to prevent or allow their formationâ€”in effect creating inducible genetic control elements","In bacteria, RNA structures such as hairpins within mRNAs can regulate transcription and translation by preventing or allowing their formation. These cis-acting elements are controlled through interactions with small RNAs (sRNAs) or ligand binding, creating inducible genetic control elements."
"This combination of versatile genetic regulation controlled by simple RNA structures creates the intriguing possibility of using nucleic acid design algorithms to create RNA regulators de novo. Thus RNA as a substrate for molecular programming has a potential major advantage over less designable protein regulators, and there is great promise for RNA synthetic biology to allow for the bottom up molecular-level design of genetic control systems","The text discusses the potential use of nucleic acid design algorithms in creating RNA regulators de novo, which could have significant advantages over protein-based genetic control systems due to its versatile genetic regulation controlled by simple RNA structures. This suggests a promising future for RNA synthetic biology and molecular programming at the bottom up level."
Final Summary,"1. RNA-based regulators are becoming increasingly recognized due to their potential in controlling gene expression and engineering synthetic cellular functions. 2. These regulatory functions are performed through programmable formation of specific RNA structures, which mediate interactions with cellular machinery to carry out gene regulation. 3. In bacteria, RNA structures such as hairpins within mRNAs can regulate transcription and translation by preventing or allowing their formation. 4. These cis-acting elements are controlled through interactions with small RNAs (sRNAs) or ligand binding, creating inducible genetic control elements. 5. The potential use of nucleic acid design algorithms in creating RNA regulators de novo could have significant advantages over protein-based genetic control systems due to its versatile genetic regulation controlled by simple RNA structures. 6. This suggests a promising future for RNA synthetic biology and molecular programming at the bottom up level."


### Run on the large document

> âš  Caution! This took 47 minutes to run on my local machine with the suggested model. Probably not worth your time since it fails in the end ðŸ˜Š

In [75]:
map_reduce_outputs = map_reduce_chain({"input_documents": large_docs})



[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mI have taken the following text:

TEXT:: It's good to be here with you this morning. It's nice and cold outside, which I like. If you have your Bibles, please turn with me to Matthew 1. We're going to pick up where we left off last week. We started in Matthew 1, verse 1. We talked through the genealogy of Jesus. And so this week we are continuing on in Matthew's gospel. I read a study this past week that half of children born in America now are born outside of wedlock. So that's a pretty historical number that half now is outside of wedlock.

And wrote a brief synopsis of the important information below. Let me know what you think!

SUMMARY:: [0m
Prompt after formatting:
[32;1m[1;3mI have taken the following text:

TEXT:: And that's not because there are single mothers everywhere. That's not why. A lot of people these days are just choosing to co

Token indices sequence length is longer than the specified maximum sequence length for this model (5099 > 1024). Running this sequence through the model will result in indexing errors



[1m> Finished chain.[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mHere is the article: 


- The speaker begins by expressing his pleasure at being with the audience and appreciating the cold weather.
- They ask the congregation to open their Bibles to Matthew 1, where they left off last week in discussing Jesus' genealogy.
- A study mentioned that half of children born in America are now born outside of wedlock, which is a significant statistic.


The text discusses how many people are choosing to live together without getting married, which is becoming more common in America. The article highlights that cohabitation and marriage may seem similar but they are not the same thing. Cohabitation involves living with someone and doing life together, while marriage has a legal and social significance.


The text discusses the difference between relationships and marriage, highlighting how in a rel

In [76]:
import tabulate

left_column = [page.page_content for page in map_reduce_outputs["input_documents"]] + ["Final Summary"]
right_column = [out for out in map_reduce_outputs["intermediate_steps"]] + [map_reduce_outputs["output_text"]]

table_data = list(zip(left_column, right_column))
table = tabulate.tabulate(table_data, tablefmt='html')
table

0,1
"It's good to be here with you this morning. It's nice and cold outside, which I like. If you have your Bibles, please turn with me to Matthew 1. We're going to pick up where we left off last week. We started in Matthew 1, verse 1. We talked through the genealogy of Jesus. And so this week we are continuing on in Matthew's gospel. I read a study this past week that half of children born in America now are born outside of wedlock. So that's a pretty historical number that half now is outside of wedlock","- The speaker begins by expressing his pleasure at being with the audience and appreciating the cold weather. - They ask the congregation to open their Bibles to Matthew 1, where they left off last week in discussing Jesus' genealogy. - A study mentioned that half of children born in America are now born outside of wedlock, which is a significant statistic."
And that's not because there are single mothers everywhere. That's not why. A lot of people these days are just choosing to cohabitate and just say no to marriage altogether. So what you have really is generationally American saying we just don't want that. And this article is kind of talking about it when pastor noted on it how cohabitation and marriage they kind of look like the same thing but they're not the same thing at all. Because in cohabitation you're living with someone and you're doing life with someone,"The text discusses how many people are choosing to live together without getting married, which is becoming more common in America. The article highlights that cohabitation and marriage may seem similar but they are not the same thing. Cohabitation involves living with someone and doing life together, while marriage has a legal and social significance."
But really there's a freedom there to back out when you want. You're not really totally committed but in marriage you're making the vows and you're saying no matter what everything that you are I'm committed to it until the very end until death do we part as we say in our marriage vow. So marriage is a very serious full commitment. And while we may be saying generationally today is America is like I don't want to make that kind of commitment. I want like my secret license to kind of pull out if I need to,"The text discusses the difference between relationships and marriage, highlighting how in a relationship there is more freedom to back out when desired, whereas in marriage, individuals make vows that are meant to be lifelong commitments until death separates them. The speaker suggests that this idea of commitment may not appeal to younger generations in America today who prefer the option of pulling out if necessary."
Matthew's not going to let us do that in reference to Jesus. He's not going to let us do that in reference to the gospel. Last week we said Jesus alone. He's got this very unique life. Jesus has a very different human life than any other human. Jesus uniquely is worth watching. Jesus is uniquely worth following. But Matthew's saying make sure you know what you're saying when you say you follow Jesus. To follow Jesus at arm's length to follow Jesus at a comfortable distance Matthew's going to argue is not following Jesus at all,"- Matthew emphasizes that we should not take Jesus and his teachings lightly. - He highlights the uniqueness of Jesus' human life, which is different from any other person's life. - To truly follow Jesus, one must be fully committed and engaged in His teachings, rather than following Him at a comfortable distance or superficially."
If you really want to follow Jesus Matthew's saying you're going to have to buy into the whole thing. It's going to be like this very real full commitment. So make sure like you know what you're getting yourself into. That's what Matthew's doing. So Matthew's like going like right for the heart of it. He's starting in this very difficult place for us to really believe in except the script she's starting with this account of the supernatural birth of Jesus,"- Matthew emphasizes the importance of full commitment when following Jesus. - The text refers to Matthew's approach as going ""right for the heart of it."" - He starts with an account of the supernatural birth of Jesus, a difficult concept for many people to believe in."
Are we willing to own friends not some of Christ? Are we willing to commit to not what we like about Christ and Christianity? But can we own the whole thing? That's Matthew's I think challenge and thrown us in the deep end with this account of the supernatural birth. So we'll be in Matthew chapter 1 verse 18. I'm going to start reading for us there. Now the birth of Jesus Christ took place in this way. When his mother Mary had been patrolled to Joseph before they came together she was found to be with child from the Holy Spirit,"- Are we willing to own friends not some of Christ? - Matthew's challenge is about committing to the whole thing, including the supernatural birth. - The passage starts at Matthew chapter 1 verse 18."
"And her husband Joseph being a just man and unruly to put her to shame, resolved to divorce her quietly. But as he considered these things, behold an angel of the Lord appeared to him in a dream, saying Joseph son of David do not fear to take Mary as your wife. For that which is conceived in her is from the Holy Spirit, she will bear a son. And you shall call his name Jesus for he will save his people from their sins. All this took place to fulfill what the Lord had spoken by the prophet","- Joseph, a just man, planned to divorce Mary quietly due to her pregnancy. - An angel appeared in his dream and told him not to fear taking Mary as his wife because Jesus was conceived by the Holy Spirit. - The child's name should be Jesus, for he will save people from their sins. - This event fulfilled a prophecy spoken by a prophet."
"The whole the Virgin shall conceive in bear a son and they shall call his name Emmanuel, which means God with us. When Joseph woke from sleep he did as the angel of the Lord commanded him. He took his wife but knew her not until she had given birth to a son and he called his name Jesus. So Mary most likely she's probably 13 to 15 years old. This is what scholars would think. That would be a normal betrothal period for a girl in ancient Israel, 13 to 15 years old","- Virgin Mary will conceive and bear a son named Emmanuel (God with us) - Joseph follows angel's command, marries Mary but does not consummate until after birth - Scholars believe Mary was likely around 13 to 15 years old at the time of betrothal"
"And betrothal is not like what you and I think of when we think about like engagements. Like today if somebody makes an engagement that's yay exciting, it's wonderful. But someone can break off an engagement as easily as they make it. Like that wouldn't be a big deal to hear about. Oh such and such was engaged and they call it off whatever life moves on. That's not betrothal. So 21st century engagements very different from ancient Israel, betrothals. In a betrothal you were bound. If you were betrothed you were as good as Mary","21st century engagements are different from ancient Israel's betrothals, which were binding commitments that made one as committed as Mary was in her situation. In contrast, modern engagements can be easily broken off without significant consequences."
"You just hadn't had a ceremony yet to like celebrate it, like move in together as a family. You start building a family. You were legally bound to that person no matter what. So really if you're going to break off an engagement in ancient Israel you have to have a really significant reason to do it. You'd have to get the law involved, you'd have to divorce the person. It wasn't a simple let's make this and break this kind of thing","In ancient Israel, breaking off an engagement was not a casual matter but required significant reasons to do so. To end the engagement, one had to divorce their partner and involve the law. This highlights that forming a family in this time period was more than just a personal decision; it involved legal commitments as well."


As I suspected, it summarized each block of text, but failed in the final `Reduce` step. 