# Summarization Docs with Langchain

We are going to be using some techniques to split up large amounts of document text. Current LLM have very large token windows depending on what you choose (e.g. 20K+). In some instances there might be a need to split up large text for your LLM of choice prior to summarizing. This notebook will use some tools that will get just at that.

Inspiration from https://www.youtube.com/watch?v=LNq_2s_H01Y&list=PL8motc6AQftk1Bs42EW45kwYbyJ4jOdiZ&index=8

In [6]:
import os
from langchain import PromptTemplate, LLMChain, OpenAI, HuggingFaceHub
from langchain.chains.conversation.memory import ConversationBufferMemory
from langchain.chains import ConversationChain
from langchain.chains.mapreduce import MapReduceChain
from langchain.text_splitter import CharacterTextSplitter

In [3]:
os.environ["HUGGINGFACEHUB_API_TOKEN"] = ""

Load the model. We'll leverage gpt-4 here

In [61]:
llm_model = OpenAI(model_name="gpt-4-0613", temperature=0, max_tokens=4096)



Instantiate langchain's text splitter class

In [27]:
text_splitter = CharacterTextSplitter()

Using Miguel's Hernan "What If" book in example which can be freely found [here](https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/)

In [8]:
with open("sample_docs/what_if_3_4.txt") as f:
    whatif_3_4 = f.read()

In [10]:
print(whatif_3_4)

Chapter 3
OBSERVATIONAL STUDIES
Consider again the causal question “does one’s looking up at the sky make other pedestrians look up too?” After
considering a randomized experiment as in the previous chapter, you concluded that looking up so many times was
too time-consuming and unhealthy for your neck bones. Hence you decided to conduct the following study: Find
a nearby pedestrian who is standing in a corner and not looking up. Then find a second pedestrian who is walking
towards the first one and not looking up either. Observe and record their behavior during the next 10 seconds.
Repeat this process a few thousand times. You could now compare the proportion of second pedestrians who
looked up after the first pedestrian did, and compare it with the proportion of second pedestrians who looked up
before the first pedestrian did. Such a scientific study in which the investigator observes and records the relevant
data is referred to as an observational study.
If you had conducted the obse

In [28]:
texts = text_splitter.split_text(whatif_3_4)

Created a chunk of size 104402, which is longer than the specified 4000


In [25]:
len(texts)

3

In [29]:
[len(texts[i]) for i in range(3)]

[104402, 452, 8630]

As we can see above the size of chunks might be much larger than what is desired. We are going to test an <u>alternative splitting scheme than would aim to specify chunk size, minimizing token window sizes when feeding to llm across more chunks</u>

In [49]:
from langchain.text_splitter import TokenTextSplitter

In [62]:
text_splitter = TokenTextSplitter(
    chunk_size=750, chunk_overlap=0
)

In [63]:
texts = text_splitter.split_text(whatif_3_4)
print(len(texts))

36


[TokenTextSplitter()](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/split_by_token) will ensure splits are constrained within max token size. This is important given that each LLM will have a max allowable window size. We make sure there that the max length of each split is under 4000

In [66]:
print(max([len(texts[i]) for i in range(len(texts))]))

print([len(texts[i]) for i in range(len(texts))])

3815
[3770, 3292, 3622, 3565, 2881, 3220, 3240, 3333, 2306, 3216, 3448, 3644, 3207, 3510, 3626, 3643, 3467, 3282, 2609, 2813, 2664, 2686, 2891, 3169, 2815, 3505, 3209, 3815, 3212, 3353, 3573, 3052, 1909, 3268, 2683, 1990]


Next, we make our doc objects for each chunk. We'll test this with only the first 10 chunks below

In [67]:
from langchain.docstore.document import Document

In [68]:
docs = [Document(page_content=i) for i in texts[:10]]

In [72]:
docs

[Document(page_content='Chapter 3\nOBSERVATIONAL STUDIES\nConsider again the causal question “does one’s looking up at the sky make other pedestrians look up too?” After\nconsidering a randomized experiment as in the previous chapter, you concluded that looking up so many times was\ntoo time-consuming and unhealthy for your neck bones. Hence you decided to conduct the following study: Find\na nearby pedestrian who is standing in a corner and not looking up. Then find a second pedestrian who is walking\ntowards the first one and not looking up either. Observe and record their behavior during the next 10 seconds.\nRepeat this process a few thousand times. You could now compare the proportion of second pedestrians who\nlooked up after the first pedestrian did, and compare it with the proportion of second pedestrians who looked up\nbefore the first pedestrian did. Such a scientific study in which the investigator observes and records the relevant\ndata is referred to as an observational st

Now we have the docs, we'll explore 3 different ways to do Summarization