# Text-Summarization

### • Stuff: All documents are inserted into a single prompt, which is then passed to an LLM. This is the simplest approach.

###  • Map-reduce: Each document is summarized individually ("map" step), then those summaries are combined into a final summary ("reduce" step). This is a two-stage process that can be more complex but offers more flexibility.

### Refine : The refine documents chain constructs a response by looping over the input documents and iteratively updating its answer. For each document, it passes all non-document inputs, the current document, and the latest intermediate answer to an LLM chain to get a new answer.

### Install Requried libaries

In [None]:
!pip install -q transformers einops accelerate langchain bitsandbytes
!pip install sentencepiece



In [None]:
!pip install pypdf



In [None]:
!pip install rouge
!pip install langchainhub



In [None]:
from langchain import HuggingFacePipeline
from transformers import AutoTokenizer, pipeline
import torch
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader,DirectoryLoader
from langchain.chains.summarize import load_summarize_chain
from transformers import pipeline

from transformers import T5Tokenizer, T5ForConditionalGeneration

### Lamimni-flan-T5-248m pre-trained is used.

In [None]:
model_path = "MBZUAI/LaMini-Flan-T5-248M"

t5_tokenizer = T5Tokenizer.from_pretrained(model_path)
t5_model = T5ForConditionalGeneration.from_pretrained(
    model_path, torch_dtype=torch.float16, device_map='auto',
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Loading the document

In [None]:
loader = PyPDFLoader("The-Hound-of-the-Baskervilles-part1.pdf")
pages = loader.load()

### splitting the document into chunk

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200,chunk_overlap=50)
texts = text_splitter.split_documents(pages)

### Creating the pipeline

In [None]:
  pipeline = pipeline(
    "summarization",
    model = t5_model,
    tokenizer = t5_tokenizer,
    max_length=400,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=t5_tokenizer.eos_token_id
)

### creating the HuggingFace Pipeline

In [None]:
llm = HuggingFacePipeline(pipeline = pipeline, model_kwargs = {'temperature':0.8})

### creating the llm and prompt template

In [None]:
from langchain.chains.summarize import load_summarize_chain
from langchain import PromptTemplate,  LLMChain

template = """
              Write a concise summary of the following text delimited by triple backquotes.
              Return your response in bullet points which covers the key points of the text.
              ```{text}```
              BULLET POINT SUMMARY:
           """

In [None]:
prompt = PromptTemplate.from_template(template)

In [None]:
llm_chain = LLMChain(llm=llm, prompt=prompt)

## stuff documents chain

In [None]:
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
stuff_chain = StuffDocumentsChain(llm_chain=llm_chain, document_variable_name="text")

In [None]:
stuff_text = stuff_chain.run(texts)

Token indices sequence length is longer than the specified maximum sequence length for this model (775 > 512). Running this sequence through the model will result in indexing errors


In [None]:
stuff_text

"- Mr. Sherlock Holmes is seated at the breakfast table and picks up a stick from his visitor. - The stick is a fine, thick piece of wood, bulbous-headed, of the sort known as a 'Penang lawyer', with a broad silver band nearly an inch across, engraved upon it with the date '1884.' - Holmes has no sign of his occupation, but he believes that Dr. Mortim, a successful, elder-ly medical man, is well-esteemed since those who know him give him this mark of their appreciation, and the probability is in favor of his being a country practitioner who does a great deal of his visiting on foot."

In [None]:
reference_text = "In the morning, Sherlock Holmes was at the breakfast table, examining a stick left by a visitor. The stick had a silver band engraved with 'To James Mortimer, M.R.C.S., from his friends of the C.C.H., 1884.' Holmes deduced that Dr. Mortimer, an elderly medical man, was likely a country practitioner who walked a lot. The 'C.C.H.' likely referred to a local hunt where Mortimer may have provided surgical assistance. Holmes praised Watson's deductions."
reference_text

"In the morning, Sherlock Holmes was at the breakfast table, examining a stick left by a visitor. The stick had a silver band engraved with 'To James Mortimer, M.R.C.S., from his friends of the C.C.H., 1884.' Holmes deduced that Dr. Mortimer, an elderly medical man, was likely a country practitioner who walked a lot. The 'C.C.H.' likely referred to a local hunt where Mortimer may have provided surgical assistance. Holmes praised Watson's deductions."

### evaluting the stuff chain model

In [None]:
from rouge import Rouge
rouge = Rouge()
rouge_scores = rouge.get_scores(stuff_text, reference_text, avg=True)
rouge_scores

{'rouge-1': {'r': 0.390625,
  'p': 0.30864197530864196,
  'f': 0.34482758127562424},
 'rouge-2': {'r': 0.15853658536585366,
  'p': 0.12264150943396226,
  'f': 0.13829786742191055},
 'rouge-l': {'r': 0.375, 'p': 0.2962962962962963, 'f': 0.3310344778273484}}

### MapReduceDocumentsChain

In [None]:
# Map
map_template = """Write a concise summary of the following text delimited by triple backquotes.
              Return your response in bullet points which covers the key points of the text.
              ```{text}```
              BULLET POINT SUMMARY:"""
map_prompt = PromptTemplate.from_template(map_template)
map_chain = LLMChain(llm=llm, prompt=map_prompt)

In [None]:
from langchain import hub

map_prompt = hub.pull("rlm/map-prompt")
map_chain = LLMChain(llm=llm, prompt=map_prompt)

In [None]:
# Reduce
reduce_template = """Write a concise summary of the following text delimited by triple backquotes.
              Return your response in bullet points which covers the key points of the text.
              ```{text}```
              BULLET POINT SUMMARY:"""
reduce_prompt = PromptTemplate.from_template(reduce_template)

In [None]:
reduce_prompt = hub.pull("rlm/map-prompt")

In [None]:

from langchain.chains import MapReduceDocumentsChain, ReduceDocumentsChain
# Run chain
reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt)

# Takes a list of documents, combines them into a single string, and passes this to an LLMChain
combine_documents_chain = StuffDocumentsChain(
    llm_chain=reduce_chain, document_variable_name="docs"
)

# Combines and iteratively reduces the mapped documents
reduce_documents_chain = ReduceDocumentsChain(
    # This is final chain that is called.
    combine_documents_chain=combine_documents_chain,
    # If documents exceed context for `StuffDocumentsChain`
    collapse_documents_chain=combine_documents_chain,
    # The maximum number of tokens to group documents into.
    token_max=4000,
)

In [None]:
# Combining documents by mapping a chain over them, then combining results
map_reduce_chain = MapReduceDocumentsChain(
    # Map chain
    llm_chain=map_chain,
    # Reduce chain
    reduce_documents_chain=reduce_documents_chain,
    # The variable name in the llm_chain to put the documents in
    document_variable_name="docs",
    # Return the results of the map steps in the output
    return_intermediate_steps=False,
)


In [None]:
loader = PyPDFLoader("The-Hound-of-the-Baskervilles-part1.pdf")
pages = loader.load()

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=50)
texts = text_splitter.split_documents(pages)

In [None]:
map_summarization = map_reduce_chain.run(texts)

Your max_length is set to 400, but your input_length is only 288. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=144)
Your max_length is set to 400, but your input_length is only 101. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)
Your max_length is set to 400, but your input_length is only 292. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=146)
Your max_length is set to 400, but your input_length is only 149. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=74

In [None]:
print(map_summarization)

The main themes in the provided set of documents are Hound of the Baskervilles Chapter 1, Mr. Sherlock Holmes, the presence of a stick with a date engraved on it by a family practitioner, "eyes in the back of your head," "polished, silver-plated coffee-pot," and "visitor's stick."


In [None]:
map_summarization

'The main themes in the provided set of documents are Hound of the Baskervilles Chapter 1, Mr. Sherlock Holmes, the presence of a stick with a date engraved on it by a family practitioner, "eyes in the back of your head," "polished, silver-plated coffee-pot," and "visitor\'s stick."'

In [None]:
reference_text = "In the morning, Sherlock Holmes was at the breakfast table, examining a stick left by a visitor. The stick had a silver band engraved with 'To James Mortimer, M.R.C.S., from his friends of the C.C.H., 1884.' Holmes deduced that Dr. Mortimer, an elderly medical man, was likely a country practitioner who walked a lot. The 'C.C.H.' likely referred to a local hunt where Mortimer may have provided surgical assistance. Holmes praised Watson's deductions."
reference_text

"In the morning, Sherlock Holmes was at the breakfast table, examining a stick left by a visitor. The stick had a silver band engraved with 'To James Mortimer, M.R.C.S., from his friends of the C.C.H., 1884.' Holmes deduced that Dr. Mortimer, an elderly medical man, was likely a country practitioner who walked a lot. The 'C.C.H.' likely referred to a local hunt where Mortimer may have provided surgical assistance. Holmes praised Watson's deductions."

### Evaluting the map-reduce chain model

In [None]:
from rouge import Rouge
rouge = Rouge()
rouge_scores = rouge.get_scores(map_summarization, reference_text, avg=True)
rouge_scores

{'rouge-1': {'r': 0.15625, 'p': 0.2631578947368421, 'f': 0.19607842669742417},
 'rouge-2': {'r': 0.036585365853658534,
  'p': 0.06521739130434782,
  'f': 0.04687499539550827},
 'rouge-l': {'r': 0.140625, 'p': 0.23684210526315788, 'f': 0.1764705835601693}}

## Refine

In [None]:
prompt_template = """Write a concise summary of the following:
{text}
CONCISE SUMMARY:"""
prompt = PromptTemplate.from_template(prompt_template)

refine_template = (
    "Your job is to produce a final summary\n"
    "We have provided an existing summary up to a certain point: {existing_answer}\n"
    "We have the opportunity to refine the existing summary"
    "(only if needed) with some more context below.\n"
    "------------\n"
    "{text}\n"
    "------------\n"
    "Given the new context, refine the original summary in Italian"
    "If the context isn't useful, return the original summary."
)
refine_prompt = PromptTemplate.from_template(refine_template)
chain = load_summarize_chain(
    llm=llm,
    chain_type="refine",
    question_prompt=prompt,
    refine_prompt=refine_prompt,
    return_intermediate_steps=True,
    input_key="input_documents",
    output_key="output_text",
)
result = chain({"input_documents": texts}, return_only_outputs=True)

  warn_deprecated(
Your max_length is set to 400, but your input_length is only 276. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=138)
Your max_length is set to 400, but your input_length is only 250. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=125)
Your max_length is set to 400, but your input_length is only 331. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=165)


In [None]:
refine_summary = result["output_text"]

In [None]:
refine_summary

"The Hound of the Baskervilles Chapter 1 is about Mr. Sherlock Holmes, a detective who was seated at the breakfast table. The author picks up a stick from a visitor's hearth-rug, engraved with the date '1884,' and reveals that he was a Penang lawyer. Free eBooks at Planet eBook.comnate become of importance as to miss him and have no notion of his errand, this accidental souvenir becomes of importance. However, we have the opportunity to refine the existing summary(only if needed) with some more context below. 'Really, Watson, you excel yourself,' said Holmes, push-ing back his chair and lighting a cigarette. ‘I am bound to say that in all the accounts which you have been so good as to give of my own small achievements you have habitually underrated your own abilities. It may be that you are not yourself luminous, but you are a conductor of light.'"

In [None]:
reference_text = "Mr. Sherlock Holmes, a detective who was seated at the breakfast table. The author picks up a stick from a visitor's hearth-rug, engraved with the date '1884,' and reveals that he was a Penang lawyer.. The stick had a silver band engraved with 'To James Mortimer, M.R.C.S., from his friends of the C.C.H., 1884.' Holmes deduced that Dr. Mortimer, an elderly medical man, was likely a country practitioner who walked a lot. The 'C.C.H.' likely referred to a local hunt where Mortimer may have provided surgical assistance. Holmes praised Watson's deductions."
reference_text

"Mr. Sherlock Holmes, a detective who was seated at the breakfast table. The author picks up a stick from a visitor's hearth-rug, engraved with the date '1884,' and reveals that he was a Penang lawyer.. The stick had a silver band engraved with 'To James Mortimer, M.R.C.S., from his friends of the C.C.H., 1884.' Holmes deduced that Dr. Mortimer, an elderly medical man, was likely a country practitioner who walked a lot. The 'C.C.H.' likely referred to a local hunt where Mortimer may have provided surgical assistance. Holmes praised Watson's deductions."

### Evaluting the Refine chain model

In [None]:
from rouge import Rouge
rouge = Rouge()
rouge_scores = rouge.get_scores(refine_summary, reference_text, avg=True)
rouge_scores

{'rouge-1': {'r': 0.4864864864864865,
  'p': 0.32432432432432434,
  'f': 0.38918918438918926},
 'rouge-2': {'r': 0.35353535353535354,
  'p': 0.2413793103448276,
  'f': 0.286885241079347},
 'rouge-l': {'r': 0.4864864864864865,
  'p': 0.32432432432432434,
  'f': 0.38918918438918926}}