In [149]:
from langchain.chains import StuffDocumentsChain, RefineDocumentsChain
from langchain.schema import Document
from langchain import PromptTemplate, LLMChain
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
import re

# Reading and Preprocessing Document Data

In [150]:
loader = PyPDFLoader("Benchmark-GLUE-data-pdf.pdf")
documents = loader.load()

In [151]:
print(f"Number of pages loaded: {len(documents)}")

Number of pages loaded: 20


In [152]:

# Function to preprocess the text
def preprocess_text(text):
    # 1. Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text).strip()
    
    # 2. Remove emails
    text = re.sub(r'\S+@\S+', '', text)
    
    # 3. Remove special characters (except basic punctuation)
    text = re.sub(r'[^a-zA-Z0-9.,;:\'"\s-]', '', text)
    
    # 4. Remove numbers if unnecessary
 #   text = re.sub(r'\b\d+\b', '', text)
    
    # 5. Convert to lowercase for uniformity
    text = text.lower()
    
    # 6. Remove headers/footers if present
    text = re.sub(r'published as a conference paper.*?iclr \d{4}', '', text, flags=re.IGNORECASE)
    
    # Return cleaned text
    return text

# Preprocess each document's content
preprocessed_documents = [
    Document(metadata=doc.metadata, page_content=preprocess_text(doc.page_content))
    for doc in documents
]

# Output the preprocessed documents
for idx, doc in enumerate(preprocessed_documents):
    print(f"Document {idx + 1}")
    print(f"Metadata: {doc.metadata}")
    print(f"Content: {doc.page_content[:500]}")  # Print the first 500 characters of content
    print("-" * 40)

Document 1
Metadata: {'source': 'Benchmark-GLUE-data-pdf.pdf', 'page': 0}
Content:  glue: a m ulti -task benchmark and analysis platform for natural language understand - ing alex wang1, amanpreet singh1, julian michael2, felix hill3, omer levy2  samuel r. bowman1 1courant institute of mathematical sciences, new york university 2paul g. allen school of computer science  engineering, university of washington 3deepmind    abstract for natural language understanding nlu technology to be maximally useful, it must be able to process language in a way that is not exclusive to a sing
----------------------------------------
Document 2
Metadata: {'source': 'Benchmark-GLUE-data-pdf.pdf', 'page': 1}
Content:  corpus train  test task metrics domain single-sentence tasks cola 8.5k 1k acceptability matthews corr. misc. sst-2 67k 1.8k sentiment acc. movie reviews similarity and paraphrase tasks mrpc 3.7k 1.7k paraphrase acc.f1 news sts-b 7k 1.4k sentence similarity pearsonspearman corr. misc. qqp 36

In [153]:

# Define the character splitter
text_splitter = CharacterTextSplitter(
    separator=" ",  # Use space as the separator
    chunk_size=300,  # Define the maximum size of each chunk
    chunk_overlap=50  # Overlap between chunks to maintain context
)

# Apply the splitter to preprocessed documents
split_documents = []
for doc in preprocessed_documents:
    chunks = text_splitter.split_text(doc.page_content)
    for chunk in chunks:
        # Create new Document objects for each chunk
        split_documents.append(Document(metadata=doc.metadata, page_content=chunk))

# Output the split documents
for idx, doc in enumerate(split_documents[:5]):  # Display only the first 5 chunks
    print(f"Chunk {idx + 1}")
    print(f"Metadata: {doc.metadata}")
    print(f"Content: {doc.page_content[:200]}")  # Print the first 200 characters
    print("-" * 40)

Chunk 1
Metadata: {'source': 'Benchmark-GLUE-data-pdf.pdf', 'page': 0}
Content: glue: a m ulti -task benchmark and analysis platform for natural language understand - ing alex wang1, amanpreet singh1, julian michael2, felix hill3, omer levy2 samuel r. bowman1 1courant institute o
----------------------------------------
Chunk 2
Metadata: {'source': 'Benchmark-GLUE-data-pdf.pdf', 'page': 0}
Content: g. allen school of computer science engineering, university of washington 3deepmind abstract for natural language understanding nlu technology to be maximally useful, it must be able to process langua
----------------------------------------
Chunk 3
Metadata: {'source': 'Benchmark-GLUE-data-pdf.pdf', 'page': 0}
Content: genre, or dataset. in pursuit of this objective, we introduce the general language understanding evaluation glue benchmark, a collection of tools for evaluat- ing the performance of models across a di
----------------------------------------
Chunk 4
Metadata: {'source': 'Benc

In [154]:
len(split_documents)

287

In [1]:
api_key= '<your_api_key>'

In [156]:
llm   = ChatOpenAI(openai_api_key=api_key)

# STUFF CHAIN

# StuffChain: Combining Multiple Documents into a Single Summary

**StuffChain** is a powerful chain in LangChain designed to process multiple documents by concatenating their content and generating a unified response. This chain allows you to efficiently summarize or process a set of documents as a whole. The approach is particularly useful when you need to analyze or extract insights from several related documents, ensuring a cohesive and comprehensive result.

### How It Works:
1. **Input**: A list of documents is provided as input, each containing relevant information.
2. **Processing**: The content of these documents is combined into a single string (or chunk of text), which is then fed into the language model for summarization or other tasks.
3. **Output**: A unified output, typically in the form of a summary or an aggregated result, is returned, offering insights from the entire collection of documents.

### Use Cases:
- **Document Summarization**: When you need to summarize a set of related documents into one cohesive summary.
- **Information Extraction**: To extract key insights from a series of documents for further analysis.
- **Data Aggregation**: Combining data from multiple sources into a single, coherent response.


In [157]:
summary_prompt = PromptTemplate(input_variables= ["text"], 
                                template       = "Summarize the following text: {text}")

llm_chain = LLMChain(llm=llm, 
                     prompt=summary_prompt)

stuff_chain = StuffDocumentsChain(llm_chain=llm_chain)

summary = stuff_chain.invoke(input={"input_documents": split_documents[:5]})
summary['output_text']

'The text introduces the General Language Understanding Evaluation (GLUE) benchmark, which is a collection of tools for evaluating the performance of natural language understanding models across various tasks. GLUE aims to encourage models that share general linguistic knowledge and includes a diagnostic test suite for detailed linguistic analysis. The text finds that multi-task training on all tasks performs better than training a separate model per task, but indicates the need for improved general NLU systems.'

# REFINE CHAIN

# RefineChain: Iterative Refinement for Enhanced Summarization

**RefineChain** is a versatile chain in LangChain designed for **iterative refinement** of document summaries or responses. This approach involves progressively improving the output by taking an initial summary or answer and refining it with additional context or information. RefineChain is particularly useful when you want to enhance the quality of the generated text over multiple iterations, ensuring more accurate, coherent, and comprehensive results.

### How It Works:
1. **Input**: The chain starts with an initial response or summary, which is typically based on the first set of documents.
2. **Refinement**: The initial summary is iteratively refined by incorporating additional documents or new information. Each iteration helps to clarify or expand on the response, improving its quality.
3. **Output**: After the refinement process, the final output is a more polished and detailed summary or answer, reflecting the combined insights from the original and additional information.

### Use Cases:
- **Improving Summaries**: If you start with a rough summary of a set of documents, RefineChain allows you to incrementally refine it, adding more context or information as needed.
- **Detailed Responses**: When generating answers to complex questions, you can start with an initial answer and then refine it by introducing more documents or data points.
- **Contextual Expansion**: Ideal for scenarios where the initial summary or answer needs further expansion and deeper understanding.


In [158]:
initial_prompt = PromptTemplate(
    input_variables= ["page_content"], 
    template       = "Generate a brief overview based on the following information: {page_content}"
)

refine_prompt = PromptTemplate(
    input_variables = ["existing_answer", "new_information"], 
    template        = "Refine the following answer by incorporating this new information: {new_information}. Answer: {existing_answer}"
)

In [159]:
# Create an initial LLMChain for the first response
initial_chain = LLMChain(llm=llm, prompt=initial_prompt)

# Create a refine LLMChain for the refinement process
refine_chain = LLMChain(llm=llm, prompt=refine_prompt)

# Combine the chains into a RefineDocumentsChain
refine_documents_chain = RefineDocumentsChain(
    initial_llm_chain    = initial_chain,       # Initial processing chain
    refine_llm_chain     = refine_chain,        # Refinement processing chain
    initial_response_name= "initial_response",  # Name for the initial response variable
    verbose              = True
)

In [197]:
# Initial Response: Process the first 3 documents
initial_response = initial_chain.invoke(input={"page_content": " ".join(doc.page_content for doc in split_documents[:2])})


In [198]:
#print("Initial Response:", initial_response)

In [199]:
existing_answer = initial_response['text']  # Directly use the 'text' key for the refinement
existing_answer

'Glue is a multi-task benchmark and analysis platform developed by a team of researchers from New York University and the University of Washington. The platform aims to advance natural language understanding (NLU) technology by testing its ability to process language across various tasks, genres, and datasets. By providing a comprehensive evaluation of NLU capabilities, Glue seeks to enhance the overall performance and versatility of NLU technology for real-world applications.'

In [203]:
# New Information
# New Information: Process the next 2 documents (index 3 to 4)
new_information = " ".join(doc.page_content for doc in split_documents[2:5])


In [194]:
refine_inputs = {
    "existing_answer": existing_answer,  # Use the initial response output
    "new_information": new_information
}

In [195]:
final_refined_response = refine_chain.invoke(input=refine_inputs)

In [196]:
print("Final Refined Summary:", final_refined_response['existing_answer'])

Final Refined Summary: Glue is a multi-task benchmark and analysis platform for natural language understanding developed by a team of researchers from New York University, University of Washington, and DeepMind. The platform aims to improve NLU technology by enabling it to process language in a way that is not limited to a single task, genre, or dataset. By testing models on a variety of tasks, Glue provides a comprehensive evaluation of their performance and helps researchers identify areas for improvement in NLU technology.


## Comparison of Outputs: Refine Chain vs. Stuff Chain

### **Output from Refine Chain:**
> "Glue is a multi-task benchmark and analysis platform for natural language understanding developed by a team of researchers from New York University, University of Washington, and DeepMind. The platform aims to improve NLU technology by enabling it to process language in a way that is not limited to a single task, genre, or dataset. By testing models on a variety of tasks, Glue provides a comprehensive evaluation of their performance and helps researchers identify areas for improvement in NLU technology."

### **Output from Stuff Chain:**
> "The text introduces the General Language Understanding Evaluation (GLUE) benchmark, which is a collection of tools for evaluating the performance of natural language understanding models across various tasks. GLUE aims to encourage models that share general linguistic knowledge and includes a diagnostic test suite for detailed linguistic analysis. The text finds that multi-task training on all tasks performs better than training a separate model per task, but indicates the need for improved general NLU systems."

---

### **Key Differences:**

1. **Focus of the Output:**
   - **Refine Chain** focuses on describing the **goal of the Glue platform**: to improve NLU technology, evaluate models across tasks, and identify areas for improvement in NLU.
   - **Stuff Chain** provides **more detailed information** about the **GLUE benchmark** itself, mentioning its purpose as a collection of tools for evaluation and including a diagnostic test suite for linguistic analysis.

2. **Inclusion of the GLUE acronym:**
   - **Stuff Chain** explicitly introduces the **GLUE acronym** (General Language Understanding Evaluation) in the first sentence, providing more context about the benchmark.
   - **Refine Chain** does not include this acronym and focuses more on the platform's evaluation aspect.

3. **Analysis of Multi-Task Training:**
   - **Stuff Chain** includes a **mention of multi-task training** and how it outperforms training separate models per task, directly addressing the comparison.
   - **Refine Chain** does not mention this aspect explicitly, focusing more on the general goal of the platform and model evaluation.

4. **Style and Detail:**
   - **Refine Chain** tends to focus on **higher-level goals**, highlighting the comprehensive evaluation and areas for improvement.
   - **Stuff Chain** provides a **bit more granular information**, such as mentioning the diagnostic test suite and emphasizing the goal of encouraging models with general linguistic knowledge.

---

### **Conclusion:**
While both outputs convey similar core information about the **GLUE benchmark** and its purpose, the **Refine Chain** output is more concise and focuses on broader objectives, while the **Stuff Chain** output delves a bit deeper into the specifics of the platform and its diagnostic capabilities.

| Feature                | Stuff Chain                                             | Refine Chain                                           |
|------------------------|--------------------------------------------------------|-------------------------------------------------------|
| **What They Are**      | A chain that combines multiple documents into a single input for the model, typically used for generating a summary or response based on all combined content. | A chain designed to improve or refine an initial output by incorporating new information iteratively. |
| **Advantages**         | - Simplicity: Easy to implement and understand.<br>- Efficient for generating a response based on a larger context by combining documents. | - Incremental Improvement: Allows for step-by-step enhancement of the output.<br>- Flexibility: Can adapt the existing output with additional context or new information. |
| **Disadvantages**      | - Potential Information Loss: Important details may get lost when summarizing multiple documents into one.<br>- Limited Iteration: Does not allow for iterative feedback on individual documents. | - Complexity: More complex to implement and understand due to multiple steps.<br>- Dependency on Initial Output: Relies on the quality of the initial response, which can affect the final output. |
| **When to Use**       | - When needing to create a summary or response based on multiple pieces of information at once.<br>- Suitable for scenarios where a quick overview is needed without the need for refinement. | - When there is a need to refine or improve an existing response.<br>- Best used when new information becomes available that should enhance an already generated output. |