# Title: Simplified Evolution Generation Using LLMs
# Author: Juan Olano
# Date: 9/3/2024
# License: Permissive

"""
## Introduction

This script provides a simplified approach to understanding how the RAGAS framework generates and validates question evolutions using a Language Learning Model (LLM) such as GPT-3.5. The code demonstrates the core components of RAGAS's methodology: generating initial questions from a text corpus, evolving these questions into more complex forms, and using a critic function to ensure the evolved questions meet specific criteria. 

### Purpose

The purpose of this script is to simulate the key processes involved in RAGAS's synthetic data generation pipeline, which is used to create diverse and challenging datasets for training and evaluating language models. This simplified version uses Python and GPT-3.5 to replicate the following steps:

1. **Document Chunking:** Splits a given document into manageable chunks of text to focus on specific parts of the content.
2. **Question Generation:** Uses an LLM to generate simple, factual questions based on each chunk of the document.
3. **Applying Evolutions:** Transforms the simple questions into more complex forms (e.g., multi-step reasoning, compressed, or rephrased questions) using the LLM.
4. **Critic Validation:** Employs a critic function, also powered by the LLM, to evaluate whether the evolved questions are valid. The critic checks if the evolved questions are sufficiently different from the originals while maintaining the same meaning.

### Workflow

The workflow of the script is divided into distinct steps:

1. **Initialization:** The script initializes an instance of the `EvolutionGenerator` class, which handles communication with the LLM and manages the evolution process.
2. **Document Chunking:** The input document is split into smaller text chunks. This mimics the approach of breaking down a large corpus into smaller, manageable pieces that can be processed individually.
3. **Simple Question Generation:** For each text chunk, a simple question is generated using the `gpt()` function. This step highlights how initial, straightforward questions are created from raw text.
4. **Applying Evolutions:** The script applies three different types of evolutions to each simple question—complexity, compression, and rephrasing. These transformations are designed to increase the difficulty and variety of the questions, similar to how RAGAS would produce varied data points for model training.
5. **Critic Evaluation:** After generating the evolved questions, the script uses the critic function to validate each evolution. The critic ensures that the transformations meet the criteria of being sufficiently distinct from the original while preserving the core intent and meaning.
6. **Validation Output:** The script provides detailed feedback on the validity of each evolved question, helping users understand the decision-making process of the critic.

### Conclusion

This script provides a hands-on, simplified example of RAGAS's approach to generating synthetic data through question evolutions and validation. By understanding these key concepts, users can appreciate the importance of diverse and high-quality training data in building robust and capable language models.

Note: This script requires an active OpenAI client with access to GPT


In [20]:
import random

In [21]:
import os
import openai
from openai import OpenAI

with open('../../../apikeys/api_openai_aimakerspace.key', 'r') as file:
    api_key = file.read().strip()

In [22]:
# Define the class with the gpt() function
class EvolutionGenerator:
    def __init__(self):
        self.openai_client = OpenAI(api_key=api_key)
    
    def gpt(self, prompt):
        try:
            response = self.openai_client.chat.completions.create(
                messages=[{"role": "system", "content": ""}, {"role": "user", "content": prompt}],
                temperature=0,
                top_p=1,
                frequency_penalty=0,
                presence_penalty=0,
                stop=None,
                model="gpt-4o",
            )     
            result = response.choices[0].message.content 
            return result
        except Exception as e:
            print(f"An error occurred on GPT: {e}")
            return None

    # Step 1: Document Chunking
    def chunk_document(self, doc, chunk_size=100):
        """Splits the document into smaller chunks of text."""
        words = doc.split()
        chunks = [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]
        return chunks

    # Step 2: Generating Simple Questions using GPT
    def generate_simple_questions(self, chunks):
        """Generates simple questions for each chunk using GPT."""
        questions = []
        for chunk in chunks:
            prompt = f"Generate a simple factual question based on the following text:\n\n{chunk}"
            question = self.gpt(prompt)
            if question:
                questions.append(question.strip())
        return questions

    # Step 3: Applying Evolutions using GPT
    def apply_evolutions(self, question):
        """Generates evolved versions of a question using GPT."""
        evolutions = []

        # Evolution 1: Complexity Evolution
        complexity_prompt = f"Transform the following simple question into a more complex reasoning question that requires multi-step logic or inference:\n\n{question}"
        complex_question = self.gpt(complexity_prompt)
        if complex_question:
            evolutions.append(complex_question.strip())

        # Evolution 2: Compression Evolution
        compression_prompt = f"Rewrite the following question to be more concise, while retaining the original meaning:\n\n{question}"
        compressed_question = self.gpt(compression_prompt)
        if compressed_question:
            evolutions.append(compressed_question.strip())

        # Evolution 3: Rephrasing Evolution
        rephrasing_prompt = f"Rephrase the following question in a different way to ask the same information:\n\n{question}"
        rephrased_question = self.gpt(rephrasing_prompt)
        if rephrased_question:
            evolutions.append(rephrased_question.strip())

        return evolutions
    
    # Step 4: Critic function to validate evolved questions
    def critic(self, original_question, evolved_question):
        """Uses GPT to evaluate if the evolved question is valid."""
        critic_prompt = f"""
        Evaluate the following evolved question to determine if it is sufficiently different from the original question while conveying the same meaning.
        Original Question: "{original_question}"
        Evolved Question: "{evolved_question}"

        Answer "VALID" if the evolved question meets both criteria. Answer "INVALID" if it fails either criterion.
        """
        result = self.gpt(critic_prompt)
        return result.strip().upper() == "VALID"

    # Main function to generate evolved questions
    def generate_evolved_questions(self, questions):
        """Generates a set of evolved questions from simple ones."""
        all_evolved_questions = []
        for question in questions:
            evolved_questions = self.apply_evolutions(question)
            all_evolved_questions.append((question, evolved_questions))  # Store original and evolved pairs
        return all_evolved_questions

    # New function to validate evolved questions
    def validate_evolved_questions(self, evolved_questions):
        """Validates evolved questions using the critic function."""
        valid_evolved_questions = []
        for original_question, evolutions in evolved_questions:
            print(f"\nOriginal Question: {original_question}")
            for evolved_question in evolutions:
                is_valid = self.critic(original_question, evolved_question)
                print(f"  Evolved Question: {evolved_question}")
                print(f"    -> Validity: {'VALID' if is_valid else 'INVALID'}")
                if is_valid:
                    valid_evolved_questions.append(evolved_question)
        return valid_evolved_questions

In [23]:
generator = EvolutionGenerator()

In [24]:
# Sample document as a string
document = """
Forward-Looking Statements
This Quarterly Report on Form 10-Q contains forward-looking statements based on management’s beliefs and assumptions and on information currently
available to management. In some cases, you can identify forward-looking statements by terms such as “may,” “will,” “should,” “could,” “goal,” “would,” “expect,”
“plan,” “anticipate,” “believe,” “estimate,” “project,” “predict,” “potential” and similar expressions intended to identify forward-looking statements. These
statements involve known and unknown risks, uncertainties and other factors, which may cause our actual results, performance, time frames or achievements to
be materially different from any future results, performance, time frames or achievements expressed or implied by the forward-looking statements. We discuss
many of these risks, uncertainties and other factors in this Quarterly Report on Form 10-Q and our Annual Report on Form 10-K for the fiscal year ended
January 28, 2024 in greater detail under the heading “Risk Factors” of such reports. Given these risks, uncertainties, and other factors, you should not place
undue reliance on these forward-looking statements. Also, these forward-looking statements represent our estimates and assumptions only as of the date of this
filing. You should read this Quarterly Report on Form 10-Q completely and understand that our actual future results may be materially different from what we
expect. We hereby qualify our forward-looking statements by these cautionary statements. Except as required by law, we assume no obligation to update these
forward-looking statements publicly, or to update the reasons actual results could differ materially from those anticipated in these forward-looking statements,
even if new information becomes available in the future.
"""

In [25]:
# Step 1: Chunk the document
chunks = generator.chunk_document(document)
print("Document Chunks:", chunks)

Document Chunks: ['Forward-Looking Statements This Quarterly Report on Form 10-Q contains forward-looking statements based on management’s beliefs and assumptions and on information currently available to management. In some cases, you can identify forward-looking statements by terms such as “may,” “will,” “should,” “could,” “goal,” “would,” “expect,” “plan,” “anticipate,” “believe,” “estimate,” “project,” “predict,” “potential” and similar expressions intended to identify forward-looking statements. These statements involve known and unknown risks, uncertainties and other factors, which may cause our actual results, performance, time frames or achievements to be materially different from any future results, performance, time frames or achievements expressed or implied by the forward-looking statements.', 'We discuss many of these risks, uncertainties and other factors in this Quarterly Report on Form 10-Q and our Annual Report on Form 10-K for the fiscal year ended January 28, 2024 in

In [26]:
# Step 2: Generate simple questions from chunks
simple_questions = generator.generate_simple_questions(chunks)
print("\nSimple Questions:", simple_questions)


Simple Questions: ['What terms are used to identify forward-looking statements in the Quarterly Report on Form 10-Q?', 'On what date did the fiscal year end for the Annual Report on Form 10-K mentioned in the text?', 'What is the company not obligated to do publicly, except as required by law?']


In [27]:
# Step 3: Apply evolutions to generate complex questions
evolved_questions = generator.generate_evolved_questions(simple_questions)
print("\nEvolved Questions:")
for original, evolved in evolved_questions:
    print(f"\nOriginal Question: {original}")
    for eq in evolved:
        print(f"  - Evolved: {eq}")


Evolved Questions:

Original Question: What terms are used to identify forward-looking statements in the Quarterly Report on Form 10-Q?
  - Evolved: In the context of analyzing a company's Quarterly Report on Form 10-Q, how would you identify and interpret the specific language and terms that indicate forward-looking statements, and what implications might these statements have for the company's future performance and investor decision-making?
  - Evolved: What terms identify forward-looking statements in the Quarterly Report on Form 10-Q?
  - Evolved: Which phrases or terminology are employed to denote forward-looking statements in the Quarterly Report on Form 10-Q?

Original Question: On what date did the fiscal year end for the Annual Report on Form 10-K mentioned in the text?
  - Evolved: Given that the Annual Report on Form 10-K typically covers a company's financial performance over a fiscal year, and considering that fiscal years can vary between companies, analyze the provided

In [28]:
# Step 4: Validate evolved questions using the critic function
print("\nValidating Evolved Questions:")
valid_evolved_questions = generator.validate_evolved_questions(evolved_questions)

print("\nValid Evolved Questions:")
for valid_question in valid_evolved_questions:
    print("-", valid_question)


Validating Evolved Questions:

Original Question: What terms are used to identify forward-looking statements in the Quarterly Report on Form 10-Q?
  Evolved Question: In the context of analyzing a company's Quarterly Report on Form 10-Q, how would you identify and interpret the specific language and terms that indicate forward-looking statements, and what implications might these statements have for the company's future performance and investor decision-making?
    -> Validity: INVALID
  Evolved Question: What terms identify forward-looking statements in the Quarterly Report on Form 10-Q?
    -> Validity: VALID
  Evolved Question: Which phrases or terminology are employed to denote forward-looking statements in the Quarterly Report on Form 10-Q?
    -> Validity: VALID

Original Question: On what date did the fiscal year end for the Annual Report on Form 10-K mentioned in the text?
  Evolved Question: Given that the Annual Report on Form 10-K typically covers a company's financial perf