<a href="https://colab.research.google.com/github/meghashyamnb/test123/blob/master/Copy_of_GenAI_AgenticPipelines_HumanInput_Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Created by: [Bharath Kumar Hemachandran](mailto:bharathh@gmail.com) [Linkedin](https://www.linkedin.com/in/bharath-hemachandran/)



**GenAI Agentic Workflows - Human Input Loop**
==============================================

#Setup

In [None]:
%%capture
!pip install crewai groq langchain_groq

In [None]:
import os
from crewai import Agent, Task, Crew, Process
from langchain_groq import ChatGroq

# Set up Groq with Llama3
os.environ["GROQ_API_KEY"] = "<your_groq_api_key>"  # Replace with your API key
model1 = "groq/llama3-8b-8192"
model2 = "llama-3.1-8b-instant"
llm = ChatGroq(model=model1)

# Create the Crew

In [None]:
def create_and_run_crew(topic, feedback=None):
    # Create your agents with specific roles
    researcher = Agent(
        role="Research Specialist",
        goal="Find and analyze data on the {topic} provided",
        backstory="You're an expert at finding relevant information and analyzing it thoroughly.",
        verbose=True,
        llm=llm
    )

    writer = Agent(
        role="Content Writer",
        goal="Create high-quality content based on research of the {topic} provided",
        backstory="You transform complex research into clear, engaging content.",
        verbose=True,
        llm=llm
    )

    reviewer = Agent(
        role="Quality Reviewer",
        goal="Ensure accuracy and quality of final output of the {topic}",
        backstory="You have a keen eye for detail and ensure all work meets high standards.",
        verbose=True,
        llm=llm
    )

    feedback_processor = Agent(
        role="Feedback Analyzer",
        goal="Process human feedback and extract actionable insights",
        backstory="You specialize in understanding user needs and translating feedback into concrete improvements.",
        verbose=True,
        llm=llm
    )

    # Define tasks for your agents
    research_task = Task(
        description="Research the given {topic} extensively and compile key findings",
        agent=researcher,
        expected_output="Comprehensive research notes on the topic"
    )

    # If there's feedback, include it in the writing task
    writing_description = "Using the research provided, create a well-structured article on the {topic} provided"
    if feedback:
        writing_description += f". Consider this feedback for improvement: {feedback}"

    writing_task = Task(
        description=writing_description,
        agent=writer,
        expected_output="A complete draft article",
        context=[research_task]
    )

    review_task = Task(
        description="Review the article for accuracy, clarity, and quality",
        agent=reviewer,
        expected_output="Final polished article with review notes",
        context=[writing_task]
    )

    # Create a crew with sequential process
    crew = Crew(
        agents=[researcher, writer, reviewer, feedback_processor],
        tasks=[research_task, writing_task, review_task],
        verbose=True,
        process=Process.sequential
    )

    # Execute the crew to complete all tasks
    result = crew.kickoff(inputs={"topic": topic})
    return result

# Handle the feedback

In [None]:
def process_feedback(feedback, article):
    """Use the LLM to process feedback and determine if it's sufficient"""
    prompt = f"""
    Analyze the following feedback on an article:

    FEEDBACK:
    {feedback}

    ARTICLE:
    {article}

    Task:
    1. Determine if the feedback is detailed enough to guide improvements (YES/NO)
    2. If NO, suggest what specific questions we should ask the user
    3. If YES, summarize the key points for improvement in a structured format

    Output your analysis in the following format:
    SUFFICIENT: [YES/NO]
    QUESTIONS: [if NO, list specific questions]
    ANALYSIS: [if YES, structured improvement points]
    """
    llm2 = ChatGroq(model=model2)
    messages = [("system",""),("human",prompt)]
    response = llm2.invoke(messages).content
    return response

def collect_detailed_feedback(feedback_analysis, initial_feedback):
    """Collect detailed feedback from the user based on analysis"""
    # Extract questions to ask the user
    questions_start = feedback_analysis.find("QUESTIONS:") + 10
    questions_end = feedback_analysis.find("ANALYSIS:") if "ANALYSIS:" in feedback_analysis else len(feedback_analysis)
    questions = feedback_analysis[questions_start:questions_end].strip()

    print(f"\nCould you provide more specific feedback? {questions}")
    additional_feedback = input("Your detailed feedback: ")

    # Check if user wants to quit
    if additional_feedback.lower() in ['quit', 'exit']:
        return None, None

    # Now process this additional feedback to see if it's sufficient
    combined_feedback = f"{initial_feedback} {additional_feedback}"
    return combined_feedback, additional_feedback


# Run the program



In [None]:
def main():
    topic = input("Enter the topic for the article: ")
    # Check if user wants to quit at the beginning
    if topic.lower() in ['quit', 'exit']:
        print("Exiting the program.")
        return

    iteration = 1
    feedback = None

    while True:
        print(f"\n=== Starting iteration {iteration} ===")
        if feedback:
            print(f"Incorporating feedback: {feedback}")

        # Run the crew and get the article
        result = create_and_run_crew(topic, feedback)
        print("\n=== Final Article ===")
        print(result)

        # Ask for human feedback
        user_feedback = input("\nHow would you rate this article? Please provide feedback (or type 'quit' or 'exit' to finish): ")

        if user_feedback.lower() in ['quit', 'exit']:
            print("Exiting the program.")
            break

        # Process initial feedback
        feedback_analysis = process_feedback(user_feedback, result)
        print("\nFeedback Analysis:")
        print(feedback_analysis)

        # Check if initial feedback is sufficient
        if "SUFFICIENT: NO" in feedback_analysis:
            # Get more detailed feedback from user
            combined_feedback, additional_feedback = collect_detailed_feedback(feedback_analysis, user_feedback)

            # Check if user chose to quit during feedback collection
            if combined_feedback is None:
                print("Exiting the program.")
                break

            # Process the combined feedback to see if it's now sufficient
            second_analysis = process_feedback(combined_feedback, result)
            print("\nUpdated Feedback Analysis:")
            print(second_analysis)

            # Check if the combined feedback is now sufficient
            if "SUFFICIENT: NO" in second_analysis:
                # If still insufficient, collect more feedback until sufficient
                while "SUFFICIENT: NO" in second_analysis:
                    print("\nYour feedback is still not specific enough.")
                    combined_feedback, more_feedback = collect_detailed_feedback(second_analysis, combined_feedback)

                    # Check if user chose to quit during additional feedback collection
                    if combined_feedback is None:
                        print("Exiting the program.")
                        return

                    second_analysis = process_feedback(combined_feedback, result)
                    print("\nUpdated Feedback Analysis:")
                    print(second_analysis)

                # When feedback becomes sufficient, extract the analysis
                analysis_start = second_analysis.find("ANALYSIS:") + 9
                feedback = second_analysis[analysis_start:].strip()
            else:
                # The combined feedback is now sufficient
                analysis_start = second_analysis.find("ANALYSIS:") + 9
                feedback = second_analysis[analysis_start:].strip()
        else:
            # Initial feedback was already sufficient
            analysis_start = feedback_analysis.find("ANALYSIS:") + 9
            feedback = feedback_analysis[analysis_start:].strip()

        iteration += 1

if __name__ == "__main__":
    main()

# Testing the Pipeline

## 1. Test Objectives
- Verify the core functionality of agent-based article generation.
- Ensure accurate handling of human feedback.
- Validate improvements in content quality through iterative feedback.
- Assess error handling and robustness in various scenarios.
- Evaluate system performance with different inputs.

## 2. Test Scope
- **Functional Testing**: Ensuring expected outputs for given inputs.
- **Edge Case Testing**: Handling unusual or extreme inputs.
- **Error Handling Testing**: Ensuring stability in erroneous scenarios.
- **Performance Testing**: Measuring response time and execution efficiency.
- **Usability Testing**: Checking clarity and responsiveness of human feedback interactions.

## 3. Test Strategy
The testing will be conducted in three phases:

- **Unit Testing**: Validate individual functions.
- **Integration Testing**: Verify interactions between agents.
- **End-to-End Testing**: Ensure a smooth workflow for article generation and feedback handling.

## 4. Test Scenarios and Steps

### **4.1 Functional Tests**
| **Test Case** | **Description** | **Expected Outcome** |
|--------------|----------------|----------------------|
| **TC-01: Generate Article** | Run `create_and_run_crew("Machine Learning")` | Returns a valid article with research, writing, and review. |
| **TC-02: Handle Feedback** | Provide structured feedback | Feedback analysis correctly determines sufficiency. |
| **TC-03: Insufficient Feedback Handling** | Give vague feedback (e.g., "Improve it") | System asks for more details. |
| **TC-04: Loop Iteration with Feedback** | Run multiple feedback cycles | Each cycle improves the article quality. |

### **4.2 Edge Case Tests**
| **Test Case** | **Description** | **Expected Outcome** |
|--------------|----------------|----------------------|
| **TC-05: Empty Topic Input** | Run `create_and_run_crew("")` | Returns an error or prompts user for input. |
| **TC-06: Non-Text Topic Input** | Input numbers/special characters as topic | Handles input gracefully or rejects invalid topics. |
| **TC-07: Feedback Injection Attack** | Enter feedback with code/script | System sanitizes input and prevents execution. |
| **TC-08: Extremely Long Feedback** | Provide a long feedback string (1000+ words) | System processes or truncates safely. |

### **4.3 Error Handling Tests**
| **Test Case** | **Description** | **Expected Outcome** |
|--------------|----------------|----------------------|
| **TC-09: API Key Missing** | Remove `GROQ_API_KEY` | System fails gracefully with an appropriate error. |
| **TC-10: Invalid API Response** | Mock invalid response from LLM | System handles error and retries or exits safely. |
| **TC-11: Unexpected User Input** | Provide gibberish input | System prompts user for clarification. |

### **4.4 Performance Tests**
| **Test Case** | **Description** | **Expected Outcome** |
|--------------|----------------|----------------------|
| **TC-12: Process Large Topic** | Use a long topic like "History of AI from 1950 to 2025 with trends" | System completes execution within a reasonable time. |
| **TC-13: Multiple Concurrent Runs** | Run `create_and_run_crew()` in parallel | System handles concurrency without crashing. |



## 5. Step-by-Step Testing Method

### **5.1 Set Up Environment**
```bash
pip install crewai groq langchain_groq
```
Ensure API key is configured.

### **5.2 Unit Testing**
```python
# Test process_feedback() with various inputs (detailed vs. vague)
feedback_analysis = process_feedback("This needs improvement", "Sample article")
print(feedback_analysis)
```
```python
# Test collect_detailed_feedback() by simulating user input
feedback_analysis = process_feedback("Not clear enough", "Sample article")
print(collect_detailed_feedback(feedback_analysis, "Not clear enough"))
```

### **5.3 Integration Testing**
```python
# Run article generation and check outputs
result = create_and_run_crew("Sample Topic")
print(result)
```
```python
# Provide different types of feedback and observe iterations
feedback = "Needs better structure."
feedback_analysis = process_feedback(feedback, result)
print(feedback_analysis)
```

### **5.4 End-to-End Testing**
```python
# Run the full script and simulate user interactions
main()
```

### **5.5 Performance & Stress Testing**
```python
# Increase topic complexity
result = create_and_run_crew("History of AI from 1950 to 2025 with trends and key research milestones")
print(result)
```
```python
# Run multiple instances in parallel
import threading
threads = []
for i in range(5):
    t = threading.Thread(target=create_and_run_crew, args=("Parallel Test",))
    threads.append(t)
    t.start()
for t in threads:
    t.join()

### **5.5 Precision & Accuracy Evaluation**
```python
from sklearn.metrics import precision_score, accuracy_score
from sentence_transformers import SentenceTransformer, util

def evaluate_output(expected_text, generated_texts):
    model = SentenceTransformer("all-MiniLM-L6-v2")  # Load a sentence similarity model
    similarities = [util.pytorch_cos_sim(model.encode(expected_text), model.encode(gen_text)) for gen_text in generated_texts]
    threshold = 0.8  # Define a similarity threshold for "correct" outputs
    relevant_outputs = sum(1 for sim in similarities if sim > threshold)
    
    precision = relevant_outputs / len(generated_texts) if generated_texts else 0
    accuracy = sum(1 for sim in similarities if sim > threshold) / len(similarities)

    return {"precision": precision, "accuracy": accuracy}

# Example Usage
expected = "Machine Learning is a subset of AI that involves training models on data."
generated = ["ML is part of AI and uses training data.", "AI includes ML and deep learning."]
metrics = evaluate_output(expected, generated)
print(metrics)  # Expected: {'precision': some_value, 'accuracy': some_value}
```

### **5.6 Performance & Stress Testing**
```python
# Increase topic complexity
result = create_and_run_crew("History of AI from 1950 to 2025 with trends and key research milestones")
print(result)
```
```python
# Run multiple instances in parallel
import threading
threads = []
for i in range(5):
    t = threading.Thread(target=create_and_run_crew, args=("Parallel Test",))
    threads.append(t)
    t.start()
for t in threads:
    t.join()
```


# Fairness & Bias Testing for AI-Generated Content

## **1. Objectives**
- Identify and mitigate biases in AI-generated content.
- Ensure fair representation across different demographic groups.
- Evaluate sentiment neutrality and contextual appropriateness.
- Apply AI fairness frameworks to measure and correct biases.

## **2. Scope**
- **Bias Detection:** Identify potential biases in text outputs.
- **Demographic Fairness:** Ensure equal representation.
- **Sentiment Analysis:** Detect and mitigate unintended bias in responses.
- **Fairness Framework Integration:** Use AI fairness tools for evaluation.

## **3. Test Scenarios**

### **3.1 Bias Detection in Content**
| **Test Case** | **Description** | **Expected Outcome** |
|--------------|----------------|----------------------|
| **TC-01: Gender Bias in Responses** | Generate responses to topics like careers, leadership, and intelligence. | No gender-stereotypical biases. |
| **TC-02: Racial Bias in Responses** | Evaluate generated content for different cultural topics. | Neutral and inclusive outputs. |
| **TC-03: Political Bias in Responses** | Generate text on political topics from various perspectives. | Balanced representation without partisanship. |

### **3.2 Demographic Fairness in AI Output**
| **Test Case** | **Description** | **Expected Outcome** |
|--------------|----------------|----------------------|
| **TC-04: Representation Across Demographics** | Generate content for different demographic groups and compare results. | Equal and fair representation. |
| **TC-05: Bias in Text Summarization** | Summarize articles containing diverse viewpoints. | No preference for any particular group. |

### **3.3 Sentiment Analysis for Bias Detection**
| **Test Case** | **Description** | **Expected Outcome** |
|--------------|----------------|----------------------|
| **TC-06: Sentiment Polarity Check** | Analyze sentiment distribution in AI-generated text. | Neutral or contextually appropriate sentiment. |
| **TC-07: Stereotype Detection** | Detect common stereotypes in AI-generated text. | AI avoids reinforcing stereotypes. |

### **3.4 Fairness Framework Integration**
| **Test Case** | **Description** | **Expected Outcome** |
|--------------|----------------|----------------------|
| **TC-08: AI Fairness 360 (AIF360) Evaluation** | Use `AIF360` to quantify bias in AI-generated text. | Fairness score above threshold. |
| **TC-09: Fairlearn Bias Mitigation** | Apply `Fairlearn` to balance demographic representation. | Improved fairness in content distribution. |

## **4. Implementation of Fairness Testing**

### **4.1 Setup Environment**
```bash
pip install aif360 fairlearn scikit-learn pandas numpy
```

### **4.2 Bias Evaluation using AIF360**
```python
from aif360.algorithms.preprocessing import Reweighing
from aif360.datasets import StandardDataset
import pandas as pd

def evaluate_bias(texts):
    df = pd.DataFrame({"text": texts})
    dataset = StandardDataset(df, label_name="text")
    
    reweighing = Reweighing()
    transformed_dataset = reweighing.fit_transform(dataset)
    
    return transformed_dataset

# Example Usage
generated_texts = ["Men are better at science.", "Women should focus on family."]
print(evaluate_bias(generated_texts))
```

### **4.3 Bias Mitigation using Fairlearn**
```python
from fairlearn.reductions import DemographicParity

def mitigate_bias(dataset):
    demographic_parity = DemographicParity()
    fairness_score = demographic_parity.fit(dataset)
    return fairness_score

# Example Usage
data = evaluate_bias(generated_texts)
print(mitigate_bias(data))
```
