<a target="_parent" href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/google/bigquery_dataframes_with_gretel_navigator_qa_pairs_for_rag.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# 🤖 Generate High-Quality Q&A Pairs from Unstructured Data for AI Knowledge Bases

Transform unstructured data into valuable, privacy-safe Q&A pairs using [Gretel Navigator](https://gretel.ai/navigator), [Google BigQuery](https://cloud.google.com/bigquery), and the [BigFrames SDK](https://cloud.google.com/python/docs/reference/bigframes/latest).

## 🔍 In this Notebook:

1. Retrieve example IT-security podcast transcripts from BigQuery
2. Generate synthetic Q&A pairs, filtering out irrelevant information and PII
3. Evaluate data quality using LLM-based scoring (toxicity, safety, accuracy, relevance, coherence)
4. Store AI-ready, privacy-safe Q&A pairs in BigQuery

## 💪 Why It Matters:

- Extract insights from various unstructured data sources
- Automatically filter out PII and sensitive information
- Customize knowledge extraction for specific topics
- Power knowledge bases, FAQs, chatbots, or LLM training data
- Ensure high-quality, relevant synthetic data
- Optimize for RAG systems or generate diverse test examples
- Process large volumes of unstructured data efficiently

Combine Gretel's AI models with BigQuery's processing power to create customized AI experiences while maintaining data privacy and quality.

[Explore Gretel Navigator](https://gretel.ai/navigator) | [Learn about BigQuery](https://cloud.google.com/bigquery)

In [None]:
%%capture
!pip install -Uqq "gretel-client>=0.22.0" langchain-text-splitters

In [None]:
# Install bigframes if it's not already installed in the environment.

# %%capture
# !pip install bigframes

In [None]:
import textwrap
import plotly.graph_objs as go
import plotly.express as px
import pandas as pd
import numpy as np

from tqdm.auto import tqdm
from typing import Tuple

import bigframes.pandas as bpd


from langchain_text_splitters import RecursiveCharacterTextSplitter

# Configure your GCP Project here
BIGQUERY_PROJECT = "gretel-vertex-demo"

In [None]:
# Initialize Gretel for synthetic data generation and evaluation
from gretel_client import Gretel
from gretel_client.bigquery import BigFrames

gretel = Gretel(api_key="prompt", validate=True, project_name="bigframes-rag")

gretel_bigframes = BigFrames(gretel)

# Initialize two separate Navigator models for generation and evaluation
gretel_bigframes.init_navigator("generator", backend_model="gretelai/auto")
gretel_bigframes.init_navigator("evaluator", backend_model="gretelai-google/gemini-pro")

In [None]:
# Load the unstructured chat transcripts from BigQuery using BigFrames

# Set BigFrames options
bpd.options.display.progress_bar = None
bpd.options.bigquery.project = BIGQUERY_PROJECT

# Define the source project and dataset
project_id = "gretel-public"
dataset_id = "public"
table_id = "sample-security-podcasts"

In [None]:
# Construct the table path
table_path = f"{project_id}.{dataset_id}.{table_id}"

# Read the table into a DataFrame
df = bpd.read_gbq_table(table_path)

In [None]:
# Visualize an example

def print_dataset_statistics(data_source):
    """Print high level dataset statistics"""
    num_rows = data_source.shape[0]
    num_chars = data_source['text'].str.len().sum()

    print(f"\nNumber of rows: {num_rows}")
    print(f"Number of characters: {num_chars}")

def print_wrapped_text(text, width=128):
    """Print text wrapped to a specified width"""
    wrapped_text = textwrap.fill(text, width=width)
    print(wrapped_text)

print("Sample Security Podcast Transcript:\n")
print_wrapped_text(df.iloc[0]['text'])
print_dataset_statistics(df)

## 🧠 Generating Synthetic Q&A Pairs with Gretel AI

Gretel's AI models offer powerful capabilities for generating high-quality, domain-specific synthetic Q&A pairs. Key features include:

- Utilizes advanced language models to understand and generate context-appropriate content
- Creates thought-provoking questions that encourage critical thinking
- Generates comprehensive, textbook-quality answers
- Maintains topical relevance to security and cloud environments
- Scales efficiently to process large volumes of podcast transcripts

This approach enables the creation of synthetic Q&A pairs that capture the nuances of security topics while providing valuable training data for chatbots and other AI applications.

[Learn more about Gretel's AI Models](https://docs.gretel.ai/)

In [None]:
# Define the prompt template

topics = "How to protect the security of your Google cloud environment, and techniques used by hackers and advanced threat actors"

prompt_template = """\
Given the following text extracted from a podcast, create a dataset with these columns:

* `question`: Generate unique, thought-provoking questions that require critical thinking and detailed answers. Focus on the following topics: {topics}. Ensure that each question:
  - Is complex enough to necessitate a multi-step or in-depth answer
  - Encourages the application of knowledge rather than mere recall
  - Addresses potential knowledge gaps or underrepresented aspects of the topic
  - Includes sufficient context to be understood without additional information
  - Do not reference 'the text', questions must be self contained and introduce the topic and context.

* `answer`: Provide comprehensive, textbook-quality answers that thoroughly address the question. Each answer should:
  - Present a step-by-step explanation of the concept or solution
  - Include all relevant details from the source text, as well as logical extensions or implications
  - Explain the reasoning process, not just the final conclusion
  - Be self-contained, assuming the reader has no access to the original context
  - Aim for 3-5 sentences of rich, educational content

Source Text:
{text}
"""

In [None]:
# Synthesize examples from data

def chunk_text(text, max_tokens=6000):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=max_tokens,
        chunk_overlap=20,
        length_function=len,
    )
    return text_splitter.split_text(text)

all_qa_pairs = [] # List[bpd.DataFrame], stores intermediate QA pairs generated from Navigator
num_records = 2  # Number of Q&A pairs per chunk

total_chunks = sum(len(chunk_text(text)) for text in df.to_pandas()['text'])

# Initialize tqdm progress bar
with tqdm(total=total_chunks * num_records, desc="Generating Synthetic Q&A pairs") as pbar:
    for _, row in df.iterrows():
        text = row['text']
        chunks = chunk_text(text)
        for chunk in chunks:
            prompt = prompt_template.format(text=chunk, topics=topics)

            # Generate the synthetic
            chunk_df = gretel_bigframes.navigator_generate("generator", prompt, num_records=num_records, disable_progress_bar=True)

            all_qa_pairs.append(chunk_df)

            # Update progress bar
            pbar.update(num_records)

print(f"\nGenerated Synthetic Q&A pairs for {len(all_qa_pairs)} chunks")

In [None]:
df_synth = bpd.concat(all_qa_pairs, ignore_index=True)
gretel_bigframes.display_dataframe_in_notebook(df_synth)

## 🎯 Evaluating Synthetic Q&A Pairs

To ensure the quality of our generated Q&A pairs, we'll use Gretel's AI evaluation capabilities. This process is crucial for maintaining the integrity and usefulness of our synthetic data, especially when dealing with sensitive or complex information sources.

💪 With Gretel Navigator, you can pass in the tabular data from BQ and simply add new fields with a prompt, so augmenting data is **only 2 lines of code!**

Key benefits of this evaluation step:

- Assesses multiple aspects of each Q&A pair, including relevance, coherence, and factual accuracy
- Provides numerical scores for easy filtering and quality control
- Helps identify and remove low-quality, irrelevant, or potentially sensitive content
- Ensures the final dataset meets high standards for use in AI applications
- Enables creation of high-quality training data or knowledge bases from diverse, unstructured sources

In [None]:
prompt = f"""
Please evaluate the following dataset of synthetically generated question-answer pairs using these five metrics. For each entry, provide a score between 1 and 100 for each metric:

Columns to add to dataset:
1. `relevance_score`: int
   Measures how well the question and answer relate to the source content and each other.
   1 (completely irrelevant) to 100 (highly relevant and on-topic)
   Consider: Does the Q&A pair address key points from the source? Is the answer directly related to the question?

2. `coherence_score`: int
   Assesses the logical flow, clarity, and internal consistency of both the question and the answer.
   1 (incoherent or confusing) to 100 (perfectly clear and logically structured)
   Consider: Is the language clear? Do ideas flow logically? Is there a consistent narrative?

3. `factual_accuracy_score`: int
   Measures the factual correctness and informativeness of the answer, based on the source content.
   1 (contains major errors or lacks depth) to 100 (completely accurate and informative)
   Consider: Are all stated facts correct? Does the answer provide substantial, useful information?

4. `bias_score`: int
   Evaluates the presence of unfair prejudice, stereotypes, or favoritism in the Q&A pair.
   1 (heavily biased) to 100 (neutral and balanced)
   Consider: Does the Q&A pair show unfair preference or discrimination? Are diverse perspectives represented fairly when relevant?

5. `safety_score`: int
   Assesses the degree to which the Q&A pair is free from harmful, toxic, or inappropriate content.
   1 (contains harmful or inappropriate content) to 100 (completely safe and appropriate)
   Consider: Is the language respectful and non-toxic? Are sensitive topics handled appropriately? Is the content suitable for a general audience?

"""

df_scored = gretel_bigframes.navigator_edit("evaluator", prompt, seed_data=df_synth)

# For local evaluation of the scores, we use a pandas DataFrame
df_eval = df_scored.to_pandas()

In [None]:
# Peak at the QA pairs with LLM-as-a-judge scores

gretel_bigframes.display_dataframe_in_notebook(df_scored.head(10))

In [None]:
# Helper functions to visualize and filter evaluation results

def plot_score_distribution(df: pd.DataFrame) -> None:
    score_columns = [col for col in df.columns if col.endswith('_score')]
    colors = px.colors.qualitative.Plotly

    fig = go.Figure()
    for i, metric in enumerate(score_columns):
        scores = df[metric]
        fig.add_trace(go.Box(
            y=scores,
            name=metric.replace('_score', '').capitalize(),
            boxpoints='all',
            jitter=0.3,
            pointpos=-1.8,
            marker_color=colors[i % len(colors)]
        ))
        print(f"{metric}: Average = {np.mean(scores):.2f}, Std Dev = {np.std(scores):.2f}")

    fig.update_layout(
        title='Distribution of LLM Judge Scores',
        xaxis_title='Evaluation Metrics',
        yaxis_title='Score (1-100)',
        xaxis_tickangle=-45,
        showlegend=False,
        margin=dict(l=40, r=40, t=40, b=80),
        height=800,
        width=1200,
        xaxis=dict(automargin=True, title_standoff=25),
        yaxis=dict(automargin=True, title_standoff=15, range=[0, 100])
    )
    fig.show()

def filter_and_summarize(df: pd.DataFrame, threshold: int = 80) -> Tuple[pd.DataFrame, str]:
    score_columns = [col for col in df.columns if col.endswith('_score')]
    total_records = len(df)

    # Filter records
    df_filtered = df[df[score_columns].min(axis=1) >= threshold]
    filtered_records = total_records - len(df_filtered)

    # Create summary
    summary = f"""
    ✨ Summary of Filtering Process ✨
    --------------------------------
    Total examples processed: {total_records}
    Examples filtered out: {filtered_records}
    Remaining examples: {len(df_filtered)}
    --------------------------------
    """

    return df_filtered, summary

In [None]:
# Visualize and Filter evaluations
score_columns = [col for col in df_eval.columns if col.endswith('_score')]
df_eval[score_columns] = df_eval[score_columns].astype(float)

# Plot original distribution
print("Original Distribution:")
plot_score_distribution(df_eval)

# Filter and summarize
df_filtered, summary = filter_and_summarize(df_eval, threshold=75)

# Print summary
print(summary)

In [None]:
# Write the synthetically generated data to your table in BQ
# NOTE: The BQ Dataset must already exist!

project_id = BIGQUERY_PROJECT
dataset_id = "syntheticdata"
table_id = "security-chatbot-qa"

# Construct the table path
table_path = f"{project_id}.{dataset_id}.{table_id}"

# Write to the destination table in BQ, un-comment to actually write to BQ.
# df_synth.to_gbq(table_path, if_exists='replace')

## 🚀 Conclusion: Unlocking the Power of Unstructured Data

This notebook demonstrates how Gretel's synthetic data capabilities can transform raw, unstructured data into valuable, AI-ready knowledge:

1. **Data Transformation**: We've taken complex, potentially sensitive podcast transcripts and extracted focused, high-quality Q&A pairs.
2. **Quality Assurance**: By using LLM-based evaluation, we ensure that only the most relevant and accurate information is retained.
3. **Versatility**: The resulting dataset can power knowledge bases, chatbots, or serve as training data for LLMs, adapting to your specific needs.
4. **Scalability**: This process can be applied to various data sources and scaled to handle large volumes of information.

By leveraging these techniques, organizations can unlock the full potential of their diverse data assets, creating customized AI experiences while maintaining data quality and privacy.