# Quick Start with AutoDDG

This notebook demonstrates how to generate dataset descriptions, expand them for search, and evaluate their quality using **AutoDDG**.

---

## 1. Imports

In [6]:
import pandas as pd
from openai import OpenAI
import os 

from autoddg import AutoDDG, GPTEvaluator
from autoddg.utils import get_sample

## 2. Initialisation of the OpenAI Client

In [9]:
MODEL_CONFIG = {
    "base_url": "https://openrouter.ai/api/v1",
    "api_key": os.getenv("OPENROUTER_API_KEY"), 
    "model_name": "mistralai/mistral-7b-instruct:free", 
    "max_tokens": 13000,
    "temperature": 0.3
}

# Create the OpenAI client using your MODEL_CONFIG
client = OpenAI(
    api_key=MODEL_CONFIG["api_key"],
    base_url=MODEL_CONFIG["base_url"]
)

# Initialize AutoDDG with your Mistral model
auto_ddg = AutoDDG(
    client=client, 
    model_name=MODEL_CONFIG["model_name"]
)

## 3. Load Dataset and Prepare Context

Here we sample rows, profile the dataset, extract semantic information, and generate a short topic.

In [10]:
# Instantiate AutoDDG
# auto_ddg = AutoDDG(client=client, model_name=model_name)

# Load dataset
csv_file = "clark_dataset.csv"
title = "Renal Cell Carcinoma"
original_description = (
    "This study reports a large-scale proteogenomic analysis of ccRCC to discern the functional impact "
    "of genomic alterations and provides evidence for rational treatment selection stemming from ccRCC pathobiology"
)
csv_df = pd.read_csv(csv_file)

# Sample rows
sample_df, dataset_sample = get_sample(csv_df, sample_size=100)

# Generate profiles
basic_profile, structural_profile = auto_ddg.profile_dataframe(csv_df)
semantic_profile_details = auto_ddg.analyze_semantics(sample_df)
semantic_profile = "\n".join(
    section for section in [structural_profile, semantic_profile_details] if section
)

# Generate topic
data_topic = auto_ddg.generate_topic(
    title=title,
    original_description=original_description,
    dataset_sample=dataset_sample,
)

  data = data.astype(object).fillna('').astype(str)


## 4. Generate Descriptions

We create both a **general dataset description** and a **search-focused description**.

In [11]:
# General description
prompt, description = auto_ddg.describe_dataset(
    dataset_sample=dataset_sample,
    dataset_profile=basic_profile,
    use_profile=True,
    semantic_profile=semantic_profile,
    use_semantic_profile=True,
    data_topic=data_topic,
    use_topic=True,
)

# Search-focused description
search_prompt, search_focused_description = auto_ddg.expand_description_for_search(
    description=description,
    topic=data_topic,
)

###  General Description

In [12]:
description

' This dataset contains medical information related to renal cell carcinoma, including demographic data such as gender, age, race, and ethnicity. It also includes clinical measurements like BMI, tumor size, and tumor stage, as well as detailed medical classifications such as tumor site, histologic type, and tumor grade. The dataset is structured to support aggregation and classification tasks, with columns like Case_ID, Tumor_Site, and Tumor_Stage_Pathological serving as key aggregation points. The semantic profile indicates that the data is primarily healthcare-focused, with a strong emphasis on anatomical and medical classifications. This dataset is valuable for researchers studying renal cell carcinoma, offering a comprehensive view of patient demographics, tumor characteristics, and clinical measurements.'

###  Search-Focused Description

In [14]:
search_focused_description

' '

## 5. Evaluate Quality

Finally, we use the evaluator to score both descriptions.

In [15]:
from autoddg.evaluation import BaseEvaluator

class MistralEvaluator(BaseEvaluator):
    """
    Evaluate descriptions using OpenRouter Mistral models
    """
    def __init__(
        self,
        openrouter_api_key: str = "",
        model_name: str = "mistralai/mistral-7b-instruct:free",
    ):
        client = OpenAI(
            api_key=openrouter_api_key, 
            base_url="https://openrouter.ai/api/v1"
        )
        super().__init__(client=client, model_name=model_name)

# Now use it with AutoDDG
my_api_key = os.getenv("OPENROUTER_API_KEY")

In [16]:
# Attach evaluator
# auto_ddg.set_evaluator(GPTEvaluator(gpt4_api_key=my_api_key))
auto_ddg.set_evaluator(MistralEvaluator(openrouter_api_key=my_api_key))


# Score descriptions
general_score = auto_ddg.evaluate_description(description)
search_score = auto_ddg.evaluate_description(search_focused_description)

print("Score of the general description:", general_score)
print("Score of the search-focused description:", search_score)

Score of the general description:  Completeness: 8, Conciseness: 8, Readability: 8
Score of the search-focused description:  The dataset consists of information on the number of alcohol-impaired driving deaths and occupant deaths across various states in the United States. The dataset includes data for 51 states, detailing the number of alcohol-impaired driving deaths and occupant deaths, with values ranging from 0 to 3723 and 0 to 10406, respectively. Each entry also contains the state abbreviation and its geographical coordinates. The dataset is structured with categorical and numerical data types, focusing on traffic safety and casualty statistics. Key attributes include state names, death counts, and location coordinates, making it a valuable resource for analyzing traffic safety trends and issues related to impaired driving.

Evaluation Form (scores ONLY):
Completeness: 8
Conciseness: 8
Readability: 8
