# Define the pipeline config

This requires some knowledge about the dataset. It's important to note that we expect the dataset to have a train and test split. If it doesn't, you will have to adapt `huqu.stages.dataset_loading.py` accordingly.

But don't worry, we have defined an example pipeline for you already. 

## Default pipeline config

```yaml
dataset:
  path: "Bingsu/Cat_and_Dog"  # The path of the hugging face dataset
  config_name: "default"
  class_name: "dog"
  class_label: 1
  label_key: "labels"  # Some datasets have different keys for the labels, like "label" or "labels"
  num_train_samples: 2
  num_test_samples: 2
  main_subject: "dogs"  # The main theme or subject of the dataset
  captions_path: "data/captions_dogs.parquet"  # Path to store the captions DataFrame
  assignments_path: "data/assignments_dogs.parquet"  # Path to store the assignments DataFrame
  unrefined_criteria_path: "data/unrefined_criteria_dogs.json"  # Path to store initial criteria
  refined_criteria_path: "data/refined_criteria_dogs.json"  # Path to store refined criteria
  format: "parquet"  # Format for storing DataFrames
  compression: "snappy"  # Compression method for parquet files

stages:
  criteria_init:
    batch_size: 2
  criteria_refinement:
    num_rounds: 1
    sample_size: 2
```
```

# Loading the dataset
Once the pipeline is configured you can load the dataset.

In [13]:
from huqu.stages.dataset_loading import DatasetLoadingStage
dataset_loader = DatasetLoadingStage()
dataset = dataset_loader.process()

# Creating captions
Once the dataset is loaded, we can load a model of your choice an created the captions.
For executing this cell, you need an openAI key that is stored in `.env` as `OPENAI_API_KEY=<your-api-key>`

The captions will automatically be saved to a new folder called `data`

In [None]:
 # First, we need to initialize the model.
 from huqu.models.chatgpt import GPT4oMiniMLLM
 multimodal_model = GPT4oMiniMLLM()
 
 # Now we can kick-off the caption generation.
 from huqu.stages.caption_generation import CaptionGenerationStage
 caption_generator = CaptionGenerationStage(multimodal_model)
 caption_generator.process(dataset)

# Criteria Initialization
Now we need to create the criteria, i.e. the dimensions and attributes. The unrefined criteria will be stored in a `unrefined_criteria_path` that you defined in `pipeline_config.yaml`.

In [None]:
 # First, we need to initialize the model.
from huqu.models.chatgpt import GPT4oMiniLLM
text_model = GPT4oMiniLLM()

# Now we can initialize the criteria.
from huqu.stages.criteria_initilization import CriteriaInitializationStage
criteria_initializer = CriteriaInitializationStage(text_model)
criteria_initializer.process()

# Criteria Initialization
Now we need to refine the criteria. The refined criteria will be stored in a `refined_criteria_path` that you defined in `pipeline_config.yaml`.

In [None]:
from huqu.stages.criteria_refinement import CriteriaRefinementStage
criteria_refiner = CriteriaRefinementStage(text_model)
criteria = criteria_refiner.process()

# Image assignment stage
Now we succesfully discovered and refined the relevant subgroups for the class defined in the pipeline config. It's time to assign each captions - which represents the image - to a subgroup. In practice that means, we assign each caption to one attribute per dimension. 

The result will be stored in `assignments_path` that you have defined in your `pipeline_config.yaml`

In [None]:
from huqu.stages.image_assignment import ImageAssignmentStage

image_assignment = ImageAssignmentStage(text_model)
image_assignment.process()

## 🥳 You have discovered and assigned the subpopulations for the first class. Now it's time to run the pipeline for another class.

### To do that, we simply need to adapt the pipeline config

Please open `pipeline_config.yaml` and paste the following parameters to run the pipeline for the class cat.


```yaml
dataset:
  path: "Bingsu/Cat_and_Dog"  # The path of the hugging face dataset
  config_name: "default"
  class_name: "cat"
  class_label: 0
  label_key: "labels"  # Some datasets have different keys for the labels, like "label" or "labels"
  num_train_samples: 2
  num_test_samples: 2
  main_subject: "cat"  # The main theme or subject of the dataset
  captions_path: "data/captions_cats.parquet"  # Path to store the captions DataFrame
  assignments_path: "data/assignments_cats.parquet"  # Path to store the assignments DataFrame
  unrefined_criteria_path: "data/unrefined_criteria_cats.json"  # Path to store initial criteria
  refined_criteria_path: "data/refined_criteria_cats.json"  # Path to store refined criteria
  format: "parquet"  # Format for storing DataFrames
  compression: "snappy"  # Compression method for parquet files

stages:
  criteria_init:
    batch_size: 2
  criteria_refinement:
    num_rounds: 1
    sample_size: 2
```

# Running the pipeline again.
Now, we can simply run all pipeline stages again. They will pull from the updated pipeline config automatically.

In [None]:
dataset = dataset_loader.process()
caption_generator.process(dataset)
criteria_initializer.process()
criteria_refiner.process()
image_assignment.process()

# 🥳 You have now run the pipeline for two classes.

## But it's not the end yet. 

There is still a bit of work to do before we can analyze the results. In fact, we need to create a training and a test set for both classes we analyzed. We have created a script that does that for you.

The only thing you have to make sure is that the `main` function of `merge_datasets.py` uses the right parameters. Setting the right parameters should be self-explanatory. You can infer them again from `pipeline_config.yaml`

In [19]:
from scripts.merge_datasets import main
main()


# Now it's time to analyze the subpopulations
## To do that, we first load the dataset again

In [20]:
parquet_file_test = "dogs_and_cats_test.parquet"
#parquet_file_test = "compost_and_metal_test.parquet"
#parquet_file_test = "fighting_and_laughing_test.parquet"

In [3]:
parquet_file_train = "dogs_and_cats_train.parquet"
#parquet_file_train = "compost_and_metal_train.parquet"
#parquet_file_train = "fighting_and_laughing_train.parquet"

## Combine dataset splits

In [4]:
df_train = pd.read_parquet(parquet_file_train)
df_test = pd.read_parquet(parquet_file_test)

In [5]:
df = {
    'train': df_train,
    'test': df_test
}

## Configure and Initialize the Analyzer


In [6]:
# Create optional custom configuration
# custom_config = {
#     'over_threshold': 0.6,       # Flag attributes appearing in more than X% of instances
#     'under_threshold': 0.05,     # Flag attributes appearing in less than X% of instances
#     'figure_size': (14, 8),      # Set larger figure size
#     'rare_threshold': 5          # Consider attributes appearing less than X times as rare
# }

In [7]:
# Initialize the analyzer with our data and custom config
analyzer = DataAnalyzer(df)

In [None]:
# Generate a comprehensive report including intraclass and interclass analysis
analyzer.complete_report()

## Focused Analysis: Intra-Class Distribution

The `detect_outliers()` function identifies any attributes that are overrepresented or underrepresented across classes

In [None]:
# Detect outliers in attribute distribution
outliers = analyzer.intra.detect_outliers()
display(outliers.head(10))

The `get_class_outliers()` function identifies any attributes that are overrepresented or underrepresented for a **specific class**. 

Adjust `over_threshold` and `under_threshold` if needed.

In [None]:
# Analyze a specific class
dog_outliers = analyzer.intra.get_class_outliers("dog")
display(dog_outliers.head(10))

In [None]:
# Visualize the results
analyzer.intra.plot_histogram(dog_outliers)

The `get_dimension_outliers()` function identifies any attributes that are overrepresented or underrepresented for a **specific class-dimension pair**. 

Adjust `over_threshold` and `under_threshold` if needed.

In [None]:
# Analyze a specific dimension within a class
dog_size_outliers = analyzer.intra.get_dimension_outliers("dog", "size")
display(dog_size_outliers)

In [None]:
# Visualize the results
analyzer.intra.plot_histogram(dog_size_outliers)

## Focused Analysis: Inter-Class Distribution

In [None]:
# Analyze rare attributes
rare_attrs = analyzer.inter.analyze_rare_attributes(threshold=4)
display(rare_attrs)

In [None]:
# Compare attributes across classes
missing_attrs = analyzer.inter.compare_class_attributes()
display(missing_attrs.head(10))

In [None]:
# Analyze attributes missing in a specific class
dog_missing = analyzer.inter.get_unique_to_class("dog")
display(dog_missing)