chore: drop FAISS; cap compute (AUC, Contours, SeqLen, Subsets); handle empty tgt_data #194

mplatzer · 2025-05-12T23:48:02Z

No description provided.

Copilot

Pull Request Overview

This PR refactors data processing for compute operations by capping sample sizes and sequence lengths, handling empty target data columns, and removing the FAISS dependency in favor of sklearn’s NearestNeighbors. Key changes include removing "faiss-cpu" from pyproject.toml, adding safeguards and detailed logging in accuracy and coherence data preparation, and limiting sample sizes and sequence lengths in similarity and sampling functions.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
pyproject.toml	Removed the faiss-cpu dependency
mostlyai/qa/reporting.py	Added error handling for empty target data columns and improved logging clarity
mostlyai/qa/_similarity.py	Limited samples to 10000 in both mean AUC calculation and contour plotting
mostlyai/qa/_sampling.py	Introduced a cap on Q95 sequence length using a constant value
mostlyai/qa/_distances.py	Dropped FAISS usage and hardcoded groups count for column splitting

Copilot · 2025-05-13T00:46:21Z

mostlyai/qa/reporting.py

            check_min_sample_size(syn_sample_size, 100, "synthetic")
            check_min_sample_size(trn_sample_size, 90, "training")
            if hol_tgt_data is not None:
                check_min_sample_size(hol_sample_size, 10, "holdout")


While handling empty target data for training and synthetic datasets, consider also checking hol_tgt_data (if available) for empty columns to maintain consistency in error handling.

Suggested change

check_min_sample_size(hol_sample_size, 10, "holdout")

check_min_sample_size(hol_sample_size, 10, "holdout")

if hol_tgt_data.shape[1] == 0:

raise PrerequisiteNotMetError("Holdout data has no columns.")

Copilot · 2025-05-13T00:46:21Z

mostlyai/qa/_sampling.py

+        cap_sequence_length = 100
        q95_sequence_length = trn_tgt_data.groupby(key).size().quantile(0.95)
-        syn_tgt_data = syn_tgt_data.groupby(key).sample(frac=1).groupby(key).head(n=q95_sequence_length)
-        trn_tgt_data = trn_tgt_data.groupby(key).sample(frac=1).groupby(key).head(n=q95_sequence_length)
+        max_sequence_length = min(q95_sequence_length, cap_sequence_length)


[nitpick] Consider defining cap_sequence_length as a module-level constant if similar caps might be used elsewhere, to improve maintainability and ease future adjustments.

Copilot · 2025-05-13T00:46:21Z

mostlyai/qa/_distances.py

+            groups += split_columns_into_correlated_groups(ori_embeds, k=3)
+        # check 3 random subsets of columns
        if ori_embeds.shape[1] > 10:
-            k = max(3, ori_embeds.shape[1] // 10)
-            groups += split_columns_into_random_groups(ori_embeds, k=k)
+            groups += split_columns_into_random_groups(ori_embeds, k=3)


[nitpick] Hardcoding the groups count to 3 for correlated and random groups may not suit all datasets; consider whether a dynamic calculation based on the number of features could yield more robust grouping.

mplatzer added 2 commits May 12, 2025 18:47

chore: limit samples for AUC; drop FAISS again

71753f4

improve log messages

e4e0095

mplatzer changed the title ~~tmp~~ chore: drop FAISS; cap subset compute; May 12, 2025

mplatzer changed the title ~~chore: drop FAISS; cap subset compute;~~ chore: drop FAISS; cap subset compute; cap AUC compute May 12, 2025

mplatzer added 2 commits May 12, 2025 20:07

cap seq_len at 100; handle empty tgt

3846eae

cap compute for contour plots

bde828f

mplatzer changed the title ~~chore: drop FAISS; cap subset compute; cap AUC compute~~ chore: drop FAISS; cap compute (AUC, Contours, SeqLen, Subsets) May 13, 2025

mplatzer changed the title ~~chore: drop FAISS; cap compute (AUC, Contours, SeqLen, Subsets)~~ chore: drop FAISS; cap compute (AUC, Contours, SeqLen, Subsets); handle empty tgt_data May 13, 2025

mplatzer marked this pull request as ready for review May 13, 2025 00:45

mplatzer requested a review from Copilot May 13, 2025 00:45

Copilot AI reviewed May 13, 2025

View reviewed changes

mplatzer merged commit ed50129 into main May 13, 2025
6 checks passed

mplatzer deleted the optim branch May 13, 2025 00:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: drop FAISS; cap compute (AUC, Contours, SeqLen, Subsets); handle empty tgt_data #194

chore: drop FAISS; cap compute (AUC, Contours, SeqLen, Subsets); handle empty tgt_data #194

Uh oh!

mplatzer commented May 12, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 13, 2025

Uh oh!

Copilot AI May 13, 2025

Uh oh!

Copilot AI May 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chore: drop FAISS; cap compute (AUC, Contours, SeqLen, Subsets); handle empty tgt_data #194

chore: drop FAISS; cap compute (AUC, Contours, SeqLen, Subsets); handle empty tgt_data #194

Uh oh!

Conversation

mplatzer commented May 12, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants