Skip to content

Conversation

@mplatzer
Copy link
Contributor

No description provided.

@mplatzer mplatzer changed the title tmp chore: drop FAISS; cap subset compute; May 12, 2025
@mplatzer mplatzer changed the title chore: drop FAISS; cap subset compute; chore: drop FAISS; cap subset compute; cap AUC compute May 12, 2025
@mplatzer mplatzer changed the title chore: drop FAISS; cap subset compute; cap AUC compute chore: drop FAISS; cap compute (AUC, Contours, SeqLen, Subsets) May 13, 2025
@mplatzer mplatzer changed the title chore: drop FAISS; cap compute (AUC, Contours, SeqLen, Subsets) chore: drop FAISS; cap compute (AUC, Contours, SeqLen, Subsets); handle empty tgt_data May 13, 2025
@mplatzer mplatzer marked this pull request as ready for review May 13, 2025 00:45
@mplatzer mplatzer requested a review from Copilot May 13, 2025 00:45
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors data processing for compute operations by capping sample sizes and sequence lengths, handling empty target data columns, and removing the FAISS dependency in favor of sklearn’s NearestNeighbors. Key changes include removing "faiss-cpu" from pyproject.toml, adding safeguards and detailed logging in accuracy and coherence data preparation, and limiting sample sizes and sequence lengths in similarity and sampling functions.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pyproject.toml Removed the faiss-cpu dependency
mostlyai/qa/reporting.py Added error handling for empty target data columns and improved logging clarity
mostlyai/qa/_similarity.py Limited samples to 10000 in both mean AUC calculation and contour plotting
mostlyai/qa/_sampling.py Introduced a cap on Q95 sequence length using a constant value
mostlyai/qa/_distances.py Dropped FAISS usage and hardcoded groups count for column splitting

check_min_sample_size(syn_sample_size, 100, "synthetic")
check_min_sample_size(trn_sample_size, 90, "training")
if hol_tgt_data is not None:
check_min_sample_size(hol_sample_size, 10, "holdout")
Copy link

Copilot AI May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While handling empty target data for training and synthetic datasets, consider also checking hol_tgt_data (if available) for empty columns to maintain consistency in error handling.

Suggested change
check_min_sample_size(hol_sample_size, 10, "holdout")
check_min_sample_size(hol_sample_size, 10, "holdout")
if hol_tgt_data.shape[1] == 0:
raise PrerequisiteNotMetError("Holdout data has no columns.")

Copilot uses AI. Check for mistakes.
Comment on lines +303 to +305
cap_sequence_length = 100
q95_sequence_length = trn_tgt_data.groupby(key).size().quantile(0.95)
syn_tgt_data = syn_tgt_data.groupby(key).sample(frac=1).groupby(key).head(n=q95_sequence_length)
trn_tgt_data = trn_tgt_data.groupby(key).sample(frac=1).groupby(key).head(n=q95_sequence_length)
max_sequence_length = min(q95_sequence_length, cap_sequence_length)
Copy link

Copilot AI May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider defining cap_sequence_length as a module-level constant if similar caps might be used elsewhere, to improve maintainability and ease future adjustments.

Copilot uses AI. Check for mistakes.
Comment on lines +78 to +81
groups += split_columns_into_correlated_groups(ori_embeds, k=3)
# check 3 random subsets of columns
if ori_embeds.shape[1] > 10:
k = max(3, ori_embeds.shape[1] // 10)
groups += split_columns_into_random_groups(ori_embeds, k=k)
groups += split_columns_into_random_groups(ori_embeds, k=3)
Copy link

Copilot AI May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Hardcoding the groups count to 3 for correlated and random groups may not suit all datasets; consider whether a dynamic calculation based on the number of features could yield more robust grouping.

Copilot uses AI. Check for mistakes.
@mplatzer mplatzer merged commit ed50129 into main May 13, 2025
6 checks passed
@mplatzer mplatzer deleted the optim branch May 13, 2025 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants