-
Notifications
You must be signed in to change notification settings - Fork 12
chore: drop FAISS; cap compute (AUC, Contours, SeqLen, Subsets); handle empty tgt_data #194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors data processing for compute operations by capping sample sizes and sequence lengths, handling empty target data columns, and removing the FAISS dependency in favor of sklearn’s NearestNeighbors. Key changes include removing "faiss-cpu" from pyproject.toml, adding safeguards and detailed logging in accuracy and coherence data preparation, and limiting sample sizes and sequence lengths in similarity and sampling functions.
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Removed the faiss-cpu dependency |
| mostlyai/qa/reporting.py | Added error handling for empty target data columns and improved logging clarity |
| mostlyai/qa/_similarity.py | Limited samples to 10000 in both mean AUC calculation and contour plotting |
| mostlyai/qa/_sampling.py | Introduced a cap on Q95 sequence length using a constant value |
| mostlyai/qa/_distances.py | Dropped FAISS usage and hardcoded groups count for column splitting |
| check_min_sample_size(syn_sample_size, 100, "synthetic") | ||
| check_min_sample_size(trn_sample_size, 90, "training") | ||
| if hol_tgt_data is not None: | ||
| check_min_sample_size(hol_sample_size, 10, "holdout") |
Copilot
AI
May 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While handling empty target data for training and synthetic datasets, consider also checking hol_tgt_data (if available) for empty columns to maintain consistency in error handling.
| check_min_sample_size(hol_sample_size, 10, "holdout") | |
| check_min_sample_size(hol_sample_size, 10, "holdout") | |
| if hol_tgt_data.shape[1] == 0: | |
| raise PrerequisiteNotMetError("Holdout data has no columns.") |
| cap_sequence_length = 100 | ||
| q95_sequence_length = trn_tgt_data.groupby(key).size().quantile(0.95) | ||
| syn_tgt_data = syn_tgt_data.groupby(key).sample(frac=1).groupby(key).head(n=q95_sequence_length) | ||
| trn_tgt_data = trn_tgt_data.groupby(key).sample(frac=1).groupby(key).head(n=q95_sequence_length) | ||
| max_sequence_length = min(q95_sequence_length, cap_sequence_length) |
Copilot
AI
May 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider defining cap_sequence_length as a module-level constant if similar caps might be used elsewhere, to improve maintainability and ease future adjustments.
| groups += split_columns_into_correlated_groups(ori_embeds, k=3) | ||
| # check 3 random subsets of columns | ||
| if ori_embeds.shape[1] > 10: | ||
| k = max(3, ori_embeds.shape[1] // 10) | ||
| groups += split_columns_into_random_groups(ori_embeds, k=k) | ||
| groups += split_columns_into_random_groups(ori_embeds, k=3) |
Copilot
AI
May 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Hardcoding the groups count to 3 for correlated and random groups may not suit all datasets; consider whether a dynamic calculation based on the number of features could yield more robust grouping.
No description provided.