-
Notifications
You must be signed in to change notification settings - Fork 62
feat: improved fk model dataset sizing #665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| # Step 3: Filter children to max_children_per_parent | ||
| # Remove rows where parent key is null or not in parent table | ||
| all_children_keys = all_children_keys[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Maybe I miss the full picture)
Isn't the special IS_NULL column exactly meant for providing this information?
e.g., if the FK is null or has a value that doesn't exist in the parent table, IS_NULL column will be True.
(ref: add_is_null_for_non_context_relation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are sampling parents from the parents table, independently of the child table, the reason being that we also want to sample parents that are not in the children table (negative pairs)
mostlyai/sdk/_data/non_context.py
Outdated
|
|
||
| # Step 5: Keep max_children for training | ||
| if num_children > max_children: | ||
| filtered_children_keys = filtered_children_keys.sample(n=max_children, random_state=42).reset_index(drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user can set random state in generator config and I think we shouldn't fix a value here?
mostlyai/sdk/_data/non_context.py
Outdated
| # Re-apply filtering: max children per parent, then max total children | ||
| tgt_data = tgt_data.groupby(tgt_parent_key, as_index=False).head(max_children_per_parent) | ||
| if len(tgt_data) > max_children: | ||
| tgt_data = tgt_data.sample(n=max_children, random_state=42).reset_index(drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question for the random_state
mostlyai/sdk/_data/non_context.py
Outdated
|
|
||
| # Step 5: Keep max_children for training | ||
| if num_children > max_children: | ||
| filtered_children_keys = filtered_children_keys.sample(n=max_children, random_state=42).reset_index(drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why making this sampling deterministic?
If reproducibility is needed, this can be set on GeneratorConfig level (random_state=42).
mostlyai/sdk/_data/non_context.py
Outdated
| # Re-apply filtering: max children per parent, then max total children | ||
| tgt_data = tgt_data.groupby(tgt_parent_key, as_index=False).head(max_children_per_parent) | ||
| if len(tgt_data) > max_children: | ||
| tgt_data = tgt_data.sample(n=max_children, random_state=42).reset_index(drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question, as above
mostlyai/sdk/_data/non_context.py
Outdated
| # Step 5: Keep max_children for training | ||
| if num_children > max_children: | ||
| filtered_children_keys = filtered_children_keys.sample(n=max_children, random_state=42).reset_index(drop=True) | ||
| num_children = max_children |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use num_children anywhere afterwards. Might make sense merging step 4&5 into one
mostlyai/sdk/_data/non_context.py
Outdated
| parent_keys_to_fetch = list(parent_keys_with_children) | ||
| tgt_data = tgt_table.read_data( | ||
| columns=tgt_columns, | ||
| where={tgt_parent_key: parent_keys_to_fetch}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's safe to directly feed parent_keys_with_children here because of this:
https://github.com/mostly-ai/mostlyai/blob/main/mostlyai/sdk/_data/db/base.py#L943-L944
No description provided.