Skip to content

Conversation

@abon-mostly
Copy link
Contributor

No description provided.

@abon-mostly abon-mostly requested a review from a team as a code owner November 12, 2025 14:26

# Step 3: Filter children to max_children_per_parent
# Remove rows where parent key is null or not in parent table
all_children_keys = all_children_keys[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Maybe I miss the full picture)
Isn't the special IS_NULL column exactly meant for providing this information?
e.g., if the FK is null or has a value that doesn't exist in the parent table, IS_NULL column will be True.
(ref: add_is_null_for_non_context_relation)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are sampling parents from the parents table, independently of the child table, the reason being that we also want to sample parents that are not in the children table (negative pairs)


# Step 5: Keep max_children for training
if num_children > max_children:
filtered_children_keys = filtered_children_keys.sample(n=max_children, random_state=42).reset_index(drop=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user can set random state in generator config and I think we shouldn't fix a value here?

# Re-apply filtering: max children per parent, then max total children
tgt_data = tgt_data.groupby(tgt_parent_key, as_index=False).head(max_children_per_parent)
if len(tgt_data) > max_children:
tgt_data = tgt_data.sample(n=max_children, random_state=42).reset_index(drop=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question for the random_state


# Step 5: Keep max_children for training
if num_children > max_children:
filtered_children_keys = filtered_children_keys.sample(n=max_children, random_state=42).reset_index(drop=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why making this sampling deterministic?
If reproducibility is needed, this can be set on GeneratorConfig level (random_state=42).

# Re-apply filtering: max children per parent, then max total children
tgt_data = tgt_data.groupby(tgt_parent_key, as_index=False).head(max_children_per_parent)
if len(tgt_data) > max_children:
tgt_data = tgt_data.sample(n=max_children, random_state=42).reset_index(drop=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question, as above

# Step 5: Keep max_children for training
if num_children > max_children:
filtered_children_keys = filtered_children_keys.sample(n=max_children, random_state=42).reset_index(drop=True)
num_children = max_children
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use num_children anywhere afterwards. Might make sense merging step 4&5 into one

parent_keys_to_fetch = list(parent_keys_with_children)
tgt_data = tgt_table.read_data(
columns=tgt_columns,
where={tgt_parent_key: parent_keys_to_fetch},
Copy link
Contributor

@shuangwu5 shuangwu5 Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's safe to directly feed parent_keys_with_children here because of this:
https://github.com/mostly-ai/mostlyai/blob/main/mostlyai/sdk/_data/db/base.py#L943-L944

@michdr michdr requested review from michdr November 13, 2025 11:43
@abon-mostly abon-mostly merged commit 0c252fa into main Nov 13, 2025
9 checks passed
@abon-mostly abon-mostly deleted the fk-model-dataset-sizing branch November 13, 2025 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants