feat: improved fk model dataset sizing #665

abon-mostly · 2025-11-12T14:26:44Z

No description provided.

shuangwu5 · 2025-11-13T10:10:41Z

mostlyai/sdk/_data/non_context.py

+
+    # Step 3: Filter children to max_children_per_parent
+    # Remove rows where parent key is null or not in parent table
+    all_children_keys = all_children_keys[


(Maybe I miss the full picture)
Isn't the special IS_NULL column exactly meant for providing this information?
e.g., if the FK is null or has a value that doesn't exist in the parent table, IS_NULL column will be True.
(ref: add_is_null_for_non_context_relation)

Here we are sampling parents from the parents table, independently of the child table, the reason being that we also want to sample parents that are not in the children table (negative pairs)

shuangwu5 · 2025-11-13T10:20:35Z

mostlyai/sdk/_data/non_context.py

+
+    # Step 5: Keep max_children for training
+    if num_children > max_children:
+        filtered_children_keys = filtered_children_keys.sample(n=max_children, random_state=42).reset_index(drop=True)


The user can set random state in generator config and I think we shouldn't fix a value here?

shuangwu5 · 2025-11-13T10:23:48Z

mostlyai/sdk/_data/non_context.py

+    # Re-apply filtering: max children per parent, then max total children
+    tgt_data = tgt_data.groupby(tgt_parent_key, as_index=False).head(max_children_per_parent)
+    if len(tgt_data) > max_children:
+        tgt_data = tgt_data.sample(n=max_children, random_state=42).reset_index(drop=True)


same question for the random_state

michdr · 2025-11-13T10:31:05Z

mostlyai/sdk/_data/non_context.py

+
+    # Step 5: Keep max_children for training
+    if num_children > max_children:
+        filtered_children_keys = filtered_children_keys.sample(n=max_children, random_state=42).reset_index(drop=True)


Why making this sampling deterministic?
If reproducibility is needed, this can be set on GeneratorConfig level (random_state=42).

michdr · 2025-11-13T10:32:24Z

mostlyai/sdk/_data/non_context.py

+    # Re-apply filtering: max children per parent, then max total children
+    tgt_data = tgt_data.groupby(tgt_parent_key, as_index=False).head(max_children_per_parent)
+    if len(tgt_data) > max_children:
+        tgt_data = tgt_data.sample(n=max_children, random_state=42).reset_index(drop=True)


Same question, as above

michdr · 2025-11-13T10:42:36Z

mostlyai/sdk/_data/non_context.py

+    # Step 5: Keep max_children for training
+    if num_children > max_children:
+        filtered_children_keys = filtered_children_keys.sample(n=max_children, random_state=42).reset_index(drop=True)
+        num_children = max_children


We don't use num_children anywhere afterwards. Might make sense merging step 4&5 into one

shuangwu5 · 2025-11-13T10:43:19Z

mostlyai/sdk/_data/non_context.py

+    parent_keys_to_fetch = list(parent_keys_with_children)
+    tgt_data = tgt_table.read_data(
+        columns=tgt_columns,
+        where={tgt_parent_key: parent_keys_to_fetch},


I guess it's safe to directly feed parent_keys_with_children here because of this:
https://github.com/mostly-ai/mostlyai/blob/main/mostlyai/sdk/_data/db/base.py#L943-L944

abon-mostly added 2 commits November 12, 2025 09:23

wip

22119b8

wip

6f2162f

abon-mostly requested a review from a team as a code owner November 12, 2025 14:26

abon-mostly added 2 commits November 12, 2025 16:41

wip

3a56a81

wip

7b912ed

abon-mostly assigned michdr and shuangwu5 Nov 13, 2025

update engine version

0efd4d4

shuangwu5 reviewed Nov 13, 2025

View reviewed changes

michdr reviewed Nov 13, 2025

View reviewed changes

shuangwu5 reviewed Nov 13, 2025

View reviewed changes

abon-mostly added 2 commits November 13, 2025 12:31

pr comments

3a261b2

pr comments

30eab88

michdr requested review from michdr November 13, 2025 11:43

michdr approved these changes Nov 13, 2025

View reviewed changes

abon-mostly merged commit 0c252fa into main Nov 13, 2025
9 checks passed

abon-mostly deleted the fk-model-dataset-sizing branch November 13, 2025 12:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: improved fk model dataset sizing #665

feat: improved fk model dataset sizing #665

Uh oh!

abon-mostly commented Nov 12, 2025

Uh oh!

shuangwu5 Nov 13, 2025

Uh oh!

abon-mostly Nov 13, 2025

Uh oh!

shuangwu5 Nov 13, 2025

Uh oh!

shuangwu5 Nov 13, 2025

Uh oh!

michdr Nov 13, 2025

Uh oh!

michdr Nov 13, 2025

Uh oh!

michdr Nov 13, 2025

Uh oh!

shuangwu5 Nov 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: improved fk model dataset sizing #665

feat: improved fk model dataset sizing #665

Uh oh!

Conversation

abon-mostly commented Nov 12, 2025

Uh oh!

shuangwu5 Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

abon-mostly Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

shuangwu5 Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

shuangwu5 Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

michdr Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

michdr Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

michdr Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

shuangwu5 Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shuangwu5 Nov 13, 2025 •

edited

Loading