Skip to content

Conversation

gribnoysup
Copy link
Collaborator

@gribnoysup gribnoysup commented Sep 3, 2025

This patch adds the logic for automatically discovering relationships between collections following the algorithm we outlined in the "Infer relationships" scope. This behavior is behind a feature flag until we add the toggle to disable the discovery to the schema generation UI, this is the next chunk of work for the project.

I added some collections to the sample_airbnb database in our teams "Cluster 0" cluster if you want to test it. If you create a data model for this namespace, it should identify relationships between listingsAndReview, reviewers, and reviews collections:

image

@github-actions github-actions bot added the feat label Sep 3, 2025
@gribnoysup gribnoysup changed the title feat(data-modeling): implement automatic relationship inference algorithm feat(data-modeling): implement automatic relationship inference algorithm COMPASS-9776 Sep 3, 2025
@gribnoysup gribnoysup added the feature flagged PRs labeled with this label will not be included in the release notes of the next release label Sep 9, 2025
@gribnoysup gribnoysup marked this pull request as ready for review September 9, 2025 13:35
@gribnoysup gribnoysup requested a review from a team as a code owner September 9, 2025 13:35
Copy link
Member

@Anemy Anemy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, left a couple comments, no blockers.

There are a few places we could check the abort signal to fail bit quicker if we want but I don't think we need to go into that now, similarly if we know it's a view already we don't need to do index checks. We aren't special casing views yet in data modeling though.

schema: MongoDBJSONSchema,
path: string[],
isArrayItem: boolean
) => void,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it's just me, but my first thought reading this function is "this callback could have been a generator". Obviously not really all that different in the end, but mentioning it in case you also feel that that would be more idiomatic

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea and worth doing, let me adjust this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored in 811d84e

@gribnoysup
Copy link
Collaborator Author

@addaleax @Anemy Thanks for the reviews! I'm rerunning some flaky tests, but otherwise I think I addressed all the feedback, do you want to take another look when you have a moment?

Copy link
Collaborator

@addaleax addaleax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! A few comments but basically LGTM 🙂

});
return indexes
.filter((index): index is IndexDescriptionInfo & { name: string } => {
return !!index.name;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my own understanding, in what type of situation would index.name be false-y?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know, that's the type we're getting directly from the driver, so I just assumed in some corner cases it might be possible and decided to account for that, can do a bit more digging 🙂

Comment on lines 165 to 169
).filter((value) => {
// They will be matching in a lot of cases, but the chances of both
// local and foreign field in a relationship being _id are very slim, so
// skipping
return value[0] !== '_id';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
).filter((value) => {
// They will be matching in a lot of cases, but the chances of both
// local and foreign field in a relationship being _id are very slim, so
// skipping
return value[0] !== '_id';
).filter((propPath) => {
// The types of _id will be matching in a lot of cases, but the chances of both
// local and foreign field in a relationship being _id are very slim, so
// skipping
return propPath[0] !== '_id';

or at least that's what we're trying to communicate here, right?

Unfortunately, I think this would be a valid pattern that we do care about, where _ids of multiple collections are references to each other as a way of reducing a single document's size (e.g. you have a busy collection with smaller docs and a larger collection with rarely-accessed and potentially large docs that could be referenced via $lookup if necessary)

I think it's fine to skip this for now but if we do, we probably want a TODO ticket to think a bit more about this situation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, actually I was really just assuming there would never be a case like that and was trying to optimise a bit, if you think it's a possible scenario I think I can just drop this filter, I don't think performance-wise it's a very big change

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the logic in c2c98ff, thanks for bringing this up!

gribnoysup and others added 2 commits September 11, 2025 10:37
@gribnoysup gribnoysup merged commit f33ac5e into main Sep 11, 2025
56 of 58 checks passed
@gribnoysup gribnoysup deleted the COMPASS-9776 branch September 11, 2025 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feat feature flagged PRs labeled with this label will not be included in the release notes of the next release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants