Skip to content

Conversation

kmruiz
Copy link
Collaborator

@kmruiz kmruiz commented Oct 8, 2025

Proposed changes

This PR adds a new session level service called VectorSearchEmbeddings, that is responsible of:

  1. Understanding if Atlas Search is available
  2. Retrieving which fields of a collection are embeddings, based on Atlas Search index definitions.
  3. Validating that, given a document, it's valid according to the embedding definitions.

Given that the embedding combinations and detection can be inaccurate, we also provide a new configuration option called "disableEmbeddingsValidation" that can be set up by CLI/Env and when true, the validation is bypassed.

This PR also introduces the embedding validation in the insertMany tool, so users can not randomly add data that can break
existing models or indexes unknowingly.

We depend on #628 to be merged, as it implements a method to detect if Atlas Search is available. Whenever the PR is merged, I'll refactor the method introduced there and use VectorSearchEmbeddings, so we have only one single place for search detection.

Checklist

@kmruiz kmruiz self-assigned this Oct 8, 2025
@kmruiz kmruiz marked this pull request as ready for review October 9, 2025 16:01
@kmruiz kmruiz requested a review from a team as a code owner October 9, 2025 16:01
@Copilot Copilot AI review requested due to automatic review settings October 9, 2025 16:01
Copilot

This comment was marked as outdated.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@kmruiz kmruiz marked this pull request as draft October 9, 2025 16:10
@kmruiz kmruiz requested a review from Copilot October 13, 2025 16:35
@kmruiz kmruiz marked this pull request as ready for review October 13, 2025 16:36
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 2 comments.

Comment on lines +162 to +167
try {
await provider.getSearchIndexes("test", "test");
return true;
} catch {
return false;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider that atlas local deployments might take time to boot up search management service. This might become a common case as soon as we provide local-atlas tools.


switch (definition.quantization) {
case "none":
return true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its worth adding this as a comment.

Comment on lines +58 to +60
} else {
return definition;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else {
return definition;
}
}
return definition;

constructor(
private readonly config: UserConfig,
private readonly embeddings: Map<EmbeddingNamespace, VectorFieldIndexDefinition[]> = new Map(),
private readonly atlasSearchStatus: Map<string, boolean> = new Map()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to listen to connection changes (here or in Session) to clear the orphan entries.

const result = await provider.insertMany(database, collection, documents);

const embeddingValidations = new Set(
...(await Promise.all(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small note about Promise here, the only reason findFieldsWithWrongEmbeddings returns a Promise is because of isAtlasSearchAvailable (directly / indirectly).

When we're already in a tool call, is there a possibility to assume that isAtlasSearchAvailable is already populated and work off from that?

Just to be clear - what we have is not a problem. I see a small scope of improvement because I had to re-look what async task findFieldsWithWrongEmbeddings was doing.

Comment on lines +203 to +215
function extractInsertedIds(content: string): ObjectId[] {
expect(content).toContain("Documents were inserted successfully.");
expect(content).toContain("Inserted IDs:");

const match = content.match(/Inserted IDs:\s(.*)/);
const group = match?.[1];
return (
group
?.split(",")
.map((e) => e.trim())
.map((e) => ObjectId.createFromHexString(e)) ?? []
);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this to helpers instead? I noticed myself sometimes re-inventing the helpers even though they exist in the codebase but only because they were in specific test files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants