Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration with MongoDB #535

Merged
merged 12 commits into from Feb 8, 2024
Merged

Conversation

Martin7-1
Copy link
Contributor

@Martin7-1 Martin7-1 commented Jan 23, 2024

see original PR #254. There are four mainly differences:

  1. Using Testcontainer and MongoDB Atlas Local Deployment to test
  2. Create collection and index when the MongoDBEmbeddingStore initialize, rather than create when adding new embedding at the first time.
  3. Optimize BsonUtils, which is replaced by org.bson.Document to create index mapping.
  4. Rename langchain4j-mongodb to langchain4j-mongodb-atlas

Local deployment tests are all passed, but cloud tests are not tested yet because I encounter some network problem when communicating with MongoDB Atlas. (But I think it doesn't matter, because local deployment is the same as cloud, the purpose of local deployment is to development and test)

@langchain4j
Copy link
Owner

@Martin7-1 if this module can be used for local MongoDB as well, maybe we should remove "-atlas" part from the name? Now it assumes that only atlas can be used...

langchain4j
langchain4j previously approved these changes Jan 25, 2024
Copy link
Owner

@langchain4j langchain4j left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Martin7-1 thanks a lot!

@Martin7-1
Copy link
Contributor Author

@Martin7-1 if this module can be used for local MongoDB as well, maybe we should remove "-atlas" part from the name? Now it assumes that only atlas can be used...

I think it can't be used for local MongoDB. LangChain python version also named it as mongodb-atlas(see LangChain doc). The local IT uses atlas local deployment (still not vanilla MongoDB, local atlas depolyment aims just for development and test scenario, and it's not recommended in product)

@langchain4j
Copy link
Owner

I think it can't be used for local MongoDB. LangChain python version also named it as mongodb-atlas(see LangChain doc). The local IT uses atlas local deployment (still not vanilla MongoDB, local atlas depolyment aims just for development and test scenario, and it's not recommended in product)

Ok, got it, thanks for explanation!

@langchain4j
Copy link
Owner

I am not sure what is the exact reason, but both tests fail for me at the moment.
E.g. MongoDbEmbeddingStoreLocalIT:
com.mongodb.MongoCommandException: Command failed with error 8 (UnknownError): '"indexes[0].definition.mappings.fields.text.type" must be one of [autocomplete, boolean, date, dateFacet, document, embeddedDocuments, geo, knnVector, number, numberFacet, objectId, sortableDateBetaV1, sortableNumberBetaV1, sortableStringBetaV1, string, stringFacet, token]' on server localhost:52237. The full response is {"ok": 0.0, "errmsg": ""indexes[0].definition.mappings.fields.text.type" must be one of [autocomplete, boolean, date, dateFacet, document, embeddedDocuments, geo, knnVector, number, numberFacet, objectId, sortableDateBetaV1, sortableNumberBetaV1, sortableStringBetaV1, string, stringFacet, token]", "code": 8, "codeName": "UnknownError", "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1706511172, "i": 1}}, "signature": {"hash": {"$binary": {"base64": "eNUE2tp0aXe1rUDNF5POEY4KJO0=", "subType": "00"}}, "keyId": 7329409656818761730}}, "operationTime": {"$timestamp": {"t": 1706511172, "i": 1}}}

@Martin7-1
Copy link
Contributor Author

I am not sure what is the exact reason, but both tests fail for me at the moment. E.g. MongoDbEmbeddingStoreLocalIT: com.mongodb.MongoCommandException: Command failed with error 8 (UnknownError): '"indexes[0].definition.mappings.fields.text.type" must be one of [autocomplete, boolean, date, dateFacet, document, embeddedDocuments, geo, knnVector, number, numberFacet, objectId, sortableDateBetaV1, sortableNumberBetaV1, sortableStringBetaV1, string, stringFacet, token]' on server localhost:52237. The full response is {"ok": 0.0, "errmsg": ""indexes[0].definition.mappings.fields.text.type" must be one of [autocomplete, boolean, date, dateFacet, document, embeddedDocuments, geo, knnVector, number, numberFacet, objectId, sortableDateBetaV1, sortableNumberBetaV1, sortableStringBetaV1, string, stringFacet, token]", "code": 8, "codeName": "UnknownError", "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1706511172, "i": 1}}, "signature": {"hash": {"$binary": {"base64": "eNUE2tp0aXe1rUDNF5POEY4KJO0=", "subType": "00"}}, "keyId": 7329409656818761730}}, "operationTime": {"$timestamp": {"t": 1706511172, "i": 1}}}

I haven't finished it yet, fix will be done in this week. Currently I am confused with the difference between Atlas Search and Atlas Vector Search. Maybe I need a little time, this PR may not be able to merge before this release.

@Martin7-1
Copy link
Contributor Author

Martin7-1 commented Jan 31, 2024

@langchain4j Hi, all tests are passed three times in my local and cloud.

image

Please ensure your Atlas Search index are created as follow:

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "embedding": {
        "dimensions": 384,
        "similarity": "cosine",
        "type": "knnVector"
      },
      "metadata": {
        "dynamic": false,
        "fields": {
          "test-key": {
            "type": "token"
          }
        },
        "type": "document"
      }
    }
  }
}

As for the question you mentioned about 'PlanExecutor error during aggregation :: caused by :: cannot query index while in state INITIAL_SYNC', I don't encounter this problem, so I can't locate where this went wrong. But maybe you can see this to figure out what's wrong in your local environment.

@langchain4j
Copy link
Owner

@Martin7-1 thank you, MongoDbEmbeddingStoreCloudIT finally works after I created an index manually with the config you provided. I was also confused between "Atlas Search" and "Atlas Vector Search", so what is the right one? Your index config json should ideally be added to MongoDbEmbeddingStore javadoc with explanations how to create an index. The last link in the javadoc also suggests to create Atlas Vector Search index which is even more confusing :D

@langchain4j
Copy link
Owner

@Martin7-1 I will merge this now and add a javadoc with json config to include it in this release (today)

@langchain4j langchain4j merged commit c694755 into langchain4j:main Feb 8, 2024
6 checks passed
@Martin7-1
Copy link
Contributor Author

@Martin7-1 thank you, MongoDbEmbeddingStoreCloudIT finally works after I created an index manually with the config you provided. I was also confused between "Atlas Search" and "Atlas Vector Search", so what is the right one? Your index config json should ideally be added to MongoDbEmbeddingStore javadoc with explanations how to create an index. The last link in the javadoc also suggests to create Atlas Vector Search index which is even more confusing :D

Currently our implementation use "Atlas Search" as our index, the difference between them is:

Atlas Vector Search allows searching through data based on semantic meaning captured in vectors, whereas Atlas Search allows for keyword search (i.e., based on the actual text and any defined synonym mappings)

I am not sure which one is correct because they seems all work fine. Choosing Atlas Search as our implementation is because the former PR use this way so I continue with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants