-
Notifications
You must be signed in to change notification settings - Fork 14k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
mongodb[minor]: MongoDB Partner Package -- Porting MongoDBAtlasVector…
…Search (#17652) This PR migrates the existing MongoDBAtlasVectorSearch abstraction from the `langchain_community` section to the partners package section of the codebase. - [x] Run the partner package script as advised in the partner-packages documentation. - [x] Add Unit Tests - [x] Migrate Integration Tests - [x] Refactor `MongoDBAtlasVectorStore` (autogenerated) to `MongoDBAtlasVectorSearch` - [x] ~Remove~ deprecate the old `langchain_community` VectorStore references. ## Additional Callouts - Implemented the `delete` method - Included any missing async function implementations - `amax_marginal_relevance_search_by_vector` - `adelete` - Added new Unit Tests that test for functionality of `MongoDBVectorSearch` methods - Removed [`del res[self._embedding_key]`](https://github.com/langchain-ai/langchain/blob/e0c81e1cb0ede673a69aae6434e17e34868c3bcc/libs/community/langchain_community/vectorstores/mongodb_atlas.py#L218) in `_similarity_search_with_score` function as it would make the `maximal_marginal_relevance` function fail otherwise. The `Document` needs to store the embedding key in metadata to work. Checklist: - [x] PR title: Please title your PR "package: description", where "package" is whichever of langchain, community, core, experimental, etc. is being modified. Use "docs: ..." for purely docs changes, "templates: ..." for template changes, "infra: ..." for CI changes. - Example: "community: add foobar LLM" - [x] PR message - [x] Pass lint and test: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified to check that you're passing lint and testing. See contribution guidelines for more information on how to write/run tests, lint, etc: https://python.langchain.com/docs/contributing/ - [x] Add tests and docs: If you're adding a new integration, please include 1. Existing tests supplied in docs/docs do not change. Updated docstrings for new functions like `delete` 2. an example notebook showing its use. It lives in `docs/docs/integrations` directory. (This already exists) If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, hwchase17. --------- Co-authored-by: Steven Silvester <steven.silvester@ieee.org> Co-authored-by: Erick Friis <erick@langchain.dev>
- Loading branch information
1 parent
4121487
commit 72bfc1d
Showing
23 changed files
with
2,321 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
__pycache__ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) 2024 LangChain, Inc. | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
.PHONY: all format lint test tests integration_tests docker_tests help extended_tests | ||
|
||
# Default target executed when no arguments are given to make. | ||
all: help | ||
|
||
# Define a variable for the test file path. | ||
TEST_FILE ?= tests/unit_tests/ | ||
integration_test integration_tests: TEST_FILE=tests/integration_tests/ | ||
|
||
test tests integration_test integration_tests: | ||
poetry run pytest $(TEST_FILE) | ||
|
||
|
||
###################### | ||
# LINTING AND FORMATTING | ||
###################### | ||
|
||
# Define a variable for Python and notebook files. | ||
PYTHON_FILES=. | ||
MYPY_CACHE=.mypy_cache | ||
lint format: PYTHON_FILES=. | ||
lint_diff format_diff: PYTHON_FILES=$(shell git diff --relative=libs/partners/mongodb --name-only --diff-filter=d master | grep -E '\.py$$|\.ipynb$$') | ||
lint_package: PYTHON_FILES=langchain_mongodb | ||
lint_tests: PYTHON_FILES=tests | ||
lint_tests: MYPY_CACHE=.mypy_cache_test | ||
|
||
lint lint_diff lint_package lint_tests: | ||
poetry run ruff . | ||
poetry run ruff format $(PYTHON_FILES) --diff | ||
poetry run ruff --select I $(PYTHON_FILES) | ||
mkdir $(MYPY_CACHE); poetry run mypy $(PYTHON_FILES) --cache-dir $(MYPY_CACHE) | ||
|
||
format format_diff: | ||
poetry run ruff format $(PYTHON_FILES) | ||
poetry run ruff --select I --fix $(PYTHON_FILES) | ||
|
||
spell_check: | ||
poetry run codespell --toml pyproject.toml | ||
|
||
spell_fix: | ||
poetry run codespell --toml pyproject.toml -w | ||
|
||
check_imports: $(shell find langchain_mongodb -name '*.py') | ||
poetry run python ./scripts/check_imports.py $^ | ||
|
||
###################### | ||
# HELP | ||
###################### | ||
|
||
help: | ||
@echo '----' | ||
@echo 'check_imports - check imports' | ||
@echo 'format - run code formatters' | ||
@echo 'lint - run linters' | ||
@echo 'test - run unit tests' | ||
@echo 'tests - run unit tests' | ||
@echo 'test TEST_FILE=<test_file> - run all tests in file' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
# langchain-mongodb | ||
|
||
# Installation | ||
``` | ||
pip install -U langchain-mongodb | ||
``` | ||
|
||
# Usage | ||
- See [integrations doc](../../../docs/docs/integrations/vectorstores/mongodb.ipynb) for more in-depth usage instructions. | ||
- See [Getting Started with the LangChain Integration](https://www.mongodb.com/docs/atlas/atlas-vector-search/ai-integrations/langchain/#get-started-with-the-langchain-integration) for a walkthrough on using your first LangChain implementation with MongoDB Atlas. | ||
|
||
## Using MongoDBAtlasVectorSearch | ||
```python | ||
from langchain_mongodb import MongoDBAtlasVectorSearch | ||
|
||
# Pull MongoDB Atlas URI from environment variables | ||
MONGODB_ATLAS_CLUSTER_URI = os.environ.get("MONGODB_ATLAS_CLUSTER_URI") | ||
|
||
DB_NAME = "langchain_db" | ||
COLLECTION_NAME = "test" | ||
ATLAS_VECTOR_SEARCH_INDEX_NAME = "index_name" | ||
MONGODB_COLLECTION = client[DB_NAME][COLLECITON_NAME] | ||
|
||
# Create the vector search via `from_connection_string` | ||
vector_search = MongoDBAtlasVectorSearch.from_connection_string( | ||
MONGODB_ATLAS_CLUSTER_URI, | ||
DB_NAME + "." + COLLECTION_NAME, | ||
OpenAIEmbeddings(disallowed_special=()), | ||
index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME, | ||
) | ||
|
||
# Initialize MongoDB python client | ||
client = MongoClient(MONGODB_ATLAS_CLUSTER_URI) | ||
# Create the vector search via instantiation | ||
vector_search_2 = MongoDBAtlasVectorSearch( | ||
collection=MONGODB_COLLECTION, | ||
embeddings=OpenAIEmbeddings(disallowed_special=()), | ||
index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME, | ||
) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
from langchain_mongodb.vectorstores import ( | ||
MongoDBAtlasVectorSearch, | ||
) | ||
|
||
__all__ = [ | ||
"MongoDBAtlasVectorSearch", | ||
] |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
""" | ||
Tools for the Maximal Marginal Relevance (MMR) reranking. | ||
Duplicated from langchain_community to avoid cross-dependencies. | ||
Functions "maximal_marginal_relevance" and "cosine_similarity" | ||
are duplicated in this utility respectively from modules: | ||
- "libs/community/langchain_community/vectorstores/utils.py" | ||
- "libs/community/langchain_community/utils/math.py" | ||
""" | ||
|
||
import logging | ||
from typing import List, Union | ||
|
||
import numpy as np | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
Matrix = Union[List[List[float]], List[np.ndarray], np.ndarray] | ||
|
||
|
||
def cosine_similarity(X: Matrix, Y: Matrix) -> np.ndarray: | ||
"""Row-wise cosine similarity between two equal-width matrices.""" | ||
if len(X) == 0 or len(Y) == 0: | ||
return np.array([]) | ||
|
||
X = np.array(X) | ||
Y = np.array(Y) | ||
if X.shape[1] != Y.shape[1]: | ||
raise ValueError( | ||
f"Number of columns in X and Y must be the same. X has shape {X.shape} " | ||
f"and Y has shape {Y.shape}." | ||
) | ||
try: | ||
import simsimd as simd # type: ignore | ||
|
||
X = np.array(X, dtype=np.float32) | ||
Y = np.array(Y, dtype=np.float32) | ||
Z = 1 - simd.cdist(X, Y, metric="cosine") | ||
if isinstance(Z, float): | ||
return np.array([Z]) | ||
return Z | ||
except ImportError: | ||
logger.info( | ||
"Unable to import simsimd, defaulting to NumPy implementation. If you want " | ||
"to use simsimd please install with `pip install simsimd`." | ||
) | ||
X_norm = np.linalg.norm(X, axis=1) | ||
Y_norm = np.linalg.norm(Y, axis=1) | ||
# Ignore divide by zero errors run time warnings as those are handled below. | ||
with np.errstate(divide="ignore", invalid="ignore"): | ||
similarity = np.dot(X, Y.T) / np.outer(X_norm, Y_norm) | ||
similarity[np.isnan(similarity) | np.isinf(similarity)] = 0.0 | ||
return similarity | ||
|
||
|
||
def maximal_marginal_relevance( | ||
query_embedding: np.ndarray, | ||
embedding_list: list, | ||
lambda_mult: float = 0.5, | ||
k: int = 4, | ||
) -> List[int]: | ||
"""Calculate maximal marginal relevance.""" | ||
if min(k, len(embedding_list)) <= 0: | ||
return [] | ||
if query_embedding.ndim == 1: | ||
query_embedding = np.expand_dims(query_embedding, axis=0) | ||
similarity_to_query = cosine_similarity(query_embedding, embedding_list)[0] | ||
most_similar = int(np.argmax(similarity_to_query)) | ||
idxs = [most_similar] | ||
selected = np.array([embedding_list[most_similar]]) | ||
while len(idxs) < min(k, len(embedding_list)): | ||
best_score = -np.inf | ||
idx_to_add = -1 | ||
similarity_to_selected = cosine_similarity(embedding_list, selected) | ||
for i, query_score in enumerate(similarity_to_query): | ||
if i in idxs: | ||
continue | ||
redundant_score = max(similarity_to_selected[i]) | ||
equation_score = ( | ||
lambda_mult * query_score - (1 - lambda_mult) * redundant_score | ||
) | ||
if equation_score > best_score: | ||
best_score = equation_score | ||
idx_to_add = i | ||
idxs.append(idx_to_add) | ||
selected = np.append(selected, [embedding_list[idx_to_add]], axis=0) | ||
return idxs |
Oops, something went wrong.