Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partners: MongoDB Partner Package -- Porting MongoDBAtlasVectorSearch #17652

Merged
merged 27 commits into from
Feb 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
321ca5b
MongoDB Partner Package -- first pass
Jibola Feb 15, 2024
34ddb1d
Merge branch 'master' into partner-package-mongodbatlas
Jibola Feb 15, 2024
edeb555
kept the vectorsearch names and kept the mongodbatlasvectorsearch nam…
Jibola Feb 16, 2024
0052663
remove docs section
Jibola Feb 16, 2024
05d46e7
make format
Jibola Feb 16, 2024
4e751d9
make format && make lint
Jibola Feb 16, 2024
82e7eb4
Merge branch 'master' into partner-package-mongodbatlas
Jibola Feb 20, 2024
ba82b0f
added some more unit tests; removed duplicate import test statement
Jibola Feb 20, 2024
0921e47
deprecate langchain_community implementation
Jibola Feb 20, 2024
40ee1ac
updated the README.md to include usage instructions; 'beefed up' unit…
Jibola Feb 21, 2024
e0c81e1
set batch sizing back to 100
Jibola Feb 21, 2024
bde5a3c
Update libs/partners/mongodb-atlas/README.md
Jibola Feb 23, 2024
b156250
rename mongodb-atlas -> mongodb; update dependencies
Jibola Feb 28, 2024
fc36819
updaetd poetry.lock
Jibola Feb 28, 2024
c7f58de
Merge branch 'master' into partner-package-mongodbatlas
Jibola Feb 28, 2024
e3f9b3c
Merge branch 'master' into partner-package-mongodbatlas
efriis Feb 28, 2024
dbdc50f
format
efriis Feb 28, 2024
fcb04e1
deprecation version
efriis Feb 28, 2024
027bcb8
x
efriis Feb 28, 2024
68c9d56
fix typing in vectorstores tests and return test_compile placeholder
Jibola Feb 28, 2024
837eec1
rename documentation to import langchain_mongodb rather than langchai…
Jibola Feb 28, 2024
5fb1c14
fix integration_test validity checks & consolidate embedder
Jibola Feb 29, 2024
542bbc8
Merge branch 'master' into partner-package-mongodbatlas
efriis Feb 29, 2024
a869240
test
efriis Feb 29, 2024
a74c952
doc nit
efriis Feb 29, 2024
0f2ea82
x
efriis Feb 29, 2024
cf5b5d4
x
efriis Feb 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/docs/integrations/providers/mongodb_atlas.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,17 @@

See [detail configuration instructions](/docs/integrations/vectorstores/mongodb_atlas).

We need to install `pymongo` python package.
We need to install `langchain-mongodb` python package.

```bash
pip install pymongo
pip install langchain-mongodb
```

## Vector Store

See a [usage example](/docs/integrations/vectorstores/mongodb_atlas).

```python
from langchain_community.vectorstores import MongoDBAtlasVectorSearch
from langchain_mongodb import MongoDBAtlasVectorSearch
```

Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
)

import numpy as np
from langchain_core._api.deprecation import deprecated
from langchain_core.documents import Document
from langchain_core.embeddings import Embeddings
from langchain_core.vectorstores import VectorStore
Expand All @@ -32,6 +33,11 @@
DEFAULT_INSERT_BATCH_SIZE = 100


@deprecated(
since="0.0.25",
removal="0.2.0",
alternative_import="langchain_mongodb.MongoDBAtlasVectorSearch",
)
class MongoDBAtlasVectorSearch(VectorStore):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not ideal to duplicate this whole class implementation. Instead can we make this a thin wrapper around langchain_mongodb_atlas.MongoDBAtlasVectorSearch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we'll be able to do that because the partner-package ends up needing a completely separate pip installation to use. I imagine having to import langchain_mongodb_atlas introduces a breaking change for folks using this class.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, so the duplication is unavoidable in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, unless @efriis has any suggestions?

"""`MongoDB Atlas Vector Search` vector store.

Expand Down
1 change: 1 addition & 0 deletions libs/partners/mongodb/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
__pycache__
21 changes: 21 additions & 0 deletions libs/partners/mongodb/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2024 LangChain, Inc.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
57 changes: 57 additions & 0 deletions libs/partners/mongodb/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
.PHONY: all format lint test tests integration_tests docker_tests help extended_tests

# Default target executed when no arguments are given to make.
all: help

# Define a variable for the test file path.
TEST_FILE ?= tests/unit_tests/
integration_test integration_tests: TEST_FILE=tests/integration_tests/

test tests integration_test integration_tests:
poetry run pytest $(TEST_FILE)


######################
# LINTING AND FORMATTING
######################

# Define a variable for Python and notebook files.
PYTHON_FILES=.
MYPY_CACHE=.mypy_cache
lint format: PYTHON_FILES=.
lint_diff format_diff: PYTHON_FILES=$(shell git diff --relative=libs/partners/mongodb --name-only --diff-filter=d master | grep -E '\.py$$|\.ipynb$$')
lint_package: PYTHON_FILES=langchain_mongodb
lint_tests: PYTHON_FILES=tests
lint_tests: MYPY_CACHE=.mypy_cache_test

lint lint_diff lint_package lint_tests:
poetry run ruff .
poetry run ruff format $(PYTHON_FILES) --diff
poetry run ruff --select I $(PYTHON_FILES)
mkdir $(MYPY_CACHE); poetry run mypy $(PYTHON_FILES) --cache-dir $(MYPY_CACHE)

format format_diff:
poetry run ruff format $(PYTHON_FILES)
poetry run ruff --select I --fix $(PYTHON_FILES)

spell_check:
poetry run codespell --toml pyproject.toml

spell_fix:
poetry run codespell --toml pyproject.toml -w

check_imports: $(shell find langchain_mongodb -name '*.py')
poetry run python ./scripts/check_imports.py $^

######################
# HELP
######################

help:
@echo '----'
@echo 'check_imports - check imports'
@echo 'format - run code formatters'
@echo 'lint - run linters'
@echo 'test - run unit tests'
@echo 'tests - run unit tests'
@echo 'test TEST_FILE=<test_file> - run all tests in file'
40 changes: 40 additions & 0 deletions libs/partners/mongodb/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# langchain-mongodb

# Installation
```
pip install -U langchain-mongodb
```

# Usage
- See [integrations doc](../../../docs/docs/integrations/vectorstores/mongodb.ipynb) for more in-depth usage instructions.
- See [Getting Started with the LangChain Integration](https://www.mongodb.com/docs/atlas/atlas-vector-search/ai-integrations/langchain/#get-started-with-the-langchain-integration) for a walkthrough on using your first LangChain implementation with MongoDB Atlas.

## Using MongoDBAtlasVectorSearch
```python
from langchain_mongodb import MongoDBAtlasVectorSearch

# Pull MongoDB Atlas URI from environment variables
MONGODB_ATLAS_CLUSTER_URI = os.environ.get("MONGODB_ATLAS_CLUSTER_URI")

DB_NAME = "langchain_db"
COLLECTION_NAME = "test"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "index_name"
MONGODB_COLLECTION = client[DB_NAME][COLLECITON_NAME]

# Create the vector search via `from_connection_string`
vector_search = MongoDBAtlasVectorSearch.from_connection_string(
MONGODB_ATLAS_CLUSTER_URI,
DB_NAME + "." + COLLECTION_NAME,
OpenAIEmbeddings(disallowed_special=()),
index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
)

# Initialize MongoDB python client
client = MongoClient(MONGODB_ATLAS_CLUSTER_URI)
# Create the vector search via instantiation
vector_search_2 = MongoDBAtlasVectorSearch(
collection=MONGODB_COLLECTION,
embeddings=OpenAIEmbeddings(disallowed_special=()),
index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
)
```
7 changes: 7 additions & 0 deletions libs/partners/mongodb/langchain_mongodb/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from langchain_mongodb.vectorstores import (
MongoDBAtlasVectorSearch,
)

__all__ = [
"MongoDBAtlasVectorSearch",
]
Empty file.
87 changes: 87 additions & 0 deletions libs/partners/mongodb/langchain_mongodb/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
"""
Tools for the Maximal Marginal Relevance (MMR) reranking.
Duplicated from langchain_community to avoid cross-dependencies.

Functions "maximal_marginal_relevance" and "cosine_similarity"
are duplicated in this utility respectively from modules:
- "libs/community/langchain_community/vectorstores/utils.py"
- "libs/community/langchain_community/utils/math.py"
"""

import logging
from typing import List, Union

import numpy as np

logger = logging.getLogger(__name__)

Matrix = Union[List[List[float]], List[np.ndarray], np.ndarray]


def cosine_similarity(X: Matrix, Y: Matrix) -> np.ndarray:
"""Row-wise cosine similarity between two equal-width matrices."""
if len(X) == 0 or len(Y) == 0:
return np.array([])

X = np.array(X)
Y = np.array(Y)
if X.shape[1] != Y.shape[1]:
raise ValueError(
f"Number of columns in X and Y must be the same. X has shape {X.shape} "
f"and Y has shape {Y.shape}."
)
try:
import simsimd as simd # type: ignore

X = np.array(X, dtype=np.float32)
Y = np.array(Y, dtype=np.float32)
Z = 1 - simd.cdist(X, Y, metric="cosine")
if isinstance(Z, float):
return np.array([Z])
return Z
except ImportError:
logger.info(
"Unable to import simsimd, defaulting to NumPy implementation. If you want "
"to use simsimd please install with `pip install simsimd`."
)
X_norm = np.linalg.norm(X, axis=1)
Y_norm = np.linalg.norm(Y, axis=1)
# Ignore divide by zero errors run time warnings as those are handled below.
with np.errstate(divide="ignore", invalid="ignore"):
similarity = np.dot(X, Y.T) / np.outer(X_norm, Y_norm)
similarity[np.isnan(similarity) | np.isinf(similarity)] = 0.0
return similarity


def maximal_marginal_relevance(
query_embedding: np.ndarray,
embedding_list: list,
lambda_mult: float = 0.5,
k: int = 4,
) -> List[int]:
"""Calculate maximal marginal relevance."""
if min(k, len(embedding_list)) <= 0:
return []
if query_embedding.ndim == 1:
query_embedding = np.expand_dims(query_embedding, axis=0)
similarity_to_query = cosine_similarity(query_embedding, embedding_list)[0]
most_similar = int(np.argmax(similarity_to_query))
idxs = [most_similar]
selected = np.array([embedding_list[most_similar]])
while len(idxs) < min(k, len(embedding_list)):
best_score = -np.inf
idx_to_add = -1
similarity_to_selected = cosine_similarity(embedding_list, selected)
for i, query_score in enumerate(similarity_to_query):
if i in idxs:
continue
redundant_score = max(similarity_to_selected[i])
equation_score = (
lambda_mult * query_score - (1 - lambda_mult) * redundant_score
)
if equation_score > best_score:
best_score = equation_score
idx_to_add = i
idxs.append(idx_to_add)
selected = np.append(selected, [embedding_list[idx_to_add]], axis=0)
return idxs
Loading
Loading