Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][VectorStore] Support StarRocks as vector db #6119

Merged
merged 16 commits into from
Jun 21, 2023

Conversation

dirtysalt
Copy link
Contributor

@dirtysalt dirtysalt commented Jun 13, 2023

Fixes # (issue)

Before submitting

Here are some examples to use StarRocks as vectordb

from langchain.vectorstores import StarRocks
from langchain.vectorstores.starrocks import StarRocksSettings

embeddings = OpenAIEmbeddings()

# conifgure starrocks settings
settings = StarRocksSettings()
settings.port = 41003
settings.host = '127.0.0.1'
settings.username = 'root'
settings.password = ''
settings.database = 'zya'

# to fill new embeddings
docsearch = StarRocks.from_documents(split_docs, embeddings, config = settings)   


# or to use already-built embeddings in database.
docsearch = StarRocks(embeddings, settings)

Who can review?

Tag maintainers/contributors who might be interested:

@dev2049

@dirtysalt dirtysalt changed the title Support StarRocks as vector db [Feature][VectorStore] Support StarRocks as vector db Jun 14, 2023
Copy link
Contributor

@hwchase17 hwchase17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an example notebook would be very helpful!

config: Optional[StarRocksSettings] = None,
**kwargs: Any,
) -> None:
"""StarRocks Wrapper to LangChain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better docstring would be nice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated docstring and added a example notebook to use starrocks as vectordb.

@hwchase17 hwchase17 added the 03 enhancement Enhancement of existing functionality label Jun 18, 2023
@dirtysalt
Copy link
Contributor Author

an example notebook would be very helpful!

OK. I'll fix that.

@vercel
Copy link

vercel bot commented Jun 19, 2023

@dirtysalt is attempting to deploy a commit to the LangChain Team on Vercel.

A member of the Team first needs to authorize it.

@rlancemartin
Copy link
Collaborator

an example notebook would be very helpful!

OK. I'll fix that.

Thanks! I'll have a look at the notebook tomorrow, and help get this merged.

@dirtysalt
Copy link
Contributor Author

an example notebook would be very helpful!

OK. I'll fix that.

Thanks! I'll have a look at the notebook tomorrow, and help get this merged.

Thanks. I've added a example notebook to use starrocks as vectordb.

dirtysalt added a commit to StarRocks/starrocks that referenced this pull request Jun 19, 2023
Add [`langchain`](langchain-ai/langchain#6119)
extension

you can test the function with the following SQLs

```
create table t1 (id int, data array<float>) engine = olap 
  distributed by hash(id) properties ("replication_num" = "1");

insert into t1 values(1, array<float>[0.1, 0.2, 0.3]), 
     (2, array<float>[0.2, 0.1, 0.3]), 
    (3, array<float>[0.3, 0.2, 0.1]);

select cosine_similarity(array<float>[0.1, 0.2, 0.3], data) as dist, id from t1;
```

Signed-off-by: yanz <dirtysalt1987@gmail.com>
@rlancemartin
Copy link
Collaborator

@dirtysalt thanks!

(1) In the notebook, please add a header # Starrocks and a short description with context on the DB. It will help with adoption. E.g., see example for other vectorDBs here.

(2) Please run make format to format and also have a look at the lint errors in the tests:

langchain/vectorstores/starrocks.py:26: error: Function is missing a type annotation  [no-untyped-def]
langchain/vectorstores/starrocks.py:32: error: Function is missing a type annotation  [no-untyped-def]
langchain/vectorstores/starrocks.py:33: error: Library stubs not installed for "pymysql.cursors"  [import]
langchain/vectorstores/starrocks.py:33: note: Hint: "python3 -m pip install types-PyMySQL"
langchain/vectorstores/starrocks.py:33: note: (or run "mypy --install-types" to install all missing stub packages)
langchain/vectorstores/starrocks.py:33: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
langchain/vectorstores/starrocks.py:33: error: Library stubs not installed for "pymysql"  [import]
langchain/vectorstores/starrocks.py:34: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
langchain/vectorstores/starrocks.py:[19](https://github.com/hwchase17/langchain/actions/runs/5309392803/jobs/9644405342?pr=6119#step:6:20)3: error: "Iterable[str]" has no attribute "index"  [attr-defined]
langchain/vectorstores/starrocks.py:335: error: Need type annotation for "settings_strs" (hint: "settings_strs: List[<type>] = ...")  [var-annotated]
langchain/vectorstores/starrocks.py:449: error: "StarRocks" has no attribute "client"  [attr-defined]

@vercel
Copy link

vercel bot commented Jun 21, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Jun 21, 2023 11:44am

@dirtysalt
Copy link
Contributor Author

@rlancemartin I'm quite a newbie to this community, and thanks for pointing out problems.

I've fixed lint and format problems.

And I've updated the notebook and added some descriptions about the background of StarRocks.

Really appreciate your review and time.

@rlancemartin
Copy link
Collaborator

@rlancemartin I'm quite a newbie to this community, and thanks for pointing out problems.

I've fixed lint and format problems.

And I've updated the notebook and added some descriptions about the background of StarRocks.

Really appreciate your review and time.

Thanks! No problem. Lint errors are always a bit annoying :) I kicked off tests and will look again tomorrow. I can resolve any remaining ones quickly and get this in.

@dirtysalt
Copy link
Contributor Author

dirtysalt commented Jun 21, 2023

OK, Thanks.

I've checked lint error. Looks like you have to install a python package in lint python env.

Because I've used pymysql this package, and there is no pymysql stub for type annotation checking.

python3 -m pip install types-PyMySQL

@dev2049 dev2049 added the Ɑ: vector store Related to vector store module label Jun 21, 2023
def __init__(
self,
embedding: Embeddings,
config: Optional[StarRocksSettings] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why make config optional and give it default None if we're gonna assert that it's non-null later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assert statement is right after handling the null case

        if config is not None:
            self.config = config
        else:
            self.config = StarRocksSettings()
        assert self.config

if config is None, we give it a default setting.

So in theory, we can remove assert self.config this stmt, because it will always true.

@rlancemartin
Copy link
Collaborator

OK, Thanks.

I've checked lint error. Looks like you have to install a python package in lint python env.

Because I've used pymysql this package, and there is no pymysql stub for type annotation checking.

python3 -m pip install types-PyMySQL

Looks like checks are now passing. Good work.

@rlancemartin rlancemartin merged commit 57cc3d1 into langchain-ai:master Jun 21, 2023
tconkling added a commit to tconkling/langchain that referenced this pull request Jun 21, 2023
* master: (28 commits)
  [Feature][VectorStore] Support StarRocks as vector db (langchain-ai#6119)
  Relax string input mapper check (langchain-ai#6544)
  bump to ver 208 (langchain-ai#6540)
  Harrison/multi tool (langchain-ai#6518)
  Infino integration for simplified logs, metrics & search across LLM data & token usage (langchain-ai#6218)
  Update model token mappings/cost to include 0613 models (langchain-ai#6122)
  Fix issue with non-list `To` header in GmailSendMessage Tool (langchain-ai#6242)
  Integrate Rockset as Vectorstore (langchain-ai#6216)
  Feat: Add a prompt template parameter to qa with structure chains (langchain-ai#6495)
  Add async support for HuggingFaceTextGenInference (langchain-ai#6507)
  Be able to use Codey models on Vertex AI (langchain-ai#6354)
  Add KuzuQAChain (langchain-ai#6454)
  Update index.mdx (langchain-ai#6326)
  Export trajectory eval fn (langchain-ai#6509)
  typo(llamacpp.ipynb): 'condiser' -> 'consider' (langchain-ai#6474)
  Fix typo in docstring of format_tool_to_openai_function (langchain-ai#6479)
  Make streamlit import optional (langchain-ai#6510)
  Fixed: 'readible' -> readable (langchain-ai#6492)
  Documentation Fix: Correct the example code output in the prompt templates doc (langchain-ai#6496)
  Fix link (langchain-ai#6501)
  ...
This was referenced Jun 25, 2023
mergify bot pushed a commit to StarRocks/starrocks that referenced this pull request Jul 13, 2023
Add [`langchain`](langchain-ai/langchain#6119)
extension

you can test the function with the following SQLs

```
create table t1 (id int, data array<float>) engine = olap
  distributed by hash(id) properties ("replication_num" = "1");

insert into t1 values(1, array<float>[0.1, 0.2, 0.3]),
     (2, array<float>[0.2, 0.1, 0.3]),
    (3, array<float>[0.3, 0.2, 0.1]);

select cosine_similarity(array<float>[0.1, 0.2, 0.3], data) as dist, id from t1;
```

Signed-off-by: yanz <dirtysalt1987@gmail.com>
(cherry picked from commit 4253167)
wanpengfei-git pushed a commit to StarRocks/starrocks that referenced this pull request Jul 14, 2023
Add [`langchain`](langchain-ai/langchain#6119)
extension

you can test the function with the following SQLs

```
create table t1 (id int, data array<float>) engine = olap
  distributed by hash(id) properties ("replication_num" = "1");

insert into t1 values(1, array<float>[0.1, 0.2, 0.3]),
     (2, array<float>[0.2, 0.1, 0.3]),
    (3, array<float>[0.3, 0.2, 0.1]);

select cosine_similarity(array<float>[0.1, 0.2, 0.3], data) as dist, id from t1;
```

Signed-off-by: yanz <dirtysalt1987@gmail.com>
(cherry picked from commit 4253167)
mergify bot pushed a commit to StarRocks/starrocks that referenced this pull request Jul 27, 2023
Add [`langchain`](langchain-ai/langchain#6119)
extension

you can test the function with the following SQLs

```
create table t1 (id int, data array<float>) engine = olap
  distributed by hash(id) properties ("replication_num" = "1");

insert into t1 values(1, array<float>[0.1, 0.2, 0.3]),
     (2, array<float>[0.2, 0.1, 0.3]),
    (3, array<float>[0.3, 0.2, 0.1]);

select cosine_similarity(array<float>[0.1, 0.2, 0.3], data) as dist, id from t1;
```

Signed-off-by: yanz <dirtysalt1987@gmail.com>
(cherry picked from commit 4253167)

# Conflicts:
#	be/src/exprs/vectorized/math_functions.cpp
#	be/src/exprs/vectorized/math_functions.h
#	gensrc/script/vectorized/vectorized_functions.py
dirtysalt added a commit to StarRocks/starrocks that referenced this pull request Jul 27, 2023
Add [`langchain`](langchain-ai/langchain#6119)
extension

you can test the function with the following SQLs

```
create table t1 (id int, data array<float>) engine = olap
  distributed by hash(id) properties ("replication_num" = "1");

insert into t1 values(1, array<float>[0.1, 0.2, 0.3]),
     (2, array<float>[0.2, 0.1, 0.3]),
    (3, array<float>[0.3, 0.2, 0.1]);

select cosine_similarity(array<float>[0.1, 0.2, 0.3], data) as dist, id from t1;
```

Signed-off-by: yanz <dirtysalt1987@gmail.com>
(cherry picked from commit 4253167)

Signed-off-by: dirtysalt <dirtysalt1987@gmail.com>
wanpengfei-git pushed a commit to StarRocks/starrocks that referenced this pull request Jul 27, 2023
Add [`langchain`](langchain-ai/langchain#6119)
extension

you can test the function with the following SQLs

```
create table t1 (id int, data array<float>) engine = olap
  distributed by hash(id) properties ("replication_num" = "1");

insert into t1 values(1, array<float>[0.1, 0.2, 0.3]),
     (2, array<float>[0.2, 0.1, 0.3]),
    (3, array<float>[0.3, 0.2, 0.1]);

select cosine_similarity(array<float>[0.1, 0.2, 0.3], data) as dist, id from t1;
```

Signed-off-by: yanz <dirtysalt1987@gmail.com>
(cherry picked from commit 4253167)

Signed-off-by: dirtysalt <dirtysalt1987@gmail.com>
mergify bot pushed a commit to StarRocks/starrocks that referenced this pull request Jul 27, 2023
Add [`langchain`](langchain-ai/langchain#6119)
extension

you can test the function with the following SQLs

```
create table t1 (id int, data array<float>) engine = olap
  distributed by hash(id) properties ("replication_num" = "1");

insert into t1 values(1, array<float>[0.1, 0.2, 0.3]),
     (2, array<float>[0.2, 0.1, 0.3]),
    (3, array<float>[0.3, 0.2, 0.1]);

select cosine_similarity(array<float>[0.1, 0.2, 0.3], data) as dist, id from t1;
```

Signed-off-by: yanz <dirtysalt1987@gmail.com>
(cherry picked from commit 4253167)
wanpengfei-git pushed a commit to StarRocks/starrocks that referenced this pull request Jul 27, 2023
Add [`langchain`](langchain-ai/langchain#6119)
extension

you can test the function with the following SQLs

```
create table t1 (id int, data array<float>) engine = olap
  distributed by hash(id) properties ("replication_num" = "1");

insert into t1 values(1, array<float>[0.1, 0.2, 0.3]),
     (2, array<float>[0.2, 0.1, 0.3]),
    (3, array<float>[0.3, 0.2, 0.1]);

select cosine_similarity(array<float>[0.1, 0.2, 0.3], data) as dist, id from t1;
```

Signed-off-by: yanz <dirtysalt1987@gmail.com>
(cherry picked from commit 4253167)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
03 enhancement Enhancement of existing functionality Ɑ: vector store Related to vector store module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants