feat: add py.typed for mypy library stubs #1071

eduardjbotha · 2024-03-07T10:25:16Z

With lancedb as a dependency, running mypy on a project complains with the following error:

error: Skipping analyzing "lancedb": module is installed, but missing library stubs or py.typed marker  [import-untyped]

Closes lancedb#721 fts will return results as a pyarrow table. Pyarrow tables has a `filter` method but it does not take sql filter strings (only pyarrow compute expressions). Instead, we do one of two things to support `tbl.search("keywords").where("foo=5").limit(10).to_arrow()`: Default path: If duckdb is available then use duckdb to execute the sql filter string on the pyarrow table. Backup path: Otherwise, write the pyarrow table to a lance dataset and then do `to_table(filter=<filter>)` Neither is ideal. Default path has two issues: 1. requires installing an extra library (duckdb) 2. duckdb mangles some fields (like fixed size list => list) Backup path incurs a latency penalty (~20ms on ssd) to write the resultset to disk. In the short term, once lancedb#676 is addressed, we can write the dataset to "memory://" instead of disk, this makes the post filter evaluate much quicker (ETA next week). In the longer term, we'd like to be able to evaluate the filter string on the pyarrow Table directly, one possibility being that we use Substrait to generate pyarrow compute expressions from sql string. Or if there's enough progress on pyarrow, it could support Substrait expressions directly (no ETA) --------- Co-authored-by: Will Jones <willjones127@gmail.com>

If you add timezone information in the Field annotation for a datetime then that will now be passed to the pyarrow data type. I'm not sure how pyarrow enforces timezones, right now, it silently coerces to the timezone given in the column regardless of whether the input had the matching timezone or not. This is probably not the right behavior. Though we could just make it so the user has to make the pydantic model do the validation instead of doing that at the pyarrow conversion layer.

API has changed significantly, namely `openai.Embedding.create` no longer exists. openai/openai-python#742 Update the OpenAI embedding function and put a minimum on the openai sdk version.

issue separate requests under the hood and concatenate results

Add support for adding lists of string input (e.g., list of categorical labels) Follow-up items: lancedb#757 lancedb#758

Co-authored-by: Aidan <64613310+aidangomar@users.noreply.github.com>

I found that it was quite incoherent to have to read through the documentation and having to search which submodule that each class should be imported from. For example, it is cumbersome to have to navigate to another documentation page to find out that `EmbeddingFunctionRegistry` is from `lancedb.embeddings`

If the input text is None, Tantivy raises an error complaining it cannot add a NoneType. We handle this upstream so None's are not added to the document. If all of the indexed fields are None then we skip this document.

In addition to lancedb#777, this pull request fixes more typos in the documentation for "Ingest Embedding Functions".

Addressed minor typos and grammatical issues to improve readability --------- Co-authored-by: Christopher Correa <chris.correa@gmail.com>

These examples don't work because of changes in openai api from version 1+

raise exception if fts index does not exist --------- Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>

…ancedb#762) By default tantivy-py uses 128MB heapsize. We change the default to 1GB and we allow the user to customize this locally this makes `test_fts.py` run 10x faster

Close lancedb#773 we pass an empty table over IPC so we don't need to manually deal with serde. Then we just return the schema attribute from the empty table. --------- Co-authored-by: albertlockett <albert.lockett@gmail.com>

closes lancedb#769 Add unit test and documentation on using quotes to perform a phrase query

should fix the error on top of main https://github.com/lancedb/lancedb/actions/runs/7457190471/job/20288985725

Closes lancedb#795

addresses lancedb#797 Problem: tantivy does not expose option to explicitly Proposed solution here: 1. Add a `.phrase_query()` option 2. Under the hood, LanceDB takes care of wrapping the input in quotes and replace nested double quotes with single quotes I've also filed an upstream issue, if they support phrase queries natively then we can get rid of our manual custom processing here.

Closes lancedb#796

This will eventually replace the remote table implementations in python and node.

…PI (lancedb#1031) I've also started `ASYNC_MIGRATION.MD` to keep track of the breaking changes from sync to async python.

…ancedb#1049) small QoL improvement

typo and broken table

…1047) The renaming of `vectordb` to `lancedb` broke the [quick start docs](https://lancedb.github.io/lancedb/basic/#__tabbed_5_3) (it's pointing to a non-existent directory). This PR fixes the code snippets and the paths in the docs page. Additionally, more fixes related to indexing docs below 👇🏽.

In order to add support for `add` we needed to migrate the rust `Table` trait to a `Table` struct and `TableInternal` trait (similar to the way the connection is designed). While doing this we also cleaned up some inconsistencies between the SDKs: * Python and Node are garbage collected languages and it can be difficult to trigger something to be freed. The convention for these languages is to have some kind of close method. I added a close method to both the table and connection which will drop the underlying rust object. * We made significant improvements to table creation in lancedb@cc5f213 for the `node` SDK. I copied these changes to the `nodejs` SDK. * The nodejs tables were using fs to create tmp directories and these were not getting cleaned up. This is mostly harmless but annoying and so I changed it up a bit to ensure we cleanup tmp directories. * ~~countRows in the node SDK was returning `bigint`. I changed it to return `number`~~ (this actually happened in a previous PR) * Tables and connections now implement `std::fmt::Display` which is hooked into python's `__repr__`. Node has no concept of a regular "to string" function and so I added a `display` method. * Python method signatures are changing so that optional parameters are always `Optional[foo] = None` instead of something like `foo = False`. This is because we want those defaults to be in rust whenever possible (though we still need to mention the default in documentation). * I changed the python `AsyncConnection/AsyncTable` classes from abstract classes with a single implementation to just classes because we no longer have the remote implementation in python. Note: this does NOT add the `add` function to the remote table. This PR was already large enough, and the remote implementation is unique enough, that I am going to do all the remote stuff at a later date (we should have the structure in place and correct so there shouldn't be any refactor concerns) --------- Co-authored-by: Will Jones <willjones127@gmail.com>

) The eslint rules specify some formatting requirements that are rather strict and conflict with vscode's default formatter. I was unable to get auto-formatting to setup correctly. Also, eslint has quite recently [given up on formatting](https://eslint.org/blog/2023/10/deprecating-formatting-rules/) and recommends using a 3rd party formatter. This PR adds prettier as the formatter. It restores the eslint rules to their defaults. This does mean we now have the "no explicit any" check back on. I know that rule is pedantic but it did help me catch a few corner cases in type testing that weren't covered in the current code. Leaving in draft as this is dependent on other PRs.

Arrow-js uses brittle `instanceof` checks throughout the code base. These fail unless the library instance that produced the object matches exactly the same instance the vectordb is using. At a minimum, this means that a user using arrow version 15 (or any version that doesn't match exactly the version that vectordb is using) will get strange errors when they try and use vectordb. However, there are even cases where the versions can be perfectly identical, and the instanceof check still fails. One such example is when using `vite` (e.g. vitejs/vite#3910) This PR solves the problem in a rather brute force, but workable, fashion. If we encounter a schema that does not pass the `instanceof` check then we will attempt to sanitize that schema by traversing the object and, if it has all the correct properties, constructing an appropriate `Schema` instance via deep cloning.

1. filtering with fts mutated the schema, which caused schema mistmatch problems with hybrid search as it combines fts and vector search tables. 2. fts with filter failed with `with_row_id`. This was because row_id was calculated before filtering which caused size mismatch on attaching it after. 3. The fix for 1 meant that now row_id is attached before filtering but passing a filter to `to_lance` on a dataset that already contains `_rowid` raises a panic from lance. So temporarily, in case where fts is used with a filter AND `with_row_id`, we just force user to using the duckdb pathway. --------- Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>

Co-authored-by: prrao87 <prrao87@gmail.com> Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>

The fact that we convert errors to strings makes them really hard to work with. For example, in SaaS we want to know whether the underlying `lance::Error` was the `InvalidInput` variant, so we can return a 400 instead of a 500.

…ble_names function from sync table_names function (lancedb#1059) The synchronous table_names function in python lancedb relies on arrow's filesystem which behaves slightly differently than object_store. As a result, the function would not work properly in GCS. However, the async table_names function uses object_store directly and thus is accurate. In most cases we can fallback to using the async table_names function and so this PR does so. The one case we cannot is if the user is already in an async context (we can't start a new async event loop). Soon, we can just redirect those users to use the async API instead of the sync API and so that case will eventually go away. For now, we fallback to the old behavior.

lancedb#1002 accidentally changed `checkout_latest` to do nothing if the table was already in latest mode. This PR makes sure it forces a reload of the table (if there is a newer version).

wjones127 · 2024-03-15T20:18:37Z

@eduardjbotha I would like to support this. However, we do not check LanceDB Python's annotations in CI yet, so I don't think many of them are event correct at the moment. Therefore, I'm reticent to turn this on until we start validating our codebase with mypy. I've filed #1117 to do that first. If you would like, you are welcome to open a PR for that.

provide schema conversio for both schema and reference

changhiskhan and others added 30 commits December 27, 2023 09:31

chore(python): update embedding API to use openai 1.6.1 (lancedb#751)

7e75e50

API has changed significantly, namely `openai.Embedding.create` no longer exists. openai/openai-python#742 Update the OpenAI embedding function and put a minimum on the openai sdk version.

[python] Bump version: 0.4.1 → 0.4.2

3927779

feat(python): first cut batch queries for remote api (lancedb#753)

7773bda

issue separate requests under the hood and concatenate results

docs: fix link (lancedb#752)

8411c36

chore: bump pylance to 0.9.2 (lancedb#754)

a9caa5f

[python] Bump version: 0.4.2 → 0.4.3

c3059dc

Bump version: 0.4.1 → 0.4.2

065ffde

Updating package-lock.json

8e248a9

Updating package-lock.json

4e3b82f

feat(js): support list of string input (lancedb#755)

684eb8b

Add support for adding lists of string input (e.g., list of categorical labels) Follow-up items: lancedb#757 lancedb#758

SaaS JS API sdk doc (lancedb#740)

4d8e401

Co-authored-by: Aidan <64613310+aidangomar@users.noreply.github.com>

chore(python): handle NaN input in fts ingestion (lancedb#763)

60b22d8

If the input text is None, Tantivy raises an error complaining it cannot add a NoneType. We handle this upstream so None's are not added to the document. If all of the indexed fields are None then we skip this document.

small bug fix for example code in SaaS JS doc (lancedb#770)

b83fbfc

Minor corrections for docs of embedding_functions (lancedb#780)

0560e3a

In addition to lancedb#777, this pull request fixes more typos in the documentation for "Ingest Embedding Functions".

Minor Fixes to Ingest Embedding Functions Docs (lancedb#777)

8be2861

Addressed minor typos and grammatical issues to improve readability --------- Co-authored-by: Christopher Correa <chris.correa@gmail.com>

Make examples work with current version of Openai api's (lancedb#779)

bf5202f

These examples don't work because of changes in openai api from version 1+

raise exception if fts index does not exist (lancedb#776)

d41d849

raise exception if fts index does not exist --------- Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>

feat(python): Set heap size to get faster fts indexing performance (l…

b0a88a7

…ancedb#762) By default tantivy-py uses 128MB heapsize. We change the default to 1GB and we allow the user to customize this locally this makes `test_fts.py` run 10x faster

chore: bump lance to 0.9.5 (lancedb#790)

c5a5256

chore(python): document phrase queries in fts (lancedb#788)

1216872

closes lancedb#769 Add unit test and documentation on using quotes to perform a phrase query

feat(python): support new style optional syntax (lancedb#793)

99ba533

eslint fix

821cf0e

fix(rust): not sure why clippy is suddenly unhappy (lancedb#794)

629379a

should fix the error on top of main https://github.com/lancedb/lancedb/actions/runs/7457190471/job/20288985725

feat(python): add count_rows with filter option (lancedb#801)

d998f80

Closes lancedb#795

chore(python): add docstring for limit behavior (lancedb#800)

2774065

Closes lancedb#796

westonpace and others added 27 commits February 29, 2024 10:55

feat: Initial remote table implementation for rust (lancedb#1024)

294c33a

This will eventually replace the remote table implementations in python and node.

fix: fix columns type for pydantic 2.x (lancedb#1045)

adf1a38

[python] Bump version: 0.6.0 → 0.6.1

085066d

feat: port create_table to the async python API and the remote rust A…

2a02d13

…PI (lancedb#1031) I've also started `ASYNC_MIGRATION.MD` to keep track of the breaking changes from sync to async python.

Add create scalar index to sdk (lancedb#1033)

c1af53b

feat(python): add model_names() method to openai embedding function (l…

d14c9b6

…ancedb#1049) small QoL improvement

doc: fix langchain link (lancedb#1053)

f95402a

Fix default_embedding_functions.md (lancedb#1043)

acfdf1b

typo and broken table

doc: fix docs deployment GHA (lancedb#1055)

62632cb

chore(python): use pypi tantivy to speed up CI (lancedb#987)

08c0803

feat(python): allow user to override api url (lancedb#1054)

d621826

chore(rust): update rust version (lancedb#810)

14b9277

doc(python): document the method in fts (lancedb#982)

6821536

Co-authored-by: prrao87 <prrao87@gmail.com> Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>

feat: more accessible errors (lancedb#1025)

47dbb98

The fact that we convert errors to strings makes them really hard to work with. For example, in SaaS we want to know whether the underlying `lance::Error` was the `InvalidInput` variant, so we can return a 400 instead of a 500.

chore: bump lance to 0.10.2 (lancedb#1061)

d198360

fix: make checkout_latest force a reload (lancedb#1064)

722fe18

lancedb#1002 accidentally changed `checkout_latest` to do nothing if the table was already in latest mode. This PR makes sure it forces a reload of the table (if there is a newer version).

[python] Bump version: 0.6.1 → 0.6.2

272cbca

Bump version: 0.4.11 → 0.4.12

ca83354

Updating package-lock.json

f393ac3

Updating package-lock.json

7fb8a73

feat: add py.typed for mypy library stubs

14040d2

westonpace force-pushed the main branch from 93c8786 to 9fee384 Compare April 5, 2024 23:40

alexkohler pushed a commit to alexkohler/lancedb that referenced this pull request Apr 20, 2024

Provide schema conv for both schema and reference (lancedb#1071)

d96adba

provide schema conversio for both schema and reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add py.typed for mypy library stubs #1071

feat: add py.typed for mypy library stubs #1071

eduardjbotha commented Mar 7, 2024

wjones127 commented Mar 15, 2024

feat: add py.typed for mypy library stubs #1071

Are you sure you want to change the base?

feat: add py.typed for mypy library stubs #1071

Conversation

eduardjbotha commented Mar 7, 2024

wjones127 commented Mar 15, 2024