# Quickstart with LanceDB Enterprise
Welcome to LanceDB Enterprise!

We run a simple example in this notebook to demonstrate how you would use LanceDB.



## Step 1: Install LanceDB

In [None]:
! pip install lancedb datasets

Collecting lancedb
  Downloading lancedb-0.21.1-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (4.1 kB)
Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting deprecation (from lancedb)
  Downloading deprecation-2.1.0-py2.py3-none-any.whl.metadata (4.6 kB)
Collecting overrides>=0.7 (from lancedb)
  Downloading overrides-7.7.0-py3-none-any.whl.metadata (5.8 kB)
Collecting pylance>=0.23.2 (from lancedb)
  Downloading pylance-0.24.1-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (7.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading lancedb-0.21.1-cp39-abi3-manylinux_2_28_x86_64.whl (33.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Step 2: Obtain the `db_uri`, `api_key` and `host_override` from the LanceDB team

The LanceDB team will share the following information with you in a secured channel.

In [None]:
db_uri = "db://your-db-uri"  # @param {type:"string"}

In [None]:
api_key = "your-lancedb-api-key"  # @param {type:"string"}

In [None]:
host_override = "your-host-override"  # @param {type:"string"}

# Step 3: Connect to LanceDB Enterprise

In [None]:
import lancedb

api_key = api_key
db_uri = db_uri
host_override = host_override

db = lancedb.connect(
    uri=db_uri,
    api_key=api_key,
    region="us-east-2",
)

## Step 4: Ingest Data

We use the `ag_news` dataset from [HuggingFace](https://huggingface.co/datasets/sunhaozhepy/ag_news_sbert_keywords_embeddings), which includes 768-dimensional precomputed embeddings. To optimize performance, we extract the first 1,000 rows from the test split for this example.

In [None]:
from datasets import load_dataset
import pyarrow as pa

sample_dataset = load_dataset(
    "sunhaozhepy/ag_news_sbert_keywords_embeddings", split="test[:1000]"
)
vector_dim = len(sample_dataset[0]["keywords_embeddings"])
print(sample_dataset.column_names)
print(sample_dataset[:5])

table_name = "lancedb-enterprise-quickstart"
table = db.create_table(table_name, data=sample_dataset, mode="overwrite")

# convert list to fixedsizelist on the vector column
table.alter_columns(
    dict(path="keywords_embeddings", data_type=pa.list_(pa.float32(), vector_dim))
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/623 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/463M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/29.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

['text', 'label', 'keywords', 'keywords_embeddings']
{'text': ["Fears for T N pension after talks Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.", 'The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com) SPACE.com - TORONTO, Canada -- A second\\team of rocketeers competing for the  #36;10 million Ansari X Prize, a contest for\\privately funded suborbital space flight, has officially announced the first\\launch date for its manned rocket.', 'Ky. Company Wins Grant to Study Peptides (AP) AP - A company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of producing better peptides, which are short chains of amino acids, the building blocks of proteins.', "Prediction Unit Helps Forecast Wildfires (AP) AP - It's barely dawn when Mike Fitzpatrick starts his shift with a blur of colorful maps, figures and endless charts, but already he

ℹ️ There are various ways to specify the table schema. More details can be found in our [documentation](https://docs.lancedb.com/core/ingestion).

### check the table has been successfully created with data ingested.

In [None]:
# list all tables in the db
print(db.table_names())

# list table schema, total number of rows
print("Table schema: ", table.schema)
print("Total number of rows: ", table.count_rows())

# show sample data from the table
print(table.search().limit(5).to_pandas())

['lancedb-enterprise-quickstart']
Table schema:  text: string
label: int64
keywords: string
keywords_embeddings: fixed_size_list<item: float>[768]
  child 0, item: float
-- schema metadata --
huggingface: '{"info": {"features": {"text": {"dtype": "string", "_type":' + 248
Total number of rows:  1000
                                                text  label  \
0  Fears for T N pension after talks Unions repre...      2   
1  The Race is On: Second Private Team Sets Launc...      3   
2  Ky. Company Wins Grant to Study Peptides (AP) ...      3   
3  Prediction Unit Helps Forecast Wildfires (AP) ...      3   
4  Calif. Aims to Limit Farm-Related Smog (AP) AP...      3   

                          keywords  \
0    pension, disappointed, unions   
1      launch, spaceflight, rocket   
2              peptides, amino, ap   
3  forecast, wildfires, prediction   
4       emissions, smog, pollution   

                                 keywords_embeddings  
0  [-0.04149173, 0.10335736, 0.02729

## Step 5: Create a vector index

We will create a vector index on the `keywords_embeddings` column.



In [None]:
table.create_index("cosine", vector_column_name="keywords_embeddings")

⚠️ WARNING: `create_index` is asynchonous so it returns when
indexing is in progress. We provide the `list_indices` and `index_stats` APIs to check index status. The index name is formed by appending “_idx” to the column name. Note that `list_indices` will not return any information until the index has fully ingested and indexed all available data.

In [None]:
import time


def wait_for_index(table, index_name):
    POLL_INTERVAL = 10
    while True:
        indices = table.list_indices()

        if indices and any(index.name == index_name for index in indices):
            break
        print(f"⏳ Waiting for {index_name} to be ready...")
        time.sleep(POLL_INTERVAL)

    print(f"✅ {index_name} is ready!")

In [None]:
index_name = "keywords_embeddings_idx"
wait_for_index(table, index_name)
print(table.index_stats(index_name))

✅ keywords_embeddings_idx is ready!
IndexStatistics(num_indexed_rows=1000, num_unindexed_rows=0, index_type='IVF_PQ', distance_type='cosine', num_indices=1)


## Step 6: Vector Query

Let's perform a search. Note here that only the `text`, `keywords` and `label` columns will be returned



In [None]:
query_dataset = load_dataset(
    "sunhaozhepy/ag_news_sbert_keywords_embeddings", split="test[5000:5001]"
)
print(query_dataset[0]["keywords"])
query_embed = query_dataset["keywords_embeddings"][0]

table.search(query_embed).select(["text", "keywords", "label"]).limit(5).to_pandas()

toyota, profit, carmaker


Unnamed: 0,text,keywords,label,_distance
0,The Hunt for a Hybrid The Aug. 23 front-page a...,"prius, civic, toyota",2,0.766818
1,GM pulls Corvette ad with underage driver DETR...,"corvette, commercial, gm",2,0.889155
2,GM pulls Guy Ritchie car ad after protest Prot...,"car, corvette, ad",2,0.895505
3,Toy store profits R back up TOY retailer Toys ...,"toys, toy, profits",2,0.91866
4,Clicking on Profits The latest data from the U...,"profits, commerce, sales",2,0.932535


Let's perform another search to filter by the `label` column.

Note: For large datasets, scalar indexes dramatically accelerate filtering operations. Learn how to create and configure them in our [scalar indexing guide](https://docs.lancedb.com/core/index#scalar-index).

In [None]:
print(query_dataset[0]["keywords"])
query_embed = query_dataset["keywords_embeddings"][0]

table.search(query_embed).where("label > 2").select(
    ["text", "keywords", "label"]
).limit(5).to_pandas()

toyota, profit, carmaker


Unnamed: 0,text,keywords,label,_distance
0,IT seeing steady but slow growth: Forrester pr...,"tech, growth, companies",3,0.975853
1,Does Nick Carr matter? Strategybusiness conclu...,"strategybusiness, strategic, controversial",3,1.036829
2,Salesforce.com 2Q Profit Up Sharply Software d...,"salesforce, revenue, profit",3,1.07117
3,European Union Extends Review of Microsoft Dea...,"microsoft, msft, belgium",3,1.083879
4,Caterpillar snaps up another remanufacturer of...,"remanufacturer, caterpillar, acquire",3,1.08823


In [None]:
!pip install tantivy

Collecting tantivy
  Downloading tantivy-0.22.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading tantivy-0.22.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.5/4.5 MB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tantivy
Successfully installed tantivy-0.22.0


## Step 7: Full-Text Search

Let's create a full-text search index

In [None]:
table.create_fts_index("text")

Similarly, the `create_fts_index` is asynchonous. Let's make sure the FTS index is created successfully before performing the query.

In [None]:
index_name = "text_idx"
wait_for_index(table, index_name)
print(table.index_stats(index_name))

Now, let's perform a full-text query

In [None]:
fts_result = (
    table.search("football", query_type="fts")
    .select(["text", "keywords", "label"])
    .limit(5)
    .to_pandas()
)
fts_result

## Step 8: Hybrid search

Let's combine vector search and full-text search and do a hybrid search. LanceDB offers build-in rerankers and also allows you to customized your own reranker.

In [None]:
from lancedb.rerankers import RRFReranker

query_text = query_dataset[0]["keywords"]
query_embed = query_dataset["keywords_embeddings"][0]
# we will use the RRF reranker
reranker = RRFReranker()

hybrid_result = (
    table.search(query_type="hybrid", fts_columns="text")
    .vector(query_embed)
    .text(query_text)
    .rerank(reranker)
    .select(["text", "keywords", "label"])
    .limit(5)
    .to_pandas()
)
hybrid_result

## Step 9: Cleanup

We can now delete the table.


In [None]:
db.drop_table(table_name)

Please refer to [LanceDB docs](https://docs.lancedb.com) for more details. If you have any questions, please contact us in the dedicated slack channel created for your team. Thank you!