# Bulk ingestion demo (inserts)

This demo shows our prototype for faster bulk ingestion into LanceDB. We'll look at two demo ingestion pipelines:

1. A straightforward upload of the TPCH `lineitem` tables (SF=5) from Parquet.
2. An upload of a 10 million row sample of the english wikipedia huggingface dataset, with vector embeddings generated on ingestion using a local mock of openAI.

## Part 1: Upload 10GB of TPCH lineitem data from Parquet

In [None]:
tpchgen-cli -s 10 --tables lineitem --parts 5 --format=parquet

In [2]:
import pyarrow.dataset as pa_ds
import os

parquet_ds = pa_ds.dataset("lineitem")

size_bytes = sum(os.path.getsize(f) for f in parquet_ds.files)
print(f"{parquet_ds.count_rows():,} rows, {size_bytes / (1024**3):.2f} GB")
parquet_ds.head(2).to_pandas()

59,986,052 rows, 2.37 GB


Unnamed: 0,l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
0,1,1551894,76910,1,17.0,33078.94,0.04,0.02,N,O,1996-03-13,1996-02-12,1996-03-22,DELIVER IN PERSON,TRUCK,egular courts above the
1,1,673091,73092,2,36.0,38306.16,0.09,0.06,N,O,1996-04-12,1996-02-28,1996-04-20,TAKE BACK RETURN,MAIL,ly final dependencies: slyly bold


In [3]:
import lancedb
import tqdm

db = lancedb.connect("demo")
table = db.create_table("lineitem", schema=parquet_ds.schema)

with tqdm.tqdm() as pbar:
    table.add(parquet_ds, progress=pbar)

27540052it [01:21, 337224.82it/s, bytes=4.0GB, throughput=55.0MB/s]   


KeyboardInterrupt: 

## Part 2: Upload 10 million rows of wikipedia data with on-the-fly embeddings

In [None]:
from datasets import load_dataset

hf_ds = (
    load_dataset("wikimedia/wikipedia", "20231101.en", split="train", streaming=True)
    .take(10_000_000)
)
list(hf_ds.take(2))

  from .autonotebook import tqdm as notebook_tqdm


[{'id': '12',
  'url': 'https://en.wikipedia.org/wiki/Anarchism',
  'title': 'Anarchism',
  'text': 'Anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions it claims maintain unnecessary coercion and hierarchy, typically including nation-states, and capitalism. Anarchism advocates for the replacement of the state with stateless societies and voluntary free associations. As a historically left-wing movement, this reading of anarchism is placed on the farthest left of the political spectrum, usually described as the libertarian wing of the socialist movement (libertarian socialism).\n\nHumans have lived in societies without formal hierarchies long before the establishment of states, realms, or empires. With the rise of organised hierarchical bodies, scepticism toward authority also rose. Although traces of anarchist ideas are found all throughout history, modern anarchism emerged from the Enlightenment.

In [None]:
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector

title_func = get_registry().get("openai").create(name="text-embedding-3-small")
embedding_func = get_registry().get("openai").create(name="text-embedding-3-large")

class WikiPage(LanceModel):
    id: int
    url: str
    title: str = title_func.SourceField()
    title_embedding: Vector[512] = title_func.VectorField()
    content: str = embedding_func.SourceField()
    embedding: Vector[3072] = embedding_func.VectorField()

wiki_table = db.create_table("wikipedia", schema=WikiPage)

In [None]:
with tqdm.tqdm() as pbar:
    wiki_table.add(hf_ds, progress=pbar)