Build a compact full-text search index for a Parquet file using hyparquet and hyparquet-writer.
Enable efficient full-text search on large Parquet datasets from any client without a server. Store your Parquet dataset on S3, generate a compact index file, and query it directly from a browser or other clients using HTTP range requests. The index tells you exactly which row blocks to fetch, so you only download the data you need.
Perfect for serverless architectures where you want to offer search capabilities without managing infrastructure.
npx hypgrep dataset.parquet [dataset.index.parquet]To install as a system-wide CLI tool:
npm install -g hypgrep
hypgrep dataset.parquet [dataset.index.parquet]Use parquetFind to find rows matching a query while preserving natural row order (like Ctrl+F):
import { parquetFind } from 'hypgrep'
for await (const row of parquetFind({
query: 'serverless',
url: 'https://s3.hyperparam.app/hypgrep/wiki_en.parquet',
})) {
console.log(row) // { title: '...', text: '...' }
}Use parquetSearch to rank results by BM25 relevance score (like a search engine):
import { parquetSearch } from 'hypgrep'
for await (const row of parquetSearch({
query: 'serverless',
url: 'https://s3.hyperparam.app/hypgrep/wiki_en.parquet',
})) {
console.log(row) // highest relevance first
}import { asyncBufferFromFile } from 'hyparquet'
import { fileWriter } from 'hyparquet-writer'
import { createIndex } from 'hypgrep'
// Generate dataset.index.parquet from dataset.parquet
const sourceFile = await asyncBufferFromFile('dataset.parquet')
const indexFile = fileWriter('dataset.index.parquet')
await createIndex({ sourceFile, indexFile })To search against local parquet files, provide an asyncBufferFactory that loads the file from the local filesystem:
import { asyncBufferFromFile } from 'hyparquet'
import { parquetFind } from 'hypgrep'
// Loads parquet file from local filesystem
function asyncBufferFactory({ url }) {
return asyncBufferFromFile(url)
}
for await (const row of parquetFind({
query: 'serverless',
url: 'dataset.parquet',
asyncBufferFactory,
})) {
console.log(row)
}