HypGrep

Build a compact full-text search index for a Parquet file using hyparquet and hyparquet-writer.

Why?

Enable efficient full-text search on large Parquet datasets from any client without a server. Store your Parquet dataset on S3, generate a compact index file, and query it directly from a browser or other clients using HTTP range requests. The index tells you exactly which row blocks to fetch, so you only download the data you need.

Perfect for serverless architectures where you want to offer search capabilities without managing infrastructure.

CLI usage

npx hypgrep dataset.parquet [dataset.index.parquet]

To install as a system-wide CLI tool:

npm install -g hypgrep
hypgrep dataset.parquet [dataset.index.parquet]

Find rows in a parquet file in JavaScript

Use parquetFind to find rows matching a query while preserving natural row order (like Ctrl+F):

import { parquetFind } from 'hypgrep'

for await (const row of parquetFind({
  query: 'serverless',
  url: 'https://s3.hyperparam.app/hypgrep/wiki_en.parquet',
})) {
  console.log(row) // { title: '...', text: '...' }
}

Ranked search

Use parquetSearch to rank results by BM25 relevance score (like a search engine):

import { parquetSearch } from 'hypgrep'

for await (const row of parquetSearch({
  query: 'serverless',
  url: 'https://s3.hyperparam.app/hypgrep/wiki_en.parquet',
})) {
  console.log(row) // highest relevance first
}

Create an index in JavaScript

import { asyncBufferFromFile } from 'hyparquet'
import { fileWriter } from 'hyparquet-writer'
import { createIndex } from 'hypgrep'

// Generate dataset.index.parquet from dataset.parquet
const sourceFile = await asyncBufferFromFile('dataset.parquet')
const indexFile = fileWriter('dataset.index.parquet')
await createIndex({ sourceFile, indexFile })

Local parquet files

To search against local parquet files, provide an asyncBufferFactory that loads the file from the local filesystem:

import { asyncBufferFromFile } from 'hyparquet'
import { parquetFind } from 'hypgrep'

// Loads parquet file from local filesystem
function asyncBufferFactory({ url }) {
  return asyncBufferFromFile(url)
}

for await (const row of parquetFind({
  query: 'serverless',
  url: 'dataset.parquet',
  asyncBufferFactory,
})) {
  console.log(row)
}

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github/workflows		.github/workflows
bin		bin
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark.js		benchmark.js
eslint.config.js		eslint.config.js
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HypGrep

Why?

CLI usage

Find rows in a parquet file in JavaScript

Ranked search

Create an index in JavaScript

Local parquet files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HypGrep

Why?

CLI usage

Find rows in a parquet file in JavaScript

Ranked search

Create an index in JavaScript

Local parquet files

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages