Skip to content

hyparam/hypgrep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HypGrep

mit license coverage

Build a compact full-text search index for a Parquet file using hyparquet and hyparquet-writer.

Why?

Enable efficient full-text search on large Parquet datasets from any client without a server. Store your Parquet dataset on S3, generate a compact index file, and query it directly from a browser or other clients using HTTP range requests. The index tells you exactly which row blocks to fetch, so you only download the data you need.

Perfect for serverless architectures where you want to offer search capabilities without managing infrastructure.

CLI usage

npx hypgrep dataset.parquet [dataset.index.parquet]

To install as a system-wide CLI tool:

npm install -g hypgrep
hypgrep dataset.parquet [dataset.index.parquet]

Find rows in a parquet file in JavaScript

Use parquetFind to find rows matching a query while preserving natural row order (like Ctrl+F):

import { parquetFind } from 'hypgrep'

for await (const row of parquetFind({
  query: 'serverless',
  url: 'https://s3.hyperparam.app/hypgrep/wiki_en.parquet',
})) {
  console.log(row) // { title: '...', text: '...' }
}

Ranked search

Use parquetSearch to rank results by BM25 relevance score (like a search engine):

import { parquetSearch } from 'hypgrep'

for await (const row of parquetSearch({
  query: 'serverless',
  url: 'https://s3.hyperparam.app/hypgrep/wiki_en.parquet',
})) {
  console.log(row) // highest relevance first
}

Create an index in JavaScript

import { asyncBufferFromFile } from 'hyparquet'
import { fileWriter } from 'hyparquet-writer'
import { createIndex } from 'hypgrep'

// Generate dataset.index.parquet from dataset.parquet
const sourceFile = await asyncBufferFromFile('dataset.parquet')
const indexFile = fileWriter('dataset.index.parquet')
await createIndex({ sourceFile, indexFile })

Local parquet files

To search against local parquet files, provide an asyncBufferFactory that loads the file from the local filesystem:

import { asyncBufferFromFile } from 'hyparquet'
import { parquetFind } from 'hypgrep'

// Loads parquet file from local filesystem
function asyncBufferFactory({ url }) {
  return asyncBufferFromFile(url)
}

for await (const row of parquetFind({
  query: 'serverless',
  url: 'dataset.parquet',
  asyncBufferFactory,
})) {
  console.log(row)
}

About

Full Text Search for Parquet

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors