This module is built to allow progressive advancement from simple json based embedding database to more advanced solutions like chroma or redis.
import { loadProviders, Collection } from '@kessler/embedding'
async function main() {
const { storage, embedders } = await loadProviders({
fs: { directory: '/some/directory' },
openai: { apiKey: 'openai key here' }
})
const { fs } = storage
const { openai } = embedders
await fs.init()
await openai.init()
const collection = new Collection('test', openai, fs)
await collection.add('hello world', { created: Date.now() })
console.log(await collection.query('hello'))
await fs.shutdown()
await openai.shutdown()
}
main()
Collections are the highest abstraction layer. They group together documents, their embedding data and some optional metadata.
class Collection {
constructor(name, embeddingService, storage) {}
async query(text, { maxResults = Infinity, threshold = 0.8 }) {}
async add(text, metadata) {}
async delete(id) {}
async get(id) {}
}
There are two categories for providers: embedding
and storage
. Embedding providers expose embedding services through a unified interface and storage providers do the same, just for storing and querying documents.
Providers can be loaded and created manually by importing their classes and instantiating them or they can be loaded through loadProviders
(see below)
Once a provider is loaded you should call it's init
method, regardless of wether you loaded it manually or through load providers. (TODO: i might want to change this behavior)
class Embedder {
constructor(underlyingProvider, config) {}
async exec(text, metadata) {}
async init() {}
async shutdown() {}
}
TODO: once a document is embedded with one service and stored, the embedding provider cannot be changed, if the embedding scheme is different in the new provider. This must be addressed some how in the design.
class MyStorage {
constructor(underlyingProvider, config) {}
async query(collectionName, embedding, { maxResults, threshold }) {}
async add(collectionName, content, embedding, metadata) {}
async delete(collectionName, id) {}
async get(collectionName, id) {}
async init() {}
async shutdown() {}
async collections() {}
}
the intent of loadProviders
is to load and instatiate any provider that can be loaded, meaning that their peer dependencies exist.
import { loadProviders } from '@kessler/embedding'
async function main() {
const { storage, embedders } = await loadProviders({ /* ...providers config */ })
const { pg } = storage
const { openai } = embedders
await pg.init()
await openai.init()
}
main()
TBD
Currently the only supported embedding service.
run npm install openai
import { loadProviders } from '@kessler/embedding'
async function main() {
const { embedders, storage } = await loadProviders({
openai: { apiKey: 'your-api-key' }
})
const { openai } = embedders
await openai.init()
// do stuff
await openai.shutdown()
}
main()
The simplest non optimized solution, collections are saved on the file system in json files.
Embedding is matched by going through all the existing documents, so not very scalable.
I have plans to implement a better algorithm in the future.
import { loadProviders } from '@kessler/embedding'
async function main() {
const { embedders, storage } = await loadProviders({
fs: { directory: '/some/path/to/embedding-db' },
})
const { fs } = storage
await fs.init()
// do stuff
await fs.shutdown()
}
main()
Uses postgresql database with pgvector extension installed.
run npm install pg pgvector
(mind the peer dependency versions)
import { loadProviders } from './index.mjs'
async function main() {
const { embedders, storage } = await loadProviders({
// there are defaults though, database "embedding", localhost, root and no password
pg: {
databaseConfig: {
database: 'embedding',
user: 'root',
password: 'shhhhhhhhhhh'
}
}
})
const { pg } = storage
await pg.init()
// do stuff
await pg.shutdown()
}
main()
TBD
TBD