Skip to content
This repository has been archived by the owner on Apr 3, 2024. It is now read-only.

Customized Embedding Hub - Examples, Datasets, Pre-Trained Matrices #18

Open
Glavin001 opened this issue Feb 23, 2023 · 0 comments
Open

Comments

@Glavin001
Copy link

Problem

The default embeddings (e.g. Ada-002 from OpenAI, etc) are great generalists. However, they are not tailored for your specific use-case.

Proposed Solution

🎉 Customizing Embeddings!

ℹ️ See my tutorial / lessons learned if you're interested in learning more, step-by-step, with screenshots and tips.

🎯 Specifically for Lanchain Hub would be providing a collection of pre-trained custom embeddings.

Similar to https://huggingface.co/models except focused on semantic embeddings.
List the known tasks so developers can search the available custom embeddings for each:

Hub provides a set of Tasks each with:

  • Modality (e.g. text, image, etc)
  • Embedding engine to use & # of dimensions (text=>ada-002 with 1536 dimensions, image=>CLIP...)
  • Expected prompt formats for documents and/or queries (i.e. what data should look like before being sent to embedding model)
    • e.g. Documents should look like X. Short form queries look like Y. Topic or objective is Z.
  • Pre-made Datasets for training on your own
    • Data preparation scripts
  • Pre-trained Matrices

Leverage Langchain's helpers to help train and use the custom embedding matrix:

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant