🚚 Loaders: PDF and Text#16
Conversation
# Conflicts: # lib/llm/hugging_face.rb
There was a problem hiding this comment.
@andreibondarev - this is likely the primary interface that I'd expect to use within my apps. There is still the "chunking" of text that would need to happen -- but this is the loading part.
Do you like this direction or would you like to chat through this to see if it's useful or if a different API makes sense at all.
There was a problem hiding this comment.
@rickychilcott I think this makes sense! Instead of instantiating the Loaders::* instance, I think we should just let the user pass their text or file directly to the vector search DBs so that you're merely calling:
qdrant = Vectorsearch::Qdrant.new()
qdrant.add_data(path: "file.pdf")
We'd need to figure out what chunking looks like maybe there's a chunking_delimiter: to pass in or chunk_size: or a whole block or lambda so that the user controls how chunks are created.
What do you think?
There was a problem hiding this comment.
Cool. I think I'm following. You have to remember, I don't know what I'm doing here.
I sketched out a little bit more in line with what you are talking about. I'm adding functionality to the Vectorsearch::Base which allows for adding loaders and data (which will trigger an add_text). I'm imagining you might want to change the chunking and loader methodology in a myriad of ways, so see what you think about this path I'm headed down.
The idea of chunking is to break up the content (to fit within the max token size) - is that right?
There was a problem hiding this comment.
More specifically, this is how I see it being used now:
qdrant = Vectorsearch::Qdrant.new
qdrant.add_loader(Loaders::PDF)
qdrant.add_data(path: "file.pdf")
# the above is equivalent to manually loading the pdf and
# calling qdrant.add_texts(texts: "...contents of pdf...")In the future, I imagine being able to:
qdrant.add_loader(Loaders::PDF, chunked_via: Chunkers::Delimited.new(delimeter: ";", max_tokens: 1_000)or
pdf_chunker = Chunkers::Delimited.new(delimeter: ";", max_tokens: 1_000)
pdf_loader = Loaders::PDF.new(chunker: pdf_chunker)
qdrant.add_loader(pdf_loader)Sane defaults would be included, but this may allow for greater flexibility and configuration.
There was a problem hiding this comment.
@rickychilcott I'm imagining you might want to change the chunking and loader methodology in a myriad of ways, so see what you think about this path I'm headed down.
Agreed! Although we should be careful not to give the developer too too much optionality right out of the gate. I think making the framework super flexible right away might lead to additional (unnecessary) complexity.
The idea of chunking is to break up the content (to fit within the max token size) - is that right?
Yep!
There was a problem hiding this comment.
The couple use-cases that I see for delimiters are either splitting on the token length/character length, character of choice, on paragraphs (this gem already handles it: https://github.com/ruby-docx/docx#reading), on pages (page 1, page 2, etc.), or you can pass in your own function, something along the lines of:
delimiter_function: -> (line) { line.start_with? "Section:" }
| attr_reader :root | ||
| end | ||
|
|
||
| @default_loaders ||= [] |
There was a problem hiding this comment.
What do you think about loading all of the default Loaders right out of the gate?
There was a problem hiding this comment.
I was thinking we should indeed include a standard set of default loaders. Maybe not ALL of them, but a reasonable set of common ones.
There was a problem hiding this comment.
Yeah, at least the ones for the most common file extensions.
|
I did a test run with Qdrant: It worked! But I think we should really flesh out this end to end flow... in follow-up PRs. |
This is a WIP PR that attempts to provide an interface for loading text-based content from files.