Skip to content

🚚 Loaders: PDF and Text#16

Merged
andreibondarev merged 8 commits intopatterns-ai-core:mainfrom
mission-met:loaders/pdf
May 19, 2023
Merged

🚚 Loaders: PDF and Text#16
andreibondarev merged 8 commits intopatterns-ai-core:mainfrom
mission-met:loaders/pdf

Conversation

@rickychilcott
Copy link
Contributor

This is a WIP PR that attempts to provide an interface for loading text-based content from files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreibondarev - this is likely the primary interface that I'd expect to use within my apps. There is still the "chunking" of text that would need to happen -- but this is the loading part.

Do you like this direction or would you like to chat through this to see if it's useful or if a different API makes sense at all.

Copy link
Collaborator

@andreibondarev andreibondarev May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rickychilcott I think this makes sense! Instead of instantiating the Loaders::* instance, I think we should just let the user pass their text or file directly to the vector search DBs so that you're merely calling:

qdrant = Vectorsearch::Qdrant.new()
qdrant.add_data(path: "file.pdf")

We'd need to figure out what chunking looks like maybe there's a chunking_delimiter: to pass in or chunk_size: or a whole block or lambda so that the user controls how chunks are created.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I think I'm following. You have to remember, I don't know what I'm doing here.

I sketched out a little bit more in line with what you are talking about. I'm adding functionality to the Vectorsearch::Base which allows for adding loaders and data (which will trigger an add_text). I'm imagining you might want to change the chunking and loader methodology in a myriad of ways, so see what you think about this path I'm headed down.

The idea of chunking is to break up the content (to fit within the max token size) - is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More specifically, this is how I see it being used now:

qdrant = Vectorsearch::Qdrant.new
qdrant.add_loader(Loaders::PDF)
qdrant.add_data(path: "file.pdf")

# the above is equivalent to manually loading the pdf and 
# calling qdrant.add_texts(texts: "...contents of pdf...")

In the future, I imagine being able to:

qdrant.add_loader(Loaders::PDF, chunked_via: Chunkers::Delimited.new(delimeter: ";", max_tokens: 1_000)

or

pdf_chunker = Chunkers::Delimited.new(delimeter: ";", max_tokens: 1_000)
pdf_loader = Loaders::PDF.new(chunker: pdf_chunker)
qdrant.add_loader(pdf_loader)

Sane defaults would be included, but this may allow for greater flexibility and configuration.

Copy link
Collaborator

@andreibondarev andreibondarev May 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rickychilcott I'm imagining you might want to change the chunking and loader methodology in a myriad of ways, so see what you think about this path I'm headed down.

Agreed! Although we should be careful not to give the developer too too much optionality right out of the gate. I think making the framework super flexible right away might lead to additional (unnecessary) complexity.

The idea of chunking is to break up the content (to fit within the max token size) - is that right?

Yep!

Copy link
Collaborator

@andreibondarev andreibondarev May 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The couple use-cases that I see for delimiters are either splitting on the token length/character length, character of choice, on paragraphs (this gem already handles it: https://github.com/ruby-docx/docx#reading), on pages (page 1, page 2, etc.), or you can pass in your own function, something along the lines of:

delimiter_function: -> (line) { line.start_with? "Section:" }

attr_reader :root
end

@default_loaders ||= []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about loading all of the default Loaders right out of the gate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we should indeed include a standard set of default loaders. Maybe not ALL of them, but a reasonable set of common ones.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, at least the ones for the most common file extensions.

@andreibondarev
Copy link
Collaborator

I did a test run with Qdrant:

qdrant.create_default_schema
qdrant.add_loader Loader::PDF

path = File.join "/Users/andrei/Downloads/Exhibit-A.pdf"
qdrant.add_data path: path

qdrant.similarity_search(query:)

It worked! But I think we should really flesh out this end to end flow... in follow-up PRs.

@andreibondarev andreibondarev self-requested a review May 19, 2023 17:25
@andreibondarev andreibondarev marked this pull request as ready for review May 19, 2023 17:33
@andreibondarev andreibondarev changed the title Loaders: PDF and Text 🚚 Loaders: PDF and Text May 19, 2023
@andreibondarev andreibondarev merged commit 73eb295 into patterns-ai-core:main May 19, 2023
@andreibondarev andreibondarev linked an issue May 19, 2023 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add an example how to use document loaders

2 participants