-
Notifications
You must be signed in to change notification settings - Fork 7.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
localdocs: implement .docx support #2986
Conversation
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
If we don't copy DocumentInfo, we can store uncopyable objects inside of it. Signed-off-by: Jared Van Bortel <jared@nomic.ai>
We already use 'doc' to refer to a parsed representation of the document, not the file info. Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
65c97dd
to
448da04
Compare
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Apple Clang still does not support __cpp_aggregate_paren_init (P0960R3). Signed-off-by: Jared Van Bortel <jared@nomic.ai>
if (m_chunk.length() < maxChunkSize + 1) { | ||
word = m_reader->word(); | ||
if (m_chunk.isEmpty()) | ||
m_page = m_reader->page(); // page number of first word |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, so chunkstreamer is a class now? what is the lifetime of this object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's matched to the lifetime of Database
. I thought it would make sense to separate the state from the main Database class, since they have separate responsibilities, even though they work together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a separate PR altogether.. sorry, was talking about first commit in this PR... the way we prefer to do PRs is just very different. Not sure why bumping the version is the first commit in this PR
That change isn't part of this PR anymore after the merge, since it was also done on main. I wanted to make sure the version number was no longer |
No, that's okay. Thanks for the explanation. |
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
The build passed for Mac and Linux, and on Windows after adjusting the CI timeouts (since build-and-test-gpt4all-chat is missing the timeout adjustment, even though the installer jobs have it). |
Using DuckX to parse .docx files similar to the way we parse PDF.
Leaving as draft until we can resolve the fact that we are chunking paragraphs and not pages. The best way to fix that is to stream the document instead of grabbing large discrete chunks of it, but this is blocked on merge of #2969 because that also touches chunkStream.edit: With the recent changes, this PR now streams the docx statefully so we don't need to worry about seeking to a particular run/paragraph every time we enter the database code, and we slice by time and not some arbitrary unit such as pages to allow the reader code to be more generic.