Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

localdocs: implement .docx support #2986

Merged
merged 14 commits into from
Sep 30, 2024
Merged

localdocs: implement .docx support #2986

merged 14 commits into from
Sep 30, 2024

Conversation

cebtenzzre
Copy link
Member

@cebtenzzre cebtenzzre commented Sep 24, 2024

Using DuckX to parse .docx files similar to the way we parse PDF.

Leaving as draft until we can resolve the fact that we are chunking paragraphs and not pages. The best way to fix that is to stream the document instead of grabbing large discrete chunks of it, but this is blocked on merge of #2969 because that also touches chunkStream.

edit: With the recent changes, this PR now streams the docx statefully so we don't need to worry about seeking to a particular run/paragraph every time we enter the database code, and we slice by time and not some arbitrary unit such as pages to allow the reader code to be more generic.

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
If we don't copy DocumentInfo, we can store uncopyable objects inside of
it.

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
We already use 'doc' to refer to a parsed representation of the
document, not the file info.

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
@cebtenzzre cebtenzzre marked this pull request as ready for review September 30, 2024 15:39
Apple Clang still does not support __cpp_aggregate_paren_init (P0960R3).

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
gpt4all-chat/src/database.cpp Show resolved Hide resolved
if (m_chunk.length() < maxChunkSize + 1) {
word = m_reader->word();
if (m_chunk.isEmpty())
m_page = m_reader->page(); // page number of first word
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, so chunkstreamer is a class now? what is the lifetime of this object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's matched to the lifetime of Database. I thought it would make sense to separate the state from the main Database class, since they have separate responsibilities, even though they work together.

Copy link
Collaborator

@manyoso manyoso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a separate PR altogether.. sorry, was talking about first commit in this PR... the way we prefer to do PRs is just very different. Not sure why bumping the version is the first commit in this PR

@cebtenzzre
Copy link
Member Author

This should be a separate PR altogether.. sorry, was talking about first commit in this PR... the way we prefer to do PRs is just very different. Not sure why bumping the version is the first commit in this PR

That change isn't part of this PR anymore after the merge, since it was also done on main. I wanted to make sure the version number was no longer v3.3.1 and I didn't want to have to wait until a PR was approved and merged in this repo in order to start working on this change. Would you like me to rebase the PR so that commit goes away?

@manyoso
Copy link
Collaborator

manyoso commented Sep 30, 2024

No, that's okay. Thanks for the explanation.

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
@cebtenzzre
Copy link
Member Author

The build passed for Mac and Linux, and on Windows after adjusting the CI timeouts (since build-and-test-gpt4all-chat is missing the timeout adjustment, even though the installer jobs have it).

@cebtenzzre cebtenzzre merged commit e190fd0 into main Sep 30, 2024
7 of 13 checks passed
@cebtenzzre cebtenzzre deleted the docx-support branch September 30, 2024 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants