localdocs: implement .docx support #2986

cebtenzzre · 2024-09-24T21:29:01Z

Using DuckX to parse .docx files similar to the way we parse PDF.

Leaving as draft until we can resolve the fact that we are chunking paragraphs and not pages. The best way to fix that is to stream the document instead of grabbing large discrete chunks of it, but this is blocked on merge of #2969 because that also touches chunkStream.

edit: With the recent changes, this PR now streams the docx statefully so we don't need to worry about seeking to a particular run/paragraph every time we enter the database code, and we slice by time and not some arbitrary unit such as pages to allow the reader code to be more generic.

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

If we don't copy DocumentInfo, we can store uncopyable objects inside of it. Signed-off-by: Jared Van Bortel <jared@nomic.ai>

We already use 'doc' to refer to a parsed representation of the document, not the file info. Signed-off-by: Jared Van Bortel <jared@nomic.ai>

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

Apple Clang still does not support __cpp_aggregate_paren_init (P0960R3). Signed-off-by: Jared Van Bortel <jared@nomic.ai>

gpt4all-chat/src/database.cpp

manyoso · 2024-09-30T17:30:03Z

gpt4all-chat/src/database.cpp

+        if (m_chunk.length() < maxChunkSize + 1) {
+            word = m_reader->word();
+            if (m_chunk.isEmpty())
+                m_page = m_reader->page(); // page number of first word


okay, so chunkstreamer is a class now? what is the lifetime of this object?

It's matched to the lifetime of Database. I thought it would make sense to separate the state from the main Database class, since they have separate responsibilities, even though they work together.

manyoso

This should be a separate PR altogether.. sorry, was talking about first commit in this PR... the way we prefer to do PRs is just very different. Not sure why bumping the version is the first commit in this PR

cebtenzzre · 2024-09-30T17:47:51Z

This should be a separate PR altogether.. sorry, was talking about first commit in this PR... the way we prefer to do PRs is just very different. Not sure why bumping the version is the first commit in this PR

That change isn't part of this PR anymore after the merge, since it was also done on main. I wanted to make sure the version number was no longer v3.3.1 and I didn't want to have to wait until a PR was approved and merged in this repo in order to start working on this change. Would you like me to rebase the PR so that commit goes away?

manyoso · 2024-09-30T17:57:34Z

No, that's okay. Thanks for the explanation.

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

cebtenzzre · 2024-09-30T21:36:14Z

The build passed for Mac and Linux, and on Windows after adjusting the CI timeouts (since build-and-test-gpt4all-chat is missing the timeout adjustment, even though the installer jobs have it).

cebtenzzre added 8 commits September 23, 2024 17:35

chat: bump version

aa26970

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

database: eliminate the need to copy DocumentInfo

adef7aa

If we don't copy DocumentInfo, we can store uncopyable objects inside of it. Signed-off-by: Jared Van Bortel <jared@nomic.ai>

database: rename DocumentInfo 'doc' to 'file'

82c1368

We already use 'doc' to refer to a parsed representation of the document, not the file info. Signed-off-by: Jared Van Bortel <jared@nomic.ai>

chat(build): create deps/CMakeLists.txt

7bbdb69

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

Merge branch 'main' into docx-support

4e56330

initial docx support

63996a6

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

stream docx files since we don't really have pages

0c3b62c

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

database: resolve TODOs

448da04

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

cebtenzzre force-pushed the docx-support branch from 65c97dd to 448da04 Compare September 30, 2024 03:49

cebtenzzre added 4 commits September 30, 2024 11:28

fix #includes with IWYU

84bf687

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

changelog: add this PR

e33a528

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

mysettings: add .docx to default list of file extensions

d7f4848

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

fix more #includes

ed79495

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

cebtenzzre marked this pull request as ready for review September 30, 2024 15:39

database: fix macOS build

fd4a779

Apple Clang still does not support __cpp_aggregate_paren_init (P0960R3). Signed-off-by: Jared Van Bortel <jared@nomic.ai>

cebtenzzre requested a review from manyoso September 30, 2024 16:53

manyoso reviewed Sep 30, 2024

View reviewed changes

manyoso approved these changes Sep 30, 2024

View reviewed changes

usearch: update submodule to fix GDI conflict

ef2b109

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

cebtenzzre merged commit e190fd0 into main Sep 30, 2024
7 of 13 checks passed

cebtenzzre deleted the docx-support branch September 30, 2024 22:48

This was referenced Oct 11, 2024

localdocs: fix regressions caused by docx change #3079

Merged

localdocs: avoid cases where batch can make no progress #3094

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

localdocs: implement .docx support #2986

localdocs: implement .docx support #2986

cebtenzzre commented Sep 24, 2024 •

edited

Loading

manyoso Sep 30, 2024

cebtenzzre Sep 30, 2024

manyoso left a comment •

edited

Loading

cebtenzzre commented Sep 30, 2024

manyoso commented Sep 30, 2024

cebtenzzre commented Sep 30, 2024

localdocs: implement .docx support #2986

localdocs: implement .docx support #2986

Conversation

cebtenzzre commented Sep 24, 2024 • edited Loading

manyoso Sep 30, 2024

Choose a reason for hiding this comment

cebtenzzre Sep 30, 2024

Choose a reason for hiding this comment

manyoso left a comment • edited Loading

Choose a reason for hiding this comment

cebtenzzre commented Sep 30, 2024

manyoso commented Sep 30, 2024

cebtenzzre commented Sep 30, 2024

cebtenzzre commented Sep 24, 2024 •

edited

Loading

manyoso left a comment •

edited

Loading