errors with org files containing certain types of structures #83

emacsomancer · 2022-09-09T15:57:42Z

With a khoj.yml file containing:

content-type:
  org:
    compressed-jsonl: ~/.khoj/content/org/org.jsonl.gz
    embeddings-file: ~/.khoj/content/org/org_embeddings.pt
    input-files: null
    input-filter: "/home/slade/Documents/Org/*.org"
processor: {}
search-type:
  asymmetric:
    cross-encoder: cross-encoder/ms-marco-MiniLM-L-6-v2
    encoder: sentence-transformers/multi-qa-MiniLM-L6-cos-v1
    model_directory: ~/.khoj/search/asymmetric/
  image:
    encoder: sentence-transformers/clip-ViT-B-32
    model_directory: ~/.khoj/search/image/
  symmetric:
    cross-encoder: cross-encoder/ms-marco-MiniLM-L-6-v2
    encoder: sentence-transformers/all-MiniLM-L6-v2
    model_directory: ~/.khoj/search/symmetric/

running khoj --regenerate produces an error:

Traceback (most recent call last):
  File "/home/slade/.local/bin/khoj", line 8, in <module>
    sys.exit(run())
  File "/home/slade/.local/lib/python3.10/site-packages/src/main.py", line 108, in run
    configure_server(args, required=False)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 36, in configure_server
    state.model = configure_search(state.model, state.config, args.regenerate)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 46, in configure_search
    model.orgmode_search = text_search.setup(
  File "/home/slade/.local/lib/python3.10/site-packages/src/search_type/text_search.py", line 173, in setup
    text_to_jsonl(config.input_files, config.input_filter, config.compressed_jsonl)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/org_to_jsonl.py", line 32, in org_to_jsonl
    entries, file_to_entries = extract_org_entries(org_files)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/org_to_jsonl.py", line 72, in extract_org_entries
    org_file_entries = orgnode.makelist(str(org_file))
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/orgnode.py", line 170, in makelist
    thisNode = Orgnode(level, heading, bodytext, tags)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/orgnode.py", line 217, in __init__
    self.level = len(level)
TypeError: object of type 'int' has no len()

The text was updated successfully, but these errors were encountered:

debanjum · 2022-09-09T19:58:43Z

It doesn't seem to be an issue with input-filter. More like an issue in parsing (some of?) your org file(s) by the OrgNode parser. The level argument is expected be a string of *s seen at the start of a heading instead of an int.

Are you seeing this issue even when you only set the input-files field in the khoj.yml?

Next Steps:

Let me try reproduce the issue on my end to see what's causing it
If you can share a test file or snippet that could be causing the failure, it'll speed up the fix for this issue.
Until then you can try bypass the issue by identifying and excluding the file(s) that maybe causing the parsing error

emacsomancer · 2022-09-09T21:59:39Z

But what about numbered lists and -, + type headings?

(Additional: on excluding files and so on: can there be multiple filters? (for files in different locations) and is there a way to exclude files from a filter?)

emacsomancer · 2022-09-09T23:14:58Z

Ok, I think I've narrowed it down to particular types of files, for which I can get at least two different types of errors.

One, a file with *-headings, but only *-headers (nothing "underneath" any of the headings):

Traceback (most recent call last):
  File "/home/slade/.local/bin/khoj", line 8, in <module>
    sys.exit(run())
  File "/home/slade/.local/lib/python3.10/site-packages/src/main.py", line 108, in run
    configure_server(args, required=False)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 36, in configure_server
    state.model = configure_search(state.model, state.config, args.regenerate)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 46, in configure_search
    model.orgmode_search = text_search.setup(
  File "/home/slade/.local/lib/python3.10/site-packages/src/search_type/text_search.py", line 181, in setup
    corpus_embeddings = compute_embeddings(entries, bi_encoder, config.embeddings_file, regenerate=regenerate)
  File "/home/slade/.local/lib/python3.10/site-packages/src/search_type/text_search.py", line 66, in compute_embeddings
    corpus_embeddings = bi_encoder.encode([entry['compiled'] for entry in entries], convert_to_tensor=True, device=state.device, show_progress_bar=True)
  File "/home/slade/.local/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py", line 187, in encode
    all_embeddings = torch.stack(all_embeddings)
RuntimeError: stack expects a non-empty TensorList

Two, a file with no *-headings at all (but which is still a valid .org file):

Traceback (most recent call last):
  File "/home/slade/.local/bin/khoj", line 8, in <module>
    sys.exit(run())
  File "/home/slade/.local/lib/python3.10/site-packages/src/main.py", line 108, in run
    configure_server(args, required=False)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 36, in configure_server
    state.model = configure_search(state.model, state.config, args.regenerate)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 46, in configure_search
    model.orgmode_search = text_search.setup(
  File "/home/slade/.local/lib/python3.10/site-packages/src/search_type/text_search.py", line 173, in setup
    text_to_jsonl(config.input_files, config.input_filter, config.compressed_jsonl)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/org_to_jsonl.py", line 32, in org_to_jsonl
    entries, file_to_entries = extract_org_entries(org_files)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/org_to_jsonl.py", line 72, in extract_org_entries
    org_file_entries = orgnode.makelist(str(org_file))
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/orgnode.py", line 170, in makelist
    thisNode = Orgnode(level, heading, bodytext, tags)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/orgnode.py", line 217, in __init__
    self.level = len(level)
TypeError: object of type 'int' has no len()

I can provide specific examples of such files if necessary.

yantar92 · 2022-09-10T03:18:09Z

Benjamin Slade ***@***.***> writes:

self.level = len(level) TypeError: object of type 'int' has no len() ```

For reference, I am also seeing this error on my Org files. Note that Orgnode is very simplistic. The most accurate and fast Org parser in the wild that I know of is https://github.com/tecosaur/Org.jl P.S. Your project is very promising :)

…

-- Ihor Radchenko, Org mode contributor, Learn more about Org mode at https://orgmode.org/. Support Org development at https://liberapay.com/org-mode, or support my work at https://liberapay.com/yantar92

debanjum · 2022-09-10T06:20:19Z

Ok, I think I've narrowed it down to particular types of files, for which I can get at least two different types of errors.

Ah, thanks for the additional details. I can reproduce both issues now.
To summarize khoj is failing to handle two type of files:

Files with only entries with no body
Files with no entries

Where an entry in org terminology is anything that starts with a heading and heading is anything that starts with *s

Context:

khoj indexes and shows results at a per entry level (similar to org-agenda search, org-rifle etc)
So khoj shouldn't fail when it sees files with no entries but it'll still ignore such files going forward
It only indexes entries with body text (i.e has something underneath the heading)
This was done for quality of results reasons. Do you feel a need for khoj to index entries with only headings or is it fine to ignore such entries? If index heading only entries is needed, we can index them but I'll add a filter to ignore entries with no body filter before we do.

As an immediate mitigation, I'll make khoj safely ignore the two cases instead of failing. Later, if needed, we can add more thought out solutions. Does that sound reasonable?

yantar92 · 2022-09-10T06:29:41Z

Debanjum ***@***.***> writes:

> Ok, I think I've narrowed it down to particular types of files, for which I can get at least two different types of errors. Ah, thanks for the additional details. I can reproduce both issues now. To summarize `khoj` is failing to handle two type of files: 1. Files with only entries with no body 2. Files with no entries *Where an `entry` in `org` terminology is anything that starts with a `heading` and `heading` is anything that starts with `*`s*

Clarification: heading must start at bol and must have a space after "*": "^\\*+ ".

3. It only indexes entries with body text (i.e has something underneath the heading) This was done for quality of results reasons. Do you feel a need for `khoj` to index entries with only headings or is it fine to ignore such entries? If index heading only entries is needed, we can index them but I'll add a filter to ignore entries with no body filter before we do.

When Org is used for bookmark management, empty bodies are not uncommon. Also, do you consider property drawer as a part of body?

As an immediate mitigation, I'll make `khoj` safely ignore the two cases instead of failing. Later, if needed, we can add more thought out solutions. Does that sound reasonable?

In the context of org-roam, files without headings may not be uncommon - the "entry" is then defined by #+TITLE keyword or something similar. Of course, not failing on Org files without headings is a good starting point.

…

-- Ihor Radchenko, Org mode contributor, Learn more about Org mode at https://orgmode.org/. Support Org development at https://liberapay.com/org-mode, or support my work at https://liberapay.com/yantar92

debanjum · 2022-09-10T06:32:29Z

on excluding files and so on: can there be multiple filters? (for files in different locations) and is there a way to exclude files from a filter?

Khoj doesn't currently support multiple input-filters but created an issue to track adding that. No way to exclude files from filter for now either but maybe I'll resolve it when the multiple input-filter support is added

debanjum · 2022-09-10T06:42:36Z

Clarification: heading must start at bol and must have a space after "": "^\+ ".

Agreed, That's how it is handled in code. I was just trying to keep my definition less verbose :)

Also, do you consider property drawer as a part of body?

Property drawers are not considered part of body* for the purposes of indexing in khoj

When Org is used for bookmark management, empty bodies are not uncommon.
In the context of org-roam, files without headings may not be uncommon -
the "entry" is then defined by #+TITLE keyword or something similar.

I see. Can you clarify the bookmark management scenario a bit more? Seems like there is a use-case for handling headings with empty bodies. So I can add an issue to track that change.

yantar92 · 2022-09-10T06:47:14Z

Debanjum ***@***.***> writes:

I see. Can you clarify the bookmark management scenario a bit more? Seems like there is a use-case for handling headings with empty bodies. So I can add an issue to track that change.

See examples in https://github.com/yantar92/org-capture-ref Bookmarks to this repo looks like ***** SOMEDAY [#A] debanjum [Github] debanjum/khoj: Natural Language Search Engine for your Org-Mode and Markdown notes, Beancount transactions and Photos :BOOKMARK:FLAGGED:misc:SOMEDAY: :PROPERTIES: :TITLE: debanjum/khoj: Natural Language Search Engine for your Org-Mode and Markdown notes, Beancount transactions and Photos :BTYPE: misc :ID: Github-debanjum-debanjum-khoj-natural-e4a :AUTHOR: debanjum :CREATED: [2022-09-07 Wed 21:51] :HOWPUBLISHED: Github :NOTE: Online; accessed 07 September 2022 :RSS: https://github.com/debanjum/khoj/commits.atom :URL: https://github.com/debanjum/khoj :END: :LOGBOOK: - Refiled on [2022-09-07 Wed 22:47] :END:

…

-- Ihor Radchenko, Org mode contributor, Learn more about Org mode at https://orgmode.org/. Support Org development at https://liberapay.com/org-mode, or support my work at https://liberapay.com/yantar92

debanjum · 2022-09-10T06:53:11Z

Note that Orgnode is very simplistic. The most accurate and fast Org
parser in the wild that I know of is https://github.com/tecosaur/Org.jl

Yeah, OrgNode is very basic. I've modified it for khoj to handle more scenarios but it's pretty ad-hoc. Org.jl looks interesting. I'm also tracking Org-Parser as they're being more methodical about parsing org syntax

debanjum · 2022-09-10T06:59:21Z

Bookmarks to this repo looks like

    ***** SOMEDAY [#A] debanjum [Github] debanjum/khoj: Natural Language Search Engine for your Org-Mode and Markdown notes, Beancount transactions and Photos :BOOKMARK:FLAGGED:misc:SOMEDAY:
    :PROPERTIES:
    :TITLE:    debanjum/khoj: Natural Language Search Engine for your Org-Mode and Markdown notes, Beancount transactions and Photos
    :BTYPE:    misc
    :ID:       Github-debanjum-debanjum-khoj-natural-e4a
    :AUTHOR:   debanjum
    :CREATED:  [2022-09-07 Wed 21:51]
    :HOWPUBLISHED: Github
    :NOTE:     Online; accessed 07 September 2022
    :RSS:      https://github.com/debanjum/khoj/commits.atom
    :URL:      https://github.com/debanjum/khoj
    :END:
    :LOGBOOK:
    - Refiled on [2022-09-07 Wed 22:47]
    :END:

I see. This will happen to get indexed as entries with logbook drawer notes get indexed. But I see the use-case for indexing entries with no body text in khoj. Will add support for it soon

yantar92 · 2022-09-10T07:07:05Z

Debanjum ***@***.***> writes:

Yeah, OrgNode is very basic. I've modified it for `khoj` to handle more scenarios but it's pretty ad-hoc. `Org.jl` looks interesting. I'm also tracking [Org-Parser](https://github.com/200ok-ch/org-parser) as they're being more methodical about parsing org syntax

org-parser has major issues with performance scaling (200ok-ch/org-parser#56). Org.jl, on the other hand, has been developed by one of the core Org developers :) It is even faster than tree sitter Org syntax (https://github.com/milisims/tree-sitter-org). Yet another parser is https://github.com/tgbugs/laundry Of course, the basic headline parsing does not require all these fancy parsers.

…

-- Ihor Radchenko, Org mode contributor, Learn more about Org mode at https://orgmode.org/. Support Org development at https://liberapay.com/org-mode, or support my work at https://liberapay.com/yantar92

debanjum · 2022-09-10T07:32:25Z

Ah, hadn't seen the org-parser perf concerns. The rest of the parsers info is very informative too. But yeah for khoj nothing too fancy is required (currently). The parsing required to create index is expected to be done in the background, so speed should be less of a concern.

yantar92 · 2022-09-10T08:10:48Z

Debanjum ***@***.***> writes:

... The parsing required to create index is expected to be done in the background, so speed should be less of a concern.

This may be problematic if there is a single large Org file and user makes changes to it followed by searching those changes. Such use-case is one of the two common paradigms to organize Org files (one large file vs. many small files aka org-roam). I am now trying to index my 20Mb notes.org file and the estimate says that the process will take over 1 hour to complete. Even if done in the background, such a long re-indexing will take forever to complete.

…

-- Ihor Radchenko, Org mode contributor, Learn more about Org mode at https://orgmode.org/. Support Org development at https://liberapay.com/org-mode, or support my work at https://liberapay.com/yantar92

debanjum · 2022-09-10T08:49:55Z

Yes, indexing takes quite a while for larger data sets. Most of this is due to the model generating embeddings. And not the actual file parsing itself.

PR #75 is meant to make this long indexing time only required the first time (or whenever a large amount of new data is to be indexed). But for subsequent runs it'll only re-index new or modified entries. This should speed up updating the index significantly. Enough hopefully so that the index can be updated automatically in the background from within the app itself 🤞🏾

debanjum · 2022-09-10T09:00:40Z

... The parsing required to create index is expected to be done in the background, so speed should be less of a concern.

This may be problematic if there is a single large Org file and user makes changes to it followed by searching those changes.

Note: Currently the index has to be manually updated by the user (by calling the /regenerate API endpoint). The user should not expect khoj to search on the latest modified notes but on the last indexed notes. Even once automatic indexing is implemented, the index will lag the latest state of notes.

This shouldn't impact most practical use-cases IMO, as you're usually searching for older entries that you don't recall, not the latest edits to notes you may have just made.

- Previously we were failing if no valid entries while computing embeddings. This was obscuring the actual issue of no valid entries found in the specified content files - Throwing an exception early with clear message when no entries found should make clarify the issue to be fixed - See issue #83 for details

- Parsed `level` argument passed to OrgNode during init is expected to be a string, not an integer - This was resulting in app failure only when parsing org files with no headings, like in issue #83, as level is set to string of `*`s the moment a heading is found in the current file

- Set LINE, SOURCE link properties in property drawer correctly for content which falls under no heading - See Issue #83 for more details

### Main Changes - bf01a4f Use filename or "#+TITLE" as heading for 0th level content in org files - d6bd7bf Fix initializing `OrgNode` `level` to string to parse org files with no headings - d835467 Throw exception if no valid entries found in specified content files ### Miscellaneous Improvements - 7df39e5 Reuse search models across `pytest` sessions. Merge unused pytest fixtures - 2dc0588 Do not normalize absolute filenames for entry links in `OrgNode` - e00bb53 Init word filter dictionary with default value as set to simplify code Resolves #83

debanjum · 2022-09-10T12:48:57Z

@emacsomancer, @yantar92 I've merged fixes for the 2 main issues found on this thread to master. khoj should now:

Parse org files with no headings
Throw error (with appropriate message) if no valid entries found

It'd be great if you can try the merged changes by upgrading to a pre-release build of khoj with:

pip install --upgrade --pre khoj-assistant

Let me know if this hasn't fixed the above issues for you all

yantar92 · 2022-09-10T14:25:14Z

The error is gone on my side.

debanjum · 2022-09-10T14:39:32Z

That's good to know! Thanks for verifying

emacsomancer · 2022-09-10T16:49:00Z

On the two files I had tested previously, now khoj runs without errors (though it still doesn't seem to actually index headers with no bodies, but expected given #87).

But when I try running with the input-filter on a larger set of files, I am now encountering a new error (not sure what file is triggering it, as that doesn't appear as part of the error output):

Traceback (most recent call last):
  File "/home/slade/.local/bin/khoj", line 8, in <module>
    sys.exit(run())
  File "/home/slade/.local/lib/python3.10/site-packages/src/main.py", line 112, in run
    configure_server(args, required=False)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 36, in configure_server
    state.model = configure_search(state.model, state.config, args.regenerate)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 46, in configure_search
    model.orgmode_search = text_search.setup(
  File "/home/slade/.local/lib/python3.10/site-packages/src/search_type/text_search.py", line 173, in setup
    text_to_jsonl(config.input_files, config.input_filter, config.compressed_jsonl)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/org_to_jsonl.py", line 32, in org_to_jsonl
    entries, file_to_entries = extract_org_entries(org_files)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/org_to_jsonl.py", line 72, in extract_org_entries
    org_file_entries = orgnode.makelist(str(org_file))
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/orgnode.py", line 83, in makelist
    for line in f:
  File "/usr/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa2 in position 676: invalid start byte

debanjum · 2022-09-10T19:03:27Z

On the two files I had tested previously, now khoj runs without errors (though it still doesn't seem to actually index headers with no bodies, but expected given #87).

Thanks for validating! Good to know that the initial issue is resolved. I'll push an update to index header only entries soon.

But when I try running with the input-filter on a larger set of files, I am now encountering a new error (not sure what file is triggering it, as that doesn't appear as part of the error output):

And thanks for discovering another bug! :) Could you please open a separate Github issue to track this new error. It'll make it easier to track separate bugs (and fixes) for future reference

debanjum added the fix Fix something that isn't working as expected label Sep 9, 2022

emacsomancer changed the title ~~errors when using input-filter~~ errors with org files containing certain types of structures Sep 9, 2022

debanjum added a commit that referenced this issue Sep 10, 2022

Use filename or #+TITLE as heading for 0th level content in org files

bf01a4f

- Set LINE, SOURCE link properties in property drawer correctly for content which falls under no heading - See Issue #83 for more details

debanjum mentioned this issue Sep 10, 2022

Handle Empty Org Files or Org Files with No Headings #86

Merged

debanjum added a commit that referenced this issue Sep 10, 2022

Use filename or #+TITLE as heading for 0th level content in org files

07b98d3

- Set LINE, SOURCE link properties in property drawer correctly for content which falls under no heading - See Issue #83 for more details

debanjum closed this as completed in #86 Sep 10, 2022

debanjum mentioned this issue Sep 10, 2022

Add Configuration Flag to Index Entries with Empty Body #87

Closed

debanjum mentioned this issue Sep 11, 2022

Khoj insall failed #93

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

errors with org files containing certain types of structures #83

errors with org files containing certain types of structures #83

emacsomancer commented Sep 9, 2022

debanjum commented Sep 9, 2022 •

edited

Loading

emacsomancer commented Sep 9, 2022 •

edited

Loading

emacsomancer commented Sep 9, 2022

yantar92 commented Sep 10, 2022 via email

debanjum commented Sep 10, 2022

yantar92 commented Sep 10, 2022 via email

debanjum commented Sep 10, 2022 •

edited

Loading

debanjum commented Sep 10, 2022

yantar92 commented Sep 10, 2022 via email

debanjum commented Sep 10, 2022

debanjum commented Sep 10, 2022

yantar92 commented Sep 10, 2022 via email •

edited

Loading

debanjum commented Sep 10, 2022

yantar92 commented Sep 10, 2022 via email

debanjum commented Sep 10, 2022 •

edited

Loading

debanjum commented Sep 10, 2022

debanjum commented Sep 10, 2022

yantar92 commented Sep 10, 2022

debanjum commented Sep 10, 2022

emacsomancer commented Sep 10, 2022 •

edited

Loading

debanjum commented Sep 10, 2022

errors with org files containing certain types of structures #83

errors with org files containing certain types of structures #83

Comments

emacsomancer commented Sep 9, 2022

debanjum commented Sep 9, 2022 • edited Loading

emacsomancer commented Sep 9, 2022 • edited Loading

emacsomancer commented Sep 9, 2022

yantar92 commented Sep 10, 2022 via email

debanjum commented Sep 10, 2022

yantar92 commented Sep 10, 2022 via email

debanjum commented Sep 10, 2022 • edited Loading

debanjum commented Sep 10, 2022

yantar92 commented Sep 10, 2022 via email

debanjum commented Sep 10, 2022

debanjum commented Sep 10, 2022

yantar92 commented Sep 10, 2022 via email • edited Loading

debanjum commented Sep 10, 2022

yantar92 commented Sep 10, 2022 via email

debanjum commented Sep 10, 2022 • edited Loading

debanjum commented Sep 10, 2022

debanjum commented Sep 10, 2022

yantar92 commented Sep 10, 2022

debanjum commented Sep 10, 2022

emacsomancer commented Sep 10, 2022 • edited Loading

debanjum commented Sep 10, 2022

debanjum commented Sep 9, 2022 •

edited

Loading

emacsomancer commented Sep 9, 2022 •

edited

Loading

debanjum commented Sep 10, 2022 •

edited

Loading

yantar92 commented Sep 10, 2022 via email •

edited

Loading

debanjum commented Sep 10, 2022 •

edited

Loading

emacsomancer commented Sep 10, 2022 •

edited

Loading