Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

errors with org files containing certain types of structures #83

Closed
emacsomancer opened this issue Sep 9, 2022 · 21 comments · Fixed by #86
Closed

errors with org files containing certain types of structures #83

emacsomancer opened this issue Sep 9, 2022 · 21 comments · Fixed by #86
Labels
fix Fix something that isn't working as expected

Comments

@emacsomancer
Copy link

With a khoj.yml file containing:

content-type:
  org:
    compressed-jsonl: ~/.khoj/content/org/org.jsonl.gz
    embeddings-file: ~/.khoj/content/org/org_embeddings.pt
    input-files: null
    input-filter: "/home/slade/Documents/Org/*.org"
processor: {}
search-type:
  asymmetric:
    cross-encoder: cross-encoder/ms-marco-MiniLM-L-6-v2
    encoder: sentence-transformers/multi-qa-MiniLM-L6-cos-v1
    model_directory: ~/.khoj/search/asymmetric/
  image:
    encoder: sentence-transformers/clip-ViT-B-32
    model_directory: ~/.khoj/search/image/
  symmetric:
    cross-encoder: cross-encoder/ms-marco-MiniLM-L-6-v2
    encoder: sentence-transformers/all-MiniLM-L6-v2
    model_directory: ~/.khoj/search/symmetric/

running khoj --regenerate produces an error:

Traceback (most recent call last):
  File "/home/slade/.local/bin/khoj", line 8, in <module>
    sys.exit(run())
  File "/home/slade/.local/lib/python3.10/site-packages/src/main.py", line 108, in run
    configure_server(args, required=False)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 36, in configure_server
    state.model = configure_search(state.model, state.config, args.regenerate)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 46, in configure_search
    model.orgmode_search = text_search.setup(
  File "/home/slade/.local/lib/python3.10/site-packages/src/search_type/text_search.py", line 173, in setup
    text_to_jsonl(config.input_files, config.input_filter, config.compressed_jsonl)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/org_to_jsonl.py", line 32, in org_to_jsonl
    entries, file_to_entries = extract_org_entries(org_files)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/org_to_jsonl.py", line 72, in extract_org_entries
    org_file_entries = orgnode.makelist(str(org_file))
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/orgnode.py", line 170, in makelist
    thisNode = Orgnode(level, heading, bodytext, tags)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/orgnode.py", line 217, in __init__
    self.level = len(level)
TypeError: object of type 'int' has no len()
@debanjum debanjum added the fix Fix something that isn't working as expected label Sep 9, 2022
@debanjum
Copy link
Member

debanjum commented Sep 9, 2022

It doesn't seem to be an issue with input-filter. More like an issue in parsing (some of?) your org file(s) by the OrgNode parser. The level argument is expected be a string of *s seen at the start of a heading instead of an int.

Are you seeing this issue even when you only set the input-files field in the khoj.yml?

Next Steps:

  • Let me try reproduce the issue on my end to see what's causing it
  • If you can share a test file or snippet that could be causing the failure, it'll speed up the fix for this issue.
  • Until then you can try bypass the issue by identifying and excluding the file(s) that maybe causing the parsing error

@emacsomancer
Copy link
Author

emacsomancer commented Sep 9, 2022

But what about numbered lists and -, + type headings?

(Additional: on excluding files and so on: can there be multiple filters? (for files in different locations) and is there a way to exclude files from a filter?)

@emacsomancer emacsomancer changed the title errors when using input-filter errors with org files containing certain types of structures Sep 9, 2022
@emacsomancer
Copy link
Author

Ok, I think I've narrowed it down to particular types of files, for which I can get at least two different types of errors.

One, a file with *-headings, but only *-headers (nothing "underneath" any of the headings):

Traceback (most recent call last):
  File "/home/slade/.local/bin/khoj", line 8, in <module>
    sys.exit(run())
  File "/home/slade/.local/lib/python3.10/site-packages/src/main.py", line 108, in run
    configure_server(args, required=False)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 36, in configure_server
    state.model = configure_search(state.model, state.config, args.regenerate)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 46, in configure_search
    model.orgmode_search = text_search.setup(
  File "/home/slade/.local/lib/python3.10/site-packages/src/search_type/text_search.py", line 181, in setup
    corpus_embeddings = compute_embeddings(entries, bi_encoder, config.embeddings_file, regenerate=regenerate)
  File "/home/slade/.local/lib/python3.10/site-packages/src/search_type/text_search.py", line 66, in compute_embeddings
    corpus_embeddings = bi_encoder.encode([entry['compiled'] for entry in entries], convert_to_tensor=True, device=state.device, show_progress_bar=True)
  File "/home/slade/.local/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py", line 187, in encode
    all_embeddings = torch.stack(all_embeddings)
RuntimeError: stack expects a non-empty TensorList

Two, a file with no *-headings at all (but which is still a valid .org file):

Traceback (most recent call last):
  File "/home/slade/.local/bin/khoj", line 8, in <module>
    sys.exit(run())
  File "/home/slade/.local/lib/python3.10/site-packages/src/main.py", line 108, in run
    configure_server(args, required=False)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 36, in configure_server
    state.model = configure_search(state.model, state.config, args.regenerate)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 46, in configure_search
    model.orgmode_search = text_search.setup(
  File "/home/slade/.local/lib/python3.10/site-packages/src/search_type/text_search.py", line 173, in setup
    text_to_jsonl(config.input_files, config.input_filter, config.compressed_jsonl)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/org_to_jsonl.py", line 32, in org_to_jsonl
    entries, file_to_entries = extract_org_entries(org_files)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/org_to_jsonl.py", line 72, in extract_org_entries
    org_file_entries = orgnode.makelist(str(org_file))
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/orgnode.py", line 170, in makelist
    thisNode = Orgnode(level, heading, bodytext, tags)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/orgnode.py", line 217, in __init__
    self.level = len(level)
TypeError: object of type 'int' has no len()

I can provide specific examples of such files if necessary.

@yantar92
Copy link

yantar92 commented Sep 10, 2022 via email

@debanjum
Copy link
Member

Ok, I think I've narrowed it down to particular types of files, for which I can get at least two different types of errors.

Ah, thanks for the additional details. I can reproduce both issues now.
To summarize khoj is failing to handle two type of files:

  1. Files with only entries with no body
  2. Files with no entries

Where an entry in org terminology is anything that starts with a heading and heading is anything that starts with *s

Context:

  1. khoj indexes and shows results at a per entry level (similar to org-agenda search, org-rifle etc)
    So khoj shouldn't fail when it sees files with no entries but it'll still ignore such files going forward
  2. It only indexes entries with body text (i.e has something underneath the heading)
    This was done for quality of results reasons. Do you feel a need for khoj to index entries with only headings or is it fine to ignore such entries? If index heading only entries is needed, we can index them but I'll add a filter to ignore entries with no body filter before we do.

As an immediate mitigation, I'll make khoj safely ignore the two cases instead of failing. Later, if needed, we can add more thought out solutions. Does that sound reasonable?

@yantar92
Copy link

yantar92 commented Sep 10, 2022 via email

@debanjum
Copy link
Member

debanjum commented Sep 10, 2022

on excluding files and so on: can there be multiple filters? (for files in different locations) and is there a way to exclude files from a filter?

Khoj doesn't currently support multiple input-filters but created an issue to track adding that. No way to exclude files from filter for now either but maybe I'll resolve it when the multiple input-filter support is added

@debanjum
Copy link
Member

Clarification: heading must start at bol and must have a space after "": "^\+ ".

Agreed, That's how it is handled in code. I was just trying to keep my definition less verbose :)

Also, do you consider property drawer as a part of body?

Property drawers are not considered part of body* for the purposes of indexing in khoj

When Org is used for bookmark management, empty bodies are not uncommon.
In the context of org-roam, files without headings may not be uncommon -
the "entry" is then defined by #+TITLE keyword or something similar.

I see. Can you clarify the bookmark management scenario a bit more? Seems like there is a use-case for handling headings with empty bodies. So I can add an issue to track that change.

@yantar92
Copy link

yantar92 commented Sep 10, 2022 via email

@debanjum
Copy link
Member

Note that Orgnode is very simplistic. The most accurate and fast Org
parser in the wild that I know of is https://github.com/tecosaur/Org.jl

Yeah, OrgNode is very basic. I've modified it for khoj to handle more scenarios but it's pretty ad-hoc. Org.jl looks interesting. I'm also tracking Org-Parser as they're being more methodical about parsing org syntax

@debanjum
Copy link
Member

Bookmarks to this repo looks like

    ***** SOMEDAY [#A] debanjum [Github] debanjum/khoj: Natural Language Search Engine for your Org-Mode and Markdown notes, Beancount transactions and Photos :BOOKMARK:FLAGGED:misc:SOMEDAY:
    :PROPERTIES:
    :TITLE:    debanjum/khoj: Natural Language Search Engine for your Org-Mode and Markdown notes, Beancount transactions and Photos
    :BTYPE:    misc
    :ID:       Github-debanjum-debanjum-khoj-natural-e4a
    :AUTHOR:   debanjum
    :CREATED:  [2022-09-07 Wed 21:51]
    :HOWPUBLISHED: Github
    :NOTE:     Online; accessed 07 September 2022
    :RSS:      https://github.com/debanjum/khoj/commits.atom
    :URL:      https://github.com/debanjum/khoj
    :END:
    :LOGBOOK:
    - Refiled on [2022-09-07 Wed 22:47]
    :END:

I see. This will happen to get indexed as entries with logbook drawer notes get indexed. But I see the use-case for indexing entries with no body text in khoj. Will add support for it soon

@yantar92
Copy link

yantar92 commented Sep 10, 2022 via email

@debanjum
Copy link
Member

Ah, hadn't seen the org-parser perf concerns. The rest of the parsers info is very informative too. But yeah for khoj nothing too fancy is required (currently). The parsing required to create index is expected to be done in the background, so speed should be less of a concern.

@yantar92
Copy link

yantar92 commented Sep 10, 2022 via email

@debanjum
Copy link
Member

debanjum commented Sep 10, 2022

Yes, indexing takes quite a while for larger data sets. Most of this is due to the model generating embeddings. And not the actual file parsing itself.

PR #75 is meant to make this long indexing time only required the first time (or whenever a large amount of new data is to be indexed). But for subsequent runs it'll only re-index new or modified entries. This should speed up updating the index significantly. Enough hopefully so that the index can be updated automatically in the background from within the app itself 🤞🏾

@debanjum
Copy link
Member

... The parsing required to create index is expected to be done in the background, so speed should be less of a concern.

This may be problematic if there is a single large Org file and user makes changes to it followed by searching those changes.

Note: Currently the index has to be manually updated by the user (by calling the /regenerate API endpoint). The user should not expect khoj to search on the latest modified notes but on the last indexed notes. Even once automatic indexing is implemented, the index will lag the latest state of notes.

This shouldn't impact most practical use-cases IMO, as you're usually searching for older entries that you don't recall, not the latest edits to notes you may have just made.

debanjum added a commit that referenced this issue Sep 10, 2022
- Previously we were failing if no valid entries while computing
  embeddings. This was obscuring the actual issue of no valid entries
  found in the specified content files
- Throwing an exception early with clear message when no entries found
  should make clarify the issue to be fixed
- See issue #83 for details
debanjum added a commit that referenced this issue Sep 10, 2022
- Parsed `level` argument passed to OrgNode during init is expected to
  be a string, not an integer
- This was resulting in app failure only when parsing org files with
  no headings, like in issue #83, as level is set to string of `*`s
  the moment a heading is found in the current file
debanjum added a commit that referenced this issue Sep 10, 2022
- Set LINE, SOURCE link properties in property drawer correctly for
  content which falls under no heading
- See Issue #83 for more details
debanjum added a commit that referenced this issue Sep 10, 2022
- Set LINE, SOURCE link properties in property drawer correctly for
  content which falls under no heading
- See Issue #83 for more details
debanjum added a commit that referenced this issue Sep 10, 2022
### Main Changes
- bf01a4f Use filename or "#+TITLE" as heading for 0th level content in org files
- d6bd7bf Fix initializing `OrgNode` `level` to string to parse org files with no headings
- d835467 Throw exception if no valid entries found in specified content files

### Miscellaneous Improvements
- 7df39e5 Reuse search models across `pytest` sessions. Merge unused pytest fixtures
- 2dc0588 Do not normalize absolute filenames for entry links in `OrgNode`
- e00bb53 Init word filter dictionary with default value as set to simplify code

Resolves #83
@debanjum
Copy link
Member

@emacsomancer, @yantar92 I've merged fixes for the 2 main issues found on this thread to master. khoj should now:

  1. Parse org files with no headings
  2. Throw error (with appropriate message) if no valid entries found

It'd be great if you can try the merged changes by upgrading to a pre-release build of khoj with:

pip install --upgrade --pre khoj-assistant

Let me know if this hasn't fixed the above issues for you all

@yantar92
Copy link

The error is gone on my side.

@debanjum
Copy link
Member

That's good to know! Thanks for verifying

@emacsomancer
Copy link
Author

emacsomancer commented Sep 10, 2022

On the two files I had tested previously, now khoj runs without errors (though it still doesn't seem to actually index headers with no bodies, but expected given #87).

But when I try running with the input-filter on a larger set of files, I am now encountering a new error (not sure what file is triggering it, as that doesn't appear as part of the error output):

Traceback (most recent call last):
  File "/home/slade/.local/bin/khoj", line 8, in <module>
    sys.exit(run())
  File "/home/slade/.local/lib/python3.10/site-packages/src/main.py", line 112, in run
    configure_server(args, required=False)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 36, in configure_server
    state.model = configure_search(state.model, state.config, args.regenerate)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 46, in configure_search
    model.orgmode_search = text_search.setup(
  File "/home/slade/.local/lib/python3.10/site-packages/src/search_type/text_search.py", line 173, in setup
    text_to_jsonl(config.input_files, config.input_filter, config.compressed_jsonl)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/org_to_jsonl.py", line 32, in org_to_jsonl
    entries, file_to_entries = extract_org_entries(org_files)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/org_to_jsonl.py", line 72, in extract_org_entries
    org_file_entries = orgnode.makelist(str(org_file))
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/orgnode.py", line 83, in makelist
    for line in f:
  File "/usr/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa2 in position 676: invalid start byte

@debanjum
Copy link
Member

On the two files I had tested previously, now khoj runs without errors (though it still doesn't seem to actually index headers with no bodies, but expected given #87).

Thanks for validating! Good to know that the initial issue is resolved. I'll push an update to index header only entries soon.

But when I try running with the input-filter on a larger set of files, I am now encountering a new error (not sure what file is triggering it, as that doesn't appear as part of the error output):

And thanks for discovering another bug! :) Could you please open a separate Github issue to track this new error. It'll make it easier to track separate bugs (and fixes) for future reference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Fix something that isn't working as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants