Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improved markdown & orgmode parsing #766

Merged
merged 1 commit into from
Aug 25, 2024
Merged

Conversation

LeXofLeviafan
Copy link
Collaborator

When trying to import a generated DB file, I've noticed some irregularities. I went around the issue by converting the file to Markdown format, but I still ended up making a few improvements to the parsing code; namely:

  • reimplemented Markdown/OrgMode parsing via regex
  • added support for "raw links" (<url>/[[url]])
  • added tests for all valid and invalid link formats (based on which formats Pandoc can handle, and treating blank links as "always invalid")

Markdown import

* [Bookmark title](http://example.com/1) <!-- TAGS: tag 1, tag 2, tag 3 -->
* [Bookmark title](javascript:void(2))   <!-- TAGS: tag 1, tag 2, tag 3 -->
* [Bookmark title]()                     <!-- TAGS: tag 1, tag 2, tag 3 -->
* [](http://example.com/4)               <!-- TAGS: tag 1, tag 2, tag 3 -->
* [](javascript:void(5))                 <!-- TAGS: tag 1, tag 2, tag 3 -->
* []()                                   <!-- TAGS: tag 1, tag 2, tag 3 -->
* <http://example.com/7>                 <!-- TAGS: tag 1, tag 2, tag 3 -->
* <javascript:void(8)>                   <!-- TAGS: tag 1, tag 2, tag 3 -->
* <>                                     <!-- TAGS: tag 1, tag 2, tag 3 -->
* [Bookmark title](http://example.com/10)
* [Bookmark title](javascript:void(11))
* [Bookmark title]()
* [](http://example.com/13)
* [](javascript:void(14))
* []()
* <http://example.com/16>
* <javascript:void(17)>
* <>

imported

OrgMode import

- [[http://example.com/1][Bookmark title]] :tag 1:tag 2:tag 3:
- [[javascript:void(2)][Bookmark title]]   :tag 1:tag 2:tag 3:
- [[][Bookmark title]]                     :tag 1:tag 2:tag 3:
- [[http://example.com/4][]]               :tag 1:tag 2:tag 3:
- [[javascript:void(5)][]]                 :tag 1:tag 2:tag 3:
- [[][]]                                   :tag 1:tag 2:tag 3:
- [[http://example.com/7]]                 :tag 1:tag 2:tag 3:
- [[javascript:void(8)]]                   :tag 1:tag 2:tag 3:
- [[]]                                     :tag 1:tag 2:tag 3:
- [[http://example.com/10][Bookmark title]]
- [[javascript:void(11)][Bookmark title]]
- [[][Bookmark title]]
- [[http://example.com/13][]]
- [[javascript:void(14)][]]
- [[][]]
- [[http://example.com/16]]
- [[javascript:void(17)]]
- [[]]

(Unlike in Markdown, empty titles – [[url][]] – are explicitly invalid in OrgMode)
imported

if newtag:
if newtag.lower() not in tags:
tags_string = (newtag + DELIM) + tags_string
tags = list(dict.fromkeys(get_org_tags(match.group('tags') or '')))
Copy link
Collaborator Author

@LeXofLeviafan LeXofLeviafan Aug 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -3670,34 +3651,24 @@ def import_org(filepath: str, newtag: Optional[str]):
tag_list_cleaned.append(tag.strip())
return tag_list_cleaned

# Supported OrgMode format: `[[url][title]] :tags:` (or `[[url]] :tags:`)
_url, _maybe_title = r'(?P<url>((?!\]\[).)+?)', r'(\]\[(?P<title>.+))?'
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regex for URL means "any string not containing ]["

('foo, bar, baz', None, ',bar,baz,foo,'),
('foo, bar, baz', 'new tag', ',bar,baz,foo,new tag,'),
])
@pytest.mark.parametrize('title', ['Bookmark title', '', None])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bookmark title](…), [](…) & <…>

('tag1: ::tag2:tag::3:tag4:: :tag:::5: ta g::6:: ', None, ',tag1,:tag2,tag:3,tag4:,tag::5,ta g:6:,'),
('tag1: ::tag2:tag::3:tag4:: :tag:::5: ta g::6:: ', 'new tag', ',new tag,tag1,:tag2,tag:3,tag4:,tag::5,ta g:6:,'),
])
@pytest.mark.parametrize('title', ['Bookmark title', '', None])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[[…][Bookmark title]], [[…][]] & [[…]]

from buku import import_md

p = tmpdir.mkdir("importmd").join("test.md")
p.write("[text1](http://example.com)")
print(line := (f'<{url}>' if title is None else f'[{title}]({url})') +
('' if not tags else f' <!-- TAGS: {tags} -->'))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Printing out the line to be parsed makes it easier to figure out what went wrong when a test fails.


parse_tags([tags])
tags = DELIM.join(s for s in [newtag, match.group('tags')] if s)
tags = parse_tags([tags])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

…Smh the output of parse_tags() was ignored before 😅

@jarun jarun merged commit e898fcc into jarun:master Aug 25, 2024
1 check passed
@jarun
Copy link
Owner

jarun commented Aug 25, 2024

Nice improvement, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants