improved markdown & orgmode parsing #766

LeXofLeviafan · 2024-08-24T16:41:40Z

When trying to import a generated DB file, I've noticed some irregularities. I went around the issue by converting the file to Markdown format, but I still ended up making a few improvements to the parsing code; namely:

reimplemented Markdown/OrgMode parsing via regex
added support for "raw links" (<url>/[[url]])
added tests for all valid and invalid link formats (based on which formats Pandoc can handle, and treating blank links as "always invalid")

Markdown import

* [Bookmark title](http://example.com/1) <!-- TAGS: tag 1, tag 2, tag 3 -->
* [Bookmark title](javascript:void(2))   <!-- TAGS: tag 1, tag 2, tag 3 -->
* [Bookmark title]()                     <!-- TAGS: tag 1, tag 2, tag 3 -->
* [](http://example.com/4)               <!-- TAGS: tag 1, tag 2, tag 3 -->
* [](javascript:void(5))                 <!-- TAGS: tag 1, tag 2, tag 3 -->
* []()                                   <!-- TAGS: tag 1, tag 2, tag 3 -->
* <http://example.com/7>                 <!-- TAGS: tag 1, tag 2, tag 3 -->
* <javascript:void(8)>                   <!-- TAGS: tag 1, tag 2, tag 3 -->
* <>                                     <!-- TAGS: tag 1, tag 2, tag 3 -->
* [Bookmark title](http://example.com/10)
* [Bookmark title](javascript:void(11))
* [Bookmark title]()
* [](http://example.com/13)
* [](javascript:void(14))
* []()
* <http://example.com/16>
* <javascript:void(17)>
* <>

OrgMode import

- [[http://example.com/1][Bookmark title]] :tag 1:tag 2:tag 3:
- [[javascript:void(2)][Bookmark title]]   :tag 1:tag 2:tag 3:
- [[][Bookmark title]]                     :tag 1:tag 2:tag 3:
- [[http://example.com/4][]]               :tag 1:tag 2:tag 3:
- [[javascript:void(5)][]]                 :tag 1:tag 2:tag 3:
- [[][]]                                   :tag 1:tag 2:tag 3:
- [[http://example.com/7]]                 :tag 1:tag 2:tag 3:
- [[javascript:void(8)]]                   :tag 1:tag 2:tag 3:
- [[]]                                     :tag 1:tag 2:tag 3:
- [[http://example.com/10][Bookmark title]]
- [[javascript:void(11)][Bookmark title]]
- [[][Bookmark title]]
- [[http://example.com/13][]]
- [[javascript:void(14)][]]
- [[][]]
- [[http://example.com/16]]
- [[javascript:void(17)]]
- [[]]

(Unlike in Markdown, empty titles – [[url][]] – are explicitly invalid in OrgMode)

LeXofLeviafan · 2024-08-24T16:44:19Z

buku

-                    if newtag:
-                        if newtag.lower() not in tags:
-                            tags_string = (newtag + DELIM) + tags_string
+                tags = list(dict.fromkeys(get_org_tags(match.group('tags') or '')))


(from Python documentation)

LeXofLeviafan · 2024-08-24T16:48:16Z

buku

@@ -3670,34 +3651,24 @@ def import_org(filepath: str, newtag: Optional[str]):
                tag_list_cleaned.append(tag.strip())
        return tag_list_cleaned

+    # Supported OrgMode format: `[[url][title]] :tags:` (or `[[url]] :tags:`)
+    _url, _maybe_title = r'(?P<url>((?!\]\[).)+?)', r'(\]\[(?P<title>.+))?'


Regex for URL means "any string not containing ]["

LeXofLeviafan · 2024-08-24T16:50:05Z

tests/test_buku.py

+    ('foo, bar, baz', None, ',bar,baz,foo,'),
+    ('foo, bar, baz', 'new tag', ',bar,baz,foo,new tag,'),
+])
+@pytest.mark.parametrize('title', ['Bookmark title', '', None])


[Bookmark title](…), [](…) & <…>

LeXofLeviafan · 2024-08-24T16:50:37Z

tests/test_buku.py

+    ('tag1: ::tag2:tag::3:tag4:: :tag:::5: ta g::6:: ', None, ',tag1,:tag2,tag:3,tag4:,tag::5,ta g:6:,'),
+    ('tag1: ::tag2:tag::3:tag4:: :tag:::5: ta g::6:: ', 'new tag', ',new tag,tag1,:tag2,tag:3,tag4:,tag::5,ta g:6:,'),
+])
+@pytest.mark.parametrize('title', ['Bookmark title', '', None])


[[…][Bookmark title]], [[…][]] & [[…]]

LeXofLeviafan · 2024-08-24T16:52:38Z

tests/test_buku.py

    from buku import import_md

    p = tmpdir.mkdir("importmd").join("test.md")
-    p.write("[text1](http://example.com)")
+    print(line := (f'<{url}>' if title is None else f'[{title}]({url})') +
+                  ('' if not tags else f' <!-- TAGS: {tags} -->'))


Printing out the line to be parsed makes it easier to figure out what went wrong when a test fails.

LeXofLeviafan · 2024-08-24T16:54:06Z

buku


-                    parse_tags([tags])
+                tags = DELIM.join(s for s in [newtag, match.group('tags')] if s)
+                tags = parse_tags([tags])


…Smh the output of parse_tags() was ignored before 😅

jarun · 2024-08-25T01:04:05Z

Nice improvement, thank you!

improved markdown & orgmode parsing

a615c37

LeXofLeviafan commented Aug 24, 2024

View reviewed changes

jarun merged commit e898fcc into jarun:master Aug 25, 2024
1 check passed

LeXofLeviafan deleted the import-regex branch August 25, 2024 01:24

LeXofLeviafan mentioned this pull request Sep 6, 2024

implementing search-with-markers #777

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improved markdown & orgmode parsing #766

improved markdown & orgmode parsing #766

LeXofLeviafan commented Aug 24, 2024

LeXofLeviafan Aug 24, 2024 •

edited

Loading

LeXofLeviafan Aug 24, 2024

LeXofLeviafan Aug 24, 2024

LeXofLeviafan Aug 24, 2024

LeXofLeviafan Aug 24, 2024

LeXofLeviafan Aug 24, 2024

jarun commented Aug 25, 2024

improved markdown & orgmode parsing #766

improved markdown & orgmode parsing #766

Conversation

LeXofLeviafan commented Aug 24, 2024

Markdown import

OrgMode import

LeXofLeviafan Aug 24, 2024 • edited Loading

Choose a reason for hiding this comment

LeXofLeviafan Aug 24, 2024

Choose a reason for hiding this comment

LeXofLeviafan Aug 24, 2024

Choose a reason for hiding this comment

LeXofLeviafan Aug 24, 2024

Choose a reason for hiding this comment

LeXofLeviafan Aug 24, 2024

Choose a reason for hiding this comment

LeXofLeviafan Aug 24, 2024

Choose a reason for hiding this comment

jarun commented Aug 25, 2024

LeXofLeviafan Aug 24, 2024 •

edited

Loading