Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implementing search-with-markers #777

Merged
merged 1 commit into from
Sep 7, 2024

Conversation

LeXofLeviafan
Copy link
Collaborator

fixes #740:

  • adding an option to enable "search-with-markers", wherein each keyword can be applied to a specific field instead of all fields (based on keyword prefix: . for title, > for description, : for URL, # for tags; * or no prefix for all fields)
  • implementing support for it via CLI (--markers), interactive shell (m), bukuserver buku filter mode (markers*), bukuserver index page/navbar search
  • updating user manuals accordingly

also:

  • adding support for specifying the bookmarks output sorting order (as appropriate)
  • implementing support for it via CLI (--order), interactive shell (v… more fitting letters are taken already 😅), bukuserver bookmarks filter (order); + a BukuDb method for sorting already fetched bookmarks (with matching behaviour)
  • fixing the filters form width in bukuserver, and ensuring all buku filter modes are synchronized (without resetting them)
  • updating the Markdown/OrgMode export of untitled bookmarks to match the recent import improvements

Note: multiple order filters can be affected by an upstream bug when edited.

Screenshots

CLI

("sort records by tags in ascending order, and those with the same tags by URL in descending order")

buku --order +tags,-url --print | less

sorted1
sorted2

buku --order +tags,-url --print --json | less

json
(“search for bar in titles and .foo/ in URLs”)

buku --np --order +tags,-url --markers --sall '. bar' :.foo/ | less

sorted search

export
buku --order +tags,-url --markers --sall '. bar' :.foo/ --export generated.md && less generated.md

export

interactive shell
v title, -desc, +id
p 1-10

sorted

m
v tags,-url
S . bar :.foo/

sorted search

webUI – index page

search form with tooltips
(these are default parameters now since the markers don't really interfere with most searches unless you actually include them)
sample form search
(navbar search applies the same defaults so it's the same search as above)
sample navbar search

webUI – bookmarks

(this is the search that either of these forms will submit when using the query . bar :.foo/)
search results
adding order filter
choosing the ordering
sorted search results
reversed bookmarks list
sorted bookmarks list
another sorted search

webUI – filters width

(these are based on primary @media (min-width: *) values used by Bootstrap)
1200px+
992px+
768px+
(the last one is the default width which is used as the fallback value when none of the above constraints are met)
default width

bookmark URL with comma-separated tags
(prepend tags with '+' or '-' to use fetched tags)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

…Forgot to update readme in #775 😅

@@ -72,13 +73,19 @@ SKIP_MIMES = {'.pdf', '.txt'}
PROMPTMSG = 'buku (? for help): ' # Prompt message string

strip_delim = lambda s, delim=DELIM, sub=' ': str(s).replace(delim, sub)
taglist = lambda ss: sorted(s.lower() for s in set(ss) if s)
taglist = lambda ss: sorted(set(s.lower().strip() for s in ss if (s or '').strip()))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensuring taglist() works correctly with unstripped tags

like_escape = lambda s, c='`': s.replace(c, c+c).replace('_', c+'_').replace('%', c+'%')
split_by_marker = lambda s: re.split(r'\s+(?=[.:>#*])', s)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically how bukubrow does it

return self.value != other and ((self.value < other) == self.ascending)

def __repr__(self):
return ('+' if self.ascending else '-') + repr(self.value)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used to apply a sort direction to each field individually (when sorting in Python)

valid = list(names) + list(names.values()) + ['tags']
_fields = [(re.sub(r'^[+-]', '', s), not s.startswith('-')) for s in (fields or [])]
_fields = [(names.get(field, field), direction) for field, direction in _fields if field in valid]
return _fields or [('id', True)]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parsing and validating an ordering description, converting it into a simple format for internal use.

text_fields = (set() if not ignore_case else {'url', 'desc', 'title', 'tags'})
get = lambda x, k: (getattr(x, k) if not ignore_case or k not in text_fields else str(getattr(x, k) or '').lower())
order = self._ordering(fields, for_db=False)
return sorted(bookmark_vars(records), key=lambda x: [SortKey(get(x, k), ascending=asc) for k, asc in order])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorting already fetched data… is not actually used anywhere in the code, but it might come in handy anytime.

"""Converts field list to SQL 'ORDER BY' parameters. (See also BukuDb._ordering().)"""
text_fields = (set() if not ignore_case else {'url', 'desc', 'metadata', 'tags'})
return ', '.join(f'{field if field not in text_fields else "LOWER("+field+")"} {"ASC" if direction else "DESC"}'
for field, direction in self._ordering(fields))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generates a list of ORDER BY SQL clauses


Returns
-------
list
A list of tuples representing bookmark records.
"""

return self._fetch('SELECT * FROM bookmarks', lock=lock)
return self._fetch(f'SELECT * FROM bookmarks ORDER BY {self._order(order)}', lock=lock)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default behaviour (when order is not specified) won't be affected here

if not s:
return []
tags = ([s] if regex and not markers else taglist(s.split(DELIM)))
return [('metadata', deep, s), ('url', deep, s), ('desc', deep, s)] + (tags and [('tags', deep, *tags)])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each token includes the field name, its own deep value (to account for special tags behaviour), and one or more parameter (again, to account for tags since they cannot include ,).

Note that for a token to be matched, all of its parameters must be matched (i.e. when searching for foo,bar,baz in tags, all three tags must be matched for the token to match).

And for the keyword to match, any of its tokens must be matched.

param = border(field, param[0]) + re.escape(param) + border(field, param[-1])
args += [param]
clauses += (_clauses if len(_clauses) < 2 else [f'({" AND ".join(_clauses)})'])
return ' OR '.join(clauses), args
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is based on the original implementation of searchdb(); regex=True means that deep is ignored and only one param is expected in a token (since comma-splitting is not done in regex mode), and otherwise we rely on \b to determine the word edge (or , when working with tags).

Note that I removed stripping / from the end of the string, since it's easy to not include a slash for the user if it's not necessary, but if it's stripped then it's impossible to actually search for it correctly (e.g. specifying .com/ in URL search may not produce the desired results).

q0 = q0[:-3] + ' AS score FROM bookmarks WHERE score > 0 ORDER BY score DESC)'
query = ('SELECT id, url, metadata, tags, desc, flags\nFROM (SELECT *, (' +
'\n + '.join(map(_count, clauses)) +
f') AS score\n FROM bookmarks WHERE score > 0 ORDER BY score DESC, {_order})')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query-generating code itself is much more compact now.

(As for the \ns, they make no difference to the DB engine, but reading logged queries is easier when they're formatted.)

q0 = q0[:-3] + ' AS score FROM bookmarks WHERE score > 0 ORDER BY score DESC)'
query = ('SELECT id, url, metadata, tags, desc, flags\nFROM (SELECT *, (' +
'\n + '.join(map(_count, clauses)) +
f') AS score\n FROM bookmarks WHERE score > 0 ORDER BY score DESC, {_order})')
Copy link
Collaborator Author

@LeXofLeviafan LeXofLeviafan Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that since score is the first clause of ORDER BY, the order value only affects search results that have the same "score".

query = ('SELECT id, url, metadata, tags, desc, flags FROM bookmarks WHERE (' +
f' {search_operator} '.join("tags LIKE '%' || ? || '%'" for tag in qargs) +
')' + ('' if not excluded_tags else ' AND tags NOT REGEXP ?') +
f' ORDER BY {_order}')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than replacing the ORDER BY clause, the changes here don't really affect the produced SQL.

all_keywords: bool = False,
deep: bool = False,
regex: bool = False,
stag: Optional[List[str]] = None,
Copy link
Collaborator Author

@LeXofLeviafan LeXofLeviafan Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding default values allows using these as keyword parameters.

(…And yes, stag is not actually a string here from what I could tell – the code even uses .join() to convert it into a string 😅)

"""Search bookmarks for entries with keywords and specified
criteria while filtering out entries with matching tags.

Parameters
----------
keywords : list of str
Keywords to search.
without : list of str
Keywords to exclude; ignored if empty. Default is None.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this parameter simplifies search invocation


count = 0
out = ''
if export_type == 'markdown':
for row in resultset:
out += '- [' + title(row, 'Untitled') + '](' + row.url + ')'
_title = title(row)
out += (f'- <{row.url}>' if not _title else f'- [{_title}]({row.url})')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we can handle <raw URLs> as input, it seemed like a good idea to output them as well – thus allowing to reimport the same exact data that was exported (…sans descriptions, I suppose)

if row.tags:
out += ' TAGS="' + html.escape(row.tags).encode('ascii', 'xmlcharrefreplace').decode('utf-8') + '"'
out += '>\n<title>{}</title>\n</bookmark>\n'\
out += '>\n <title>{}</title>\n </bookmark>\n'\
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the same formatting as in the HTML export (…mostly for sake of appearances)

num : int
Number of results to show per page. Default is 10.
"""

if not isinstance(obj, BukuDb):
LOGERR('Not a BukuDb instance')
return
bdb = obj
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is just less confusing

@@ -4623,28 +4694,31 @@ def prompt(obj, results, noninteractive=False, deep=False, listtags=False, sugge

# search ANY match with new keywords
if nav.startswith('s '):
results = obj.searchdb(nav[2:].split(), False, deep)
keywords = (nav[2:].split() if not markers else split_by_marker(nav[2:]))
results = bdb.searchdb(keywords, deep=deep, markers=markers, order=order)
Copy link
Collaborator Author

@LeXofLeviafan LeXofLeviafan Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Search behaviour only changes after enabling search-with-markers

else:
print('Invalid input')
ids and bdb.print_rec(ids, order=order)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows to sort the printed records according to selected order

@@ -6010,6 +6103,8 @@ POSITIONAL ARGUMENTS:
elif args.unlock is not None:
BukuCrypt.decrypt_file(args.unlock)

order = [s for ss in (args.order or []) for s in re.split(r'\s*,\s*', ss.strip()) if s]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spaces and/or commas can be used to separate fields in the order description

search_opted = True
tags_search = bool(args.stag is not None and len(args.stag))
exclude_results = bool(args.exclude is not None and len(args.exclude))
search_results, search_opted = None, True
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"not None and not empty" is how truthiness is defined for lists anyway

else:
LOGERR('no keyword')
search_results = bdb.search_keywords_and_filter_by_tags(
args.sany, deep=args.deep, stag=args.stag, markers=args.markers, without=args.exclude, order=order)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

…Yes, all of that got replaced by a single call (which still does the same)

@@ -6273,7 +6299,7 @@ POSITIONAL ARGUMENTS:

if args.json is None and not args.format:
num = 10 if not args.count else args.count
prompt(bdb, search_results, oneshot, args.deep, num=num)
prompt(bdb, search_results, noninteractive=oneshot, deep=args.deep, markers=args.markers, order=order, num=num)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using keyword params explicitly

@media (min-width: 1200px) {
.filters .filter-op {width: 280px !important}
.filters .filter-val {width: 595px !important}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, flask-admin sets style="width: …" on all dropdowns in the filters (matching whatever the current width happens to be); thus the need for !important, and for reiterating the default width.

adder.onclick = () => {
try {
let key = bukuFilters().first().val() || 'buku_search_markers_match_all';
setTimeout(() => sync(key));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a buku filter is added, it's automatically matched to already existing ones (defaulting to the mode used in navbar search).

let _value = filterInput(this).val();
$(this).val(evt.val).triggerHandler('change', '$norecur$');
filterInput(this).val(_value); // retaining the last value for other filters
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When one of the buku filter modes is changed, all others get updated as well (and their keywords are restored)

{% endblock %}

{% block menu_links %}
{{ super() }}
<form class="navbar-form navbar-right" action="{{ url_for('bookmark.index_view') }}" method="GET">
<form class="navbar-form navbar-right" action="{{ url_for('admin.search') }}" method="POST">
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Search splitting is done in the admin.search handler, thus replacing the action= value.

@@ -4,14 +4,16 @@
{% block head %}
{{ super() }}
{{ buku.close_if_popup() }}
{{ buku.focus('form[action="/"]') }}
{{ buku.focus('main form[action="/"]') }}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both forms have the same action now, thus specifying which one we want focused on page load.

this.title = this.title.replace(/'(.*?)'/g, `'<strong><tt>$1</tt></strong>'`).replace(/\bFULL\b/g, `<strong><em>full</em></strong>`);
}).attr('data-container', 'body').attr('data-trigger', 'hover').tooltip();
</script>
<style>.tooltip-inner {text-align: left; white-space: pre; max-width: 600px}</style>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabling (and prettifying) Bootstrap tooltips

deep, all_keywords = (x and not regex for x in [form.deep.data, form.all_keywords.data])
flt = bs_filters.BookmarkBukuFilter(deep=deep, regex=regex, markers=markers, all_keywords=all_keywords)
vals = ([('', form.keyword.data)] if not markers else enumerate(buku.split_by_marker(form.keyword.data)))
url = url_for('bookmark.index_view', **{filter_key(flt, idx): val for idx, val in vals})
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old behaviour without markers, otherwise splitting into multiple keywords based on detected markers.

(…Also I removed combinations of regex with deep/all_keywords since those do nothing 😅)

(1, "http://example.org", None, ",", "", 0),
(2, "http://google.com", "Google", ",", "", 0),
(2, "http://example.org", None, ",bar,baz,foo,", "", 0),
(3, "http://google.com", "Google", ",bar,baz,foo,", "", 0),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding tags to test their export as well. (…Also fixing indices since these wouldn't be encountered in actual data)

assert split_by_marker(search_string) == [
' global substring', '.title substring', ':url substring', ':https',
'> description substring', '#partial,tags:', '#,exact,tags,', '*another global substring ',
]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The split condition is “one or multiple spaces followed by a marker” (with the marker being retained)

assert SortKey('foo', ascending=False) < SortKey('bar', ascending=False)
assert not SortKey('foo', ascending=False) < SortKey('foo', ascending=False) # pylint: disable=unnecessary-negation
assert not SortKey('foo', ascending=False) > SortKey('foo', ascending=False) # pylint: disable=unnecessary-negation
assert not SortKey('foo', ascending=False) > SortKey('bar', ascending=False) # pylint: disable=unnecessary-negation
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing </> specifically here

assert not SortKey('foo', ascending=False) > SortKey('bar', ascending=False) # pylint: disable=unnecessary-negation

custom_order = lambda s: (SortKey(len(s), ascending=False), SortKey(s, ascending=True))
assert sorted(['foo', 'bar', 'baz', 'quux'], key=custom_order) == ['quux', 'bar', 'baz', 'foo']
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“Sort longer strings first, ordering lexicographically when they have the same length”

" OR (tags LIKE ('%' || ? || '%') AND tags LIKE ('%' || ? || '%') AND tags LIKE ('%' || ? || '%'))"),
])
def test_search_clause(bukuDb, regex, tokens, args, clauses):
assert bukuDb()._search_clause(tokens, regex=regex) == (clauses, args)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoroughly testing search SQL generation

@jarun jarun merged commit e757c8e into jarun:master Sep 7, 2024
1 check passed
@jarun
Copy link
Owner

jarun commented Sep 7, 2024

Thank you!

@jarun
Copy link
Owner

jarun commented Sep 7, 2024

Please update the ToDO list tracker.

Also, is the --marker option added to the auto-completion scripts?

@LeXofLeviafan
Copy link
Collaborator Author

Also, is the --markers option added to the auto-completion scripts?

Ah, no; forgot about that.

…Say; are the completion lists meant to be sorted? I think the last few options that were added to them are misplaced if that's the case 😅

@LeXofLeviafan LeXofLeviafan deleted the search-with-markers branch September 7, 2024 06:48
@LeXofLeviafan LeXofLeviafan mentioned this pull request Sep 29, 2024
99 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add search option to include and exclude columns
2 participants