Issue to track v5.0.0 beta feedback #142

lucaong · 2022-03-11T09:35:24Z

This issue is for collecting feedback and possible issues during the beta testing of MiniSearch v5.0.0, which includes an improved scoring algorithm.

The text was updated successfully, but these errors were encountered:

lucaong · 2022-04-05T09:46:09Z

Here is one case that might need tweaking before releasing v5.0.0: when using both prefix and fuzzy match, if a term matches one document with prefix search and another with fuzzy search, all else being equal (including the edit distance, field length, etc.) by default the prefix match should probably score slightly higher than the fuzzy match, because it tends to be more relevant.

Here is a spec reproducing the issue:

it('ranks prefix matches higher than fuzzy matches with the same distance, all else being equal', () => {
  const ms = new MiniSearch({ fields: ['text'] })
  const documents = [
    { id: 1, text: 'unicorns' },
    { id: 2, text: 'unikorn' }
  ]
  ms.addAll(documents)
  expect(ms.documentCount).toEqual(documents.length)

  const results = ms.search('unicorn', { fuzzy: 0.2, prefix: true })
  expect(results.map(({ id }) => id)).toEqual([1, 2])
})

Also, this is mentioned by at least one user in this issue report (see the comment about order of results in the first message).

/cc @rolftimmermans

lucaong · 2022-04-05T10:00:10Z

The issue above is addressed in #146 by slightly tuning the default weights (in particular, slightly lowering the weight for fuzzy matches to pass all existing test cases plus the new one), but would benefit from more testing with different collections of documents.

lucaong · 2022-04-05T14:02:50Z

Decision about the issue above was to keep the weights unchanged, the relevant discussion is in #146

daKmoR · 2022-04-16T16:15:23Z

5.x might be the right time to add an export map?

something like this?

  "exports": {
    ".": {
      "module": "./dist/es/index.js",
      "default": "./dist/umd/index.js"
    }
  }

https://nodejs.org/api/packages.html#package-entry-points

if you want more info or are interested but need help let me know 🤗 I could prepare a PR 👍

lucaong · 2022-04-16T20:41:54Z

@daKmoR that does sound very interesting. I think it should be relatively easy to do that for MiniSearch, as the public API is well defined. At the moment, the two things that are publicly exported are minisearch and minisearch/SearchableMap. The latter is probably not widely used, but it is provided as a public export for people who need the data structure used internally for the index, a Map searchable also by prefix or fuzzy match, which could also be of general utility.

PRs would be very welcome, especially when the functionality doesn’t break existing use cases, which seems to be the case here (apart from those two entry points there is not much else that can be imported by users).

joyously · 2022-05-05T23:23:29Z

Is this project using semvar?
I had used minisearch 3.2.0 for a site last year, and I just checked back here and found a 4.0 and now a 5.0, but no docs to show what changed. Are there breaking changes?

lucaong · 2022-05-06T09:10:07Z

Hi @joyously ,
Yes, MiniSearch follows semantic versioning and tracks changes in the CHANGELOG.md, indicating which breaking changes are introduced by major releases. Breaking changes are only introduced in major releases.

From version 3 to version 4 the breaking changes are only to the serialization format: if you don’t cache/load the index with loadJSON you should be able to upgrade with no issue. You can also upgrade to version 5: the scoring algorithm was improved, so the sorting of results will be different (hopefully better), but nothing should break.

joyously · 2022-05-06T16:12:47Z

Ah, of course, I didn't even see the change log. Perhaps you could reference it in the readme if that's the only place these important topics are mentioned.
I had gone to the releases and found nothing, the tags have no info, and the listed doc has nothing useful.

lucaong · 2022-05-06T17:42:41Z

Perhaps you could reference it in the readme if that's the only place these important topics are mentioned.

That makes sense, I will do so

lucaong · 2022-05-06T18:32:21Z

Done ✅

lucaong · 2022-05-31T09:41:09Z

Version v5.0.0-beta2 is now published on NPM. On top of v5.0.0-beta1 it adds an export map in the package.json as suggested by @daKmoR . This change should be completely backwards compatible, and just making things cleaner for users of recent Node.JS versions using import vs require.

If all is well, I plan to release v5.0.0 soon, with no further additions from the beta. Working on improving the auto suggestions would be good, but can go in a later release, as there is no clear action plan at the moment.

lucaong · 2022-06-14T09:48:18Z

In the end, following the discussion in #157 , the defaults for auto suggestions have been changed slightly in v5.0.0-beta3 (see #161 ). With this change, I would say all is ready for v5.0.0.

joyously · 2022-06-14T13:14:44Z

I haven't looked(not sure how), but I was wondering about memory usage.
If I load my documents, that's a big chunk. I then call miniSearch to create an index, which takes another big chunk. But this happens in a function so the reference to the documents keeps it in memory?
Is memory being reclaimed? Is there error handling for running out of memory?

lucaong · 2022-06-15T13:04:16Z

@joyously MiniSearch keeps the index in memory, so a huge index could in theory exhaust memory. I am not aware of way to recover from a out-of-memory error from within a library or app. This is not specific to MiniSearch, but to any library.

That said, a couple of considerations:

MiniSearch does its best to keep the index compact. In particular, it uses a radix tree, which is a kind of prefix tree. This means that common prefixes are only stored once. For example, if I had the terms “unicorn”, “university” and “universe” (a total of 25 characters), the index would use less memory than those three strings, because it stores only once the common prefix “uni” (that appears in all three terms) and “vers” (that appears after “uni” in two terms). It basically would store the strings “uni”, “corn”, “vers”, “e”, and “ty” (a total of 14 characters), plus the data pointed by the terms.
The original documents are not copied by MiniSearch, and not even referenced, apart from the configured storeFields. They will therefore be in memory only once, or not at all if the original reference is garbage collected.
As soon as the MiniSearch instance goes out of scope, the index memory can be reclaimed: MiniSearch does not use any global reference to its data.

In sum, MiniSearch keeps data in memory, but doesn’t duplicate it, and tries to keep its own data structures compact. That said, if the original collection of documents doesn’t fit in memory, the index will likely not fit either. I routinely use MiniSearch in the browser with collections of tens of thousands of small documents, or thousands of bigger ones. Use cases that cannot fit in memory will necessary need a search server.

joyously · 2022-06-15T13:19:55Z

I am not aware of way to recover from a out-of-memory error from within a library or app.

Wouldn't it work to bracket the addition to the index with a try/catch ?

The original documents are not copied by MiniSearch

So my code needs to have a separate function to load the documents and return the index, so that the documents go out of scope.

As soon as the MiniSearch instance goes out of scope, the index memory can be reclaimed

Yes, but it usually needs to be in scope for the life of a web page, so that part is not something to affect.

joyously · 2022-06-15T14:28:31Z

Perhaps the index object should have a property of the count of the documents in the index.
This could be used in an error handler outside the library, as a threshold to continue or not, or even displayed to the user for them to decide to continue.
But the library would need to handle the error in order to return something...

lucaong · 2022-06-15T16:24:25Z

@joyously the number of documents in the index is available via the documentCount function.

As for detecting if the app is about to run out of memory, there is a browser API for such feature, but unfortunately it is non-standard, and browser support is still sketchy.

Wouldn't it work to bracket the addition to the index with a try/catch ?

Unfortunately, try/catch will probably not work for out-of-memory error (which might crash the browser tab).

So my code needs to have a separate function to load the documents and return the index, so that the documents go out of scope.

Not necessarily. Garbage collection should be able to do its job. The only problem would be if you assign the documents or the index to global variables in a single-page application. But otherwise, memory can be reclaimed as soon as your code does not reference it anymore, with or without a function. A function is often a good way to make the documents go out of scope after the function returns, so if I understand correctly you are right.

lucaong · 2022-06-16T16:23:59Z

Version v5.0.0 is now published on NPM. Thank you everyone for the outstanding contributions and feedback!

joyously · 2022-09-06T18:06:19Z

Version v5.0.0 is now published on NPM.

You didn't make a git tag for this or a GitHub release. The 5.0.0 is referenced in the readme only, in one line for the script tag.

lucaong · 2022-09-22T15:44:23Z

Thanks @joyously , the release was on NPM but I forgot to push the git tag

lucaong mentioned this issue Mar 11, 2022

Issues with scoring #129

Closed

lucaong mentioned this issue Apr 5, 2022

Make prefix match score higher than fuzzy (all else being equal) #146

Closed

lucaong closed this as completed Jun 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue to track v5.0.0 beta feedback #142

Issue to track v5.0.0 beta feedback #142

lucaong commented Mar 11, 2022

lucaong commented Apr 5, 2022

lucaong commented Apr 5, 2022

lucaong commented Apr 5, 2022

daKmoR commented Apr 16, 2022

lucaong commented Apr 16, 2022

joyously commented May 5, 2022

lucaong commented May 6, 2022

joyously commented May 6, 2022

lucaong commented May 6, 2022

lucaong commented May 6, 2022

lucaong commented May 31, 2022 •

edited

Loading

lucaong commented Jun 14, 2022

joyously commented Jun 14, 2022

lucaong commented Jun 15, 2022 •

edited

Loading

joyously commented Jun 15, 2022

joyously commented Jun 15, 2022

lucaong commented Jun 15, 2022 •

edited

Loading

lucaong commented Jun 16, 2022

joyously commented Sep 6, 2022

lucaong commented Sep 22, 2022

Issue to track v5.0.0 beta feedback #142

Issue to track v5.0.0 beta feedback #142

Comments

lucaong commented Mar 11, 2022

lucaong commented Apr 5, 2022

lucaong commented Apr 5, 2022

lucaong commented Apr 5, 2022

daKmoR commented Apr 16, 2022

lucaong commented Apr 16, 2022

joyously commented May 5, 2022

lucaong commented May 6, 2022

joyously commented May 6, 2022

lucaong commented May 6, 2022

lucaong commented May 6, 2022

lucaong commented May 31, 2022 • edited Loading

lucaong commented Jun 14, 2022

joyously commented Jun 14, 2022

lucaong commented Jun 15, 2022 • edited Loading

joyously commented Jun 15, 2022

joyously commented Jun 15, 2022

lucaong commented Jun 15, 2022 • edited Loading

lucaong commented Jun 16, 2022

joyously commented Sep 6, 2022

lucaong commented Sep 22, 2022

lucaong commented May 31, 2022 •

edited

Loading

lucaong commented Jun 15, 2022 •

edited

Loading

lucaong commented Jun 15, 2022 •

edited

Loading