Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue to track v5.0.0 beta feedback #142

Closed
lucaong opened this issue Mar 11, 2022 · 20 comments
Closed

Issue to track v5.0.0 beta feedback #142

lucaong opened this issue Mar 11, 2022 · 20 comments

Comments

@lucaong
Copy link
Owner

lucaong commented Mar 11, 2022

This issue is for collecting feedback and possible issues during the beta testing of MiniSearch v5.0.0, which includes an improved scoring algorithm.

@lucaong
Copy link
Owner Author

lucaong commented Apr 5, 2022

Here is one case that might need tweaking before releasing v5.0.0: when using both prefix and fuzzy match, if a term matches one document with prefix search and another with fuzzy search, all else being equal (including the edit distance, field length, etc.) by default the prefix match should probably score slightly higher than the fuzzy match, because it tends to be more relevant.

Here is a spec reproducing the issue:

it('ranks prefix matches higher than fuzzy matches with the same distance, all else being equal', () => {
  const ms = new MiniSearch({ fields: ['text'] })
  const documents = [
    { id: 1, text: 'unicorns' },
    { id: 2, text: 'unikorn' }
  ]
  ms.addAll(documents)
  expect(ms.documentCount).toEqual(documents.length)

  const results = ms.search('unicorn', { fuzzy: 0.2, prefix: true })
  expect(results.map(({ id }) => id)).toEqual([1, 2])
})

Also, this is mentioned by at least one user in this issue report (see the comment about order of results in the first message).

/cc @rolftimmermans

@lucaong
Copy link
Owner Author

lucaong commented Apr 5, 2022

The issue above is addressed in #146 by slightly tuning the default weights (in particular, slightly lowering the weight for fuzzy matches to pass all existing test cases plus the new one), but would benefit from more testing with different collections of documents.

@lucaong
Copy link
Owner Author

lucaong commented Apr 5, 2022

Decision about the issue above was to keep the weights unchanged, the relevant discussion is in #146

@daKmoR
Copy link

daKmoR commented Apr 16, 2022

5.x might be the right time to add an export map?

something like this?

  "exports": {
    ".": {
      "module": "./dist/es/index.js",
      "default": "./dist/umd/index.js"
    }
  }

https://nodejs.org/api/packages.html#package-entry-points

if you want more info or are interested but need help let me know 🤗 I could prepare a PR 👍

@lucaong
Copy link
Owner Author

lucaong commented Apr 16, 2022

@daKmoR that does sound very interesting. I think it should be relatively easy to do that for MiniSearch, as the public API is well defined. At the moment, the two things that are publicly exported are minisearch and minisearch/SearchableMap. The latter is probably not widely used, but it is provided as a public export for people who need the data structure used internally for the index, a Map searchable also by prefix or fuzzy match, which could also be of general utility.

PRs would be very welcome, especially when the functionality doesn’t break existing use cases, which seems to be the case here (apart from those two entry points there is not much else that can be imported by users).

@joyously
Copy link

joyously commented May 5, 2022

Is this project using semvar?
I had used minisearch 3.2.0 for a site last year, and I just checked back here and found a 4.0 and now a 5.0, but no docs to show what changed. Are there breaking changes?

@lucaong
Copy link
Owner Author

lucaong commented May 6, 2022

Hi @joyously ,
Yes, MiniSearch follows semantic versioning and tracks changes in the CHANGELOG.md, indicating which breaking changes are introduced by major releases. Breaking changes are only introduced in major releases.

From version 3 to version 4 the breaking changes are only to the serialization format: if you don’t cache/load the index with loadJSON you should be able to upgrade with no issue. You can also upgrade to version 5: the scoring algorithm was improved, so the sorting of results will be different (hopefully better), but nothing should break.

@joyously
Copy link

joyously commented May 6, 2022

Ah, of course, I didn't even see the change log. Perhaps you could reference it in the readme if that's the only place these important topics are mentioned.
I had gone to the releases and found nothing, the tags have no info, and the listed doc has nothing useful.

@lucaong
Copy link
Owner Author

lucaong commented May 6, 2022

Perhaps you could reference it in the readme if that's the only place these important topics are mentioned.

That makes sense, I will do so

@lucaong
Copy link
Owner Author

lucaong commented May 6, 2022

Done ✅

@lucaong
Copy link
Owner Author

lucaong commented May 31, 2022

Version v5.0.0-beta2 is now published on NPM. On top of v5.0.0-beta1 it adds an export map in the package.json as suggested by @daKmoR . This change should be completely backwards compatible, and just making things cleaner for users of recent Node.JS versions using import vs require.

If all is well, I plan to release v5.0.0 soon, with no further additions from the beta. Working on improving the auto suggestions would be good, but can go in a later release, as there is no clear action plan at the moment.

@lucaong
Copy link
Owner Author

lucaong commented Jun 14, 2022

In the end, following the discussion in #157 , the defaults for auto suggestions have been changed slightly in v5.0.0-beta3 (see #161 ). With this change, I would say all is ready for v5.0.0.

@joyously
Copy link

I haven't looked(not sure how), but I was wondering about memory usage.
If I load my documents, that's a big chunk. I then call miniSearch to create an index, which takes another big chunk. But this happens in a function so the reference to the documents keeps it in memory?
Is memory being reclaimed? Is there error handling for running out of memory?

@lucaong
Copy link
Owner Author

lucaong commented Jun 15, 2022

@joyously MiniSearch keeps the index in memory, so a huge index could in theory exhaust memory. I am not aware of way to recover from a out-of-memory error from within a library or app. This is not specific to MiniSearch, but to any library.

That said, a couple of considerations:

  • MiniSearch does its best to keep the index compact. In particular, it uses a radix tree, which is a kind of prefix tree. This means that common prefixes are only stored once. For example, if I had the terms “unicorn”, “university” and “universe” (a total of 25 characters), the index would use less memory than those three strings, because it stores only once the common prefix “uni” (that appears in all three terms) and “vers” (that appears after “uni” in two terms). It basically would store the strings “uni”, “corn”, “vers”, “e”, and “ty” (a total of 14 characters), plus the data pointed by the terms.
  • The original documents are not copied by MiniSearch, and not even referenced, apart from the configured storeFields. They will therefore be in memory only once, or not at all if the original reference is garbage collected.
  • As soon as the MiniSearch instance goes out of scope, the index memory can be reclaimed: MiniSearch does not use any global reference to its data.

In sum, MiniSearch keeps data in memory, but doesn’t duplicate it, and tries to keep its own data structures compact. That said, if the original collection of documents doesn’t fit in memory, the index will likely not fit either. I routinely use MiniSearch in the browser with collections of tens of thousands of small documents, or thousands of bigger ones. Use cases that cannot fit in memory will necessary need a search server.

@joyously
Copy link

I am not aware of way to recover from a out-of-memory error from within a library or app.

Wouldn't it work to bracket the addition to the index with a try/catch ?

The original documents are not copied by MiniSearch

So my code needs to have a separate function to load the documents and return the index, so that the documents go out of scope.

As soon as the MiniSearch instance goes out of scope, the index memory can be reclaimed

Yes, but it usually needs to be in scope for the life of a web page, so that part is not something to affect.

@joyously
Copy link

Perhaps the index object should have a property of the count of the documents in the index.
This could be used in an error handler outside the library, as a threshold to continue or not, or even displayed to the user for them to decide to continue.
But the library would need to handle the error in order to return something...

@lucaong
Copy link
Owner Author

lucaong commented Jun 15, 2022

@joyously the number of documents in the index is available via the documentCount function.

As for detecting if the app is about to run out of memory, there is a browser API for such feature, but unfortunately it is non-standard, and browser support is still sketchy.

Wouldn't it work to bracket the addition to the index with a try/catch ?

Unfortunately, try/catch will probably not work for out-of-memory error (which might crash the browser tab).

So my code needs to have a separate function to load the documents and return the index, so that the documents go out of scope.

Not necessarily. Garbage collection should be able to do its job. The only problem would be if you assign the documents or the index to global variables in a single-page application. But otherwise, memory can be reclaimed as soon as your code does not reference it anymore, with or without a function. A function is often a good way to make the documents go out of scope after the function returns, so if I understand correctly you are right.

@lucaong
Copy link
Owner Author

lucaong commented Jun 16, 2022

Version v5.0.0 is now published on NPM. Thank you everyone for the outstanding contributions and feedback!

@lucaong lucaong closed this as completed Jun 16, 2022
@joyously
Copy link

joyously commented Sep 6, 2022

Version v5.0.0 is now published on NPM.

You didn't make a git tag for this or a GitHub release. The 5.0.0 is referenced in the readme only, in one line for the script tag.

@lucaong
Copy link
Owner Author

lucaong commented Sep 22, 2022

Thanks @joyously , the release was on NPM but I forgot to push the git tag

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants