Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
lucaong committed Feb 11, 2019
1 parent 190bb19 commit 962aba1
Show file tree
Hide file tree
Showing 4 changed files with 41 additions and 28 deletions.
2 changes: 1 addition & 1 deletion docs/index.json
Expand Up @@ -3484,7 +3484,7 @@
"kind": "manual",
"longname": "/Users/luca/Code/minisearch/DESIGN_DOCUMENT.md",
"name": "DESIGN_DOCUMENT.md",
"content": "# Design Document\n\nThis design document has the aim to explain the details of `MiniSearch`\ndesign and implementation to library developers that intend to contribute to\nthis project, or that are simply curious about the internals.\n\n**Last update: Oct. 9, 2018**\n\n## Goals (and non-goals)\n\n`MiniSearch` is aimed at providing rich fulltext search functionalities in a\nlocal setup (e.g. client side, in the browser). It is therefore optimized for:\n\n 1. Small memory footprint of the index data structure\n 2. Fast indexing of documents\n 3. Versatile and performant search features, to the extent possible while\n meeting goals 1 and 2\n 4. Small and simple API surface, on top of which more specific solutions can\n be built by application developers\n 5. Possibility to add and remove documents from the index at any time\n\n`MiniSearch` is therefore NOT directly aimed at offering:\n\n - A solution for use cases requiring large index data structure size\n - Distributed setup where the index resides on multiple nodes and need to be\n kept in sync\n - Turn-key opinionated solutions (e.g. supporting multiple locales):\n `MiniSearch` _enables_ developer to build these on top of the core API, but\n does not provide it out of the box.\n\nFor these points listed as non-goals, other solutions exist that should be\npreferred to `MiniSearch`. Adapting `MiniSearch` to support those goals would in\nfact necessarily go against the primary project goals.\n\n\n## Technical design\n\n`MiniSearch` is composed of two layers:\n\n 1. A compact and versatile data structure for indexing terms, providing\n prefix search and fuzzy get\n 2. An API layer on top of this data structure, providing the search\n features\n\nHere follows a description of these two layers.\n\n### Index data structure\n\nThe data structure chosen for the index is a [radix\ntree](https://en.wikipedia.org/wiki/Radix_tree), which is a trie where nodes\nwith no siblings are merged with the parent node. The reason for choosing this\ndata structure follows from the project goals:\n\n - The radix tree minimizes the memory footprint of the index, because common\n prefixes are stored only once, and nodes are compressed into a single\n multi-character node whenever possible.\n - Radix trees offer fast key lookup, with performance proportional to the key\n length, and fast lookup of subtrees sharing the same key prefix. These\n properties make it possible to offer exact match and prefix search.\n - On top of a radix tree it is possible to implement lookup of keys that are\n within a certain maximum edit distance from a given key. This search rapidly\n becomes complex as the maximum distance grows, but for practical search\n use-cases the maximum distance is small enough for this algorithm to be\n performant. Other more performant solutions for fuzzy search require more\n space.\n\nThe class implementing the radix tree is called `SearchableMap`, because it\nimplements the standard JavaScript `Map` interface, adding on top of it more\nadvanced key lookup features:\n\n - `SearchableMap.prototype.atPrefix(prefix)`, returning another\n `SearchableMap` representing a mutable view of the original one, containing\n only entries where the keys share the given prefix.\n - `SearchableMap.prototype.fuzzyGet(searchKey, maxEditDistance)`, returning\n all the entries where the key is within the given edit (Levenshtein)\n distance from `searchKey`.\n\nAs a trade-off for offering these additional features, `SearchableMap` is\nrestricted to use only string keys.\n\nThe `SearchableMap` data type is part of the public API of `MiniSearch`, exposed\nas `MiniSearch.SearchableMap`. Its usefulness is in fact not limited to\nproviding a data structure for the inverted index, and developers can use it as\na building block for other solutions (e.g. autocompletion).\n\n### Fuzzy search algorithm\n\nThe algorithm used to provide fuzzy search of keys within a maximum [Levenshtein\ndistance](https://en.wikipedia.org/wiki/Levenshtein_distance) from a given term\nis the following:\n\n - The search starts with a budget of edit distance, initially equal to the\n given maximum distance.\n - The radix tree is traversed, starting from the root, visiting each path and\n propagating the remaining budget along each path, but quitting any search\n path along which the budget is exhausted.\n - For each visited node in the radix tree, the string contained in the node is\n traversed character by character using cursors that are kept on a stack.\n - Each cursor has: a pointer to a position in the node string; a pointer to a\n corresponding position in the search string; the type of the last edit,\n either `deletion`, or `insertion`, or `change`, or `none`; a budget of\n \"available edits\". This budget is decremented whenever an edit is required.\n The budget is passed from parent to children cursors.\n - The algorithm pulls cursors from the stack, and compares the pointed\n character in the node string with the pointed character in the search\n string:\n * if they are the same, one single child cursor is created, advancing both\n pointers of 1 position. No edit was necessary, so the last edit type is\n `none`.\n * if they are not the same, and the remaining budget is higher than zero, up\n to three children cursors are created: one corresponding to a character\n `change`, where both pointers are incremented by 1; one corresponding to a\n `deletion`, where only the search string pointer is incremented; one\n corresponding to an `insertion`, where only the node string pointer is\n incremented. Each of the children cursors have a budget that is one less\n the parent budget.\n * Some special cases are considered to avoid creating unnecessary cursors. A\n sequence of adjacent `deletion`-`insertion`, or `insertion`-`deletion`,\n would have the same effect of a change, but would consume more budget:\n therefore, a delete cursor is never created after a insertion cursor, and\n vice-versa. Similarily, adjacent `change`-`deletion` and\n `deletion`-`change`, or `change`-`insertion` and `insertion`-`change`, are\n equivalent. Therefore, only one of these cases is generated, by never\n producing a change cursor after a deletion or insertion one.\n - Whenever the algorithm finds a leaf node, it reports it as a result.\n\nNote that this algorithm can get complex if the maximum edit distance is large,\nas many paths would be followed. The reason why this algorithm is employed is a\ntrade-off:\n\n - for fulltext search purposes, the maximum edit distance is small, so the\n algorithm is performant enough\n - The alternatives (e.g. trigram indexes), would require much more space\n - As `MiniSearch` is optimized for local and possibly memory-constrained\n setup, higher computation complexity is traded in exchange for smaller space\n requirement for the index.\n\n### Search API layer\n\nThe search API layer offers a small and simple API surface for application\ndevelopers. It does not assume that a specific locale is used in the indexed\ndocuments, therefore no stemming nor stop-word filtering is performed, but\ninstead offers easy options for developers to provide their own implementation.\nThis heuristic will be followed in future development too: rather than providing\nan opinionated solution, the project will offer simple building blocks for\napplication developers to implement their own solutions.\n\nThe inverted index is implemented with `SearchableMap`, and posting lists are\nstored as values in the Map. This way, the same data structure provides both the\ninverted index and the set of indexed terms. Different document fields are\nindexed within the same index, to further save space. The index is therefore\nstructure as following:\n\n```\nterm -> field -> { document frequency, posting list }\n```\n\nWhen performing a search, the entries corresponding to the search term are\nlooked up in the index (optionally searching the index with prefix or fuzzy\nsearch), then the documents are scored with a variant of\n[Tf-Idf](https://en.wikipedia.org/wiki/Tf–idf), and finally results for\ndifferent search terms are merged with the given combinator function (by default\n`OR`, but `AND` can be specified).\n",
"content": "# Design Document\n\nThis design document has the aim to explain the details of `MiniSearch`\ndesign and implementation to library developers that intend to contribute to\nthis project, or that are simply curious about the internals.\n\n**Last update: Feb. 11, 2019**\n\n## Goals (and non-goals)\n\n`MiniSearch` is aimed at providing rich full-text search functionalities in a\nlocal setup (e.g. client side, in the browser). It is therefore optimized for:\n\n 1. Small memory footprint of the index data structure\n 2. Fast indexing of documents\n 3. Versatile and performant search features, to the extent possible while\n meeting goals 1 and 2\n 4. Small and simple API surface, on top of which more specific solutions can\n be built by application developers\n 5. Possibility to add and remove documents from the index at any time\n\n`MiniSearch` is therefore NOT directly aimed at offering:\n\n - A solution for use cases requiring large index data structure size\n - Distributed setup where the index resides on multiple nodes and need to be\n kept in sync\n - Turn-key opinionated solutions (e.g. supporting multiple locales):\n `MiniSearch` _enables_ developer to build these on top of the core API, but\n does not provide it out of the box.\n\nFor these points listed as non-goals, other solutions exist that should be\npreferred to `MiniSearch`. Adapting `MiniSearch` to support those goals would in\nfact necessarily go against the primary project goals.\n\n\n## Technical design\n\n`MiniSearch` is composed of two layers:\n\n 1. A compact and versatile data structure for indexing terms, providing\n prefix and fuzzy lookup\n 2. An API layer on top of this data structure, providing the search\n features\n\nHere follows a description of these two layers.\n\n### Index data structure\n\nThe data structure chosen for the index is a [radix\ntree](https://en.wikipedia.org/wiki/Radix_tree), which is a prefix tree where\nnodes with no siblings are merged with the parent node. The reason for choosing\nthis data structure follows from the project goals:\n\n - The radix tree minimizes the memory footprint of the index, because common\n prefixes are stored only once, and nodes are compressed into a single\n multi-character node whenever possible.\n - Radix trees offer fast key lookup, with performance proportional to the key\n length, and fast lookup of subtrees sharing the same key prefix. These\n properties make it possible to offer exact match and prefix search.\n - On top of a radix tree it is possible to implement lookup of keys that are\n within a certain maximum edit distance from a given key. This search rapidly\n becomes complex as the maximum distance grows, but for practical search\n use-cases the maximum distance is small enough for this algorithm to be\n performant. Other more performant solutions for fuzzy search would require\n more space (e.g. n-gram indexes).\n\nThe class implementing the radix tree is called `SearchableMap`, because it\nimplements the standard JavaScript [`Map`\ninterface](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Map),\nadding on top of it some key lookup methods:\n\n - `SearchableMap.prototype.atPrefix(prefix)`, returning another\n `SearchableMap` representing a mutable view of the original one, containing\n only entries where the keys share the given prefix.\n - `SearchableMap.prototype.fuzzyGet(searchKey, maxEditDistance)`, returning\n all the entries where the key is within the given edit (Levenshtein)\n distance from `searchKey`.\n\nAs a trade-off for offering these additional features, `SearchableMap` is\nrestricted to use only string keys.\n\nThe `SearchableMap` data type is part of the public API of `MiniSearch`, exposed\nas `MiniSearch.SearchableMap`. Its usefulness is in fact not limited to\nproviding a data structure for the inverted index, and developers can use it as\na building block for other solutions. When modifying this class, one should\nthink about it in terms of a generic data structure, that could in principle be\nreleased as a separate library.\n\n### Fuzzy search algorithm\n\nThe algorithm used to provide fuzzy search of keys within a maximum [Levenshtein\ndistance](https://en.wikipedia.org/wiki/Levenshtein_distance) from a given term\nis the following:\n\n - The search starts with a budget of edit distance, initially equal to the\n given maximum distance.\n - The radix tree is traversed, starting from the root, visiting each path and\n propagating the remaining budget along each path, but quitting any search\n path along which the budget is exhausted.\n - For each visited node in the radix tree, the string contained in the node is\n traversed character by character using cursors that are kept on a stack.\n - Each cursor has: a pointer to a position in the node string; a pointer to a\n corresponding position in the search string; the type of the last edit,\n either `deletion`, or `insertion`, or `change`, or `none`; a budget of\n \"available edits\". This budget is decremented whenever an edit is required.\n The budget is passed from parent to children cursors.\n - The algorithm pulls cursors from the stack, and compares the pointed\n character in the node string with the pointed character in the search\n string:\n * if they are the same, one single child cursor is created, advancing both\n pointers of 1 position. No edit was necessary, so the last edit type is\n `none`.\n * if they are not the same, and the remaining budget is higher than zero, up\n to three children cursors are created: one corresponding to a character\n `change`, where both pointers are incremented by 1; one corresponding to a\n `deletion`, where only the search string pointer is incremented; one\n corresponding to an `insertion`, where only the node string pointer is\n incremented. Each of the children cursors have a budget that is one less\n the parent budget.\n * Some special cases are considered to avoid creating unnecessary cursors. A\n sequence of adjacent `deletion`-`insertion`, or `insertion`-`deletion`,\n would have the same effect of a change, but would consume more budget:\n therefore, a delete cursor is never created after a insertion cursor, and\n vice-versa. Similarily, adjacent `change`-`deletion` and\n `deletion`-`change`, or `change`-`insertion` and `insertion`-`change`, are\n equivalent. Therefore, only one of these cases is generated, by never\n producing a change cursor after a deletion or insertion one.\n - Whenever the algorithm finds a leaf node, it reports it as a result.\n\nNote that this algorithm can get complex if the maximum edit distance is large,\nas many paths would be followed. The reason why this algorithm is employed is a\ntrade-off:\n\n - For full-text search purposes, the maximum edit distance is small, so the\n algorithm is performant enough\n - The alternatives (e.g. trigram indexes), would require much more space\n - As `MiniSearch` is optimized for local and possibly memory-constrained\n setup, higher computation complexity is traded in exchange for smaller space\n requirement for the index.\n\n### Search API layer\n\nThe search API layer offers a small and simple API surface for application\ndevelopers. It does not assume that a specific locale is used in the indexed\ndocuments, therefore no stemming nor stop-word filtering is performed, but\ninstead offers easy options for developers to provide their own implementation.\nThis heuristic will be followed in future development too: rather than providing\nan opinionated solution, the project will offer simple building blocks for\napplication developers to implement their own solutions.\n\nThe inverted index is implemented with `SearchableMap`, and posting lists are\nstored as values in the Map. This way, the same data structure provides both the\ninverted index and the set of indexed terms. Different document fields are\nindexed within the same index, to further save space. The index is therefore\nstructured as following:\n\n```\nterm -> field -> { document frequency, posting list }\n```\n\nWhen performing a search, the entries corresponding to the search term are\nlooked up in the index (optionally searching the index with prefix or fuzzy\nsearch), then the documents are scored with a variant of\n[Tf-Idf](https://en.wikipedia.org/wiki/Tf–idf), and finally results for\ndifferent search terms are merged with the given combinator function (by default\n`OR`, but `AND` can be specified).\n\nAs the document IDs necessarily occur many times in the posting list, as a space\noptimization they are substituted by short generated IDs. An index of short ID\nto original ID is maintained alongside the search index, to reconstruct the\noriginal IDs. A similar optimization is applied to the field names.\n",
"static": true,
"access": "public"
}
Expand Down

0 comments on commit 962aba1

Please sign in to comment.