Implement schema conversion to returned docs and general improvements #69

ChillFish8 · 2022-01-22T20:59:01Z

Closes #67

This adds the ability for a user to explicitly state in the schema if a field is multi-value or not.
If the field isn't multi-value the returned field will be unwrapped from it's array wrapper.

…-improvements

kindlychung · 2022-01-23T10:25:25Z

lnx-engine/search-index/src/schema.rs

+pub static PRIMARY_KEY: &str = "_id";
+
+fn default_to_true() -> bool {
+    true


I'm curious what the purpose of this is.

This is a rather unfortunate result of serde not allowing default values directly and instead requiring a factory.

This is basically doing "if it's not given default to true" as hacky as it is. :(

kindlychung · 2022-01-23T10:27:17Z

lnx-engine/search-index/src/schema.rs

+    /// These values need to either be a fast field (ints) or TEXT.
+    search_fields: Vec<String>,
+
+    /// A set of fields to boost by a given factor.


Could you comment a bit more the boost factor? For example, what is the range?

Could probably mirror the docs from Tantivy's boost query to be honest, considering this is just a multiplier.

Ah, thanks for the clarification. Putting a link to tantiviy's doc page would be helpful.

kindlychung · 2022-01-23T10:28:24Z

lnx-engine/search-index/src/schema.rs

+    fn validate(&self) -> Result<()> {
+        if self.search_fields.is_empty() {
+            return Err(Error::msg(
+                "at least one indexed field must be given to search.",


Does it make sense to make search_fields optional and by default use all indexed fields for search?

Thats a good idea! Shall add that.

kindlychung · 2022-01-23T10:37:17Z

lnx-engine/search-index/src/schema.rs

+        let existing_fields_set: HashSet<&str> =
+            HashSet::from_iter(existing_fields.into_iter());
+
+        let union: Vec<&str> = defined_fields_set


Shouldn't this variable be called diff or something?

it does indeed, left over from testing 😓 Has been changed in my local copy.

…i-value fields.

kindlychung

Not sure if it's appropriate to write unsolicited review, please just take it as generic comments from someone who doesn't know much about search engines.

kindlychung · 2022-01-23T10:42:11Z

lnx-engine/search-index/src/schema.rs

+        for (field, details) in self.fields.iter() {
+            if field == PRIMARY_KEY {
+                warn!(
+                    "{} is a reserved field name due to being a primary key",


Note that _id is the primary key identifier for mongodb, too. This might cause pain if someone wants to build a bridge between mongo and lnx. Maybe make it a bit more explicit, something like _lnx_id?

I think it's quite reasonable to do that, although probably want to make that change in a separate PR as this one's already pretty big.

ChillFish8 · 2022-01-23T14:47:06Z

Now that this PR is pretty much done I'll do a brief summary of the changes as there are quite a few:

Returned documents are now aligned with the given index schema if a field is missing one is added post search to its given default. I.e single value fields return null, multi-value fields return [] this creates the guarantee that returned documents will always follow the defined schema.
You can now mark a field as required which will cause the system to reject documents if they're missing a required field. This behaviour defaults to false and all fields are optional (as long as one valid field is present) to allow the engine to be more forgiving by default.
Unknown document fields are skipped rather than rejected directly.
Uploading multiple values to a single-value field will result in the last value in the given value list to be used. i.e [1, 2, 3] will result in just 3 being used.
The system now correctly errors when trying to sort by multi-field values instead of panicking. See Multi-value field sorting. #70 for multi-value sorting itself.
Term queries can now provide separate boost factors to each target field.
Fields now only need to be defined as fast: bool rather than their specific variants. This is because the cardinality is automatically selected depending on whether the field is multi or not.
Search fields automatically default to all indexed fields (Text, String, And indexed fast-fields) if you don't have any text/string fields however, fuzzy searching will be disabled.

ChillFish8 · 2022-01-23T14:55:52Z

New Shema:

{
  "override_if_exists": true,
  "index": {
    "name": "bench",
    "writer_buffer": 60000000,
    "writer_threads": 8,
    "reader_threads": 1,
    "max_concurrency": 12,
    "storage_type": "filesystem",
    "fields": {
      "release_date": {
        "type": "date",
        "stored": true,
        "indexed": false,
        "multi": true,
        "fast": true
      },
      "title": {
        "type": "text",
        "stored": true
      }
    },
    "boost_fields": {
      "title": 2.0
    },
    "use_fast_fuzzy": true,
    "strip_stop_words": true
  }
}

Results in this set of documents:

Being converted as results into:

{
  "status": 200,
  "data": {
    "hits": [
      {
        "doc": {
          "release_date": [],
          "title": "hello world 4!"
        },
        "document_id": "10994863538528176088",
        "score": 0.3646432
      },
      {
        "doc": {
          "release_date": [
            "2022-01-23T14:23:56+00:00"
          ],
          "title": "hello world 4!"
        },
        "document_id": "11592911168202451198",
        "score": 0.3646432
      }
    ],
    "count": 2,
    "time_taken": 0.0002273
  }
}

Harrison Burt added 2 commits January 22, 2022 20:32

add new schema logic

786c67c

adjust error message to cleanup code

dd4863f

ChillFish8 added the enhancement New feature or request label Jan 22, 2022

ChillFish8 and others added 7 commits January 22, 2022 21:28

adjust error message to cleanup code

a5a73a7

Merge remote-tracking branch 'origin/schema-improvements' into schema…

767a737

…-improvements

add the ability to specify required fields or optional fields

2a76d9e

add compute once fields

c9a1d5c

add computed fields and fix some unit tests

82e1c25

fix remaining unit tests relating to schema changes

c755d9c

deal with unittests fails later

9d6b5e4

kindlychung reviewed Jan 23, 2022

View reviewed changes

ChillFish8 and others added 16 commits January 23, 2022 11:40

fix unittests

24a4936

fix how fields are handled

0baf12a

cleanup some handling

032a11a

cleanup and reformat

7658511

remove unneeded clippy lint

bd890c3

add documents following schema

a14cc3c

reformat code

ba58e50

add unit tests

478ce2b

reformat code

07d6b85

make calculate a require function for Calculated trait

84d204e

remove redundant boost field for single field.

d8db722

remove debug print

f3567aa

remove now non-existent enum

ecdf248

remove unneeded test

efa5beb

improve function name

b8dc24c

cleanup unittests

4566ea9

ChillFish8 and others added 5 commits January 23, 2022 13:19

cleanup unittests

19e1b2b

update doc

2f70c8d

add protection to prevent people running into a panic when using mult…

480696d

…i-value fields.

cleanup code and add issue marker

94136f2

reformat code

dfa6352

kindlychung reviewed Jan 23, 2022

View reviewed changes

ChillFish8 and others added 2 commits January 23, 2022 14:32

fix how documents align themselves to schema

ae43ccd

add error if no fuzzy fields exist

d077719

Harrison Burt added 2 commits January 23, 2022 14:47

reformat

0796f3e

clean up code

8bae9c6

allow lint

5f22c00

ChillFish8 merged commit 01a61da into master Jan 23, 2022

ChillFish8 deleted the schema-improvements branch January 23, 2022 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement schema conversion to returned docs and general improvements #69

Implement schema conversion to returned docs and general improvements #69

ChillFish8 commented Jan 22, 2022

kindlychung Jan 23, 2022

ChillFish8 Jan 23, 2022

kindlychung Jan 23, 2022

ChillFish8 Jan 23, 2022

kindlychung Jan 23, 2022

kindlychung Jan 23, 2022

ChillFish8 Jan 23, 2022

kindlychung Jan 23, 2022

ChillFish8 Jan 23, 2022

kindlychung left a comment

kindlychung Jan 23, 2022

ChillFish8 Jan 23, 2022

ChillFish8 commented Jan 23, 2022

ChillFish8 commented Jan 23, 2022

Implement schema conversion to returned docs and general improvements #69

Implement schema conversion to returned docs and general improvements #69

Conversation

ChillFish8 commented Jan 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kindlychung left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChillFish8 commented Jan 23, 2022

ChillFish8 commented Jan 23, 2022