Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement schema conversion to returned docs and general improvements #69

Merged
merged 35 commits into from
Jan 23, 2022

Conversation

ChillFish8
Copy link
Member

Closes #67

This adds the ability for a user to explicitly state in the schema if a field is multi-value or not.
If the field isn't multi-value the returned field will be unwrapped from it's array wrapper.

@ChillFish8 ChillFish8 added the enhancement New feature or request label Jan 22, 2022
pub static PRIMARY_KEY: &str = "_id";

fn default_to_true() -> bool {
true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what the purpose of this is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a rather unfortunate result of serde not allowing default values directly and instead requiring a factory.

This is basically doing "if it's not given default to true" as hacky as it is. :(

/// These values need to either be a fast field (ints) or TEXT.
search_fields: Vec<String>,

/// A set of fields to boost by a given factor.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you comment a bit more the boost factor? For example, what is the range?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could probably mirror the docs from Tantivy's boost query to be honest, considering this is just a multiplier.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for the clarification. Putting a link to tantiviy's doc page would be helpful.

fn validate(&self) -> Result<()> {
if self.search_fields.is_empty() {
return Err(Error::msg(
"at least one indexed field must be given to search.",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to make search_fields optional and by default use all indexed fields for search?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a good idea! Shall add that.

let existing_fields_set: HashSet<&str> =
HashSet::from_iter(existing_fields.into_iter());

let union: Vec<&str> = defined_fields_set

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this variable be called diff or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does indeed, left over from testing 😓 Has been changed in my local copy.

Copy link

@kindlychung kindlychung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's appropriate to write unsolicited review, please just take it as generic comments from someone who doesn't know much about search engines.

for (field, details) in self.fields.iter() {
if field == PRIMARY_KEY {
warn!(
"{} is a reserved field name due to being a primary key",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that _id is the primary key identifier for mongodb, too. This might cause pain if someone wants to build a bridge between mongo and lnx. Maybe make it a bit more explicit, something like _lnx_id?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's quite reasonable to do that, although probably want to make that change in a separate PR as this one's already pretty big.

@ChillFish8
Copy link
Member Author

Now that this PR is pretty much done I'll do a brief summary of the changes as there are quite a few:

  • Returned documents are now aligned with the given index schema if a field is missing one is added post search to its given default. I.e single value fields return null, multi-value fields return [] this creates the guarantee that returned documents will always follow the defined schema.
  • You can now mark a field as required which will cause the system to reject documents if they're missing a required field. This behaviour defaults to false and all fields are optional (as long as one valid field is present) to allow the engine to be more forgiving by default.
  • Unknown document fields are skipped rather than rejected directly.
  • Uploading multiple values to a single-value field will result in the last value in the given value list to be used. i.e [1, 2, 3] will result in just 3 being used.
  • The system now correctly errors when trying to sort by multi-field values instead of panicking. See Multi-value field sorting. #70 for multi-value sorting itself.
  • Term queries can now provide separate boost factors to each target field.
  • Fields now only need to be defined as fast: bool rather than their specific variants. This is because the cardinality is automatically selected depending on whether the field is multi or not.
  • Search fields automatically default to all indexed fields (Text, String, And indexed fast-fields) if you don't have any text/string fields however, fuzzy searching will be disabled.

@ChillFish8
Copy link
Member Author

New Shema:

{
  "override_if_exists": true,
  "index": {
    "name": "bench",
    "writer_buffer": 60000000,
    "writer_threads": 8,
    "reader_threads": 1,
    "max_concurrency": 12,
    "storage_type": "filesystem",
    "fields": {
      "release_date": {
        "type": "date",
        "stored": true,
        "indexed": false,
        "multi": true,
        "fast": true
      },
      "title": {
        "type": "text",
        "stored": true
      }
    },
    "boost_fields": {
      "title": 2.0
    },
    "use_fast_fuzzy": true,
    "strip_stop_words": true
  }
}

Results in this set of documents:
image

Being converted as results into:

{
  "status": 200,
  "data": {
    "hits": [
      {
        "doc": {
          "release_date": [],
          "title": "hello world 4!"
        },
        "document_id": "10994863538528176088",
        "score": 0.3646432
      },
      {
        "doc": {
          "release_date": [
            "2022-01-23T14:23:56+00:00"
          ],
          "title": "hello world 4!"
        },
        "document_id": "11592911168202451198",
        "score": 0.3646432
      }
    ],
    "count": 2,
    "time_taken": 0.0002273
  }
}

@ChillFish8 ChillFish8 merged commit 01a61da into master Jan 23, 2022
@ChillFish8 ChillFish8 deleted the schema-improvements branch January 23, 2022 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Why is every doc field wrapped in an array in search results?
2 participants