Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

indexing and search on nested properties? #100

Closed
dtchang opened this issue Jul 17, 2014 · 17 comments
Closed

indexing and search on nested properties? #100

dtchang opened this issue Jul 17, 2014 · 17 comments

Comments

@dtchang
Copy link

dtchang commented Jul 17, 2014

Can one index and search on nested properties, e.g.,
{ "employees": [{ "name": "...",
"address": { "city": "...",
"street": "..."
}
}]
}

If so, what is the path expression to be used in defining 'field'? I tried something like:
this.field('employees.address.city');
or
this.field('employees[].address.city');
but none worked.

@dtchang
Copy link
Author

dtchang commented Aug 1, 2014

It has been two weeks. Can someone knowledgeable about lunr respond? Thanks.

@olivernn
Copy link
Owner

It is currently not possible and it isn't something that will be solved by lunr.

A solution would be for your code to map between whatever structure your documents/data is in into a flat object for indexing by lunr.

For example, say you have your employee object that looks like this:

{
    "id": 123,
    "name": "bob",
    "address": {
        "street": "some street",
        "city": "some city"
    }
}

You could set up your index like so:

idx = lunr(function () {
    this.field('name')
    this.field('address_street')
    this.field('address_city')
})

And setup a mapping function:

var employeeMapper = function (employee) {
    return {
        "id": employee.id,
        "name": employee.name,
        "address_street": employee.address.street,
        "address_city": employee.address.city
    }
}

Now lets say you have a collection of employees that you need to index, you can make use of your map function:

employees
    .map(employeeMapper)
    .forEach(function (doc) { idx.add(doc) })

The reason for this not being in lunr is that is easy to solve outside lunr and isn't really relevant to searching/indexing the documents.

@matthiasg
Copy link

very sad to see this not inside lunr.js itself, since it does increase the amount of ram and cpu required and increases pressure on the garbage collector by having copies around (less when text fields are large but still very noticeable in my scenarios)

@kkirby
Copy link

kkirby commented May 31, 2018

@matthiasg Late reply since I'm looking into this as well. But something that I'd like to point out a is that if Lunr were to do it for you it would use more CPU and RAM than if you did it for Lunr.

@matthiasg
Copy link

matthiasg commented Jun 1, 2018

@kkirby no worries for the delayed response. Sorry but as for your comment, I do not follow yet.

When my code has to do it I have to build a new shallow copy object of all fields to index:

original:

{
  id: "myid",
 data: { 
   name:  "myname"
   myotherinfo: "myotherinfo"
 }
}

new document:

{
   id: "myid",
   "data.name":  "myname"
   "data.myotherinfo": "myotherinfo"
}

Now admittedly this data is just a shallow copy but it is a new document which requires garbage collection.

The way I would see it implemented in an indexing engine (we do this currently ourselves with some limitations) at time of indexing we resolve the fields into a index (shown simplified below). E.g

data_name_index: {
  "myname": ["myid"]
}

data_myotherinfo_index: {
  "myotherinfo: ["myid"]
}

It is similar in operation to how elasticsearch flattens documents and there is minimal cpu overhead (if any) at indexing time and no GC overhead. The right index is again selected at query time of course.

I might need to look at your code to see whether I am missing something though.

Thanks for your feedback.

@kkirby
Copy link

kkirby commented Jun 1, 2018

@matthiasg That's a good point. I suppose it all depends on how Lunr would end up storing the data. My thought process was that it would end up taking Lunr longer to detect and parse out the proper indexes based on your configuration vs having a flat tree of data. Though, I see what you mean about creating copies of objects and how that would trigger a GC, taking up CPU time.

@olivernn
Copy link
Owner

olivernn commented Jun 2, 2018

I think its a fair point that there is an overhead in creating the documents in the format required for Lunr. How much of an overhead is difficult to know without some benchmarks/traces etc. My guess is it would be insignificant compared to the amount of work in actually building the index though.

@matthiasg If Lunr did support this, what would you propose the interface would look like? The simple approach would probably be to pass some kind of extractor function when defining the field, perhaps something like this:

this.field('name', {
  extractor: function (doc) {
    return doc.data.name
  }
})

An alternative might be something like JSONPath, though I think that would add a bunch of complexity that probably isn't worth it.

This would only require a small amount of extra state at build time, the index wouldn't need to change I don't think. The changes would be isolated to lunr.Builder.

@matthiasg
Copy link

matthiasg commented Jun 4, 2018

an extractor function would allow for some very customized behavior of course.

I would have simply made it default to follow the nested value
e.g this.field('data.name') assuming that data.name is a the propertyName/Path as it means to follow the lookup as in

const nestedValueFromObject = nestedPropertyName.split('.').reduce((object, propertyName) => object[propertyName], object)

The indexed field name would simply be data.name including the .

EDIT: Of course the extractor function should also get the name of the field as a parameter. That way a generic function would be more easily implemented.

@matthiasg
Copy link

@olivernn I agree as for the overhead, but we have developed a specific software platform were entire data large trees can be taken offline with a minimum of thousands of objects and potentially tens of thousands locally.

The GC pressure for us is already pretty high in load/import scenarios and thus we try to always use algorithms that don't accrue more load if not necessary and if close wrt implementation overhead.

@olivernn
Copy link
Owner

olivernn commented Jun 4, 2018

I would have simply made it default to follow the nested value

Thats basically JSONPath then right? You would also need to handle arrays etc. I think this way pushes a bunch of complexity into Lunr that I don't really want to maintain.

EDIT: Of course the extractor function should also get the name of the field as a parameter. That way a generic function would be more easily implemented.

You could still make the extractor function generic without passing it the field name:

function dataExtractor = function (fieldName) {
  return function (doc) {
    return doc.data[fieldName]
  }
}

this.field('name', { extractor: dataExtractor('name') })
this.field('myotherinfo', { extractor: dataExtractor('myotherinfo') })

@olivernn
Copy link
Owner

@matthiasg unless it wasn't clear, I'm open to a pull requests that adds this feature 😉

I'm not 100% on the name extractor yet but I think passing an optional function that extracts the field from a document is a reasonable implementation.

@olivernn olivernn reopened this Jun 11, 2018
@olivernn
Copy link
Owner

I went ahead and implemented the extractor function on fields. It is available in 2.3.0.

@matthiasg
Copy link

@olivernn great !

Though I would still have passed the field name, since I now have to create a lot of functions (one for each unique name at least) which again will increase resource usage :) and this time it is even more difficult to understand the actual GC impact, since code will be compiled and I am not sure how well browsers will collect the compiled code once it is not needed anymore (could not find good info on this yet).

I might still add this later in a PR when I get around to it.

@olivernn
Copy link
Owner

@matthiasg How many fields are you indexing?

increase resource usage

Of course a function instance will consume some resources, I think its probably an insignificant amount though. That is unless I'm missing something about your implementation?

As for the impact on the GC, the extractor functions will need to exist as long as the builder exists, so they should be all discarded when the builder goes out of scope.

Without data collected form profiling we're just speculating though, data wins here ;)

@matthiasg
Copy link

Data always wins :)

Since we have hundreds of forms (mostly user generated) we do have an unspecified number of form/fields.

But just once more, I have no trouble believing this is an non-issue resource-wise, but when we are faced internally with two ways to implement, one with short lived objects or closures and one without, we try to choose the latter.

Again. Thanks for this constructive discussion and even implementing it (I did not expect that) !

@tuanluu-agilityio
Copy link

extractor works fine

     this.field('country', {
        extractor: function (doc = {}) {
          return doc.address.country.name
        }
      })
      this.field('city', {
        extractor: function (doc = {}) {
          return doc.address.city_name
        }
      })
      this.field('commission', {
        extractor: function (doc = {}) {
        return doc.commission && doc.commission.value
      }})

Thanks @olivernn

@tibabit
Copy link

tibabit commented Sep 7, 2020

What about position and index metdata here? How is that expected to be computed and added into token metdata?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants