indexing and search on nested properties? #100

dtchang · 2014-07-17T23:27:06Z

Can one index and search on nested properties, e.g.,
{ "employees": [{ "name": "...",
"address": { "city": "...",
"street": "..."
}
}]
}

If so, what is the path expression to be used in defining 'field'? I tried something like:
this.field('employees.address.city');
or
this.field('employees[].address.city');
but none worked.

dtchang · 2014-08-01T03:42:02Z

It has been two weeks. Can someone knowledgeable about lunr respond? Thanks.

olivernn · 2014-08-11T19:28:43Z

It is currently not possible and it isn't something that will be solved by lunr.

A solution would be for your code to map between whatever structure your documents/data is in into a flat object for indexing by lunr.

For example, say you have your employee object that looks like this:

{
    "id": 123,
    "name": "bob",
    "address": {
        "street": "some street",
        "city": "some city"
    }
}

You could set up your index like so:

idx = lunr(function () {
    this.field('name')
    this.field('address_street')
    this.field('address_city')
})

And setup a mapping function:

var employeeMapper = function (employee) {
    return {
        "id": employee.id,
        "name": employee.name,
        "address_street": employee.address.street,
        "address_city": employee.address.city
    }
}

Now lets say you have a collection of employees that you need to index, you can make use of your map function:

employees
    .map(employeeMapper)
    .forEach(function (doc) { idx.add(doc) })

The reason for this not being in lunr is that is easy to solve outside lunr and isn't really relevant to searching/indexing the documents.

matthiasg · 2018-04-26T07:11:14Z

very sad to see this not inside lunr.js itself, since it does increase the amount of ram and cpu required and increases pressure on the garbage collector by having copies around (less when text fields are large but still very noticeable in my scenarios)

kkirby · 2018-05-31T16:55:10Z

@matthiasg Late reply since I'm looking into this as well. But something that I'd like to point out a is that if Lunr were to do it for you it would use more CPU and RAM than if you did it for Lunr.

matthiasg · 2018-06-01T08:56:48Z

@kkirby no worries for the delayed response. Sorry but as for your comment, I do not follow yet.

When my code has to do it I have to build a new shallow copy object of all fields to index:

original:

{
  id: "myid",
 data: { 
   name:  "myname"
   myotherinfo: "myotherinfo"
 }
}

new document:

{
   id: "myid",
   "data.name":  "myname"
   "data.myotherinfo": "myotherinfo"
}

Now admittedly this data is just a shallow copy but it is a new document which requires garbage collection.

The way I would see it implemented in an indexing engine (we do this currently ourselves with some limitations) at time of indexing we resolve the fields into a index (shown simplified below). E.g

data_name_index: {
  "myname": ["myid"]
}

data_myotherinfo_index: {
  "myotherinfo: ["myid"]
}

It is similar in operation to how elasticsearch flattens documents and there is minimal cpu overhead (if any) at indexing time and no GC overhead. The right index is again selected at query time of course.

I might need to look at your code to see whether I am missing something though.

Thanks for your feedback.

kkirby · 2018-06-01T16:19:28Z

@matthiasg That's a good point. I suppose it all depends on how Lunr would end up storing the data. My thought process was that it would end up taking Lunr longer to detect and parse out the proper indexes based on your configuration vs having a flat tree of data. Though, I see what you mean about creating copies of objects and how that would trigger a GC, taking up CPU time.

olivernn · 2018-06-02T20:05:10Z

I think its a fair point that there is an overhead in creating the documents in the format required for Lunr. How much of an overhead is difficult to know without some benchmarks/traces etc. My guess is it would be insignificant compared to the amount of work in actually building the index though.

@matthiasg If Lunr did support this, what would you propose the interface would look like? The simple approach would probably be to pass some kind of extractor function when defining the field, perhaps something like this:

this.field('name', {
  extractor: function (doc) {
    return doc.data.name
  }
})

An alternative might be something like JSONPath, though I think that would add a bunch of complexity that probably isn't worth it.

This would only require a small amount of extra state at build time, the index wouldn't need to change I don't think. The changes would be isolated to lunr.Builder.

matthiasg · 2018-06-04T09:22:38Z

an extractor function would allow for some very customized behavior of course.

I would have simply made it default to follow the nested value
e.g this.field('data.name') assuming that data.name is a the propertyName/Path as it means to follow the lookup as in

const nestedValueFromObject = nestedPropertyName.split('.').reduce((object, propertyName) => object[propertyName], object)

The indexed field name would simply be data.name including the .

EDIT: Of course the extractor function should also get the name of the field as a parameter. That way a generic function would be more easily implemented.

matthiasg · 2018-06-04T09:26:30Z

@olivernn I agree as for the overhead, but we have developed a specific software platform were entire data large trees can be taken offline with a minimum of thousands of objects and potentially tens of thousands locally.

The GC pressure for us is already pretty high in load/import scenarios and thus we try to always use algorithms that don't accrue more load if not necessary and if close wrt implementation overhead.

olivernn · 2018-06-04T12:39:04Z

I would have simply made it default to follow the nested value

Thats basically JSONPath then right? You would also need to handle arrays etc. I think this way pushes a bunch of complexity into Lunr that I don't really want to maintain.

EDIT: Of course the extractor function should also get the name of the field as a parameter. That way a generic function would be more easily implemented.

You could still make the extractor function generic without passing it the field name:

function dataExtractor = function (fieldName) {
  return function (doc) {
    return doc.data[fieldName]
  }
}

this.field('name', { extractor: dataExtractor('name') })
this.field('myotherinfo', { extractor: dataExtractor('myotherinfo') })

olivernn · 2018-06-11T19:56:48Z

@matthiasg unless it wasn't clear, I'm open to a pull requests that adds this feature 😉

I'm not 100% on the name extractor yet but I think passing an optional function that extracts the field from a document is a reasonable implementation.

olivernn · 2018-06-26T06:45:55Z

I went ahead and implemented the extractor function on fields. It is available in 2.3.0.

matthiasg · 2018-06-26T07:32:01Z

@olivernn great !

Though I would still have passed the field name, since I now have to create a lot of functions (one for each unique name at least) which again will increase resource usage :) and this time it is even more difficult to understand the actual GC impact, since code will be compiled and I am not sure how well browsers will collect the compiled code once it is not needed anymore (could not find good info on this yet).

I might still add this later in a PR when I get around to it.

olivernn · 2018-06-26T08:22:29Z

@matthiasg How many fields are you indexing?

increase resource usage

Of course a function instance will consume some resources, I think its probably an insignificant amount though. That is unless I'm missing something about your implementation?

As for the impact on the GC, the extractor functions will need to exist as long as the builder exists, so they should be all discarded when the builder goes out of scope.

Without data collected form profiling we're just speculating though, data wins here ;)

matthiasg · 2018-06-26T11:18:37Z

Data always wins :)

Since we have hundreds of forms (mostly user generated) we do have an unspecified number of form/fields.

But just once more, I have no trouble believing this is an non-issue resource-wise, but when we are faced internally with two ways to implement, one with short lived objects or closures and one without, we try to choose the latter.

Again. Thanks for this constructive discussion and even implementing it (I did not expect that) !

tuanluu-agilityio · 2018-08-14T04:03:54Z

extractor works fine

     this.field('country', {
        extractor: function (doc = {}) {
          return doc.address.country.name
        }
      })
      this.field('city', {
        extractor: function (doc = {}) {
          return doc.address.city_name
        }
      })
      this.field('commission', {
        extractor: function (doc = {}) {
        return doc.commission && doc.commission.value
      }})

Thanks @olivernn

tibabit · 2020-09-07T07:06:32Z

What about position and index metdata here? How is that expected to be computed and added into token metdata?

olivernn closed this as completed Aug 11, 2014

olivernn mentioned this issue Jun 24, 2015

Nested field support for index setup #164

Closed

olivernn reopened this Jun 11, 2018

olivernn closed this as completed Jun 26, 2018

diablodale mentioned this issue Dec 19, 2022

Custom content (e.g. front matter) added to Lunr search index just-the-docs/just-the-docs#1067

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

indexing and search on nested properties? #100

indexing and search on nested properties? #100

dtchang commented Jul 17, 2014

dtchang commented Aug 1, 2014

olivernn commented Aug 11, 2014

matthiasg commented Apr 26, 2018

kkirby commented May 31, 2018

matthiasg commented Jun 1, 2018 •

edited

Loading

kkirby commented Jun 1, 2018

olivernn commented Jun 2, 2018

matthiasg commented Jun 4, 2018 •

edited

Loading

matthiasg commented Jun 4, 2018

olivernn commented Jun 4, 2018

olivernn commented Jun 11, 2018

olivernn commented Jun 26, 2018

matthiasg commented Jun 26, 2018

olivernn commented Jun 26, 2018

matthiasg commented Jun 26, 2018

tuanluu-agilityio commented Aug 14, 2018

tibabit commented Sep 7, 2020

indexing and search on nested properties? #100

indexing and search on nested properties? #100

Comments

dtchang commented Jul 17, 2014

dtchang commented Aug 1, 2014

olivernn commented Aug 11, 2014

matthiasg commented Apr 26, 2018

kkirby commented May 31, 2018

matthiasg commented Jun 1, 2018 • edited Loading

kkirby commented Jun 1, 2018

olivernn commented Jun 2, 2018

matthiasg commented Jun 4, 2018 • edited Loading

matthiasg commented Jun 4, 2018

olivernn commented Jun 4, 2018

olivernn commented Jun 11, 2018

olivernn commented Jun 26, 2018

matthiasg commented Jun 26, 2018

olivernn commented Jun 26, 2018

matthiasg commented Jun 26, 2018

tuanluu-agilityio commented Aug 14, 2018

tibabit commented Sep 7, 2020

matthiasg commented Jun 1, 2018 •

edited

Loading

matthiasg commented Jun 4, 2018 •

edited

Loading