HSEARCH-1219 Changed memroy allocation model for distance collection and... #349

Merged
merged 1 commit into from Nov 27, 2012

Projects

None yet

4 participants

@nicolashelleringer
Member

... sort + fix in bottom queue comparaison in sort

@Sanne
Member
Sanne commented Nov 12, 2012

Hi Nicolas,
thanks for the fix and your patience. I almost merged it last week but the Map usage has thrown me a bit off balance.

The test looks good and the implementation looks correct from a functional point of view, but the Collector can be huge, you might remember we already changed this previously to use arrays instead. The problem is not only Map vs Arrays but boxing of each document id into Integer instances.

Do you think you could rethink the fix while keeping the implementation array-based?

If you prefer we can merge this for now - as it definitely is a step for correctness - and open an issue to improve on it later, but I suspect it might be easier to do that by starting from the array based code. WDYT?

@nicolashelleringer
Member

The problem is I cannot forsee how many slots I ll need as indices are going beyond hitscount as they are base reader by reader. So i cannot chose an array size that garantees it will be ok :s

@Sanne
Member
Sanne commented Nov 12, 2012

is the problem to predict the array size for a single segment, or the
overall? In the second case we could think of defining an array for each
segment. I'll have a look at how other Collectors do this, I don't remember
anymore how it's supposed to be done on Lucene 3: with 4 all constructs
operate on a single segment only.

On 12 November 2012 16:47, Nicolas Helleringer notifications@github.comwrote:

The problem is I cannot forsee how many slots I ll need as indices are
going beyond hitscount as they are base reader by reader. So i cannot chose
an array size that garantees it will be ok :s


Reply to this email directly or view it on GitHubhttps://github.com/hibernate/hibernate-search/pull/349#issuecomment-10292289.

@nicolashelleringer
Member

It is the overall as we have to store distance for records not in the same segment to compare.

@emmanuelbernard
Member

I wonder if a List<Object[]> or whatever the type is would solve the problem? Or at least mitigate the problem?

@nicolashelleringer
Member

I did revert back to int[maxResults] in DistanceCollector but i am not satisfied with it.

If you index high volume of data ( 8 millions PoIs in geonames for instance), you end up allocating 3 arrays of 8 millions doubles where you could have survived with maps of only the hitcounts of your search.

This seems to me not efficient : do I miss something ?

@nicolashelleringer
Member

I have found something.

In lucene-facet module they have coded a IntToDoubleMap on primitives types.

This way it would be storage efficient on both way Map struct instead of big arrays and primitive types instead ob auto unboxing objects.

How do you feel about it ?

facet module is contrib in 3.6.1 http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/util/collections/IntToDoubleMap.html and standard in 4.0 http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/util/collections/IntToDoubleMap.html

Nicolas Helleringer HSEARCH-1219 Changed memory allocation model for distance collection …
…and sort + fix in bottom queue comparaison in sort
2e140ce
@nicolashelleringer
Member

Any news / comment on integrating this plz ?

@hferentschik
Member

Any news / comment on integrating this plz ?

@Sanne looked into it. He is at the moment moving house hence the delay.

@nicolashelleringer
Member

ok thx =)

@Sanne
Member
Sanne commented Nov 27, 2012

just about to resume this. I had not seen your comment about
IntToDoubleMap: I'm very glad you have been looking for examples, as that's
what I was about to do.

Your previous comment seems very reasonable, I'm tempted to retire my
concern about the HashMap implementation and take that version (your first
impl) for now to solve the issue and move forward, then we can discuss
efficiency as a separate concern/issue. WDYT?

On a different page, I have recently been thinking if you really need to
store this. I have also been lurking in other Lucene Collectors and hadn't
found the IntToDoubleMap but had found examples in which the implementation
stores stuff in an array of a limited size, then discarding values it can't
keep around to eventually re-compute them if needed: some sort of an
internal cache.

I wouldn't like it to break your head too much on this as the approach in
Lucene4 would be totally different..

On 27 November 2012 16:26, Nicolas Helleringer notifications@github.comwrote:

ok thx =)


Reply to this email directly or view it on GitHubhttps://github.com/hibernate/hibernate-search/pull/349#issuecomment-10764552.

@hferentschik
Member

he he, speaking of the devel ;-)

@Sanne, I keep hearing you talking about how different things are with Lucene 4. Would be great to have a IRC session about this at some stage. maybe announced one so that everyone interested can join in

@nicolashelleringer
Member

I like the way it is now with the Lucene Collection, memory efficient and fast =)

The only drawback is the added dependency to the facet module even if has nothing to do with faceting ( event if the module will be core in 4.x ).

Either this one or the native Java is ok for me.

@Sanne
Member
Sanne commented Nov 27, 2012

@hferentschik Lucene 4 is very different and I've been reading some code, but I won't presume I can lecture anyone about it. Your best friend it so have a look into http://lucene.apache.org/core/4_0_0/MIGRATE.html

Conceptually a strong difference is that segments are exposed explicitly to extension points (like a Collator) and generally the API requires you to deal with a single segment only. So in this case for example you wouldn't have so much trouble as you can collect & cache all stuff for the current segment easily; aggregating is still a problem to solve, but that's performed in a different component as far as I understood.

@nicolashelleringer

I like the way it is now with the Lucene Collection, memory efficient and fast =)

By now you mean in Lucene 4 ?

@nicolashelleringer
Member

No, i ment as in my last commit ;)

@hferentschik
Member

Either this one or the native Java is ok for me.

not really dived more into this, but why not the native Java solution. Do we really want to add another dependency now. O the other hand, work for integrating native Lucene faceting has started now so I would expect that we depend on this very shortly anyways. It is just the question whether we want this for this release?

@Sanne
Member
Sanne commented Nov 27, 2012

No, i ment as in my last commit ;)

ha! would be good to warn when you push changes, or I'm not notified you changed the proposal.. Good thanks, having a new look.. (I hope you have the older solutions around?)

@Sanne
Member
Sanne commented Nov 27, 2012

It is just the question whether we want this for this release?

good point. Not a big fan of this either, but we promised geo-queries for this release and if it really is far superior I'm not going to veto it..

@hferentschik
Member

For me a suboptimal native solution would be ok for this release. It would feel somehow "nicer" to introduce the facet jar later on.

@Sanne
Member
Sanne commented Nov 27, 2012

I'm not strong about it. After all we already have tons of jars coming in through solr, and this is not a minor release. I wish I had my large database setup ready to run some tests with geolocation.. would take me a day or so.

@Sanne
Member
Sanne commented Nov 27, 2012

Last commit looks great! @hferentschik please withdraw your concerns or elaborate on why we shouldn't already start getting pople used to this dependency :)

@hferentschik
Member

Fair enough. Was just a minor concern on my behalf.

@Sanne
Member
Sanne commented Nov 27, 2012

cool!

@Sanne Sanne merged commit 2e140ce into hibernate:master Nov 27, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment