Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What triggers 'marked for removal'? #792

Closed
dbusch opened this issue Aug 26, 2013 · 16 comments

Comments

Projects
None yet
6 participants
@dbusch
Copy link

commented Aug 26, 2013

I have several write-behind maps in which it appears entries are being inadvertently marked for removal and are being passed to my store via the 'deleteAll' call when I'm expecting instead to see them passed to save. Watching both my logs and the admin console I see a high percentage of entries marked for removal but I don't ever call remove on entries in these maps or clear. Instead, I rely on the write-behind threads to flush out to my store based on a configured eviction and max size policies. I'm curious what other mechanisms there are that would mark these entries for removal if I am not doing it explicitly.

Here is an example configuration:

<map name="models.LowResolutionModel">
    <backup-count>0</backup-count>
    <read-backup-data>false</read-backup-data>
    <time-to-live-seconds>0</time-to-live-seconds>
    <max-idle-seconds>300</max-idle-seconds>
    <eviction-policy>LRU</eviction-policy>
    <max-size policy="map_size_per_jvm">5000</max-size>
    <map-store enabled="true">
        <class-name>models.LowResolutionModel$LowResolutionModelStore</class-name>
        <write-delay-seconds>5</write-delay-seconds>
    </map-store>
</map>  

I have not yet been able to reproduce this issue locally and only sporadically in my test environment that runs with 2 nodes. However, in production this is happening many times per minute and is progressively becoming more frequent the more load I throw at it (where I am running between 6 and 12 nodes).

Any thoughts or help would be greatly appreciated!

@pveentjer

This comment has been minimized.

Copy link
Member

commented Aug 26, 2013

Which Hazelcast version are you using?

@dbusch

This comment has been minimized.

Copy link
Author

commented Aug 26, 2013

Currently using 2.6 on java7 and will soon be upgrading to 2.6.1

SumAll
Dan Busch
+1 415 264 0694
dan@sumall.com
sumall.com

On Mon, Aug 26, 2013 at 8:15 AM, Peter Veentjer notifications@github.comwrote:

Which Hazelcast version are you using?


Reply to this email directly or view it on GitHubhttps://github.com//issues/792#issuecomment-23258497
.

@enesakar

This comment has been minimized.

Copy link
Member

commented Aug 26, 2013

Hi Dan;

Marked as removed records may the ones evicted from your cache.
Looking at your configuration, I see you have both enabled max-size evicton
and eviction due to max-idle-seconds.

On Mon, Aug 26, 2013 at 3:17 PM, Dan Busch notifications@github.com wrote:

Currently using 2.6 on java7 and will soon be upgrading to 2.6.1

SumAll
Dan Busch
+1 415 264 0694
dan@sumall.com
sumall.com

On Mon, Aug 26, 2013 at 8:15 AM, Peter Veentjer notifications@github.comwrote:

Which Hazelcast version are you using?


Reply to this email directly or view it on GitHub<
https://github.com/hazelcast/hazelcast/issues/792#issuecomment-23258497>
.


Reply to this email directly or view it on GitHubhttps://github.com//issues/792#issuecomment-23258576
.

Enes Akar
Hazelcast | Open source in-memory data grid
Mobile: +90.507.150.56.71

@dbusch

This comment has been minimized.

Copy link
Author

commented Aug 26, 2013

The entities evicted due to idleness will be marked for removal while those
evicted due to max-size will just be dropped from in-memory cache?

Is it always the case that entities marked for removal will be passed to
the deleteAll method of the configured model store?

SumAll
Dan Busch
+1 415 264 0694
dan@sumall.com
sumall.com

On Mon, Aug 26, 2013 at 8:19 AM, Enes Akar notifications@github.com wrote:

Hi Dan;

Marked as removed records may the ones evicted from your cache.
Looking at your configuration, I see you have both enabled max-size
evicton
and eviction due to max-idle-seconds.

On Mon, Aug 26, 2013 at 3:17 PM, Dan Busch notifications@github.com
wrote:

Currently using 2.6 on java7 and will soon be upgrading to 2.6.1

SumAll
Dan Busch
+1 415 264 0694
dan@sumall.com
sumall.com

On Mon, Aug 26, 2013 at 8:15 AM, Peter Veentjer <
notifications@github.com>wrote:

Which Hazelcast version are you using?


Reply to this email directly or view it on GitHub<
https://github.com/hazelcast/hazelcast/issues/792#issuecomment-23258497>

.


Reply to this email directly or view it on GitHub<
https://github.com/hazelcast/hazelcast/issues/792#issuecomment-23258576>
.

Enes Akar
Hazelcast | Open source in-memory data grid
Mobile: +90.507.150.56.71


Reply to this email directly or view it on GitHubhttps://github.com//issues/792#issuecomment-23258683
.

@enesakar

This comment has been minimized.

Copy link
Member

commented Aug 26, 2013

No, the evicted ones can be seen as "marked" but they should not be deleted from map store as they should not be "dirty".
If you experience the entries are deleted from mapstore without any remove/clear, then this is a bug.
But first we should reproduce the problem.
Can you give clues about why you may see this behaviour in just production? High load? More number of nodes? More frequency of eviction?

@dbusch

This comment has been minimized.

Copy link
Author

commented Aug 26, 2013

I suspect its load-related but have not yet been able to isolate the issue
and I am still trying to reproduce in a local environment with repeatable
test case.

The basic call pattern is:

  1. Attempt to lock the entry designated by key K, waiting at most 1000ms to
    acquire lock.
  2. GET the entry for K.
  3. If the entry for K is null, then create a new model M
  4. Run some updates on M
  5. PUT M into the cache with key K.
  6. Finally, attempt to unlock K

The thread that executes this logic is a distributed task thread.

On Mon, Aug 26, 2013 at 8:34 AM, Enes Akar notifications@github.com wrote:

No, the evicted ones can be seen as "marked" but they should not be
deleted from map store as they should not be "dirty".
If you experience the entries are deleted from mapstore without any
remove/clear, then this is a bug.
But first we should reproduce the problem.
Can you give clues about why you may see this behaviour in just
production? High load? More number of nodes? More frequency of eviction?


Reply to this email directly or view it on GitHubhttps://github.com//issues/792#issuecomment-23259337
.

@paolodedios

This comment has been minimized.

Copy link

commented Aug 27, 2013

@enesakar We are seeing this problem in production under high load. We are still trying to reproduce under a smaller scale. Our workload churns through a lot of data in the cache, to the point where we cache entries don't stay for very long in the cache and merely act as a small write buffer to our underlying data store.

Our current working theory is that under high and sustained load, the cache put rate will churn our cache and cause it to constantly hit the configured max-size eviction policy. We think that this is triggering the eviction thread to constantly run, looking for entries to mark as no longer "active". Given our high cache churn rate, our cache.put() caller will always be marking entries as dirty. Our churn rate is high enough that we see a scenario where most entries are going to be marked "dirty" while at the same time, the eviction process starts marking anything it can as "not active" to relieve pressure. By the time the write-behind thread runs, entries will be deleted because two separate threads have marked them dirty and no longer active. After scanning through the Hazelcast source code, this edge case seems plausible and we're trying to simulate to reproduce.

As a simple test, we are going to try and disable the max-size policy and switch to TTL to force the eviction thread to work on a timer instead. Please advise if there is anything else we can try.

I hope my explanation makes sense.

@mikehwang

This comment has been minimized.

Copy link

commented Aug 27, 2013

@enesakar I think I found the issue after looking through the "maintenance-2.x branch". I think the removeTime on the records are not being setup correctly. Grepping through the code base, removeTime is set to either "0" (on initialization) or "now" (when markRemoved). We do not explicitly mark records to be removed explicitly in our code base so the records never get set with a non-zero removeTime.

In CMap.java, there is a startCleanup call that buckets existing records and takes action i.e. purge, evict, etc. Our records are all getting bucketed to be purged because if you look at the logic in:

boolean shouldPurgeRecord(Record record, long now) {
    return !record.isActive() && record.isRemovable()
            && ((now - record.getRemoveTime()) > removeDelayMillis);
}

The records with removeTime set to zero will always be purged as long as no other parts of the system are using this record. We are able to reproduce this locally now via this config

<map name="models.aggregates.visitors.VisitInfoByHour">
    <backup-count>0</backup-count>
    <read-backup-data>false</read-backup-data>
    <time-to-live-seconds>5</time-to-live-seconds>
    <max-idle-seconds>30</max-idle-seconds>
    <eviction-policy>LRU</eviction-policy>
    <map-store enabled="true">
        <class-name>models.aggregates.visitors.VisitInfoByHour$VisitInfoByHourStore</class-name>
        <write-delay-seconds>5</write-delay-seconds>
    </map-store>
</map>

We changed the TTL and max-idle to be very short.

I tested with this fixed version and my records are now getting evcited rather than purged now:

boolean shouldPurgeRecord(Record record, long now) {
    return !record.isActive() && record.isRemovable()
            && (record.getRemoveTime() > 0 && (now - record.getRemoveTime()) > removeDelayMillis);
}

What do you think?

@mdogan

This comment has been minimized.

Copy link
Member

commented Aug 28, 2013

@paolodedios and @mikehwang, thanks for your findings and detailed information. It seems you're on the right track. We'll have a look at this very soon.

@mdogan

This comment has been minimized.

Copy link
Member

commented Aug 28, 2013

@paolodedios;

Hazelcast 2 has a single operation thread (we call it service thread). And marking active or not-active are done in this service thread. Cleanup thread doesn't mark a record as not-active. It only enqueues a task into service thread to to evict/purge.

@mdogan

This comment has been minimized.

Copy link
Member

commented Aug 28, 2013

@mikehwang;

Eviction and purge are consecutive processes. Eviction marks the records as 'not-active' and purge removes not-active tasks from memory. Purge process never calls map-store. So, the current code seems right to me. If you disable purge, then entries will sit in memory as not-active.

@enesakar

This comment has been minimized.

Copy link
Member

commented Aug 28, 2013

@mikehwang;

I tried to reproduce the problem. Tried to out entries with your setting and checked if any delete() is called from my mapstore. But could not reproduce the problem.

Can you send your test code with a mock mapstore code that you can reproduce the problem in your local env.?

@dbusch

This comment has been minimized.

Copy link
Author

commented Aug 28, 2013

@enesakar;

We have not yet been able to reproduce in an isolated local dev setting (ie. unit test) but have now been able to reproduce more readily inside our application with an updated hz config. The config below will trigger the issue much more quickly and easily inside our app. The difference here being a non-zero TTL, reduced max-idle-seconds and no max-size. What I believe these changes have done in our application is caused the eviction process to run more frequently and hence cause our issue to occur more frequently.

To add some more detail, these config options are used for about 40 maps in our application alongside 20 queues (that have slightly different configs and do not encounter the issue we see with maps). There are several client threads on each node accessing each map to lock, mutate values, then put back into the map and we are using the partitioning abilities of hz to direct worker threads to the appropriate node.

<map name="models.LowResolutionModel">
    <backup-count>0</backup-count>
    <read-backup-data>false</read-backup-data>
    <time-to-live-seconds>5</time-to-live-seconds>
    <max-idle-seconds>30</max-idle-seconds>
    <eviction-policy>LRU</eviction-policy>
    <map-store enabled="true">
        <class-name>models.LowResolutionModel$LowResolutionModelStore</class-name>
        <write-delay-seconds>5</write-delay-seconds>
    </map-store>
</map> 

Hope this helps!

@mikehwang

This comment has been minimized.

Copy link

commented Aug 28, 2013

@mdogan Thanks for the feedback. I wanted some clarification: records that are purged will be explicitly removed from mapRecords thus won't it be eventually be deleted via map-store? If purge is not a remove then why does it check the removeTime? Maybe I'm just confused between evict, purge, and remove.

@paolodedios

This comment has been minimized.

Copy link

commented Aug 28, 2013

@mdogan What we seem to be noticing is that while the Hazelcast 2 service thread is marking entries as in-active, our application thread is marking entries as dirty via cache.put(). One of our theories is that the Hazelcast service thread and our application thread enter a race condition during periods of high load when our cache is being churned at a high rate.

@mikehwang

This comment has been minimized.

Copy link

commented Aug 28, 2013

@mdogan I have observed the following timeline that is causing the map-store deletion:

  • Record gets bucketed to be evcited thus async thread does the operation
  • Record gets bucketed as dirty
  • Record gets markedAsEvicted since async thread has completed the operation
  • Record gets deleted in runStoreUpdate bc the previous step has made the dirty entry inactive

There is a race condition here as @paolodedios pointed out since the eviction operation is not transactional.

Here are the raw log lines from a modified version of the hazelcast jar that shows the timeline in more detail:

2013-08-28 02:07:45,966 T=[hz._hzInstance_1_mhwang-machine.cached.thread-15] INFO ~ [192.168.15.89]:10000 [mhwang-machine] name=c:models.aggregates.visitors.TrafficSourceByDay tag=startCleanup(EVICT) Rec=[DefaultRecord: Record key=Data{partitionHash=-1557862892} size= 237, active=true, version=3, removable=false, evictable=true, valueCount= 1, isLocked= false, s
cheduled= 0, isDirty=false, isValid=false, create=Wed Aug 28 02:06:35 GMT 2013, exp=Sun Aug 17 07:12:55 GMT 292278994, update=Wed Aug 28 02:06:35 GMT 2013, remove=Thu Jan 01 00:00:00 GMT 1970, write=Wed Aug 28 02:06:40 GMT 2013] L=[CMap]

2013-08-28 02:08:08,024 T=[hz._hzInstance_1_mhwang-machine.scheduled.thread-1] INFO ~ [192.168.15.89]:10000 [mhwang-machine] name=c:models.aggregates.visitors.TrafficSourceByDay tag=startCleanup(DIRTY) Rec=[DefaultRecord: Record key=Data{partitionHash=-1557862892} size= 237, active=true, version=4, removable=false, evictable=true, valueCount= 1, isLocked= false,
scheduled= 0, isDirty=true, isValid=true, create=Wed Aug 28 02:08:02 GMT 2013, exp=Sun Aug 17 07:12:55 GMT 292278994, update=Wed Aug 28 02:08:02 GMT 2013, remove=Thu Jan 01 00:00:00 GMT 1970, write=Wed Aug 28 02:08:07 GMT 2013] L=[CMap]

2013-08-28 02:08:09,086 T=[hz._hzInstance_1_mhwang-machine.ServiceThread] INFO ~ [192.168.15.89]:10000 [mhwang-machine] name=c:models.aggregates.visitors.TrafficSourceByDay tag=markAsEvicted Rec=[DefaultRecord: Record key=Data{partitionHash=-1557862892} size= 237, active=false, version=5, removable=true, evictable=true, valueCount= 0, isLocked= false, scheduled=
0, isDirty=false, isValid=false, create=Wed Aug 28 02:08:02 GMT 2013, exp=Sun Aug 17 07:12:55 GMT 292278994, update=Wed Aug 28 02:08:02 GMT 2013, remove=Thu Jan 01 00:00:00 GMT 1970, write=Wed Aug 28 02:08:07 GMT 2013] L=[CMap]

2013-08-28 02:08:11,447 T=[hz._hzInstance_1_mhwang-machine.cached.thread-23] INFO ~ [192.168.15.89]:10000 [mhwang-machine] name=c:models.aggregates.visitors.TrafficSourceByDay tag=runStoreUpdate Rec=[DefaultRecord: Record key=Data{partitionHash=-1557862892} size= 237, active=false, version=5, removable=true, evictable=true, valueCount= 0, isLocked= false, scheduled= 0, isDirty=false, isValid=false, create=Wed Aug 28 02:08:02 GMT 2013, exp=Sun Aug 17 07:12:55 GMT 292278994, update=Wed Aug 28 02:08:02 GMT 2013, remove=Thu Jan 01 00:00:00 GMT 1970, write=Wed Aug 28 02:08:07 GMT 2013] L=[CMap]

mdogan added a commit to mdogan/hazelcast that referenced this issue Aug 30, 2013

mdogan added a commit to mdogan/hazelcast that referenced this issue Aug 30, 2013

@ghost ghost assigned mdogan Sep 5, 2013

@mdogan mdogan closed this Sep 5, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.