Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client near cache, race condition `File is already being used by this Hazelcast instance` #11648

Closed
Danny-Hazelcast opened this issue Oct 23, 2017 · 9 comments

Comments

@Danny-Hazelcast
Copy link
Member

@Danny-Hazelcast Danny-Hazelcast commented Oct 23, 2017

https://hazelcast-l337.ci.cloudbees.com/view/shutdown/job/shutdown-All/52/consoleFull

http://54.82.84.143/~jenkins/workspace/shutdown-All/3.9/2017_10_20-17_17_19/stable/destroy/create-use-dist-destroy

fail HzClient4HZ _create-use-dist-destroy_createUseDistDestroy_mapBak1HD hzcmd.map.CreateUse threadId=0 com.hazelcast.core.HazelcastException: Cannot acquire lock on /home/ec2-user/hz-root/HzClient4HZ/nearCache-mapBak1HDCreateUseDestroy_create-use-dist-destroy0.store.lock. File is already being used by this Hazelcast instance.

this issue is quite difficult to reproduce, and is not happening that frequently

@Donnerbart
Copy link
Contributor

@Donnerbart Donnerbart commented Oct 23, 2017

We get that error when the JDK throws an OverlappingFileLockException, which happens when a FileLock on the same file is already acquired by the same JVM (if it's another JVM, we would get another error). The lock is acquired on DefaulNearCache.initialize() (when the NearCacheRecordStore is created). So if the test process is correct, we might have an issue, that initialize() is called twice for the same Near Cache.

There is a check, but it's not atomic:

    @Override
    public void initialize() {
        if (nearCacheRecordStore == null) {
            nearCacheRecordStore = createNearCacheRecordStore(name, nearCacheConfig);
        }

Two threads could enter this in parallel.

@tombujok tombujok modified the milestones: 3.9.1, 3.9.2 Nov 21, 2017
@sancar
Copy link
Member

@sancar sancar commented Dec 19, 2017

@Donnerbart

initialize part being not atomic does not seem to be problem. Because it is always called under lock.
From DefaultNearCacheManager.getOrCreateNearCache:

        NearCache<K, V> nearCache = nearCacheMap.get(name);
        if (nearCache == null) {
            synchronized (mutex) {
                nearCache = nearCacheMap.get(name);
                if (nearCache == null) {
                    nearCache = createNearCache(name, nearCacheConfig);
                    nearCache.initialize();   // <<<===========

                    nearCacheMap.put(name, nearCache);

                    .......
                }
            }
        }
@Danny-Hazelcast
Copy link
Member Author

@Danny-Hazelcast Danny-Hazelcast commented Dec 19, 2017

so the test is creating and destroying, so could there be and overlap / race condition between
initialize and destroy. as 2 threads could be creating and destroying as the same time.

@mmedenjak mmedenjak modified the milestones: 3.9.2, 3.9.3 Dec 19, 2017
@sancar
Copy link
Member

@sancar sancar commented Dec 19, 2017

I could not reason how it can fail.
I tried following tests but it does not fail either. I use enterprise.
Nearcache configured as follows:

NearCacheConfig nearCacheConfig = new NearCacheConfig();
        NearCachePreloaderConfig preloaderConfig = new NearCachePreloaderConfig();
        preloaderConfig.setEnabled(true);
        nearCacheConfig.setPreloaderConfig(preloaderConfig);
        nearCacheConfig.setName(`test*`);

test

for (int i = 0; i < 100; i++) {
            new Thread(new Runnable() {
                @Override
                public void run() {
                    while (true) {
                        IMap<Object, Object> map = client.getMap("test");
                        map.put(1, 1);
                        map.get(1);
                        map.destroy();

                        IMap<Object, Object> map2 = client.getMap("test0");
                        map2.put(1, 1);
                        map2.get(1);
                        map2.destroy();
                    }
                }
            }).start();
        }

It feels like a scenario that can fail easily with simple test, but I think I could find the correct configuration yet. @Danny-Hazelcast Am I missing an important configuration ?

@Danny-Hazelcast
Copy link
Member Author

@Danny-Hazelcast Danny-Hazelcast commented Dec 19, 2017

re produced with hat master

https://hazelcast-l337.ci.cloudbees.com/view/stable/job/stable-x2/129/console

fail HzClient1HZ _create-use-dist-destroy_createUseDistDestroy-Obj hzcmd.distributed.Destroy threadId=0 com.hazelcast.core.HazelcastException: Cannot acquire lock on /home/ec2-user/hz-root/HzClient1HZ/nearCache-mapBak1HD-ncHD_CreateUseDestroy_create-use-dist-destroy.store.lock. File is already being used by this Hazelcast instance. 

i have 4 members and 4 clients, all making the create-use-dist-destroy
operations

hear are the operations
http://54.82.84.143/~jenkins/workspace/stable-x2/3.10-SNAPSHOT/2017_12_19-16_50_34/create-use-dist-destroy/createUseDestroy

and xml files
http://54.82.84.143/~jenkins/workspace/stable-x2/3.10-SNAPSHOT/2017_12_19-16_50_34/create-use-dist-destroy/config-hz/

i am calling distributed Destroy
_create-use-dist-destroy_createUseDistDestroy-Obj@class=hzcmd.distributed.Destroy
_create-use-dist-destroy_createUseDistDestroy-Obj@threads=2

e.g.

 for (DistributedObject distributedObject : hzInstance.getDistributedObjects()) {
            distributedObject.destroy();
        }
@Danny-Hazelcast
Copy link
Member Author

@Danny-Hazelcast Danny-Hazelcast commented Dec 19, 2017

however some run's fail with opp time out

https://hazelcast-l337.ci.cloudbees.com/view/stable/job/stable-x2/131/console

this issue is quite difficult to reproduce Cannot acquire lock on /home/ec2-user/hz-root/HzClient1HZ/nearCache-mapBak1HD-ncHD_CreateUseDestroy_create-use-dist-destroy.store.lock

@Danny-Hazelcast
Copy link
Member Author

@Danny-Hazelcast Danny-Hazelcast commented Dec 20, 2017

so a test could have

X threads running from member and clients

 IMap<Object, Object> map = hzInstance.getMap("test");
 map.put(key, val);
 map.get(key);

and

X threads running from member and clients

 ICache cache = hzInstance.getCache("test");
 cache.put(key, val);
 cache.get(key);

and

2 * X threads running from members and clients

for (DistributedObject distributedObject : hzInstance.getDistributedObjects()) {
       distributedObject.destroy();
}
@sancar
Copy link
Member

@sancar sancar commented Dec 20, 2017

Thanks @Danny-Hazelcast for details. I will try with these too.

@Danny-Hazelcast
Copy link
Member Author

@Danny-Hazelcast Danny-Hazelcast commented Dec 20, 2017

i am running a simplified, near-cache map/cache, create used dist destroy test
https://hazelcast-l337.ci.cloudbees.com/view/stable/job/stable-x2/133/console

but now getting a different fail

fail HzClient3HZ _create-use-dist-destroy_createUseDistDestroy_cacheBak1HD-ncHD hzcmd.cache.CreateUse threadId=0 java.lang.Exception: com.hazelcast.memory.NativeOutOfMemoryError: System allocations limit exceeded! Limit: 2048 KB, usage: 2048 KB, requested: 16 bytes

a client size HD NOOME at map put, this should be a 2nd issue

sancar added a commit to sancar/hazelcast that referenced this issue Dec 20, 2017
Problem caused by synchronization problem with nearcache
creation and destroy inside nearcachemanager.

Scenario that leads to exception as follows
1. A nearcache is put to nearcachemanager and takes file lock
inside mutex lock.
2. Thread1 initates nearcache.destroy. It first removes from
nearcache manager. (did not release the lock yet).
3. Thread2 tries to create a nearcahe. Checks and sees
that nearcache is not avaiable with that name, tries to create
the nearcache. And fails with OverlappingFileLockException when
trying to get lock for nearCachePreLoader.
4. Thread1 releases the lock.

Solution is to make 2. and 4. step under the mutex lock that
is used when creating the nearCache.

Fixes hazelcast#11648
sancar added a commit to sancar/hazelcast that referenced this issue Dec 20, 2017
Problem caused by synchronization problem with nearcache
creation and destroy inside nearcachemanager.

Scenario that leads to exception as follows
1. A nearcache is put to nearcachemanager and takes file lock
inside mutex lock.
2. Thread1 initates nearcache.destroy. It first removes from
nearcache manager. (did not release the lock yet).
3. Thread2 tries to create a nearcahe. Checks and sees
that nearcache is not avaiable with that name, tries to create
the nearcache. And fails with OverlappingFileLockException when
trying to get lock for nearCachePreLoader.
4. Thread1 releases the lock.

Solution is to make 2. and 4. stop under the mutex lock that
is used when creating the nearCache.

Fixes hazelcast#11648
@sancar sancar self-assigned this Dec 20, 2017
sancar added a commit to sancar/hazelcast that referenced this issue Dec 21, 2017
Problem caused by synchronization problem with nearcache
creation and destroy inside nearcachemanager.

Scenario that leads to exception as follows
1. A nearcache is put to nearcachemanager and takes file lock
inside mutex lock.
2. Thread1 initates nearcache.destroy. It first removes from
nearcache manager. (did not release the lock yet).
3. Thread2 tries to create a nearcahe. Checks and sees
that nearcache is not avaiable with that name, tries to create
the nearcache. And fails with OverlappingFileLockException when
trying to get lock for nearCachePreLoader.
4. Thread1 releases the lock.

Solution is to make 2. and 4. step under the mutex lock that
is used when creating the nearCache.

Fixes hazelcast#11648
sancar added a commit to sancar/hazelcast that referenced this issue Dec 21, 2017
Problem caused by synchronization problem with nearcache
creation and destroy inside nearcachemanager.

Scenario that leads to exception as follows
1. A nearcache is put to nearcachemanager and takes file lock
inside mutex lock.
2. Thread1 initates nearcache.destroy. It first removes from
nearcache manager. (did not release the lock yet).
3. Thread2 tries to create a nearcahe. Checks and sees
that nearcache is not avaiable with that name, tries to create
the nearcache. And fails with OverlappingFileLockException when
trying to get lock for nearCachePreLoader.
4. Thread1 releases the lock.

Solution is to make 2. and 4. stop under the mutex lock that
is used when creating the nearCache.

Fixes hazelcast#11648
sancar added a commit to sancar/hazelcast that referenced this issue Dec 21, 2017
Problem caused by synchronization problem with nearcache
creation and destroy inside nearcachemanager.

Scenario that leads to exception as follows
1. A nearcache is put to nearcachemanager and takes file lock
inside mutex lock.
2. Thread1 initates nearcache.destroy. It first removes from
nearcache manager. (did not release the lock yet).
3. Thread2 tries to create a nearcahe. Checks and sees
that nearcache is not avaiable with that name, tries to create
the nearcache. And fails with OverlappingFileLockException when
trying to get lock for nearCachePreLoader.
4. Thread1 releases the lock.

Solution is to make 2. and 4. stop under the mutex lock that
is used when creating the nearCache.

Fixes hazelcast#11648
sancar added a commit to sancar/hazelcast that referenced this issue Dec 21, 2017
Problem caused by synchronization problem with nearcache
creation and destroy inside nearcachemanager.

Scenario that leads to exception as follows
1. A nearcache is put to nearcachemanager and takes file lock
inside mutex lock.
2. Thread1 initates nearcache.destroy. It first removes from
nearcache manager. (did not release the lock yet).
3. Thread2 tries to create a nearcahe. Checks and sees
that nearcache is not avaiable with that name, tries to create
the nearcache. And fails with OverlappingFileLockException when
trying to get lock for nearCachePreLoader.
4. Thread1 releases the lock.

Solution is to make 2. and 4. step under the mutex lock that
is used when creating the nearCache.

Fixes hazelcast#11648
sancar added a commit to sancar/hazelcast that referenced this issue Dec 21, 2017
Problem caused by synchronization problem with nearcache
creation and destroy inside nearcachemanager.

Scenario that leads to exception as follows
1. A nearcache is put to nearcachemanager and takes file lock
inside mutex lock.
2. Thread1 initates nearcache.destroy. It first removes from
nearcache manager. (did not release the lock yet).
3. Thread2 tries to create a nearcahe. Checks and sees
that nearcache is not avaiable with that name, tries to create
the nearcache. And fails with OverlappingFileLockException when
trying to get lock for nearCachePreLoader.
4. Thread1 releases the lock.

Solution is to make 2. and 4. step under the mutex lock that
is used when creating the nearCache.

Fixes hazelcast#11648
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

5 participants
You can’t perform that action at this time.