Skip to content
This repository has been archived by the owner on Oct 7, 2021. It is now read-only.

Persistent Problems with Shard / Cache corruption #23

Closed
ThomasBergman1 opened this issue Aug 21, 2017 · 2 comments
Closed

Persistent Problems with Shard / Cache corruption #23

ThomasBergman1 opened this issue Aug 21, 2017 · 2 comments

Comments

@ThomasBergman1
Copy link

I am having persistent problems like this:

Some day, suddenly I will get an error:
java.lang.RuntimeException: unable to open session

This error will be tied to specific date range in a dataset.
Eg If my query includes that date in that dataset it breaks, if not, it doesn't.

I can solve this by deleting the contents of cache
sudo rm -rf /var/data/file_cache/*
And then restarting deamon and then killing all active imhotep processes (workaround due to #19)

However it sucks I have to do this manually and with some frequency.

When I look in the logs for the daemon, I see some periodic problem that looks like:

2017-08-21 17:29:06,327 INFO  [CachingLocalImhotepServiceCore] loading shard index20170710.00-20170717.00 from com.indeed.imhotep.io.caching.CachedFile@3463f366

2017-08-21 17:29:06,578 ERROR [CachingLocalImhotepServiceCore] Exception during cleanup of a Closeable, ignoring
java.lang.NullPointerException
at com.indeed.imhotep.io.Shard.close(Shard.java:131)
at com.indeed.util.core.reference.SharedReference.decRef(SharedReference.java:111)
at com.indeed.util.core.reference.SharedReference.close(SharedReference.java:76)
at com.indeed.util.core.io.Closeables2.closeQuietly(Closeables2.java:29)
at com.indeed.imhotep.service.CachingLocalImhotepServiceCore.updateShards(CachingLocalImhotepServiceCore.java:308)
at com.indeed.imhotep.service.CachingLocalImhotepServiceCore.(CachingLocalImhotepServiceCore.java:148)
at com.indeed.imhotep.service.ImhotepDaemon.newImhotepDaemon(ImhotepDaemon.java:758)
at com.indeed.imhotep.service.ImhotepDaemon.main(ImhotepDaemon.java:728)
at com.indeed.imhotep.service.ImhotepDaemon.main(ImhotepDaemon.java:694)

Not sure if related.

@ThomasBergman1
Copy link
Author

ThomasBergman1 commented Jan 16, 2018

I actually know what this is now I think
so imhotep compresses files into shards and stores in s3
then when imhotep daemon goes to query shard, first it checks cache, if not it goes to s3 and grabs the data to load into cache
i think this problem happens when imhotep daemon is full.
it goes to s3 to grab data and that overwrites old data
but it expects old data to be there, and shows an error "unable to open session".
Imhotep doesn't crash, but it causes some subsection of the data to be unquery-able.
Other queries eg for other time ranges or in other datasets work fine.

When I tail the logs while querying the not-working time range, I see an error like 'blah blah metadata.txt not found'

So in an ideal world it shouldn't fail.
it should give some error like 'out of room couldn't load data for query, please clear cache or increase size of instance'

@youknowjack
Copy link
Collaborator

There is a configuration parameter for the file cache size, specified in /opt/imhotep/imhotep-caching.yaml:

-   type: CACHED
    order: 6
    mountpoint: /
    cache-dir: /var/data/file_cache
    cacheSizeMB: 32000

If you have a configuration like that, check the size of /var/data/file_cache and make sure the cache size is well under the available space (minus current cache size)

$ df -h /var/data/file_cache
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdb        30G  122M   30G   1% /var/data
[ec2-user@ip-172-31-7-225 ~]$ du -sh /var/data/file_cache
14M	/var/data/file_cache

Adjust the cacheSizeMB param in imhotep-caching.yaml if it is too large for your cache file system. (Or give more space to the cache file system.)

I got this note from an Indeed developer about our experiences with running out of space for cache:

We had a case where it happened once. I have found 2 ways it could exceed the allowed cache size so far:

  1. it has an extra metadata index which is not part of the cache but is stored in the same location. In our case it is around 20GB per daemon so not enormous but something to note.
  2. the cache counts the size of data in the files instead of the size of disk sectors it consumes. As part of it it doesn't record any space taken by directory inodes and the cache implementation is not deleting directories when they become empty.

We have an internal fix for the empty directories cleanup that we need to get done here. Since that seems like a very likely cause, this issue can track delivering that fix.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants