Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tsm1] Data older than a few minutes is not showing up in queries + memory is leaking #4354

Closed
KiNgMaR opened this issue Oct 7, 2015 · 30 comments

Comments

@KiNgMaR
Copy link

KiNgMaR commented Oct 7, 2015

The new storage engine looks very promising so far. Memory usage is still going up non-stop (writing over UDP exclusively), but slower than before. Overall the system seems more stable and under much less pressure.

However, after a few minutes, points that have been written and were previously showing up in queries disappear. All WHERE time >= now() - 1h queries only show data for approx. the last 2-10 minutes. This affects CQs and normal queries. It can even be reproduced using a simple SELECT * FROM blah query without any time constraints.
The data showing from this type of query behaves as follows: new points are appended and showing up. Then, every few minutes, the starting point (i.e. the timestamp of the first point returned) moves some minutes into the future.

P.S. Thanks everyone, you're doing a great job!

@jwilder
Copy link
Contributor

jwilder commented Oct 7, 2015

What are your retention policies set to?

@otoolep
Copy link
Contributor

otoolep commented Oct 7, 2015

Yes, please let us know what the retention policies are.

A small program (in any language) that would allow us to a) recreate your database and retention policies and b) write the data pattern you have, would be great.

@KiNgMaR
Copy link
Author

KiNgMaR commented Oct 8, 2015

Retention policies are as follows:

create database "xxx_nonprod"
create database "xxx_prod"
alter retention policy "default" on "xxx_nonprod" duration 7d
alter retention policy "default" on "xxx_prod" duration 31d

There may have been points written between create database and alter retention policy, could that be a problem?

I will try to reproduce on a clean database and/or with a code example.

@otoolep
Copy link
Contributor

otoolep commented Oct 8, 2015

@KiNgMaR -- that may be. To be clear, when points are first written shards are created and the end-time of the shard is determined by the retention policy duration at that time. Subsequently changing the retention policy duration does not change the end-time of pre-existing shards, only shards created after the ALTER command.

@KiNgMaR
Copy link
Author

KiNgMaR commented Oct 8, 2015

So I stopped InfluxDB, deleted all DBs, shards etc. and disabled the UDP listeners. Started InfluxDB back up, created the DBs, alterered the default retention policies and checked that no shards had been created yet. Shut down InfluxDB, re-enabled the UDP listeners... but the issue remains. This log file should illustrate it: https://gist.github.com/KiNgMaR/d3419fde0b5f8c78d88e The shard properties look correct to me.

If needed I should probably be able to provide a node.js script to reproduce from scratch...

@pkittenis
Copy link

Also getting this and it seems to be caused by wal flushes. Retention period is the default - infinite. Doubt retention period is in any way related.

DB version is 0.9.5-nightly-64cef0f, branch master, commit 64cef0fa63487e4b675cbb0b59e83b493e88ff08

After messages like the below, data older than the time of the flush no longer appear in queries:

[tsm1wal] 2015/10/08 14:17:56 idle flush of /var/opt/influxdb/data/graphite/default/5 with 54 keys and 54 total values of 864 bytes
[tsm1wal] 2015/10/08 14:17:56 idle flush of /var/opt/influxdb/data/_internal/monitor/4 with 52 keys and 312 total values of 4992 bytes
[tsm1wal] 2015/10/08 14:17:56 idle flush of /var/opt/influxdb/data/graphite/default/2 with 83 keys and 504 total values of 8064 bytes
[tsm1wal] 2015/10/08 14:18:28 memory flush of /var/opt/influxdb/data/graphite/default/1 with 104109 keys and 328749 total values of 5259984 bytes
select  value from "development.amers.b00583.fix_eng.undefined.alpqfix01.os.cpu.idle" where time > now() - 1h
development.amers.b00583.fix_eng.undefined.alpqfix01.os.cpu.idle
time    value
2015-10-08T14:18:15Z    97
2015-10-08T14:18:17Z    97
2015-10-08T14:18:20Z    97

Then, a few minutes later, another flush:

[tsm1wal] 2015/10/08 14:25:56 idle flush of /var/opt/influxdb/data/_internal/monitor/4 with 52 keys and 312 total values of 4992 bytes
[tsm1wal] 2015/10/08 14:25:56 idle flush of /var/opt/influxdb/data/graphite/default/5 with 54 keys and 54 total values of 864 bytes
[tsm1wal] 2015/10/08 14:25:56 idle flush of /var/opt/influxdb/data/graphite/default/2 with 83 keys and 504 total values of 8064 bytes
[tsm1wal] 2015/10/08 14:26:12 memory flush of /var/opt/influxdb/data/graphite/default/1 with 127105 keys and 328706 total values of 5259296 bytes

And queries now respond with data since 14:25.

select value from "development.amers.b00583.fix_eng.undefined.alpqfix01.os.cpu.idle" where time > now() - 1h
development.amers.b00583.fix_eng.undefined.alpqfix01.os.cpu.idle
time    value
2015-10-08T14:25:52Z    

Do not have a 100% reproducible test yet, will post back when I have one.

What I have noticed is that sometimes a log message is printed like the below following a wal flush:

[tsm1] 2015/10/08 14:22:53 rewriting index file /var/opt/influxdb/data/graphite/default/1/0000064.tsm1 with /var/opt/influxdb/data/graphite/default/1/0000065.tsm1

However, this message is not printed when data disappears following a wal flush.

Would presume that the merging of existing indexes on rewrite index does not happen in certain circumstances after a wal flush which causes data to 'disappear' from the index and not show up in queries.

Hope this helps.

@KiNgMaR
Copy link
Author

KiNgMaR commented Oct 9, 2015

I updated to the fresh nightly (bc569e8) and the issue appears to be resolved for me. 😃 However I browsed the commits and didn't find anything that struck me as related, so we probably have to keep an eye on it...

@pkittenis can you also test with the fresh build?

@otoolep
Copy link
Contributor

otoolep commented Oct 9, 2015

Are you sure you are running the tsm1 engine? Just have to be sure it isn't something simple that fixed it, like inadvertently switching back to bz1. :-)

@KiNgMaR
Copy link
Author

KiNgMaR commented Oct 9, 2015

# grep 'engine =' /etc/opt/influxdb/influxdb.conf
engine = "tsm1"

# grep wal /var/log/influxdb/influxd.log | tail -n1
[tsm1wal] 2015/10/09 08:44:49 memory flush of /var/opt/influxdb/data/xxx_prod/default/2 with 140876 keys and 330139 total values of 5282224 bytes

LGTM 👍

@jhofeditz
Copy link

I had this same issue after wiping all data and upgrading to f1e0c59 nightly. Data disappeared after 5-10 minutes even though I changed to tsm1 (or thought I did). Even saw tsm1 messages in the log:
[tsm1wal] 2015/10/12 06:56:41 /ssd/influxdb/data/_internal/monitor/36 flush to index took 1.817µs
[tsm1wal] 2015/10/12 06:57:31 idle flush of /ssd/influxdb/data/_internal/monitor/36 with 34 keys and 166 total values of 2656 bytes

I then moved my 0.9.4 config and used a new 0.9.5 config. Then I noticed that I neglected to uncomment the engine line # engine ="tsm1". D'oh! Fixed, wiped data, and restarted. It's been working exactly as expected for the last few hours now.

Just putting this out there in case someone else has a similar issue.

@KiNgMaR
Copy link
Author

KiNgMaR commented Oct 13, 2015

Just upgraded to the f0efd1f nightly, cleaned all databases and the issue has reappeared for me. :(

Looking at the log files, the time window is resetting whenever the WAL flushes to disk:

[tsm1wal] 2015/10/13 07:31:02 memory flush of /var/opt/influxdb/data/xxx_prod/default/2
          with 101755 keys and 336873 total values of 5389968 bytes

Possibly related, this is also when the process is adding some 100 MB to its RSS - which I described here: #4096 (comment) - so it looks like the WAL flush and/or the "join" with indexed data is causing the trouble I am currently experiencing. How can we get to the bottom of this?

@KiNgMaR KiNgMaR changed the title [tsm1] Data older than a few minutes is not showing up in queries [tsm1] Data older than a few minutes is not showing up in queries + memory is leaking Oct 13, 2015
@pkittenis
Copy link

so it looks like the WAL flush and/or the "join" with indexed data is causing the trouble I am currently experiencing

Exactly my thoughts as well - see comment above.

Working on a reproducible test - have one but does not reproduce the issue 100% of the time.

@pkittenis
Copy link

Hi @otoolep ,

Can you please provide an email that I can send a DB dump to that has this issue? The dump is 25MB and though the data is not sensitive, do not want to publish it publicly.

Have been unable to replicate it in testing but the issue immediately starts appearing in our live test environment with real clients sending data to influxdb using tsm1.

Needless to say it's a blocker for moving ahead with the new engine :)

Here are example queries that demonstrate the issue.

Query:

select value from "development.amers.b001287.cmcm.undefined.ouy-ingestion.os.nfs.nfs_call" where time > '2015-10-14T14:00:52Z'

First response:

development.amers.b001287.cmcm.undefined.ouy-ingestion.os.nfs.nfs_call
time    value
2015-10-14T14:54:52Z     0

1min later, second response:

Success! (no results to display)

Previous data has 'disappeared'. There were also several more datapoints written and seen before the first query that have also disappeared. The data looks to be there in the DB dump judging by the size of the DB but does not show up in queries.

I am fairly confident this is an issue with WAL writes and the joining of index files. Influxdb logs during write in our test environment have no rewriting index file <..> messages, unlike in local testing where I have been unable to replicate this.

Eg:

[tsm1wal] 2015/10/14 15:22:32 idle flush of /var/opt/influxdb/data/graphite/default/2 with 83 keys and 504 total values of 8064 bytes
[tsm1wal] 2015/10/14 15:22:32 idle flush of /var/opt/influxdb/data/graphite/default/3 with 54 keys and 378 total values of 6048 bytes
[query] 2015/10/14 15:22:32 SELECT value FROM "graphite"."default"."development.amers.b001287.cmcm.undefined.ouy-ingestion.os.nfs.nfs_call" WHERE time > '2015-10-14T14:00:52Z'

Can see wal flushes, sans rewriting index, followed by the second query that resulted in no data above.

Compared to log messages while load testing and exercising the wal:

[tsm1wal] 2015/10/14 15:31:08 memory flush of /var/opt/influxdb/data/graphite/default/1 with 1000 keys and 329000 total values of 5264000 bytes
[tsm1] 2015/10/14 15:31:08 rewriting index file /var/opt/influxdb/data/graphite/default/1/0000415.tsm1 with /var/opt/influxdb/data/graphite/default/1/0000416.tsm1
[tsm1wal] 2015/10/14 15:31:11 /var/opt/influxdb/data/graphite/default/1 flush to index took 2.236247905s

@otoolep
Copy link
Contributor

otoolep commented Oct 14, 2015

@pauldix is actually looking at all tsm1 issues, so may be interested in your data.

@beckettsean -- can you provide the upload details for CloudWok, so we can get the data from @pkittenis ?

@rossmcdonald
Copy link
Contributor

@pkittenis If you would like to send us your data, you can upload it to our S3 bucket here:

https://www.cloudwok.com/cu/gh4354

@pkittenis
Copy link

Thanks @rossmcdonald - DB dump uploaded.

@KiNgMaR
Copy link
Author

KiNgMaR commented Oct 20, 2015

I checked my log files from the last few days (running with very lite workload to avoid going OOM) and I absolutely never see any rewriting index file entries in my log files. My configuration is this: https://gist.github.com/KiNgMaR/2f0ba7ca6af1f83f39bc

Furthermore, I don't have any .tsm1 files on disk. Only .wal files... wut?!

@otoolep @pauldix: Any suggestions?

@pauldix
Copy link
Member

pauldix commented Oct 20, 2015

@KiNgMaR: What does an ls of the directory look like?

@KiNgMaR
Copy link
Author

KiNgMaR commented Oct 20, 2015

@KiNgMaR
Copy link
Author

KiNgMaR commented Oct 21, 2015

I think something may be going wrong on the file or filesystem level. Looking at the following piece of code from wal.go, I think that an error is occuring in Write() in tsm1.go, but it's never logged. The return value from flush() in wal.go seems to be discarded as well. This should probably be addressed?

if l.LoggingEnabled {
    // ...
    l.logger.Printf("%s flush of %s with %d keys and %d total values of %d bytes\n", ftype, l.path, len(l.flushCache), valueCount, flushSize)
}

startTime := time.Now()
if err := l.IndexWriter.Write(l.flushCache, mfc, scc); err != nil {
    // *** SHOULD LOG HERE?! ***
    return err
}
if l.LoggingEnabled {
    l.logger.Printf("%s flush to index took %s\n", l.path, time.Since(startTime))
}

In my log file, I can see the %s flush of %s with %d keys and %d total values of %d bytes messages, but not the %s flush to index took %s ones...

I will try to come up with a pull request to log the error unless someone else beats me to it... :)

@jwilder
Copy link
Contributor

jwilder commented Oct 21, 2015

@KiNgMaR Nice catch. #4529 adds logging for those errors.

@KiNgMaR
Copy link
Author

KiNgMaR commented Oct 21, 2015

Awesome, @jwilder! Can't wait to try the nightly tomorrow / whenever it's merged :)

@KiNgMaR
Copy link
Author

KiNgMaR commented Oct 22, 2015

My issue originates from replaceCompressedFile in tsm1.go. For some reason, the Remove call in the following lines results in a Permission Denied error:

if err := os.Remove(name); err != nil && !os.IsNotExist(err) {
    return err
}

Weirdly enough, this does not seem to happen in cleanupMetafile. Looking at the implementation of os.Remove in go, an unlink is performed which is then followed by an rmdir syscall. In this case, the unlink would most probably return ENOENT or similar. However, if the following rmdir returned something that was not ENOTDIR (e.g. EACCES), the error from rmdir would now be returned to the application. And os.IsNotExist would than be falsey.

My data directory lives on NFS, so maybe that is what happens.

According to the rename syscall manpage, the os.Remove may not be necessary at all:

   If newpath already exists, it will be atomically replaced (subject to
   a few conditions; see ERRORS below), so that there is no point at
   which another process attempting to access newpath will find it
   missing.

Thoughts?

@pkittenis
Copy link

Testing last night's build and it's looking more stable - no longer see data loss from queries so far and rewriting index <..> lines have started appearing in logs.

Will keep you updated if anything changes.

@pauldix
Copy link
Member

pauldix commented Oct 22, 2015

@pkittenis I just merged #4530 which is a fairly important fix so it'll be interesting to hear what happens with a build that has that.

@KiNgMaR Not sure, you're saying we can just take out the remove because the rename will replace the old file? This seems like a really odd error to get a permission denied.

@jwilder
Copy link
Contributor

jwilder commented Oct 22, 2015

@pauldix @KiNgMaR Looking at replaceCompressedFile, the Remove call is not needed. We can remove it.

@pkittenis
Copy link

FWIW, also saw permission denied errors from one instance which went away after I wiped the data on it.

[tsm1wal] 2015/10/22 10:12:28 idle flush of /var/opt/influxdb/data/_internal/monitor/2 with 46 keys and 276 total values of 4416 bytes
[tsm1wal] 2015/10/22 10:12:28 failed to flush to index: remove ids: permission denied
[tsm1] 2015/10/22 10:12:28 PerformMaintenance: WAL flush failed: remove ids: permission denied
[tsm1wal] 2015/10/22 10:12:28 idle flush of /var/opt/influxdb/data/graphite/default/3 with 54 keys and 54 total values of 864 bytes
[tsm1wal] 2015/10/22 10:12:28 failed to flush to index: remove ids: permission denied
[tsm1] 2015/10/22 10:12:28 PerformMaintenance: WAL flush failed: remove ids: permission denied

Did not investigate too hard and has not re-occured since.

Last night's build has been stable so far with no data loss - will test with #4530 included next.

@jwilder
Copy link
Contributor

jwilder commented Oct 22, 2015

@KiNgMaR @pkittenis Are you installing upgrades using of the nightly packages or building manually? There may be a bug with permissions in the packaging script.

@KiNgMaR
Copy link
Author

KiNgMaR commented Oct 26, 2015

Fixed by #4555! Woohooo!! 💯

Still observing memory usage, will report back in a new issue in case it keeps growing still.

@jwilder: I use the nightly rpm builds for CentOS 6. E.g.

    Running Transaction
      Updating   : influxdb-0.9.5_nightly_9633410-1.x86_64                                                          1/2
    ln: creating symbolic link `/opt/influxdb/influx_inspect': File exists
    ln: creating symbolic link `/opt/influxdb/influx_stress': File exists
    chown: cannot dereference `/opt/influxdb/influx_inspect': No such file or directory
    chown: cannot dereference `/opt/influxdb/influx_stress': No such file or directory
      Cleanup    : influxdb-0.9.5_nightly_3a5044d-1.x86_64                                                          2/2
      Verifying  : influxdb-0.9.5_nightly_9633410-1.x86_64                                                          1/2
      Verifying  : influxdb-0.9.5_nightly_3a5044d-1.x86_64                                                          2/2

    Updated:
      influxdb.x86_64 0:0.9.5_nightly_9633410-1

@KiNgMaR KiNgMaR closed this as completed Oct 26, 2015
@pkittenis
Copy link

Can also confirm issue appears resolved with changes from #4530 included.

In general latest nightly builds seem much more stable, thanks for your hard work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants