-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tsm1] Data older than a few minutes is not showing up in queries + memory is leaking #4354
Comments
What are your retention policies set to? |
Yes, please let us know what the retention policies are. A small program (in any language) that would allow us to a) recreate your database and retention policies and b) write the data pattern you have, would be great. |
Retention policies are as follows:
There may have been points written between I will try to reproduce on a clean database and/or with a code example. |
@KiNgMaR -- that may be. To be clear, when points are first written shards are created and the end-time of the shard is determined by the retention policy duration at that time. Subsequently changing the retention policy duration does not change the end-time of pre-existing shards, only shards created after the |
So I stopped InfluxDB, deleted all DBs, shards etc. and disabled the UDP listeners. Started InfluxDB back up, created the DBs, alterered the default retention policies and checked that no shards had been created yet. Shut down InfluxDB, re-enabled the UDP listeners... but the issue remains. This log file should illustrate it: https://gist.github.com/KiNgMaR/d3419fde0b5f8c78d88e The shard properties look correct to me. If needed I should probably be able to provide a node.js script to reproduce from scratch... |
Also getting this and it seems to be caused by wal flushes. Retention period is the default - infinite. Doubt retention period is in any way related. DB version is After messages like the below, data older than the time of the flush no longer appear in queries:
Then, a few minutes later, another flush:
And queries now respond with data since 14:25.
Do not have a 100% reproducible test yet, will post back when I have one. What I have noticed is that sometimes a log message is printed like the below following a wal flush:
However, this message is not printed when data disappears following a wal flush. Would presume that the merging of existing indexes on rewrite index does not happen in certain circumstances after a wal flush which causes data to 'disappear' from the index and not show up in queries. Hope this helps. |
I updated to the fresh nightly (bc569e8) and the issue appears to be resolved for me. 😃 However I browsed the commits and didn't find anything that struck me as related, so we probably have to keep an eye on it... @pkittenis can you also test with the fresh build? |
Are you sure you are running the |
LGTM 👍 |
I had this same issue after wiping all data and upgrading to f1e0c59 nightly. Data disappeared after 5-10 minutes even though I changed to tsm1 (or thought I did). Even saw tsm1 messages in the log: I then moved my 0.9.4 config and used a new 0.9.5 config. Then I noticed that I neglected to uncomment the engine line # engine ="tsm1". D'oh! Fixed, wiped data, and restarted. It's been working exactly as expected for the last few hours now. Just putting this out there in case someone else has a similar issue. |
Just upgraded to the f0efd1f nightly, cleaned all databases and the issue has reappeared for me. :( Looking at the log files, the time window is resetting whenever the WAL flushes to disk:
Possibly related, this is also when the process is adding some 100 MB to its RSS - which I described here: #4096 (comment) - so it looks like the WAL flush and/or the "join" with indexed data is causing the trouble I am currently experiencing. How can we get to the bottom of this? |
Exactly my thoughts as well - see comment above. Working on a reproducible test - have one but does not reproduce the issue 100% of the time. |
Hi @otoolep , Can you please provide an email that I can send a DB dump to that has this issue? The dump is 25MB and though the data is not sensitive, do not want to publish it publicly. Have been unable to replicate it in testing but the issue immediately starts appearing in our live test environment with real clients sending data to influxdb using tsm1. Needless to say it's a blocker for moving ahead with the new engine :) Here are example queries that demonstrate the issue. Query:
First response:
1min later, second response:
Previous data has 'disappeared'. There were also several more datapoints written and seen before the first query that have also disappeared. The data looks to be there in the DB dump judging by the size of the DB but does not show up in queries. I am fairly confident this is an issue with WAL writes and the joining of index files. Influxdb logs during write in our test environment have no Eg:
Can see wal flushes, sans rewriting index, followed by the second query that resulted in no data above. Compared to log messages while load testing and exercising the wal:
|
@pauldix is actually looking at all tsm1 issues, so may be interested in your data. @beckettsean -- can you provide the upload details for CloudWok, so we can get the data from @pkittenis ? |
@pkittenis If you would like to send us your data, you can upload it to our S3 bucket here: |
Thanks @rossmcdonald - DB dump uploaded. |
I checked my log files from the last few days (running with very lite workload to avoid going OOM) and I absolutely never see any Furthermore, I don't have any .tsm1 files on disk. Only .wal files... wut?! |
@KiNgMaR: What does an ls of the directory look like? |
@pauldix: here you go: https://gist.github.com/KiNgMaR/632d0acbad19fb9f4649 |
I think something may be going wrong on the file or filesystem level. Looking at the following piece of code from
In my log file, I can see the I will try to come up with a pull request to log the error unless someone else beats me to it... :) |
Awesome, @jwilder! Can't wait to try the nightly tomorrow / whenever it's merged :) |
My issue originates from
Weirdly enough, this does not seem to happen in My data directory lives on NFS, so maybe that is what happens. According to the
Thoughts? |
Testing last night's build and it's looking more stable - no longer see data loss from queries so far and Will keep you updated if anything changes. |
@pkittenis I just merged #4530 which is a fairly important fix so it'll be interesting to hear what happens with a build that has that. @KiNgMaR Not sure, you're saying we can just take out the remove because the rename will replace the old file? This seems like a really odd error to get a permission denied. |
FWIW, also saw permission denied errors from one instance which went away after I wiped the data on it.
Did not investigate too hard and has not re-occured since. Last night's build has been stable so far with no data loss - will test with #4530 included next. |
@KiNgMaR @pkittenis Are you installing upgrades using of the nightly packages or building manually? There may be a bug with permissions in the packaging script. |
Fixed by #4555! Woohooo!! 💯 Still observing memory usage, will report back in a new issue in case it keeps growing still. @jwilder: I use the nightly rpm builds for CentOS 6. E.g.
|
Can also confirm issue appears resolved with changes from #4530 included. In general latest nightly builds seem much more stable, thanks for your hard work. |
The new storage engine looks very promising so far. Memory usage is still going up non-stop (writing over UDP exclusively), but slower than before. Overall the system seems more stable and under much less pressure.
However, after a few minutes, points that have been written and were previously showing up in queries disappear. All
WHERE time >= now() - 1h
queries only show data for approx. the last 2-10 minutes. This affects CQs and normal queries. It can even be reproduced using a simpleSELECT * FROM blah
query without any time constraints.The data showing from this type of query behaves as follows: new points are appended and showing up. Then, every few minutes, the starting point (i.e. the timestamp of the first point returned) moves some minutes into the future.
P.S. Thanks everyone, you're doing a great job!
The text was updated successfully, but these errors were encountered: