cache maximum memory size exceeded #6109

yangfan876 · 2016-03-24T07:04:55Z

hi,
when I restart influxdb and I have too many wal files, the influxdb will exit. Logs as follows:

[cacheloader] 2016/03/24 14:51:50 reading file /data1/influxdb/wal/sysnoc/default/2/_00692.wal, size 10502077 [cacheloader] 2016/03/24 14:51:52 reading file /data1/influxdb/wal/sysnoc/default/2/_00693.wal, size 10489020 [cacheloader] 2016/03/24 14:51:53 reading file /data1/influxdb/wal/sysnoc/default/2/_00694.wal, size 10498512 [cacheloader] 2016/03/24 14:51:55 reading file /data1/influxdb/wal/sysnoc/default/2/_00695.wal, size 10501253 [cacheloader] 2016/03/24 14:51:56 reading file /data1/influxdb/wal/sysnoc/default/2/_00696.wal, size 10498537 [cacheloader] 2016/03/24 14:51:58 reading file /data1/influxdb/wal/sysnoc/default/2/_00697.wal, size 10503642 [cacheloader] 2016/03/24 14:51:59 reading file /data1/influxdb/wal/sysnoc/default/2/_00698.wal, size 10504904 [cacheloader] 2016/03/24 14:52:01 reading file /data1/influxdb/wal/sysnoc/default/2/_00699.wal, size 10488883 [cacheloader] 2016/03/24 14:52:03 reading file /data1/influxdb/wal/sysnoc/default/2/_00700.wal, size 10495194 [cacheloader] 2016/03/24 14:52:04 reading file /data1/influxdb/wal/sysnoc/default/2/_00701.wal, size 10505617 [cacheloader] 2016/03/24 14:52:06 reading file /data1/influxdb/wal/sysnoc/default/2/_00702.wal, size 10506820 [cacheloader] 2016/03/24 14:52:07 reading file /data1/influxdb/wal/sysnoc/default/2/_00703.wal, size 10504926 [cacheloader] 2016/03/24 14:52:09 reading file /data1/influxdb/wal/sysnoc/default/2/_00704.wal, size 10494123 run: open server: open tsdb store: [shard 2] cache maximum memory size exceeded

I think if wal file is too many, service should stop read and flush some cache to disk, it should not exit.

The text was updated successfully, but these errors were encountered:

rossmcdonald · 2016-03-24T14:17:43Z

@yangfan876 Did you convert your shard to the TSM storage engine recently? This may be due to the permissions being incorrect on the shards (preventing them from flushing at all). Can you run a (assuming InfluxDB is running as the influxdb user):

sudo chown -R influxdb:influxdb /data1/influxdb

To see if that fixes it?

mark-rushakoff · 2016-03-24T15:54:01Z

You can also try increasing the value of cache-max-memory-size in your config file, at least as a temporary workaround. It defaults to 500MB (524288000).

yangfan876 · 2016-03-25T02:31:27Z

@rossmcdonald yes, the version of influxdb I test is 0.11, TSM is default. and I'm sure '/data1/influxdb' belongs to influxdb user:
drwxr-xr-x 6 influxdb influxdb 47 Mar 23 18:36 influxdb

yangfan876 · 2016-03-25T02:39:13Z

@mark-rushakoff yes, I can increase the value of cache-max-memory-size,but in my view when I reach the maximum memory of cache maybe we should flush cache data to disk and then keep reading wal files.

mark-rushakoff · 2016-03-25T04:38:22Z

@yangfan876 try lowering cache-snapshot-memory-size. The code looks like it should check the cache's size against that value once per second, and if the current value is greater than the configuration value, write the cache to disk as a snapshot file.

yangfan876 · 2016-03-25T12:59:56Z

@mark-rushakoff I read the code of influxdb, the store engine start goroutione to check the cache's size, but in the engine Open method:
file:tsdb/engine/tsm1/engine.go
128 func (e *Engine) Open() error { 129 e.done = make(chan struct{}) 130 e.Compactor.Cancel = e.done 131 132 if err := os.MkdirAll(e.path, 0777); err != nil { 133 return err 134 } 135 136 if err := e.cleanup(); err != nil { 137 return err 138 } 139 140 if err := e.WAL.Open(); err != nil { 141 return err 142 } 143 144 if err := e.FileStore.Open(); err != nil { 145 return err 146 } 147 148 if err := e.reloadCache(); err != nil { //reload wal before check cache goroutine start. 149 return err 150 } 151 152 e.wg.Add(5) 153 go e.compactCache() //start goroutine to check cache. 154 go e.compactTSMFull() 155 go e.compactTSMLevel(true, 1) 156 go e.compactTSMLevel(true, 2) 157 go e.compactTSMLevel(false, 3) 158 159 return nil 160 }
so, as you know the when I have a lot of wal files to reload, but the compactCache not start yet, cache will reach the maximum memory. And I made a pull request with the other account.

yangfan876 · 2016-03-25T13:08:07Z

@mark-rushakoff BTW I don't think sleep 1 second is a good idea to check the cache's size. Maybe it can depend on the load of service.

sstarcher · 2016-03-25T13:44:50Z

Same issue on influxdb 0.10 randomly. It was after running a fresh version of influxdb 0.10.0 for about 2 weeks the issue randomly came up on restart. I increased cache-max-memory-size to 1gb and it is working now.

rossmcdonald · 2016-03-25T14:02:10Z

@sstarcher Did you upgrade from a previous version of InfluxDB? Or did you start with v0.10?

sstarcher · 2016-03-25T14:09:55Z

@rossmcdonald no I have only used 0.10.0 for this data and no other version

sstarcher · 2016-03-30T13:14:00Z

@rossmcdonald I upgraded from 0.10.0 to 0.11.0 and I still get the error, but it no longer crashes

March 30th 2016, 09:12:47.262   [store] 2016/03/30 13:12:47 Failed to open shard: 231: [shard 231] cache maximum memory size exceeded
March 30th 2016, 09:12:46.893   [cacheloader] 2016/03/30 13:12:46 reading file /data/db/wal/telegraf/default/231/_01465.wal, size 7906114
March 30th 2016, 09:12:46.502   [cacheloader] 2016/03/30 13:12:46 reading file /data/db/wal/telegraf/default/231/_01464.wal, size 10638834
March 30th 2016, 09:12:46.080   [cacheloader] 2016/03/30 13:12:46 reading file /data/db/wal/telegraf/default/231/_01463.wal, size 11191183
March 30th 2016, 09:12:45.654   [cacheloader] 2016/03/30 13:12:45 reading file /data/db/wal/telegraf/default/231/_01462.wal, size 10949076
March 30th 2016, 09:12:45.239   [cacheloader] 2016/03/30 13:12:45 reading file /data/db/wal/telegraf/default/231/_01461.wal, size 11130405
March 30th 2016, 09:12:44.729   [cacheloader] 2016/03/30 13:12:44 reading file /data/db/wal/telegraf/default/231/_01460.wal, size 11082699
March 30th 2016, 09:12:44.237   [cacheloader] 2016/03/30 13:12:44 reading file /data/db/wal/telegraf/default/231/_01459.wal, size 10987478
March 30th 2016, 09:12:43.835   [cacheloader] 2016/03/30 13:12:43 reading file /data/db/wal/telegraf/default/231/_01458.wal, size 10588884
March 30th 2016, 09:12:43.436   [cacheloader] 2016/03/30 13:12:43 reading file /data/db/wal/telegraf/default/231/_01457.wal, size 11138151
March 30th 2016, 09:12:42.997   [cacheloader] 2016/03/30 13:12:42 reading file /data/db/wal/telegraf/default/231/_01456.wal, size 11160262
March 30th 2016, 09:12:42.509   [cacheloader] 2016/03/30 13:12:42 reading file /data/db/wal/telegraf/default/231/_01455.wal, size 11168045
March 30th 2016, 09:12:42.076   [cacheloader] 2016/03/30 13:12:42 reading file /data/db/wal/telegraf/default/231/_01454.wal, size 10495132
March 30th 2016, 09:12:41.657   [cacheloader] 2016/03/30 13:12:41 reading file /data/db/wal/telegraf/default/231/_01453.wal, size 10688138
March 30th 2016, 09:12:41.171   [cacheloader] 2016/03/30 13:12:41 reading file /data/db/wal/telegraf/default/231/_01452.wal, size 11241832
March 30th 2016, 09:12:40.814   [cacheloader] 2016/03/30 13:12:40 reading file /data/db/wal/telegraf/default/231/_01451.wal, size 10552126
March 30th 2016, 09:12:40.431   [cacheloader] 2016/03/30 13:12:40 reading file /data/db/wal/telegraf/default/231/_01450.wal, size 10961726
March 30th 2016, 09:12:40.079   [cacheloader] 2016/03/30 13:12:40 reading file /data/db/wal/telegraf/default/231/_01449.wal, size 10660695
March 30th 2016, 09:12:39.632   [cacheloader] 2016/03/30 13:12:39 reading file /data/db/wal/telegraf/default/231/_01448.wal, size 10796586

The cache max memory size is an approximate size and can prevent a shard from loading at startup. This change disable the max size at startup to prevent this problem and sets the limt back after reloading. Fixes #6109

influxdata/influxdb#6109

vilinski · 2022-03-13T23:47:09Z

cache size increases until max size (within a couple of days) and then influxdb accepts no points. What could be the cause?
influxdb 1.8.10-1 on debian
{"error":"engine: cache-max-memory-size exceeded: (1073884708/1073741824)"}

samhld · 2022-03-14T17:43:48Z

@vilinski does your write volume increase over that time? That's usually an error wrt write volume being too high at some point in time.

If not that, any other errors in your logs leading up to this?

vilinski · 2022-03-15T08:36:08Z

TBH I don't know what wrt write volume is 😄

DB size on disk is growing, cache as well

Only logs other than Post and Get are like this:

Mar 15 09:09:11 suite-ipc-10 influxd-systemd-start.sh[6698]: ts=2022-03-15T08:09:11.464301Z lvl=info msg="Cache snapshot (start)" log_id=0_EfZnhG000 engine=tsm1 trace_id=0_F9pQA0000 op_name=tsm1_cache_snapshot op_event=start
Mar 15 09:09:11 suite-ipc-10 influxd-systemd-start.sh[6698]: ts=2022-03-15T08:09:11.817628Z lvl=info msg="Snapshot for path written" log_id=0_EfZnhG000 engine=tsm1 trace_id=0_F9pQA0000 op_name=tsm1_cache_snapshot path=/var/lib/influxdb/data/8dd2dd7a-9790-4212-b840-85813b547ea6-ProcessData/autogen/129 duration=353.343ms
Mar 15 09:09:11 suite-ipc-10 influxd-systemd-start.sh[6698]: ts=2022-03-15T08:09:11.817661Z lvl=info msg="Cache snapshot (end)" log_id=0_EfZnhG000 engine=tsm1 trace_id=0_F9pQA0000 op_name=tsm1_cache_snapshot op_event=end op_elapsed=353.373ms
5

samhld · 2022-03-15T22:27:16Z

@vilinski "wrt" means "with regard to". This cache fills up when you are writing more data in (which means the cache) than you can effectively snapshot (take out of the cache). This means you are either writing too much data for your cache size (you can configure it to be larger) at some point in your workload...or your snapshots are too slow because your disk is too slow.

Make sure your cache is big enough for your workload. Make sure the memory on the node is larger enough to increase your cache size if needed. Most importantly, make sure your disk is fast. SSDs!

vilinski · 2022-03-16T22:27:30Z

thanks for explanations
well, this is already an ssd, works even with more data/s, writes batched as recommended by 1s or 10000 points at max. Should fit into default 25mb?
Maybe more than 100 of compressing continuous queries play a role? one per measurement

samhld · 2022-03-17T04:07:40Z

@vilinski try to find the tsm1_cache_snapshot logs and see if the time they're taking is increasing.

Also, can you share an example continuous query?

samhld · 2022-03-17T04:12:39Z

@vilinski another thing to look at is the metric snapshotCount in the _internal database. If that's not enabled, you can find that at the /debug/vars endpoint. Telegraf can collect all these metrics for you and you viz them in Grafana.

vilinski · 2022-03-17T15:22:51Z

we are already collecting the metrics with telegraf and having grafana dashboards, like one I posted above
the influxdb_tsm1_cache_snapshotCount metric has always a value 0 for all dbs and paths
continuous queries are precalculating values, to show in a software:

16ce202f-1a22-497d-a840-1e15fe5156fe_1m CREATE CONTINUOUS QUERY "16ce202f-1a22-497d-a840-1e15fe5156fe_1m" ON "8dd2dd7a-9790-4212-b840-85813b547ea6-ProcessData" BEGIN SELECT mean(value) AS value, min(value) AS min, max(value) AS max, stddev(value) AS stddev INTO "8dd2dd7a-9790-4212-b840-85813b547ea6-ProcessData".autogen."16ce202f-1a22-497d-a840-1e15fe5156fe_1m" FROM "8dd2dd7a-9790-4212-b840-85813b547ea6-ProcessData".autogen."16ce202f-1a22-497d-a840-1e15fe5156fe" GROUP BY time(1m), * END
16ce202f-1a22-497d-a840-1e15fe5156fe_1h CREATE CONTINUOUS QUERY "16ce202f-1a22-497d-a840-1e15fe5156fe_1h" ON "8dd2dd7a-9790-4212-b840-85813b547ea6-ProcessData" BEGIN SELECT mean(value) AS value, min(value) AS min, max(value) AS max, stddev(value) AS stddev INTO "8dd2dd7a-9790-4212-b840-85813b547ea6-ProcessData".autogen."16ce202f-1a22-497d-a840-1e15fe5156fe_1h" FROM "8dd2dd7a-9790-4212-b840-85813b547ea6-ProcessData".autogen."16ce202f-1a22-497d-a840-1e15fe5156fe_1m" GROUP BY time(1h), * END

each such measurement has one tag with up to 40 different tag values, so has about 40 data points/s

vilinski · 2022-03-17T17:08:07Z

the logged snapshot writes are written with random duration between 300 and 500 ms, but not increasing

samhld · 2022-03-17T20:27:58Z

@vilinski well that's good -- indicates the cache snapshotting is keeping up. I'm not sure how else to identify this issue without reproduceable steps. Perhaps starting a thread in the community Slack would help.

vilinski · 2022-03-17T23:11:48Z

also already done ^^ https://influxcommunity.slack.com/archives/CH8TV3LJG/p1647216283324089

jonseymour mentioned this issue Apr 1, 2016

store engine start compactCache goroutine before reload wal to cache.fix issues#6109 #6122

Closed

mark-rushakoff mentioned this issue Apr 19, 2016

[0.12.0] Goroutine leak with intense write load and shard in state "cache maximum memory size exceeded" #6417

Closed

jwilder added this to the 0.13.0 milestone Apr 20, 2016

jwilder added the area/tsm label Apr 20, 2016

jwilder mentioned this issue Apr 20, 2016

Disable cache max memory size when reloading the cache #6430

Merged

4 tasks

jwilder closed this as completed in #6430 Apr 20, 2016

mohit added a commit to graveyard/influxdb-service that referenced this issue May 12, 2016

update to 0.13 release for

1b3a6dd

influxdata/influxdb#6109

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache maximum memory size exceeded #6109

cache maximum memory size exceeded #6109

yangfan876 commented Mar 24, 2016

rossmcdonald commented Mar 24, 2016

mark-rushakoff commented Mar 24, 2016

yangfan876 commented Mar 25, 2016

yangfan876 commented Mar 25, 2016

mark-rushakoff commented Mar 25, 2016

yangfan876 commented Mar 25, 2016

yangfan876 commented Mar 25, 2016

sstarcher commented Mar 25, 2016

rossmcdonald commented Mar 25, 2016

sstarcher commented Mar 25, 2016

sstarcher commented Mar 30, 2016

vilinski commented Mar 13, 2022 •

edited

Loading

samhld commented Mar 14, 2022

vilinski commented Mar 15, 2022 •

edited

Loading

samhld commented Mar 15, 2022

vilinski commented Mar 16, 2022

samhld commented Mar 17, 2022 •

edited

Loading

samhld commented Mar 17, 2022

vilinski commented Mar 17, 2022

vilinski commented Mar 17, 2022

samhld commented Mar 17, 2022

vilinski commented Mar 17, 2022

cache maximum memory size exceeded #6109

cache maximum memory size exceeded #6109

Comments

yangfan876 commented Mar 24, 2016

rossmcdonald commented Mar 24, 2016

mark-rushakoff commented Mar 24, 2016

yangfan876 commented Mar 25, 2016

yangfan876 commented Mar 25, 2016

mark-rushakoff commented Mar 25, 2016

yangfan876 commented Mar 25, 2016

yangfan876 commented Mar 25, 2016

sstarcher commented Mar 25, 2016

rossmcdonald commented Mar 25, 2016

sstarcher commented Mar 25, 2016

sstarcher commented Mar 30, 2016

vilinski commented Mar 13, 2022 • edited Loading

samhld commented Mar 14, 2022

vilinski commented Mar 15, 2022 • edited Loading

samhld commented Mar 15, 2022

vilinski commented Mar 16, 2022

samhld commented Mar 17, 2022 • edited Loading

samhld commented Mar 17, 2022

vilinski commented Mar 17, 2022

vilinski commented Mar 17, 2022

samhld commented Mar 17, 2022

vilinski commented Mar 17, 2022

vilinski commented Mar 13, 2022 •

edited

Loading

vilinski commented Mar 15, 2022 •

edited

Loading

samhld commented Mar 17, 2022 •

edited

Loading