New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OutOfDirectMemoryError errors causing linkerd to fail #1690

Closed
DukeyToo opened this Issue Nov 1, 2017 · 48 comments

Comments

Projects
None yet
@DukeyToo
Copy link

DukeyToo commented Nov 1, 2017

Issue Type:

  • Bug report
  • Feature request

What happened:
Over a period of days, linkerd memory usage climbs until it reaches a maximum and starts throwing OutOfDirectMemoryError errors

What you expected to happen:
OutOfDirectMemoryError errors should not happen

How to reproduce it (as minimally and precisely as possible):
Running linkerd 1.3.0 single instance per host, with namerd on separate hosts. Environment memory overrides: JVM_HEAP_MAX=1024M and JVM_HEAP_MIN=1024M. Let run for days, and monitor process memory usage. Usage increases until errors start occuring:

Memory usage chart (10/27 new VMs running 1.3.0, dips at the end are me restarting linkerd):
screen shot 2017-11-01 at 10 56 02 am

Stack trace:

linkerd[2410]: W 1101 14:10:26.714 UTC THREAD42: Unhandled exception in connection with /10.49.154.23:42176, shutting down connection
linkerd[2410]: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 1048576 byte(s) of direct memory (used: 1037041958, max: 1037959168)
linkerd[2410]: at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:618)
linkerd[2410]: at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:572)
linkerd[2410]: at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764)
linkerd[2410]: at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740)
linkerd[2410]: at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244)
linkerd[2410]: at io.netty.buffer.PoolArena.allocate(PoolArena.java:214)
linkerd[2410]: at io.netty.buffer.PoolArena.allocate(PoolArena.java:146)
linkerd[2410]: at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324)
linkerd[2410]: at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:181)
linkerd[2410]: at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:172)
linkerd[2410]: at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:133)
linkerd[2410]: at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:80)
linkerd[2410]: at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:122)
linkerd[2410]: at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
linkerd[2410]: at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
linkerd[2410]: at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
linkerd[2410]: at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
linkerd[2410]: at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
linkerd[2410]: at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
linkerd[2410]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
linkerd[2410]: at com.twitter.finagle.util.BlockingTimeTrackingThreadFactory$$anon$1.run(BlockingTimeTrackingThreadFactory.scala:23)
linkerd[2410]: at java.lang.Thread.run(Unknown Source)

Anything else we need to know?:

Environment:

  • linkerd/namerd version, config files:
    1.3.0 linkerd and namerd. linkerd config:
admin:
  port: 9990
  ip: 0.0.0.0
routers:
- protocol: http
  label: microservices_prod
  dstPrefix: /http
  interpreter:
    kind: io.l5d.mesh
    dst: /$/inet/cwl-mesos-masters.service.consul/4182
    experimental: true
    root: /microservices-prod
  identifier:
    - kind: io.l5d.path
      segments: 3
    - kind: io.l5d.path
      segments: 2
    - kind: io.l5d.path
      segments: 1
  servers:
  - port: 4170
    ip: 0.0.0.0
  client:
    kind: io.l5d.global
    loadBalancer:
      kind: ewma
    failureAccrual:
      kind: io.l5d.consecutiveFailures
      failures: 5
    requeueBudget:
      minRetriesPerSec: 10
      percentCanRetry: 0.2
      ttlSecs: 10
  service:
    kind: io.l5d.global
    totalTimeoutMs: 10000
    responseClassifier:
      kind: io.l5d.http.retryableRead5XX
    retries:
      budget:
        minRetriesPerSec: 10
        percentCanRetry: 0.2
        ttlSecs: 10
- protocol: http
  label: proxy_by_host
  dstPrefix: /http
  interpreter:
    kind: io.l5d.mesh
    dst: /$/inet/cwl-mesos-masters.service.consul/4182
    experimental: true
    root: /default
  identifier:
    - kind: io.l5d.header.token
      header: Host
  servers:
  - port: 4141
    ip: 0.0.0.0
  - port: 80
    ip: 0.0.0.0
- protocol: http
  label: proxy_by_path3
  dstPrefix: /http
  interpreter:
    kind: io.l5d.mesh
    dst: /$/inet/cwl-mesos-masters.service.consul/4182
    experimental: true
    root: /default
  identifier:
    - kind: io.l5d.path
      segments: 3
  servers:
  - port: 4153
    ip: 0.0.0.0
- protocol: http
  label: proxy_by_path2
  dstPrefix: /http
  interpreter:
    kind: io.l5d.mesh
    dst: /$/inet/cwl-mesos-masters.service.consul/4182
    experimental: true
    root: /default
  identifier:
    - kind: io.l5d.path
      segments: 2
  servers:
  - port: 4152
    ip: 0.0.0.0
- protocol: http
  label: proxy_by_path1
  dstPrefix: /http
  interpreter:
    kind: io.l5d.mesh
    dst: /$/inet/cwl-mesos-masters.service.consul/4182
    experimental: true
    root: /default
  identifier:
    - kind: io.l5d.path
      segments: 1
  servers:
  - port: 4151
    ip: 0.0.0.0
telemetry:
- kind: io.l5d.influxdb
  • Platform, version, and config files (Kubernetes, DC/OS, etc):
    CentOS Linux release 7.4.1708 (Core)

  • Cloud provider or hardware configuration:
    Google Cloud

@siggy

This comment has been minimized.

Copy link
Member

siggy commented Nov 1, 2017

Thanks for the detailed report @DukeyToo, we are investigating this.

@amitsaha

This comment has been minimized.

Copy link

amitsaha commented Nov 1, 2017

hi @DukeyToo do you mind sharing the specific metric you are tracking for the memory usage? Thanks :)

@zackangelo

This comment has been minimized.

Copy link
Contributor

zackangelo commented Nov 1, 2017

@DukeyToo can you share how many CPUs are on the linkerd hosts?

@DukeyToo

This comment has been minimized.

Copy link

DukeyToo commented Nov 2, 2017

@zackangelo there are 8 cpu cores on the hosts.

@amitsaha we're analyzing /proc data on linux, RSS memory only (using https://github.com/influxdata/telegraf/tree/master/plugins/inputs/procstat). My assumption is that the 2G+ usage shown on the chart is approx 1G heap + 1G direct memory used by Netty + some MetaSpace. The 2nd line of the stack trace shows the specifics of how much total direct memory was allowed/allocated and how much was used.

@siggy

This comment has been minimized.

Copy link
Member

siggy commented Nov 3, 2017

Quick update for folks watching this issue. We have reproduced a leak and are actively working on a fix. To confirm the leak you are seeing is the same one we have identified, have a look at your open_streams metrics. If they grow over time, that is the leak.

@amitsaha

This comment has been minimized.

Copy link

amitsaha commented Nov 3, 2017

Hi @siggy thanks, which specific metric is this? I tried searching in the prometheus exported metrics, but can't seem to find it?

@amitsaha

This comment has been minimized.

Copy link

amitsaha commented Nov 3, 2017

Thanks @DukeyToo

@zackangelo

This comment has been minimized.

Copy link
Contributor

zackangelo commented Nov 6, 2017

@DukeyToo are you seeing this exception in your linkerd logs?

Nov 03 17:36:55  WARN 1103 17:36:55.421 CDT finagle/netty4-6: An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
Nov 03 17:36:55  java.io.IOException: Connection reset by peer
Nov 03 17:36:55  at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
Nov 03 17:36:55  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
Nov 03 17:36:55  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
Nov 03 17:36:55  at sun.nio.ch.IOUtil.read(IOUtil.java:192)
@DukeyToo

This comment has been minimized.

Copy link

DukeyToo commented Nov 7, 2017

@zackangelo no, I cannot find that error in our logs, however old logs have rotated out from when we last saw the error, so it might have been there.

@siggy

This comment has been minimized.

Copy link
Member

siggy commented Nov 16, 2017

@amitsaha We're looking for any metric ending in open_streams.

For example:

  • rt:client:stream:open_streams
  • rt:server:stream:open_streams
  • rt:io_l5d_mesh:stream:open_streams
@siggy

This comment has been minimized.

Copy link
Member

siggy commented Nov 16, 2017

Quick update: we are still actively investigating this. We believe it may be related to our usage of netty, in relation to reference counting, see http://netty.io/wiki/reference-counted-objects.html for more details.

To get additional information, if folks can try setting the environment variable LOCAL_JVM_OPTIONS to -Dio.netty.leakDetection.level=paranoid -Dio.netty.leakDetection.maxRecords=500, it should give us additional information around the nature of this leak.

@siggy

This comment has been minimized.

Copy link
Member

siggy commented Nov 27, 2017

@DukeyToo We have a fix up at: #1711 Is it possible to test this branch against your use case?

@DukeyToo

This comment has been minimized.

Copy link

DukeyToo commented Nov 30, 2017

@siggy sure, I will try

@hawkw

This comment has been minimized.

Copy link
Member

hawkw commented Nov 30, 2017

Hi @DukeyToo, I just wanted to let you know that you can now test a release candidate for 1.3.3 by using the buoyantio/linkerd:1.3.3-RC1-jdk Docker image, which should make testing your use case against the fix much easier.

@DukeyToo

This comment has been minimized.

Copy link

DukeyToo commented Dec 5, 2017

@siggy @hawkw I took a build from master at https://github.com/linkerd/linkerd/tree/c6f0d2eaeecca80c60314e6f6cb852a31870877a but was unable to reproduce in our test environment, because it takes more than 24 hours of traffic until I can see the issue occurring - and our test environment linkerd instances are restarted every day by an automated process.

I did plot rt:io_l5d_mesh:stream:open_streams over time for various routers in test environment, and it does trend upward over time each day - so if the bug is indicated by gradual increases in that metric, then it is not seem like it is fixed for me.

Example (the drops are due to instances restarting as-scheduled):
screen shot 2017-12-05 at 1 15 48 pm

I also looked for rt:client:stream:open_streams and rt:server:stream:open_streams in /admin/metrics.json, but did not find those as metrics.

@hawkw

This comment has been minimized.

Copy link
Member

hawkw commented Dec 5, 2017

Hi @DukeyToo, the gradual increase in rt:{client, server}: stream:open_streams metric is a symptom of an unrelated issue that only affects HTTP/2 routers, which track those metrics. As your configuration doesn't contain any H2 routers, it's correct for you not to see those metrics, since HTTP/1.1 is not a streaming protocol. I believe @siggy was mistaken when he told you to look at that metric. Sorry for any confusion!

@DukeyToo

This comment has been minimized.

Copy link

DukeyToo commented Dec 7, 2017

I have captured some logs with LOCAL_JVM_OPTIONS=-Dio.netty.leakDetection.level=paranoid -Dio.netty.leakDetection.maxRecords=500 at https://gist.github.com/DukeyToo/582285e2b3603ce9468e4f2c2f21be75

@siggy

This comment has been minimized.

Copy link
Member

siggy commented Dec 7, 2017

Thanks for the log @DukeyToo. Can you confirm what version of linkerd this was with?

@hawkw

This comment has been minimized.

Copy link
Member

hawkw commented Dec 7, 2017

@siggy, I believe in the logs that @DukeyToo posted, the Linkerd reports itself as linkerd 1.3.2-SNAPSHOT. A few comments ago (https://gist.github.com/DukeyToo/582285e2b3603ce9468e4f2c2f21be75#file-linkerd-log-L8) @DukeyToo says he did a build of master as of revision c6f0d2e, I'm assuming that's what he's running currently?

@DukeyToo

This comment has been minimized.

Copy link

DukeyToo commented Dec 8, 2017

Yes, that is correct

@hawkw

This comment has been minimized.

Copy link
Member

hawkw commented Jan 16, 2018

Hi @DukeyToo, following up on this issue: we've fixed a number of memory leaks since the version where this issue was observed. Are you still seeing this leak in the latest release, Linkerd 1.3.4?

@chrismikehogan

This comment has been minimized.

Copy link

chrismikehogan commented Jan 27, 2018

FWIW, we were seeing similar issues to @DukeyToo. We deployed 1.3.5 and the leak we were seeing was resolved.

@wmorgan

This comment has been minimized.

Copy link
Member

wmorgan commented Jan 27, 2018

@chrismikehogan thanks! @DukeyToo, it would be great to have you try 1.3.4 or 1.3.5.

@DukeyToo

This comment has been minimized.

Copy link

DukeyToo commented Jan 30, 2018

I am busy testing 1.3.4. It takes a few days to reproduce, so I'm hoping for the best at this stage. Should know soon.

@wmorgan

This comment has been minimized.

Copy link
Member

wmorgan commented Jan 30, 2018

Thank you! Standing by eagerly.

@DukeyToo

This comment has been minimized.

Copy link

DukeyToo commented Feb 2, 2018

I restarted linkerd this afternoon, before it reached the memory usage where we have seen this issue in the past ... so it will be a while longer before I know if it still happens.

If nothing else, it seems to be taking longer to get to the memory usage threshold where the bug occurred :)

@hawkw

This comment has been minimized.

Copy link
Member

hawkw commented Feb 3, 2018

That, at least, is good to hear!

@DukeyToo

This comment has been minimized.

Copy link

DukeyToo commented Feb 9, 2018

This is not fixed. Took 5 days to reproduce, but it happened again.

Let me know if you need any more info. Metrics are logged to an InfluxDb so I can post whatever we gather that might be relevant.

@wmorgan

This comment has been minimized.

Copy link
Member

wmorgan commented Feb 14, 2018

Bummer. We're going to allocate some time to dig into this later this month.

@mebe

This comment has been minimized.

Copy link

mebe commented Mar 2, 2018

We've run into something that looks identical. We are running Linkerd in front of an object storage system, and are blocked by this issue. We get io.netty.util.internal.OutOfDirectMemoryError with an identical stacktrace after we've uploaded around 400-500 GB of data. We're running the Linkerd with -Xms and -Xmx of 4 GB and -XX:MaxDirectMemorySize of 2GB.

This is fairly straightforward to reproduce by running Minio in Docker and using the AWS CLI to repeatedly upload and delete stuff.

Let's start by bringing up the Minio container:

docker pull minio/minio
docker run -p 9000:9000 minio/minio server /data

Make a note of the AccessKey and SecretKey printed by this. You'll need those later.

Install AWS CLI, e.g. with the bundled installer:

curl "https://s3.amazonaws.com/aws-cli/awscli-bundle.zip" -o "awscli-bundle.zip"
unzip awscli-bundle.zip
./awscli-bundle/install -b ~/bin/aws

Configure AWS CLI to use Minio:

mkdir ~/.aws

cat > ~/.aws/config <<EOF
[default]
s3 =
    max_concurrent_request = 100
    endpoint_url = http://minio
    multipart_threshold = 50MB
    multipart_chunksize = 10MB
EOF

cat > ~/.aws/credentials <<EOF
[default]
aws_access_key_id = <THE-ACCESS-KEY-FROM-MINIO-HERE>
aws_secret_access_key = <THE-SECRET-KEY-FROM-MINIO-HERE>
EOF

Configure Linkerd to proxy to Minio:

cat > linkerd-minio.yaml <<EOF
admin:
  port: 9990

routers:
- protocol: http
  dtab: |
    /svc/minio => /$/inet/127.0.0.1/6060;
  httpAccessLog: logs/access.log
  maxRequestKB: 51200
  maxResponseKB: 51200
  servers:
  - port: 6007
    ip: 127.0.0.1
EOF

Start Linkerd:

LOCAL_JVM_OPTIONS="-XX:MaxDirectMemorySize=2G" JVM_HEAP_MIN=4G JVM_HEAP_MAX=4G linkerd linkerd-minio.yaml

Make AWS CLI use Linkerd as a proxy:

export HTTP_PROXY=http://127.0.0.1:6007

Create a bucket in Minio:

~/bin/aws s3 mb s3://five-gigs

Create some files to upload:

mkdir five-gigs; touch five-gigs/5gb-{1..10} && shred -n 1 -s 5G five-gigs/5gb-{1..10}

Loop upload and delete until it fails:

until [ $? -gt 0 ]; do ~/bin/aws --endpoint-url http://minio s3 rm --recursive s3://five-gigs/ && ~/bin/aws --endpoint-url http://minio s3 sync five-gigs s3://five-gigs; done

After about 400-500 GB (which takes a bit over an half an hour for me), you should start seeing OutOfDirectMemoryErrors from Linkerd and the sync will fail. Linkerd won't recover from this until it's restarted.

I imagine the reproduction time will be shorter with less -XX:MaxDirectMemorySize. With 4 GB heap, the heap usage doesn't seem to be a problem and total GC for the whole run stays in a couple of seconds.

I hope this helps! Let me know if you need any more information.

@wmorgan

This comment has been minimized.

Copy link
Member

wmorgan commented Mar 3, 2018

@mebe thank you for a really detailed repro case!

@chris-goffinet-ck

This comment has been minimized.

Copy link

chris-goffinet-ck commented Mar 3, 2018

@mebe which version of linkerd is this repo case for?

@mebe

This comment has been minimized.

Copy link

mebe commented Mar 3, 2018

@chrisgoffinet

This comment has been minimized.

Copy link
Contributor

chrisgoffinet commented Mar 7, 2018

Can you try adding this to LOCAL_JVM_OPTIONS:

-Djdk.nio.maxCachedBufferSize=262144

Another user was having similar leaks in the JVM, and after providing this suggestion his memory growth stabilized.

@mebe

This comment has been minimized.

Copy link

mebe commented Mar 8, 2018

I added -Djdk.nio.maxCachedBufferSize=262144 and removed the -XX:MaxDirectMemorySize. Result is that Linkerd still crashed with the same error message after a bit shy of 3 TB had been uploaded. This took me about 3.5 hours with constant transfer rate of around 225-230 MB/s.

W 0307 19:18:57.801 UTC THREAD55: Unhandled exception in connection with /127.0.0.1:51350, shutting down connection
io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 1048576 byte(s) of direct memory (used: 2146435079, max: 2147483648)

The RSS reached at the time of the crash was about 6972 MB.

linkerd s3 workload rss until oom

I left Linkerd running with no traffic after the OOM. The RSS kept on creeping up for about 3.5 hours when it suddenly spiked up by about 760 MB over the course of an hour. The RSS stayed at that level for an hour before finally making another 40 MB jump to 7787 MB. The RSS has stayed at this level ever since.

The RSS didn't decrease at any point.

linkerd s3 workload rss after oom

@adleong

This comment has been minimized.

Copy link
Member

adleong commented Mar 22, 2018

@mebe Thanks for the very excellent reproduction instructions! I've tried to dig into this but I'm having trouble getting it to run for long enough to exhibit the issue. I set MaxDirectMemorySize to 512M in an attempt to reproduce the issue faster. I had to use only 2 of the 5G files in order for minio to not run out of space on my laptop and am able to run the upload for a while but it will eventually stop when Linkerd encounters:

E 0322 10:23:07.654 PDT THREAD19 TraceId:17040cea6d9f4381: service failure: com.twitter.finagle.ChannelWriteException: com.twitter.finagle.ChannelClosedException: null at remote address: /127.0.0.1:9000. Remote Info: Not Available from service: $/inet/127.0.0.1/9000. Remote Info: Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: /127.0.0.1:9000, Downstream label: $/inet/127.0.0.1/9000, Trace Id: 17040cea6d9f4381.02e1c836ab7aaf92<:17040cea6d9f4381

indicating the minio has closed the connection to Linkerd for some reason. This results in Linkerd responding to the client with 502 which stops the upload.

What I have observed so far is that Linkerd memory usage shoots up to about 3.3GB relatively quickly but then seems to stay steady there. I have not yet been able to reproduce either an unbounded memory growth or an OutOfDirectMemoryError.

@mebe

This comment has been minimized.

Copy link

mebe commented Mar 26, 2018

@adleong,

I could provide you SSH access to an EC2 instance where I've already successfully reproduced the problem. If you think that's useful, let me know, and I'll set it up.

Thanks!

@chrisgoffinet

This comment has been minimized.

Copy link
Contributor

chrisgoffinet commented Mar 27, 2018

@mebe I'd be happy to take a look as well, feel free to email me at chris@threecomma.io. Logging into the machine would be super helpful

@mebe

This comment has been minimized.

Copy link

mebe commented Apr 4, 2018

I've been trying to reproduce this on a clean Packet.net c1.small.x86 (aka Type 1), but there it seems to work fine. The RSS quickly settles around 3.5 GB and doesn't really budge one way or the other.

There is a big difference between the HW of c1.small.x86 and what we're using internally:

  • c1.small.x86: 4 physical cores, 32 GB of RAM and 2 x SSD in Raid 1
  • Internal: 72 physical cores, 1.5 TB of RAM and 24 x SAS HD in Raid 6

I'm currently running the test on the internal server with CPU count limited to 4 with taskset. Initial results are promising.

@wmorgan

This comment has been minimized.

Copy link
Member

wmorgan commented Apr 4, 2018

@mebe That's interesting. Let us know what you find out with your tests on the internal server.

@chrisgoffinet

This comment has been minimized.

Copy link
Contributor

chrisgoffinet commented Apr 4, 2018

@mebe can you confirm you are using the same AMI image as production? I want to rule out any glibc differences. I too see it settle around 3.5GB. Also you should be running with -XX:+AlwaysPreTouch that way the JVM heap memory gets touched and the RSS will show the actual. It will be much easier to watch.

@mebe

This comment has been minimized.

Copy link

mebe commented Apr 5, 2018

@chrisgoffinet these are both bare metal instances from different provides, so the images aren't identical. The 4-core is running Centos 7.4.1708 and the 72-core is running Centos 7.2.1511. Both have glibc 2.17-196.

FWIW, the restricted-to-4-cores instance on the 72-core machine has now been running continuously for 8 hours when it previously crashed in about 3.5 hours.

@chrisgoffinet

This comment has been minimized.

Copy link
Contributor

chrisgoffinet commented Apr 6, 2018

@mebe i was able to reproduce your issue, before I do a full write up on what's going on, could you try applying these changes below and report back the results. When running linkerd, do not use taskset.

LOCAL_JVM_OPTIONS="-XX:MaxDirectMemorySize=2G -XX:+AlwaysPreTouch"
LOCAL_JVM_OPTIONS="$LOCAL_JVM_OPTIONS -Dio.netty.threadLocalDirectBufferSize=0 -Djdk.nio.maxCachedBufferSize=262144"
LOCAL_JVM_OPTIONS="$LOCAL_JVM_OPTIONS -Dio.netty.recycler.maxDelayedQueuesPerThread=4"
LOCAL_JVM_OPTIONS="$LOCAL_JVM_OPTIONS -Dio.netty.recycler.maxCapacity=4096"
LOCAL_JVM_OPTIONS="$LOCAL_JVM_OPTIONS -Dio.netty.allocator.numHeapArenas=4"
LOCAL_JVM_OPTIONS="$LOCAL_JVM_OPTIONS -Dio.netty.allocator.numDirectArenas=4"

export JVM_HEAP_MIN=4G
export JVM_HEAP_MAX=4G
export MALLOC_ARENA_MAX=2
export MALLOC_MMAP_THRESHOLD_=131072
export MALLOC_TRIM_THRESHOLD_=131072
export MALLOC_TOP_PAD_=131072
export MALLOC_MMAP_MAX_=65536
@chrisgoffinet

This comment has been minimized.

Copy link
Contributor

chrisgoffinet commented Apr 6, 2018

Also please paste the results for ps aux | grep linkerd so we can confirm the options I put above ended up getting passed properly, for sanity sake. Thank you

@mebe

This comment has been minimized.

Copy link

mebe commented Apr 6, 2018

@chrisgoffinet here's a chart of the RSS after about 7 hours with these settings:

rss with chris s settings

After about 6 hours, the RSS seems to have stablized at around 5080 MB.

I started Linkerd with a script like this:

LOCAL_JVM_OPTIONS="-XX:MaxDirectMemorySize=2G -XX:+AlwaysPreTouch"
LOCAL_JVM_OPTIONS="$LOCAL_JVM_OPTIONS -Dio.netty.threadLocalDirectBufferSize=0 -Djdk.nio.maxCachedBufferSize=262144"
LOCAL_JVM_OPTIONS="$LOCAL_JVM_OPTIONS -Dio.netty.recycler.maxDelayedQueuesPerThread=4"
LOCAL_JVM_OPTIONS="$LOCAL_JVM_OPTIONS -Dio.netty.recycler.maxCapacity=4096"
LOCAL_JVM_OPTIONS="$LOCAL_JVM_OPTIONS -Dio.netty.allocator.numHeapArenas=4"
LOCAL_JVM_OPTIONS="$LOCAL_JVM_OPTIONS -Dio.netty.allocator.numDirectArenas=4"
export LOCAL_JVM_OPTIONS

export JVM_HEAP_MIN=4G
export JVM_HEAP_MAX=4G
export MALLOC_ARENA_MAX=2
export MALLOC_MMAP_THRESHOLD_=131072
export MALLOC_TRIM_THRESHOLD_=131072
export MALLOC_TOP_PAD_=131072
export MALLOC_MMAP_MAX_=65536

./linkerd-1.3.6-exec -log.append='true' -log.output='logs/linkerd.log' -log.rollPolicy='Daily' config/linkerd.yaml

ps aux | grep linkerd output:

/usr/bin/java -XX:+PrintCommandLineFlags -Djava.net.preferIPv4Stack=true -Dsun.net.inetaddr.ttl=60 -Xms4G -Xmx4G -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+CMSClassUnloadingEnabled -XX:+ScavengeBeforeFullGC -XX:+CMSScavengeBeforeRemark -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 -XX:-TieredCompilation -XX:+UseStringDeduplication -Dcom.twitter.util.events.sinkEnabled=false -Dorg.apache.thrift.readLength=10485760 -XX:MaxDirectMemorySize=2G -XX:+AlwaysPreTouch -Dio.netty.threadLocalDirectBufferSize=0 -Djdk.nio.maxCachedBufferSize=262144 -Dio.netty.recycler.maxDelayedQueuesPerThread=4 -Dio.netty.recycler.maxCapacity=4096 -Dio.netty.allocator.numHeapArenas=4 -Dio.netty.allocator.numDirectArenas=4 -cp ./linkerd-1.3.6-exec -server io.buoyant.linkerd.Main -log.append=true -log.output=logs/linkerd.log -log.rollPolicy=Daily config/linkerd.yaml
@chrisgoffinet

This comment has been minimized.

Copy link
Contributor

chrisgoffinet commented Apr 6, 2018

@mebe Could you add this as addition to the LOCAL_JVM_OPTIONS and report back the results.

-Dcom.twitter.finagle.netty4.numWorkers=4, I missed that Finagle was lazily using numCores() * 2

Thank you

@chrisgoffinet

This comment has been minimized.

Copy link
Contributor

chrisgoffinet commented Apr 9, 2018

I will open a new PR with optimal settings.

tl; dr:

  • memory fragmentation due to glibc
  • netty and finagle tuneables required to bound memory growth, especially gets bad on systems with many cpu cores due to defaults in finagle and netty for their pool allocator.

We have a couple issues going on that I think I've been able to nail down. To rule out a memory leak I used jemalloc + jeprof to trace where memory was being allocated outside the JVM. I was able confirm there was no leaks occurring, and suspected what we have been seeing was malloc heap fragmentation. But it became a bit more complicated since netty implements their own allocator, and the underlying malloc() still is using glibc (more on that later).

The first issue is finagle by default will use numCores() * 2. So in the case for @mebe they had 72 cores * 2 = 144 finagle threads running due to their large machine. -Dcom.twitter.finagle.netty4.numWorkers is a way to override this. The tricky part was finagle does not just create those threads on startup. They are lazily created, and it took me awhile even after most of these settings to track down why I was still seeing some growth, even slowly. Which was due to Finagle's lazy thread creation.

In netty 4x, they implemented the pool allocator, which is based on a variant of jemalloc using buddy and slab allocation (https://netty.io/wiki/new-and-noteworthy-in-4.0.html)

Out of the box, netty and finagle will set io.netty.allocator.numHeapArenas and io.netty.allocator.numDirectArenas to numCores() * 2. This is to reduce per-thread contention when accessing memory. Whatever value we set numWorkers to above, we need to set those values for the heapArenas.

glibc fragmentation:

Llinkerd with docker (uses Debian Jessie w/ glibc 2.24) and most linux distros are using glibc >=2.10, glibc implemented per-thread arenas. Feel free to read the malloc scalability section at https://udrepper.livejournal.com/20948.html

You can adjust this behavior by setting MALLOC_ARENA_MAX, which will set a max # of arenas for the JVM to use, instead of it being calculated based on # of threads running (which is the default). I've found MALLOC_ARENA_MAX=2 to be acceptable, and found various other projects utilizing this value without major performance issues.

Netty 4.x changes:

On Feb 12, an issue was opened and fixed in Netty 4.1.22 where they were caching DirectByteBuffer objects (up to 64k) along with the recycler which may cache up to (32k) objects. The below PRs have this fixed in 4.1.22, and thankfully they are already tunable so we can adjust this by setting a defined property.

-Dio.netty.threadLocalDirectBufferSize=0
-Dio.netty.recycler.maxCapacity=4096

netty/netty#7704
netty/netty#7701

Here are some graphs I put together tracking RSS with before and after.

Default Values

linkerd-defaults

Tuned values
linkerd-arena

@mebe

This comment has been minimized.

Copy link

mebe commented Apr 9, 2018

I ran my S3 stress test over the weekend for a bit shy of two days with these latest switches (including netty4.numWorkers=4). The results look very promising:

rss with netty4 numworkers 4

There's a sudden spike (literally between two data points in my measurements) of 150MB at 15 hours in to the test, but otherwise the memory usage looks practically stable. I think we can consider the issue solved.

Great work, @chrisgoffinet!

chrisgoffinet added a commit to chrisgoffinet/linkerd that referenced this issue Apr 9, 2018

Tune for memory fragmentation and more optimal netty/finagle settings
* Added FINAGLE_WORKERS option to linkerd/namerd to tune the number of finagle workers, with a default of (eight) workers. If you send small sized requests
and need higher throughput, increase the # of workers.

* Tune netty settings to:

-Dio.netty.threadLocalDirectBufferSize=0
-Dio.netty.recycler.maxCapacity=4096

Per netty/netty#7704 and netty/netty#7701

* Support using MALLOC_ARENA_MAX = 2 for glibc >= 2.10, which switched to a per-thread memory pool

Fixes: linkerd#1690

Signed-off-by: Chris Goffinet <chris@threecomma.io>

adleong added a commit that referenced this issue Apr 11, 2018

Tune for memory fragmentation and more optimal netty/finagle settings (
…#1889)

* Added FINAGLE_WORKERS option to linkerd/namerd to tune the number of finagle workers, with a default of (eight) workers. If you send small sized requests
and need higher throughput, increase the # of workers.

* Tune netty settings to:

-Dio.netty.threadLocalDirectBufferSize=0
-Dio.netty.recycler.maxCapacity=4096

Per netty/netty#7704 and netty/netty#7701

* Support using MALLOC_ARENA_MAX = 2 for glibc >= 2.10, which switched to a per-thread memory pool

Fixes: #1690

Signed-off-by: Chris Goffinet <chris@threecomma.io>

@adleong adleong removed this from Bug in Linkerd Roadmap Apr 12, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment