Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server crashes after 1-4 hours of workload #10950

Closed
Dwynr opened this issue Nov 22, 2020 · 14 comments · Fixed by #10951
Closed

Server crashes after 1-4 hours of workload #10950

Dwynr opened this issue Nov 22, 2020 · 14 comments · Fixed by #10951
Assignees

Comments

@Dwynr
Copy link

Dwynr commented Nov 22, 2020

Expected Behavior

Don't crash

Current Behavior

After 1-4 hours of Put/Get workload Minio crashes

Possible Solution

/

Steps to Reproduce (for bugs)

  1. Start minio in standalone mode
  2. Wait

Context

Maybe this is related to my setup having about 50.000 buckets with 50 million objects, but I'm not sure.

Your Environment

  • Version used (minio --version): minio version RELEASE.2020-11-19T23-48-16Z
  • Server setup and configuration: Standalone
  • Operating System and version (uname -a): Linux de1cluster1 4.15.0-124-generic Accessing dir as object should return ObjectNotFound #127-Ubuntu SMP Fri Nov 6 10:54:43 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

I attached the log files.

log.txt

@harshavardhana
Copy link
Member

Looks like found the problem

fatal error: concurrent map iteration and map write

goroutine 12723471 [running]:
runtime.throw(0x1e1f2de, 0x26)
        runtime/panic.go:1116 +0x72 fp=0xc01cf76808 sp=0xc01cf767d8 pc=0x437952
runtime.mapiternext(0xc01cf76980)
        runtime/map.go:853 +0x554 fp=0xc01cf76888 sp=0xc01cf76808 pc=0x410db4
github.com/minio/minio/cmd.(*dataUsageCache).flatten(0xc01cf76ea0, 0x2d80645f418, 0x303873, 0x2b629, 0x248c5, 0x2b3985, 0x0, 0x0, 0x0, 0x0, ...)
        github.com/minio/minio/cmd/data-usage-cache.go:302 +0xe9 fp=0xc01cf769f0 sp=0xc01cf76888 pc=0x15869a9
github.com/minio/minio/cmd.(*dataUsageCache).dui(0xc01cf76ea0, 0x1dd4216, 0x1, 0xc020cb2000, 0x18508, 0x1a666, 0x0, 0x0, 0x0, 0x0, ...)
        github.com/minio/minio/cmd/data-usage-cache.go:213 +0xf8 fp=0xc01cf76b98 sp=0xc01cf769f0 pc=0x1585858
github.com/minio/minio/cmd.(*erasureServerSets).CrawlAndGetDataUsage.func2.1()
        github.com/minio/minio/cmd/erasure-server-sets.go:370 +0x35d fp=0xc01cf76d68 sp=0xc01cf76b98 pc=0x179b8fd
github.com/minio/minio/cmd.(*erasureServerSets).CrawlAndGetDataUsage.func2(0xc003e67100, 0xc00cb0ba00, 0xc01110cf60, 0xc020cb2000, 0x18508, 0x1a666, 0xc011107270, 0xc0181591a0, 0xc011107290,
 0xc000bd0020)
        github.com/minio/minio/cmd/erasure-server-sets.go:389 +0x18c fp=0xc01cf76f90 sp=0xc01cf76d68 pc=0x179bc2c
runtime.goexit()
        runtime/asm_amd64.s:1374 +0x1 fp=0xc01cf76f98 sp=0xc01cf76f90 pc=0x470701
created by github.com/minio/minio/cmd.(*erasureServerSets).CrawlAndGetDataUsage
        github.com/minio/minio/cmd/erasure-server-sets.go:348 +0x8af

@Dwynr
Copy link
Author

Dwynr commented Nov 22, 2020

Yep thats what I was looking at aswell. I tried disabling the usage crawler but I did not help (maybe I did it wrong). I upgraded the server binary today and ever since then it started crashing.

@harshavardhana
Copy link
Member

Yep thats what I was looking at aswell. I tried disabling the usage crawler but I did not help (maybe I did it wrong). I upgraded the server binary today and ever since then it started crashing.

Disabling is simple export MINIO_DISK_USAGE_CRAWL_ENABLE=off @Dwynr

@harshavardhana
Copy link
Member

Can you paste your command line?

@Dwynr
Copy link
Author

Dwynr commented Nov 22, 2020

Okay let me try disabling it then.

MINIO_BROWSER=off MINIO_STORAGE_CLASS_STANDARD=EC:2 MINIO_STORAGE_CLASS_RRS=EC:2 MINIO_ACCESS_KEY=accessKey MINIO_SECRET_KEY=secretKey minio server http://host/minio/data{1...8}

@harshavardhana
Copy link
Member

MINIO_BROWSER=off MINIO_STORAGE_CLASS_STANDARD=EC:2 MINIO_STORAGE_CLASS_RRS=EC:2 MINIO_ACCESS_KEY=accessKey MINIO_SECRET_KEY=secretKey minio server http://host/minio/data{1...8}

why are you running a single host system in this manner?

http://host/minio/data{1...8}

you can simply run minio server /minio/data{1...8}

@Dwynr
Copy link
Author

Dwynr commented Nov 22, 2020

Yea I'm pretty new to Minio and I wanted to expand the cluster sooner or later. I figured that I need to run it this way then (and just add more hosts to the command line).

@harshavardhana
Copy link
Member

Yea I'm pretty new to Minio and I wanted to expand the cluster sooner or later. I figured that I need to run it this way then (and just add more hosts to the command line).

You cannot expand in this manner @Dwynr - a standalone system cannot be converted to multiple nodes in this manner you will not get the distribution - i.e erasure-coded objects wouldn't be spread out.

If you want to expand you need to setup distributed mode first and then copy your content from your current cluster to new cluster using tools like mc mirror or aws s3 sync

@harshavardhana
Copy link
Member

Can you try using minio server /minio/data{1...8} without disabling crawler and see if it fails? @Dwynr

@Dwynr
Copy link
Author

Dwynr commented Nov 22, 2020

Yea I'm pretty new to Minio and I wanted to expand the cluster sooner or later. I figured that I need to run it this way then (and just add more hosts to the command line).

You cannot expand in this manner @Dwynr - a standalone system cannot be converted to multiple nodes in this manner you will not get the distribution - i.e erasure-coded objects wouldn't be spread out.

If you want to expand you need to setup distributed mode first and then copy your content from your current cluster to new cluster using tools like mc mirror or aws s3 sync

Thanks for your help! I wanted to create new nodes for the cluster anyways. So are you saying I should start a new cluster on 2 nodes in distributed mode and then just mirror all data from the old one? And then I can expand it as needed?

Can you try using minio server /minio/data{1...8} without disabling crawler and see if it fails? @Dwynr

I will now. Will report back if it crashes again. Thanks for your help again. Much appreciated.

harshavardhana added a commit to harshavardhana/minio that referenced this issue Nov 22, 2020
this is needed to avoid initializing notification peers
that can lead to races in many sub-systems

fixes minio#10950
@harshavardhana harshavardhana self-assigned this Nov 22, 2020
harshavardhana added a commit to harshavardhana/minio that referenced this issue Nov 23, 2020
this is needed to avoid initializing notification peers
that can lead to races in many sub-systems

fixes minio#10950
@Dwynr
Copy link
Author

Dwynr commented Nov 23, 2020

So I the server ran fine for almost a day now with the crawler disabled and your command line suggestions, but now I get

`
runtime: failed to create new OS thread (have 10802 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc

runtime stack:
runtime.throw(0x1dde293, 0x9)
runtime/panic.go:1116 +0x72
runtime.newosproc(0xc0476c1c00)
runtime/os_linux.go:161 +0x1c5
runtime.newm1(0xc0476c1c00)
runtime/proc.go:1837 +0xdd
runtime.newm(0x0, 0xc000061800, 0x2a31)
runtime/proc.go:1816 +0x9b
runtime.startm(0xc000061800, 0x3252600)
runtime/proc.go:1973 +0xc9
runtime.handoffp(0xc000061800)
runtime/proc.go:2001 +0x52
runtime.retake(0x52f20429996, 0xc000050000)
runtime/proc.go:4826 +0x168
runtime.sysmon()
runtime/proc.go:4734 +0x1b1
runtime.mstart1()
runtime/proc.go:1172 +0xc8
runtime.mstart()
runtime/proc.go:1137 +0x6e
`

I got open files limit and user processes limit set to more than enough on the machine.

core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 127352 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 1048576 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 640000 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited

I didnt change anything (still the same workload).

harshavardhana added a commit that referenced this issue Nov 23, 2020
)

this is needed to avoid initializing notification peers
that can lead to races in many sub-systems

fixes #10950
@harshavardhana
Copy link
Member

So I the server ran fine for almost a day now with the crawler disabled and your command line suggestions, but now I get

`
runtime: failed to create new OS thread (have 10802 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc

runtime stack:
runtime.throw(0x1dde293, 0x9)
runtime/panic.go:1116 +0x72
runtime.newosproc(0xc0476c1c00)
runtime/os_linux.go:161 +0x1c5
runtime.newm1(0xc0476c1c00)
runtime/proc.go:1837 +0xdd
runtime.newm(0x0, 0xc000061800, 0x2a31)
runtime/proc.go:1816 +0x9b
runtime.startm(0xc000061800, 0x3252600)
runtime/proc.go:1973 +0xc9
runtime.handoffp(0xc000061800)
runtime/proc.go:2001 +0x52
runtime.retake(0x52f20429996, 0xc000050000)
runtime/proc.go:4826 +0x168
runtime.sysmon()
runtime/proc.go:4734 +0x1b1
runtime.mstart1()
runtime/proc.go:1172 +0xc8
runtime.mstart()
runtime/proc.go:1137 +0x6e
`

I got open files limit and user processes limit set to more than enough on the machine.

core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 127352 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 1048576 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 640000 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited

I didnt change anything (still the same workload).

As the error message indicates you need to increase ulimit -u

@Dwynr
Copy link
Author

Dwynr commented Nov 23, 2020

Well the limit for processes is already at max user processes (-u) 640000. That should be enough no?

@harshavardhana
Copy link
Member

harshavardhana commented Nov 23, 2020

Well the limit for processes is already at max user processes (-u) 640000. That should be enough no?

According to the error its not @Dwynr - it looks like you are co-hosting this with other applications which are eating away the number of processes on the system

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants