Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ftruncate sometimes leaves a chunk with no valid copies #456

Open
njaard opened this issue Oct 5, 2021 · 4 comments
Open

[BUG] ftruncate sometimes leaves a chunk with no valid copies #456

njaard opened this issue Oct 5, 2021 · 4 comments

Comments

@njaard
Copy link

njaard commented Oct 5, 2021

Have you read through available documentation, open Github issues and Github Q&A Discussions?

Yes

System information

Your moosefs version and its origin (moosefs.com, packaged by distro, built from source, ...).

Debian packages from moosefs.com version 3.0.116.

Operating system (distribution) and kernel version.

Debian bullseye, 5.10.0-8

Hardware / network configuration, and underlying filesystems on master, chunkservers, and clients.

4 chunkservers and 1Gbit or 10Gbit ethernet between them. The same machines (plus two more) have the clients and one of them has the master.

The filesystems are by and large xfs with a smaller number that are ext4.

The chunkservers often report connect to X.X.X.X:9422 failed, error: ETIMEDOUT (Operation timed out)

The machines are all quite busy, but the disk drives themselves are exclusively used by moosefs-chunkserver.

How much data is tracked by moosefs master (order of magnitude)?

  • All fs objects: 4M
  • Total space: 600TB
  • Free space: 19TB
  • RAM used: 2.7GiB
  • last metadata save duration: 7.4s

Describe the problem you observed.

This has been happening rarely, occasionally, a file (with potential for concurrent readers) gets ftruncated, and then the last chunk shows this:

        chunk 18647: 0000000023348B3C_00000311 / (id:590646076 ver:785)
                copy 1: X.X.X.242:9422 (status:WRONG VERSION)
                copy 2: X.X.X.245:9422 (status:INVALID)

Can you reproduce it? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

I have one of these incidents every couple of days, very sporadically. There are around 100 or so truncates done on these files every minute. The truncates generally happen immediately after an append on the file, which is to say, I add some data to the end of the file and then immediately remove it.

Include any warning/errors/backtraces from the system logs.

Here's the moosefs-master syslog:

Oct 05 21:25:12 vogar mfsmaster[10840]: (X.X.X.242:9422) chunk: 0000000023348B35 truncate status: Operation not completed
Oct 05 21:25:13 vogar mfsmaster[10840]: (X.X.X.242:9422) chunk: 0000000023348B25 truncate status: Operation not completed
Oct 05 21:25:21 vogar mfsmaster[10840]: (X.X.X.242:9422) chunk: 00000000233489DE truncate status: Operation not completed
Oct 05 21:25:27 vogar mfsmaster[10840]: (X.X.X.242:9422) chunk: 0000000023348B3C truncate status: Operation not completed
Oct 05 21:25:31 vogar mfsmaster[10840]: (X.X.X.242:9422) chunk: 0000000023348ADA truncate status: Operation not completed
Oct 05 21:25:31 vogar mfsmaster[10840]: (X.X.X.242:9422) chunk: 0000000023348A93 truncate status: Operation not completed
Oct 05 21:25:31 vogar mfsmaster[10840]: chunk 0000000023348B3C_00000314: chunk in middle of operation TRUNCATE, but no chunk server is busy - finish operation
Oct 05 21:25:31 vogar mfsmaster[10840]: (X.X.X.245:9422) chunk: 0000000023348B3C truncate status: Operation not completed
Oct 05 21:25:31 vogar mfsmaster[10840]: chunk 0000000023348B3C_00000313: chunk in middle of operation TRUNCATE, but no chunk server is busy - finish operation
Oct 05 21:25:31 vogar mfsmaster[10840]: (X.X.X.245:9422) chunk: 0000000023348B3C truncate status: Operation not completed
Oct 05 21:25:32 vogar mfsmaster[10840]: chunk 0000000023348B3C_00000312: chunk in middle of operation TRUNCATE, but no chunk server is busy - finish operation
Oct 05 21:25:32 vogar mfsmaster[10840]: (X.X.X.245:9422) chunk: 0000000023348B3C truncate status: Operation not completed
Oct 05 21:25:32 vogar mfsmaster[10840]: chunk 0000000023348B3C_00000311: chunk in middle of operation TRUNCATE, but no chunk server is busy - finish operation
Oct 05 21:25:32 vogar mfsmaster[10840]: (X.X.X.245:9422) chunk: 0000000023348B3C truncate status: Operation not completed
Oct 05 21:25:33 vogar mfsmaster[10840]: (X.X.X.242:9422) chunk: 0000000023348B3C truncate status: Operation not completed
Oct 05 21:25:33 vogar mfsmaster[10840]: chunk 0000000023348B3C_00000310: chunk in middle of operation TRUNCATE, but no chunk server is busy - finish operation
Oct 05 21:25:33 vogar mfsmaster[10840]: (X.X.X.245:9422) chunk: 0000000023348B3C truncate status: Operation not completed
Oct 05 21:25:34 vogar mfsmaster[10840]: (X.X.X.242:9422) chunk: 0000000023348B50 truncate status: Operation not completed
Oct 05 21:25:34 vogar mfsmaster[10840]: chunk 0000000023348B3C_00000311: chunk in middle of operation TRUNCATE, but no chunk server is busy - finish operation
Oct 05 21:25:34 vogar mfsmaster[10840]: chunk 0000000023348B3C has only copies with wrong versions (1) - please repair it manually
Oct 05 21:25:34 vogar mfsmaster[10840]: chunk 0000000023348B3C_00000311 - invalid copy on (X.X.X.245 - ver:00000000)
Oct 05 21:25:34 vogar mfsmaster[10840]: chunk 0000000023348B3C_00000311 - copy with wrong version on (X.X.X.242 - ver:00000311)
Oct 05 21:25:34 vogar mfsmaster[10840]: (X.X.X.245:9422) chunk: 0000000023348B3C truncate status: Wrong chunk version
Oct 05 21:25:35 vogar mfsmaster[10840]: chunk 0000000023348B3C has only copies with wrong versions (1) - please repair it manually
Oct 05 21:25:35 vogar mfsmaster[10840]: chunk 0000000023348B3C_00000311 - invalid copy on (X.X.X.245 - ver:00000000)
Oct 05 21:25:35 vogar mfsmaster[10840]: chunk 0000000023348B3C_00000311 - copy with wrong version on (X.X.X.242 - ver:00000311)
@chogata
Copy link
Member

chogata commented Oct 7, 2021

Thank you for this report, we will investigate.

@chogata
Copy link
Member

chogata commented Oct 7, 2021

What is your load on chunk servers usually? (In CGI, the "Servers" tab, column "load").

@njaard
Copy link
Author

njaard commented Oct 7, 2021

Hi @chogata , from the log file, I saw:

Heavy load server detected (X.X.X.244:9422); load: 210 ; threshold: 150 ; loadavg (without this server): 24.00

I have decreased the write load and stopped doing as many ftruncates to mitigate my production issues, currently the load is 0-2

@chogata
Copy link
Member

chogata commented Oct 11, 2021

@njaard you can try and patch your instance with the new fix we have just submitted and this should solve your problem. However, your load is quite high. How many max workers do you have in your chunk server's config? If the overall system load of your machines is not too high, you can consider upping this number to increase the overall performance of your MooseFS instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants