Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange chunk distribution and replication. #521

Open
dogshoes opened this issue Feb 16, 2017 · 5 comments
Open

Strange chunk distribution and replication. #521

dogshoes opened this issue Feb 16, 2017 · 5 comments

Comments

@dogshoes
Copy link

I'm running a LizardFS cluster with three chunk servers holding approximately 18 million files. The goal is "3". This is a pre-production cluster I'm evaluating.

I'm converting the chunk servers one by one from magnetic media to SSD. I downed the first chunk server and performed the maintenance: removed the old drives (four magnetic drives) and installed the new drives (seven SSDs). I started the server, it joined the cluster again, and chunks were replicated. About half-way through I noticed that space was being used on this chunk server at twice the rate expected, and this trend continued to when replication finished. The two untouched nodes had 509 GiB of data and the new node had 1.2 TiB of data while having the same number of chunks: 19512380.

At this point I made note of the oddity and moved on to the second chunk server. Similar maintenance was performed and the node was brought back online. However, no chunks are being replicated to this updated chunk server. At this point, all of our chunks are being reported as undergoal.

In an attempt to kickstart replication again I used mfssetgoal to change the goal from 3 to 2 and back to 3. The chunks went from undergoal to stable, and back to undergoal, but no replication started to the new node.

Since then, some data has been written to the cluster and those chunks are stable and are present on all three chunk servers. The only chunks present on the second chunk server are those written since the drives were swapped.

No errors reported on the chunk server or our metadata master. The mfschunkserver daemon has set up the expected directory structure on the new drives.

Not sure what's going on. Any advice?

lizardfs-disks

lizardfs-info

lizardfs-servers

@psarna
Copy link
Member

psarna commented Feb 16, 2017

As for increased space usage, it can be a lot of things. My first idea is sparse files - perhaps the old filesystem on magnetic hard drives had them, and the new one does not. You can try running fallocate command mentioned in this issue #370 and see if space usage drops. If it's not that, it may still be something with filesystems. What was mounted before and what is mounted now? If the number of chunks is the same, it doesn't really looks like internal LizardFS problem (I don't completely rule it out, though).

As for question 2: changing goal to 2 and then back to 3 won't kickstart replication. Here are the master's configuration entries that would (mfsmaster.cfg):

OPERATIONS_DELAY_INIT
   initial delay in seconds before starting chunk operations (default is 300)

OPERATIONS_DELAY_DISCONNECT
   chunk operations delay in seconds after chunkserver disconnection (default is 3600)


CHUNKS_LOOP_PERIOD
   Time in milliseconds between chunks loop execution (default is 1000).

CHUNKS_LOOP_MAX_CPU
   Hard limit on CPU usage by chunks loop (percentage value, default is 60).



CHUNKS_SOFT_DEL_LIMIT
   Soft maximum number of chunks to delete on one chunkserver (default is 10)

CHUNKS_HARD_DEL_LIMIT
   Hard maximum number of chunks to delete on one chunkserver (default is 25)

CHUNKS_WRITE_REP_LIMIT
   Maximum number of chunks to replicate to one chunkserver (default is 2)

CHUNKS_READ_REP_LIMIT
   Maximum number of chunks to replicate from one chunkserver (default is 10)

Chunk loop is a continuous maintenance operation that happens in the master server. During each loop, some chunks might be scheduled for replication, but the number will never exceed CHUNKS_READ_REP_LIMIT for outgoing replication and CHUNKS_WRITE_REP_LIMIT for incoming replication. Same goes for deleting overgoal chunks, rebalancing, etc. Additionally, some chunks are in protected period if their chunkservers were recently disconnected or master server was restarted (OPERATIONS_DELAY_DISCONNECT, OPERATIONS_DELAY_INIT). Finally, you can set how often the chunk loop goes (CHUNKS_LOOP_PERIOD) and how aggresive it is (CHUNKS_LOOP_MAX_CPU). That's how you can boost replication in LizardFS.

@dogshoes
Copy link
Author

dogshoes commented Feb 16, 2017

Hi @psarna,

The FS and mount parameters for the drives are identical between the magnetic and SSD drives. I make a single GPT partition and run mkfs.xfs on it without arguments. These chunk drives are mounted with the recommended "rw,noexec,nodev,noatime,nodiratime,largeio,inode64,barrier=0". I'll take a look to see whether sparse files are in use, but I don't believe so. The dataset should be roughly the 550GB that we were seeing before.

If I can get replication started again I can attempt to determine whether disk geometry or number of disks behave differently.

The chunk loop parameters are configured to permit a high rate of replication without affecting client I/O. The first chunk server that replicated did so immediately and very quickly. Our current settings:

# CHUNKS_LOOP_MAX_CPS = 100000
# CHUNKS_LOOP_MIN_TIME = 300
# CHUNKS_LOOP_PERIOD = 1000
# CHUNKS_LOOP_MAX_CPU = 60
CHUNKS_WRITE_REP_LIMIT = 100
CHUNKS_READ_REP_LIMIT = 20
ENDANGERED_CHUNKS_PRIORITY = .1

image

Here's the replication rate graph. Our first chunk server finished replicating around 18:00, taking nearly a week to complete. The second chunk server has not yet started, nearly 12 hours later.

@dogshoes
Copy link
Author

I was able to spend a little more time to investigate this issue. Sparse files are definitely involved, but are equally involved on all chunk servers (those with 500GB of use and those with 1.2TB of disk use).

To make things easier:

  • We had three identically configured chunk servers: A, B, and C. All three servers are less than three weeks old. We started testing failure scenarios and performance after our dataset completed copying to the cluster about a week after they were installed.
  • Chunk server A had all drives removed (as if it failed) and replaced with SSDs. It was brought back online. This was to both test the failure of a chunk server and to start improving performance (moving from magnetic HDDs to SSDs).
  • We noticed very slow chunk replication and tuned the CHUNKS_WRITE_REP_LIMIT and CHUNKS_READ_REP_LIMIT.
  • After about a week all 17,500,000 chunks replicated but chunk server A had a little over twice as much data as would be expected.
  • After chunk server A completed replication, chunk server B had all drives removed and replaced with SSDs. It was brought back online and so far as had no undergoal chunks replicated to it.
  • Chunk server C is untouched and running in its initial configuration.

Observations so far:

  • All of our nodes are running CentOS 7 and LizardFS 3.10.6.
  • The metrics of chunks per disk reported through the cgi server matches the number of files on the disk.
  • We're still reporting a huge number of undergoal chunks, as chunk server B still hasn't replicated the bulk of the data.
  • Rebooted chunk server B, no replication started.
  • The chunk server B is receiving new chunks without issue. It's also removing them without issue (emptied the trash on the meta mount).
  • All chunk servers have selinux enabled. Disabling selinux doesn't seem to improve the situation and there are no errors logged related to selinux.
  • The datasource that was copied to this LizardFS cluster is 327GB in size.

Would it be ill-advised to mark all disks on chunk server A, which has too much data on it, as evicted to see what happens when the master moves data off the disks? I'm curious as to whether the chunk server B will then replicate the undergoal chunks or will still refuse to. I'm tempted to rebuild both chunk server A and B from scratch, but during that time we'd only have one copy of each chunk. This will also complicate maintenance if we can't take chunk servers offline to upgrade or expand them without completely rebuilding them.

@4Dolio
Copy link

4Dolio commented Mar 27, 2017

Try turning down the OPERATIONS_DELAY_* values to 0 perhaps.

@Blackpaw
Copy link

Did you resolve the undergoal replication issue dogshoes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants