Add distance penalty to chunk server configuration #525

Open
wants to merge 9 commits into
from

Conversation

Projects
None yet
6 participants
@jdieter
Contributor

jdieter commented Feb 27, 2017

Currently when reading chunks from chunkservers, there is no way to
prioritize fast chunkservers running on SSDs over slow chunkservers
running on spinning disk.

This patch-set adds a new configuration option (DISTANCE_PENALTY)
that allows the admin to set a distance penalty for each chunkserver.
This penalty will be added to any distance calculations made for the
chunkserver, which (depending on its value and whether my previous
distance calculation patch, #523, which is included as the first patch in this
patch-set, has been applied) will make the chunkserver a less likely
candidate for reads.

This patch-set is now in production at our school, where I've configured
all our spinning disks to have a penalty of 100, while all our SSDs have
a penalty of 0 (the default). Our VMs (which have some chunks on spinning
disk due to changed goals of snapshots) are now much faster, because
they only read from the spinning disks if the SSDs are unavailable.

This patch-set does include a network protocol change that makes it
incompatible with older LizardFS installs. If there's any way for the
chunkserver to tell older masters that they can ignore the
LIZ_CSTOMA_REGISTER_PENALTY packet, or for the chunkserver to detect that
the master doesn't recognize that packet, I would happily implement it.

@MPaszkiewicz

This comment has been minimized.

Show comment
Hide comment
@Blackpaw

This comment has been minimized.

Show comment
Hide comment
@Blackpaw

Blackpaw Feb 27, 2017

Usefull but with some limitations?

  • Writes will still be at the rate of the slowest chunkserver though won't they
  • It requires chunkservers to be all ssd

Usefull but with some limitations?

  • Writes will still be at the rate of the slowest chunkserver though won't they
  • It requires chunkservers to be all ssd
@Zorlin

This comment has been minimized.

Show comment
Hide comment
@Zorlin

Zorlin Feb 28, 2017

This is a very interesting patch, thank you. I hope something like it can be officially implemented.

Zorlin commented Feb 28, 2017

This is a very interesting patch, thank you. I hope something like it can be officially implemented.

@jdieter

This comment has been minimized.

Show comment
Hide comment
@jdieter

jdieter Feb 28, 2017

Contributor

@Blackpaw, I haven't really looked at writes yet. If they are sent in parallel to all chunkservers, then yes, it will be at the speed of the slowest chunkserver.

This patch-set does not require all chunkservers to be SSD, though. In our use case, we have three SSD chunkservers and three spinning disk chunkservers. Our VMs are set up to use the SSD chunkservers with three replicas, but every night I make a snapshot and set the goal of the snapshot to be spinning disk. LizardFS will then make six replicas (three SSD, three spinning disk), and then free the SSD replicas as those chunks are no longer used by the VMs.

My problem was the the virtual hosts kept reading from the spinning disk, even though all new writes were going to the SSDs. This patch-set fixed that problem.

I would like to look at the writes, but this is the first step.

Contributor

jdieter commented Feb 28, 2017

@Blackpaw, I haven't really looked at writes yet. If they are sent in parallel to all chunkservers, then yes, it will be at the speed of the slowest chunkserver.

This patch-set does not require all chunkservers to be SSD, though. In our use case, we have three SSD chunkservers and three spinning disk chunkservers. Our VMs are set up to use the SSD chunkservers with three replicas, but every night I make a snapshot and set the goal of the snapshot to be spinning disk. LizardFS will then make six replicas (three SSD, three spinning disk), and then free the SSD replicas as those chunks are no longer used by the VMs.

My problem was the the virtual hosts kept reading from the spinning disk, even though all new writes were going to the SSDs. This patch-set fixed that problem.

I would like to look at the writes, but this is the first step.

@jdieter

This comment has been minimized.

Show comment
Hide comment
@jdieter

jdieter Mar 1, 2017

Contributor

@MPaszkiewicz, I think I've addressed the reasons that this patch failed the sanity check in my last commit. Is there anything I need to do to have it resubmitted?

Contributor

jdieter commented Mar 1, 2017

@MPaszkiewicz, I think I've addressed the reasons that this patch failed the sanity check in my last commit. Is there anything I need to do to have it resubmitted?

@MPaszkiewicz

This comment has been minimized.

Show comment
Hide comment
@MPaszkiewicz

MPaszkiewicz Mar 1, 2017

Contributor

@jdieter, your first patch introducing distance calculation between racks is quite likely to be merged soon.
Other set of patches can not be merged at the moment. You can take a look at the comment we made here: http://cr.skytechnology.pl:8081/#/c/2863
Our current plan for handling this kind of situations is to implement more complex topology mechanism based on graphs. In particular, we would like this change not to affect chunkserver logic.

Contributor

MPaszkiewicz commented Mar 1, 2017

@jdieter, your first patch introducing distance calculation between racks is quite likely to be merged soon.
Other set of patches can not be merged at the moment. You can take a look at the comment we made here: http://cr.skytechnology.pl:8081/#/c/2863
Our current plan for handling this kind of situations is to implement more complex topology mechanism based on graphs. In particular, we would like this change not to affect chunkserver logic.

@jdieter

This comment has been minimized.

Show comment
Hide comment
@jdieter

jdieter Mar 1, 2017

Contributor

Thanks so much for the comment. I'll look at adding in a legacy data structure. As for the more complex topology mechanism, would there be a way to indicate that two chunkservers at the same IP would have different distances? This was the main thing that convinced me to do a separate chunkserver configuration option, because our SSD chunkservers are running on the same server as our HDD chunkservers.

Contributor

jdieter commented Mar 1, 2017

Thanks so much for the comment. I'll look at adding in a legacy data structure. As for the more complex topology mechanism, would there be a way to indicate that two chunkservers at the same IP would have different distances? This was the main thing that convinced me to do a separate chunkserver configuration option, because our SSD chunkservers are running on the same server as our HDD chunkservers.

@MPaszkiewicz

This comment has been minimized.

Show comment
Hide comment
@MPaszkiewicz

MPaszkiewicz Mar 2, 2017

Contributor

Yes, in order differentiate two chunkservers sharing the same IP, we can take ports into account.

Contributor

MPaszkiewicz commented Mar 2, 2017

Yes, in order differentiate two chunkservers sharing the same IP, we can take ports into account.

@jdieter

This comment has been minimized.

Show comment
Hide comment
@jdieter

jdieter Mar 27, 2017

Contributor

Ok, just to make sure this is clear. Is there any point in my addressing the issues in http://cr.skytechnology.pl:8081/#/c/2863, or do you feel strongly enough that you don't want to change the chunkserver logic that I just shouldn't bother?

Contributor

jdieter commented Mar 27, 2017

Ok, just to make sure this is clear. Is there any point in my addressing the issues in http://cr.skytechnology.pl:8081/#/c/2863, or do you feel strongly enough that you don't want to change the chunkserver logic that I just shouldn't bother?

@psarna

This comment has been minimized.

Show comment
Hide comment
@psarna

psarna Mar 27, 2017

Member

http://cr.skytechnology.pl:8081/#/c/2861 will most likely be merged after we apply the amended version,
but as for http://cr.skytechnology.pl:8081/#/c/2863, right now it is too complicated to be considered for a merge, even after addressing existing issues.

I think the best way is to leave this issue open for future reference (when we decide to rewrite topology in a more sane way). Then, patches 2862-2865 could be reconsidered as merge candidates.

Member

psarna commented Mar 27, 2017

http://cr.skytechnology.pl:8081/#/c/2861 will most likely be merged after we apply the amended version,
but as for http://cr.skytechnology.pl:8081/#/c/2863, right now it is too complicated to be considered for a merge, even after addressing existing issues.

I think the best way is to leave this issue open for future reference (when we decide to rewrite topology in a more sane way). Then, patches 2862-2865 could be reconsidered as merge candidates.

@jdieter

This comment has been minimized.

Show comment
Hide comment
@jdieter

jdieter Mar 27, 2017

Contributor

Ok, thanks so much. I'll hold off polishing this patch until you're ready for it or we have something else that does essentially the same thing.

Contributor

jdieter commented Mar 27, 2017

Ok, thanks so much. I'll hold off polishing this patch until you're ready for it or we have something else that does essentially the same thing.

DarkHaze and others added some commits May 11, 2017

master: Fix high cpu usage in fs_periodic_file_test
This commit Fixes high cpu usage in fs_periodic_file_test.

Fixes #547

Change-Id: Ia93173dd0f358f3ff606c7ebb5848f2b786b2158
master: Add missing initializer to load_factor
This commit adds missing initialization of load_factor member
to avoid valgrind warnings.

Change-Id: Ifca5ad0afd781c6fc23090206750a6fe66573f10
master: Use difference between racks to calculate distance
Currently distance is calculated as 0 (same machine), 1 (same rack) and 2
(different racks).  This doesn't allow any way of saying that rack 1 is
closer to rack 2 than it is to rack 42.

This patch changes the topology distance calculation by having different
racks represented by the difference between the racks + 1 (since a
distance of 1 means they're on the same rack).

Rack 2 now has a distance of 2 from rack 1 (abs(2-1)+1=2) and a distance
of 41 from rack 42 (abs(2-42)+1=41), which means clients on rack 2 will
far prefer chunk servers on rack 1.

Signed-off-by: Jonathan Dieter <jdieter@lesbg.com>
Change size to uint32_t and check for overflow.
We use 64-bit version of std::min to make sure we don't overflow, but make
sure it's <= max for uint32_t, then cast to uint32_t.

Signed-off-by: Jonathan Dieter <jdieter@lesbg.com>
chunkserver: Add distance penalty to chunk server configuration
Currently when reading chunks from chunkservers, there is no way to
prioritize fast chunkservers running on SSDs over slow chunkservers
running on spinning disk.

This patch-set adds a new configuration option that allows the admin
to set a distance penalty for each chunkserver.  This penalty will be
added to *any* distance calculations made for the chunkserver, which
(depending on its value and whether my previous distance calculation
patch has been applied) will make the chunkserver a less likely candidate
for reads.

This patch-set is now in production at our school, where I've configured
all our spinning disks to have a penalty of 100, while all our SSDs have
a penalty of 0 (the default).  Our VMs (which have some chunks on spinning
disk due to changed goals of snapshots) are now *much* faster, because
they only read from the spinning disks if the SSDs are unavailable.

This patch-set does include a network protocol change that makes it
incompatible with older LizardFS installs.  If there's any way for the
chunkserver to tell older masters that they can ignore the
LIZ_CSTOMA_REGISTER_PENALTY packet, or for the chunkserver to detect that
the master doesn't recognize that packet, I would happily implement it.

This first patch contains the chunkserver code to read the distance
penalty and send it to the master.

Signed-off-by: Jonathan Dieter <jdieter@lesbg.com>
master: Read and apply distance penalty from chunkservers
This patch-set adds a new configuration option that allows the admin
to set a distance penalty for each chunkserver.  This penalty will be
added to *any* distance calculations made for the chunkserver, which
(depending on its value and whether my previous distance calculation
patch has been applied) will make the chunkserver a less likely candidate
for reads.

With this patch, the master can read and apply the distance penalties sent
by the chunkservers.

Signed-off-by: Jonathan Dieter <jdieter@lesbg.com>
cgi: Show distance penalty in web interface
This patch-set adds a new configuration option that allows the admin
to set a distance penalty for each chunkserver.  This penalty will be
added to *any* distance calculations made for the chunkserver, which
(depending on its value and whether my previous distance calculation
patch has been applied) will make the chunkserver a less likely candidate
for reads.

With this patch, the web interface now shows the distance penalty in the
servers tab.

Signed-off-by: Jonathan Dieter <jdieter@lesbg.com>
doc: Update documentation to include DISTANCE_PENALTY
This patch-set adds a new configuration option that allows the admin
to set a distance penalty for each chunkserver.  This penalty will be
added to *any* distance calculations made for the chunkserver, which
(depending on its value and whether my previous distance calculation
patch has been applied) will make the chunkserver a less likely candidate
for reads.

This patch updates the documentation to show the DISTANCE_PENALTY
configuration option.

Signed-off-by: Jonathan Dieter <jdieter@lesbg.com>
Convert spaces to tabs (I hope I got them all)
Signed-off-by: Jonathan Dieter <jdieter@lesbg.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment