New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Feature: Adds support for External Volumes #489

Closed
wants to merge 4 commits into
base: master
from

Conversation

Projects
None yet
3 participants
@dvonthenen
Contributor

dvonthenen commented Feb 8, 2016

Adds support for External Volumes to be used for both the config and data volumes for elastic search nodes. This support works for both Docker and Mesos containerizers.

This functionality works by assigning an elasticsearch node id to each node that is created. That node id metadata is saved in the form of an environment variable that can be accessed later on. When a node fails (crash or health check fails) for this example lets say node 2, that task is released thereby releasing the node id back into the pool. When the new task is launched to replace the failed node, the node id 2 will be free and the new task will assume the id of the old node perhaps on a completely different mesos agent and it will reattach the volumes and continue where it left off. This should yield a much quicker rebuild of the elastic search node since it isn't rebuilding from scratch.

In the Docker container case (default), the use of external volumes has been solved generically and it will enable all Docker Volume Drivers. In TaskInfoFactory.java on line 134, we add the docker run parameter "volume-driver" which enables the external support. You will also noticed that when "volume-driver" is set, the hostpath no longer references a path on the host node but rather the external volume name.

In the Mesos container case, the use of external volumes has be implemented using mesos isolator modules like the mesos-module-dvdi module. Example: github.com/emccode/mesos-module-dvdi You can find the environment variable interface specification there.

Also it is worth noting that the original OfferStrategy class has been made into a parent class. The class OfferStrategyNormal extends OfferStrategy and provides the exact same strategy that the original OfferStrategy provided. The new class OfferStrategyExternalStorage is enabled when an external storage driver is specified and removes the storage/disk checks from the strategy because the volumes are externally managed now.

I should add that if someone else needs to verify this before merging, you can find an example of docker volume driver below. Please install both the rexray driver and the DVDCLI. Configuration instructions found on that page.
https://github.com/emccode/rexray
https://github.com/emccode/dvdcli

To test the Mesos containerizer, you need to install the mesos-module-dvdi located below. Configuration instructions are also found on the page below.
https://github.com/emccode/mesos-module-dvdi

Special configuration considerations:
Rexray needs to be configured with mount preempt equals true and unmount ignoreUsedCount equals true. Below is an example with those flags set using Amazon EBS drivers as the backing storage.

rexray:
  storageDrivers:
  - ec2
  volume:
    mount:
      preempt: true
    unmount:
      ignoreUsedCount: true
aws:
  accessKey: YOUR_KEY
  secretKey: YOUR_KEY 

Tested using the supported docker method for the elastic search framework.

David vonThenen added some commits Feb 2, 2016

David vonThenen
This adds support for external volumes to be used for both the config…
… and data volumes for elastic search nodes. This support works for both Docker and Mesos containerizers.

In the Docker container case, the use of external volumes has been solved generically and it will enable all Docker Volume Drivers.

In the Mesos container case, the use of external volumes has be implemented using mesos isolator modules like the mesos-module-dvdi module. Example: github.com/emccode/mesos-module-dvdi

Will have more details in the pull requests.
@containersol

This comment has been minimized.

Show comment
Hide comment
@containersol

containersol Feb 8, 2016

Collaborator

Can one of the admins verify this patch?

Collaborator

containersol commented Feb 8, 2016

Can one of the admins verify this patch?

@dvonthenen

This comment has been minimized.

Show comment
Hide comment
@dvonthenen

dvonthenen Feb 8, 2016

Contributor

There was a request for a demo in the previous PR, I am looking to pull that together now that the majority of the coding pieces seems to have stabilized.

Contributor

dvonthenen commented Feb 8, 2016

There was a request for a demo in the previous PR, I am looking to pull that together now that the majority of the coding pieces seems to have stabilized.

@dvonthenen

This comment has been minimized.

Show comment
Hide comment
@dvonthenen

dvonthenen Feb 8, 2016

Contributor

I have created a demo that I posted to YouTube. You can view it here:
https://youtu.be/0dhlcft9aWc

If you have any question about the functionality, feedback for the code, need more clarification, please let me know. You can also find me on Slack at codecommunity.slack.com.

Contributor

dvonthenen commented Feb 8, 2016

I have created a demo that I posted to YouTube. You can view it here:
https://youtu.be/0dhlcft9aWc

If you have any question about the functionality, feedback for the code, need more clarification, please let me know. You can also find me on Slack at codecommunity.slack.com.

@philwinder

This comment has been minimized.

Show comment
Hide comment
@philwinder

philwinder Feb 9, 2016

Contributor

David, thank you for the fantastic video. It looks great! Everyone at Container Solutions is very excited!

I only have two minor problems with the code.

The first is that because you are not storing the volume id information within the TaskInfo or TaskStatus messages, they are not being persisted to ZooKeeper. So if the scheduler dies, it will not know what volumes it should be attached to what executors.

The second, related to the first, is the use of a bitmask to store the volume number information. I would change this to a simple hashmap, which is easier to understand. But this will depend on where we can store it in ZooKeeper.

I will work on these two issues when I come to merging the code, and I also need to add some tests.

Again, thank you for this excellent work. I'm aiming to include it in the 0.8.0 release.

Phil

Contributor

philwinder commented Feb 9, 2016

David, thank you for the fantastic video. It looks great! Everyone at Container Solutions is very excited!

I only have two minor problems with the code.

The first is that because you are not storing the volume id information within the TaskInfo or TaskStatus messages, they are not being persisted to ZooKeeper. So if the scheduler dies, it will not know what volumes it should be attached to what executors.

The second, related to the first, is the use of a bitmask to store the volume number information. I would change this to a simple hashmap, which is easier to understand. But this will depend on where we can store it in ZooKeeper.

I will work on these two issues when I come to merging the code, and I also need to add some tests.

Again, thank you for this excellent work. I'm aiming to include it in the 0.8.0 release.

Phil

@dvonthenen

This comment has been minimized.

Show comment
Hide comment
@dvonthenen

dvonthenen Feb 9, 2016

Contributor

Hi Phil,

The first point you brought up is actually intentional. Since the config and data are no longer stored on the slave nodes direct attached storage, the executors represent nothing more than compute engines and don't need to be tied back to their original volumes. So for example, if we get a slave node failure where an executor needs to be brought up on a different mesos slave node AND we scale elastic search executor nodes from 3 to 4, the 3rd or the 4th node can take volumes for #3. If the executor 4th takes the volumes, it now assumes the persona as if it were node 3.

For the second, I can definitely make the change to move to a map if you would like. It does place the upper limit on ES nodes to 32. The map will not place an arbitrary limit on the number of nodes you could have.

Contributor

dvonthenen commented Feb 9, 2016

Hi Phil,

The first point you brought up is actually intentional. Since the config and data are no longer stored on the slave nodes direct attached storage, the executors represent nothing more than compute engines and don't need to be tied back to their original volumes. So for example, if we get a slave node failure where an executor needs to be brought up on a different mesos slave node AND we scale elastic search executor nodes from 3 to 4, the 3rd or the 4th node can take volumes for #3. If the executor 4th takes the volumes, it now assumes the persona as if it were node 3.

For the second, I can definitely make the change to move to a map if you would like. It does place the upper limit on ES nodes to 32. The map will not place an arbitrary limit on the number of nodes you could have.

@philwinder

This comment has been minimized.

Show comment
Hide comment
@philwinder

philwinder Feb 23, 2016

Contributor

@dvonthenen I am integrating this PR now. Am I right in thinking that this will only work in docker mode? E.g. if someone runs the framework in jar mode, this won't work?

Contributor

philwinder commented Feb 23, 2016

@dvonthenen I am integrating this PR now. Am I right in thinking that this will only work in docker mode? E.g. if someone runs the framework in jar mode, this won't work?

@philwinder

This comment has been minimized.

Show comment
Hide comment
@philwinder

philwinder Feb 23, 2016

Contributor

@dvonthenen Hi David. I'm getting close.

The node id code is only generating an id of 1 for me. So all three nodes are trying to connect to elasticsearch1data. I don't know why. I will have to debug to find out.

But I don't quite know the logic. Say something catastrophic has happened. Only the ID 3 is running. (assume not a bitmask, just a counter). The first one will come in and generate a list of what is running. So just 3. What should it connect to? 1? 0? There needs to be some logic around which nodes are created. I guess the bitmask provides that logic. 000001 will be created first. But I think it will be easier for other people to understand if we just use a counter. Start at 0. There should be 3 nodes, so go through and start 0, 1 and 2.

Does that make sense?

Contributor

philwinder commented Feb 23, 2016

@dvonthenen Hi David. I'm getting close.

The node id code is only generating an id of 1 for me. So all three nodes are trying to connect to elasticsearch1data. I don't know why. I will have to debug to find out.

But I don't quite know the logic. Say something catastrophic has happened. Only the ID 3 is running. (assume not a bitmask, just a counter). The first one will come in and generate a list of what is running. So just 3. What should it connect to? 1? 0? There needs to be some logic around which nodes are created. I guess the bitmask provides that logic. 000001 will be created first. But I think it will be easier for other people to understand if we just use a counter. Start at 0. There should be 3 nodes, so go through and start 0, 1 and 2.

Does that make sense?

@dvonthenen

This comment has been minimized.

Show comment
Hide comment
@dvonthenen

dvonthenen Feb 23, 2016

Contributor

Am I right in thinking that this will only work in docker mode? E.g. if someone runs the framework in jar mode, this won't work?

it will work in jar mode but you need to have the DVDCLI installed (https://github.com/emccode/dvdcli) and mesos-module-dvdi (https://github.com/emccode/mesos-module-dvdi)

in that case, the isolator will mount (and create if you havent already done so) the volumes to each executor node via using environment variables that it will autocreate. It should autocreate them for node 1 as follows:

"env": {
  "DVDI_VOLUME_NAME": "elasticsearch1data",
  "DVDI_VOLUME_DRIVER": "rexray",
  "DVDI_VOLUME_OPTS": "size=100",
  "DVDI_VOLUME_CONTAINERPATH": "/data",
  "DVDI_VOLUME_NAME1": "elasticsearch1config",
  "DVDI_VOLUME_DRIVER1": "rexray",
  "DVDI_VOLUME_OPTS1": "size=10",
  "DVDI_VOLUME_CONTAINERPATH1": "/tmp/config"
}

when you flip it into non-docker mode… the code that creates this lives at ExecutorEnvironmentVariables.java starting with function populateEnvMapForMesos() line 71

Contributor

dvonthenen commented Feb 23, 2016

Am I right in thinking that this will only work in docker mode? E.g. if someone runs the framework in jar mode, this won't work?

it will work in jar mode but you need to have the DVDCLI installed (https://github.com/emccode/dvdcli) and mesos-module-dvdi (https://github.com/emccode/mesos-module-dvdi)

in that case, the isolator will mount (and create if you havent already done so) the volumes to each executor node via using environment variables that it will autocreate. It should autocreate them for node 1 as follows:

"env": {
  "DVDI_VOLUME_NAME": "elasticsearch1data",
  "DVDI_VOLUME_DRIVER": "rexray",
  "DVDI_VOLUME_OPTS": "size=100",
  "DVDI_VOLUME_CONTAINERPATH": "/data",
  "DVDI_VOLUME_NAME1": "elasticsearch1config",
  "DVDI_VOLUME_DRIVER1": "rexray",
  "DVDI_VOLUME_OPTS1": "size=10",
  "DVDI_VOLUME_CONTAINERPATH1": "/tmp/config"
}

when you flip it into non-docker mode… the code that creates this lives at ExecutorEnvironmentVariables.java starting with function populateEnvMapForMesos() line 71

@dvonthenen

This comment has been minimized.

Show comment
Hide comment
@dvonthenen

dvonthenen Feb 23, 2016

Contributor

There needs to be some logic around which nodes are created. I guess the bitmask provides that logic. 000001 will be created first. But I think it will be easier for other people to understand if we just use a counter. Start at 0. There should be 3 nodes, so go through and start 0, 1 and 2. Does that make sense?

The bitmask does contain that logic, but using a simple counter wont work. The problem is you need to recycle the IDs when nodes fail. If you only have 3 ES nodes, it would be wrong on a node failure to have the new node come up with ID 4 because there doesnt exist volumes elasticsearch4data and elasticsearch4config.If node ID2 failed, you would want the new executor instance to come up with ID2.

You can still use the list or map approach, you just need to walk the list starting with 1 (or 0 if 0-based indexing is used) and see if index 1 is available. I think we might be saying the same thing.

Contributor

dvonthenen commented Feb 23, 2016

There needs to be some logic around which nodes are created. I guess the bitmask provides that logic. 000001 will be created first. But I think it will be easier for other people to understand if we just use a counter. Start at 0. There should be 3 nodes, so go through and start 0, 1 and 2. Does that make sense?

The bitmask does contain that logic, but using a simple counter wont work. The problem is you need to recycle the IDs when nodes fail. If you only have 3 ES nodes, it would be wrong on a node failure to have the new node come up with ID 4 because there doesnt exist volumes elasticsearch4data and elasticsearch4config.If node ID2 failed, you would want the new executor instance to come up with ID2.

You can still use the list or map approach, you just need to walk the list starting with 1 (or 0 if 0-based indexing is used) and see if index 1 is available. I think we might be saying the same thing.

@philwinder

This comment has been minimized.

Show comment
Hide comment
@philwinder

philwinder Feb 29, 2016

Contributor

Addressed in #514.

Contributor

philwinder commented Feb 29, 2016

Addressed in #514.

@philwinder philwinder closed this Feb 29, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment