User templates for hadoop / spark #202

ncherel · 2017-07-03T20:07:51Z

This PR makes the following changes:

add a field in the the configuration file to allow user to use his own templates for hadoop / spark

I manually tested this PR.
I started to run pytest but encountered errors not related to my changes. test_static are okay
I will run the full test set once the general direction of this PR is approved.

I was thinking of adding a flintrock generate-templates /path/to/folder command to provide the base templates (those bundled with Flintrock). What do you think?

Fixes #200

nchammas · 2017-07-20T14:09:59Z

Hi @ncherel and thank you for this contribution to Flintrock! Sorry about the delay in reviewing it. I will try to take a detailed look through and give feedback by the end of this week.

I was thinking of adding a flintrock generate-templates /path/to/folder command to provide the base templates (those bundled with Flintrock). What do you think?

I would prefer not to add new commands and instead enhance flintrock configure to handle this. Perhaps we can duplicate the template tree into Flintrock's config directory (like we currently do for config.yaml) and that way users can simply find the templates with flintrock configure --locate and edit them as they please.

I'll have to think about this more, but that's my initial reaction.

nchammas

This is a solid first crack at implementing this feature, and several people have asked me about it in the past.

There are 2 critical things that need to be addressed to move this PR forward:

Resolve all merge conflicts.
Address the issue with the cluster manifest.

There is one additional thing we can address after those 2 are taken care of:

Have flintrock configure copy the built-in templates to Flintrock's config dir and use those as the default paths in Click.

But we can address this later.

nchammas · 2017-08-05T16:55:35Z

flintrock/config.yaml.template

@@ -9,12 +9,22 @@ services:
    #   - must be a tar.gz file
    # download-source: "https://www.example.com/files/spark/{v}/spark-{v}.tar.gz"
    # executor-instances: 1
+    # template-dir: # folder containing spark configuration files:


To make this consistent with the other examples, I'd put the comment above the template-dir line and instead put an example path as template-dir's value.

nchammas · 2017-08-05T16:55:54Z

flintrock/config.yaml.template

  hdfs:
    version: 2.7.3
    # optional; defaults to download from a dynamically selected Apache mirror
    #   - must contain a {v} template corresponding to the version
    #   - must be a .tar.gz file
    # download-source: "https://www.example.com/files/hadoop/{v}/hadoop-{v}.tar.gz"
+    # template-dir: # path to the folder containing the hadoop configuration files


Same comment here as on the Spark template dir.

nchammas · 2017-08-05T16:56:41Z

flintrock/flintrock.py

@@ -218,6 +218,7 @@ def cli(cli_context, config, provider, debug):
              help="URL to download Hadoop from.",
              default='http://www.apache.org/dyn/closer.lua/hadoop/common/hadoop-{v}/hadoop-{v}.tar.gz?as_json',
              show_default=True)
+@click.option('--hdfs-template-dir')


The default for this should be under Flintrock's configuration dir.

nchammas · 2017-08-05T16:56:46Z

flintrock/flintrock.py

@@ -239,6 +240,7 @@ def cli(cli_context, config, provider, debug):
              help="Git repository to clone Spark from.",
              default='https://github.com/apache/spark',
              show_default=True)
+@click.option('--spark-template-dir')


The default for this should be under Flintrock's configuration dir.

nchammas · 2017-08-05T17:06:22Z

flintrock/services.py

+        self.manifest = {
+            'version': version,
+            'download_source': download_source,
+            'template_dir': template_dir}


This raises an important issue: The point of the manifest is to be a self-contained representation of how the cluster is configured. Any instance of Flintrock should be able to manage a cluster solely using the information in the manifest and from EC2.

With this change, however, Flintrock needs access to the specified local directory to get the correct templates. So if one instance of Flintrock launches a cluster, another instance of Flintrock on a different machine won't be able to expand or restart it correctly because the manifest only has the paths to the templates on some non-cluster machine and not the templates themselves.

Realistically, I would guess that most people using Flintrock do so from a single machine. But I'm not comfortable breaking the implicit guarantee that any instance of Flintrock at version X on any machine can fully manage clusters launched by Flintrock at version X. And I know there are some teams using Flintrock where this would matter.

To address this issue, we need to put the full contents of all the templates in the manifest (or somehow otherwise on the cluster) and use that in the add-slaves, remove-slaves, and start commands as necessary.

A different approach we could take here is to note the template directory and some kind of hash of the contents in the manifest. If someone wants to expand an existing Flintrock cluster without having the same template files that were used during launch, Flintrock will display a warning but try to continue anyway.

This kind of forgiving behavior is technically dangerous, but in practice it should work well. And the warning should make it clear to users that they are responsible for any borked operations since they are not using the same files that were used during launch.

We should probably use this kind of approach for the other places where Flintrock references local files during launch, like --ec2-user-data.

nchammas · 2017-09-28T20:46:03Z

Hi @ncherel! Are you still interested in moving this PR forward? I think it's solid feature add to Flintrock, and if we resolve the issue with the manifest I'd be happy to merge it in. What do you think?

nchammas · 2020-01-27T04:58:39Z

Closing this PR due to inactivity. I might pick this idea up later because I think it's really good!

Nicolas Cherel added 6 commits July 3, 2017 17:53

Add config_path to Services

e2320d9

Added {service}_config_path to click

e7badc5

Rename config_path to template_dir, add field to YAML template

cc0757c

Formatting

2431d03

Flake8 compliance

29e33fd

Added template_dir to manifests

4df4ed0

nchammas reviewed Aug 5, 2017

View reviewed changes

nchammas mentioned this pull request Sep 28, 2017

Spark worker cleanup added. #211

Closed

nchammas mentioned this pull request Oct 5, 2017

Unable to load AWS credentials from any provider in the chain #213

Closed

nchammas closed this Jan 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User templates for hadoop / spark #202

User templates for hadoop / spark #202

ncherel commented Jul 3, 2017

nchammas commented Jul 20, 2017

nchammas left a comment

nchammas Aug 5, 2017

nchammas Aug 5, 2017

nchammas Aug 5, 2017

nchammas Aug 5, 2017

nchammas Aug 5, 2017

nchammas Nov 30, 2017

nchammas commented Sep 28, 2017

nchammas commented Jan 27, 2020

User templates for hadoop / spark #202

User templates for hadoop / spark #202

Conversation

ncherel commented Jul 3, 2017

nchammas commented Jul 20, 2017

nchammas left a comment

Choose a reason for hiding this comment

nchammas Aug 5, 2017

Choose a reason for hiding this comment

nchammas Aug 5, 2017

Choose a reason for hiding this comment

nchammas Aug 5, 2017

Choose a reason for hiding this comment

nchammas Aug 5, 2017

Choose a reason for hiding this comment

nchammas Aug 5, 2017

Choose a reason for hiding this comment

nchammas Nov 30, 2017

Choose a reason for hiding this comment

nchammas commented Sep 28, 2017

nchammas commented Jan 27, 2020