Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User templates for hadoop / spark #202

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
10 changes: 10 additions & 0 deletions flintrock/config.yaml.template
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,22 @@ services:
# - must be a tar.gz file
# download-source: "https://www.example.com/files/spark/{v}/spark-{v}.tar.gz"
# executor-instances: 1
# template-dir: # folder containing spark configuration files:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this consistent with the other examples, I'd put the comment above the template-dir line and instead put an example path as template-dir's value.

# - slaves
# - spark-env.sh
# - spark-defaults.conf
hdfs:
version: 2.7.3
# optional; defaults to download from a dynamically selected Apache mirror
# - must contain a {v} template corresponding to the version
# - must be a .tar.gz file
# download-source: "https://www.example.com/files/hadoop/{v}/hadoop-{v}.tar.gz"
# template-dir: # path to the folder containing the hadoop configuration files
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here as on the Spark template dir.

# - masters
# - slaves
# - hadoop-env.sh
# - hdfs-site.xml
# - core-site.xml

provider: ec2

Expand Down
12 changes: 11 additions & 1 deletion flintrock/flintrock.py
Original file line number Diff line number Diff line change
Expand Up @@ -218,6 +218,7 @@ def cli(cli_context, config, provider, debug):
help="URL to download Hadoop from.",
default='http://www.apache.org/dyn/closer.lua/hadoop/common/hadoop-{v}/hadoop-{v}.tar.gz?as_json',
show_default=True)
@click.option('--hdfs-template-dir')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default for this should be under Flintrock's configuration dir.

@click.option('--install-spark/--no-install-spark', default=True)
@click.option('--spark-executor-instances', default=1,
help="How many executor instances per worker.")
Expand All @@ -239,6 +240,7 @@ def cli(cli_context, config, provider, debug):
help="Git repository to clone Spark from.",
default='https://github.com/apache/spark',
show_default=True)
@click.option('--spark-template-dir')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default for this should be under Flintrock's configuration dir.

@click.option('--assume-yes/--no-assume-yes', default=False)
@click.option('--ec2-key-name')
@click.option('--ec2-identity-file',
Expand Down Expand Up @@ -281,12 +283,14 @@ def launch(
install_hdfs,
hdfs_version,
hdfs_download_source,
hdfs_template_dir,
install_spark,
spark_executor_instances,
spark_version,
spark_git_commit,
spark_git_repository,
spark_download_source,
spark_template_dir,
assume_yes,
ec2_key_name,
ec2_identity_file,
Expand Down Expand Up @@ -351,7 +355,11 @@ def launch(
check_external_dependency('ssh-keygen')

if install_hdfs:
hdfs = HDFS(version=hdfs_version, download_source=hdfs_download_source)
hdfs = HDFS(
version=hdfs_version,
download_source=hdfs_download_source,
template_dir=hdfs_template_dir
)
services += [hdfs]
if install_spark:
if spark_version:
Expand All @@ -360,6 +368,7 @@ def launch(
version=spark_version,
hadoop_version=hdfs_version,
download_source=spark_download_source,
template_dir=spark_template_dir
)
elif spark_git_commit:
logger.warning(
Expand All @@ -373,6 +382,7 @@ def launch(
git_commit=spark_git_commit,
git_repository=spark_git_repository,
hadoop_version=hdfs_version,
template_dir=spark_template_dir
)
services += [spark]

Expand Down
45 changes: 31 additions & 14 deletions flintrock/services.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,10 +107,14 @@ def health_check(


class HDFS(FlintrockService):
def __init__(self, *, version, download_source):
def __init__(self, *, version, download_source, template_dir):
self.version = version
self.download_source = download_source
self.manifest = {'version': version, 'download_source': download_source}
self.template_dir = template_dir
self.manifest = {
'version': version,
'download_source': download_source,
'template_dir': template_dir}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises an important issue: The point of the manifest is to be a self-contained representation of how the cluster is configured. Any instance of Flintrock should be able to manage a cluster solely using the information in the manifest and from EC2.

With this change, however, Flintrock needs access to the specified local directory to get the correct templates. So if one instance of Flintrock launches a cluster, another instance of Flintrock on a different machine won't be able to expand or restart it correctly because the manifest only has the paths to the templates on some non-cluster machine and not the templates themselves.

Realistically, I would guess that most people using Flintrock do so from a single machine. But I'm not comfortable breaking the implicit guarantee that any instance of Flintrock at version X on any machine can fully manage clusters launched by Flintrock at version X. And I know there are some teams using Flintrock where this would matter.

To address this issue, we need to put the full contents of all the templates in the manifest (or somehow otherwise on the cluster) and use that in the add-slaves, remove-slaves, and start commands as necessary.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A different approach we could take here is to note the template directory and some kind of hash of the contents in the manifest. If someone wants to expand an existing Flintrock cluster without having the same template files that were used during launch, Flintrock will display a warning but try to continue anyway.

This kind of forgiving behavior is technically dangerous, but in practice it should work well. And the warning should make it clear to users that they are responsible for any borked operations since they are not using the same files that were used during launch.

We should probably use this kind of approach for the other places where Flintrock references local files during launch, like --ec2-user-data.


def install(
self,
Expand Down Expand Up @@ -149,13 +153,18 @@ def configure(
cluster: FlintrockCluster):
# TODO: os.walk() through these files.
template_paths = [
'hadoop/conf/masters',
'hadoop/conf/slaves',
'hadoop/conf/hadoop-env.sh',
'hadoop/conf/core-site.xml',
'hadoop/conf/hdfs-site.xml',
'masters',
'slaves',
'hadoop-env.sh',
'core-site.xml',
'hdfs-site.xml',
]

# TODO: throw error if config folder / file can't be found
template_dir = self.template_dir
if template_dir is None:
template_dir = os.path.join(THIS_DIR, "templates/hadoop/conf")

for template_path in template_paths:
ssh_check_output(
client=ssh_client,
Expand All @@ -164,7 +173,7 @@ def configure(
""".format(
f=shlex.quote(
get_formatted_template(
path=os.path.join(THIS_DIR, "templates", template_path),
path=os.path.join(template_dir, template_path),
mapping=generate_template_mapping(
cluster=cluster,
hadoop_version=self.version,
Expand All @@ -173,7 +182,7 @@ def configure(
spark_version='',
spark_executor_instances=0,
))),
p=shlex.quote(template_path)))
p=shlex.quote(os.path.join("hadoop/conf/", template_path))))

# TODO: Convert this into start_master() and split master- or slave-specific
# stuff out of configure() into configure_master() and configure_slave().
Expand Down Expand Up @@ -217,6 +226,7 @@ def __init__(
version: str=None,
hadoop_version: str,
download_source: str=None,
template_dir: str=None,
git_commit: str=None,
git_repository: str=None
):
Expand All @@ -230,6 +240,7 @@ def __init__(
self.version = version
self.hadoop_version = hadoop_version
self.download_source = download_source
self.template_dir = template_dir
self.git_commit = git_commit
self.git_repository = git_repository

Expand All @@ -238,6 +249,7 @@ def __init__(
'spark_executor_instances': spark_executor_instances,
'hadoop_version': hadoop_version,
'download_source': download_source,
'template_dir': template_dir,
'git_commit': git_commit,
'git_repository': git_repository}

Expand Down Expand Up @@ -310,10 +322,15 @@ def configure(
cluster: FlintrockCluster):

template_paths = [
'spark/conf/spark-env.sh',
'spark/conf/slaves',
'spark/conf/spark-defaults.conf',
'spark-env.sh',
'slaves',
'spark-defaults.conf',
]

template_dir = self.template_dir
if template_dir is None:
template_dir = os.path.join(THIS_DIR, "templates/spark/conf")

for template_path in template_paths:
ssh_check_output(
client=ssh_client,
Expand All @@ -322,14 +339,14 @@ def configure(
""".format(
f=shlex.quote(
get_formatted_template(
path=os.path.join(THIS_DIR, "templates", template_path),
path=os.path.join(template_dir, template_path),
mapping=generate_template_mapping(
cluster=cluster,
spark_executor_instances=self.spark_executor_instances,
hadoop_version=self.hadoop_version,
spark_version=self.version or self.git_commit,
))),
p=shlex.quote(template_path)))
p=shlex.quote(os.path.join("spark/conf/", template_path))))

# TODO: Convert this into start_master() and split master- or slave-specific
# stuff out of configure() into configure_master() and configure_slave().
Expand Down