# Launching Spark in the AWS Cloud with `flintrock`

[**flintrock**](https://github.com/nchammas/flintrock) is a command line tool for simplifying the setup of a Spark cluster on the Amazon EC2 infrastructure.

## Preamble

In [1]:
from data_science_learning_paths import show_command

## Setup

The following steps describe the creation of a Spark cluster with `flintrock`.

Configurations and code examples used in the following are from **[this repository]()**.

_Clone the repository to your local machine and point the `proj_path` variable to it._ 

In [2]:
proj_path = "/Users/cls/Documents/Work/Projects/point8/DataScienceLearningPaths/big-data-cluster"

In [3]:
show_command(f"cd {proj_path}")

### Configuration File

A `flintrock` configuration file describes the intial configuration of the cluster. Here is an example:

```yaml
services:
  spark:
    version: 2.2.0
  hdfs:
    version: 2.7.3

provider: ec2

providers:
  ec2:
    key-name: aws-bigdata
    identity-file: /Users/user/.ssh/aws-bigdata.pem
    instance-type: t2.micro
    region: eu-central-1
    ami: ami-043097594a7df80ec   # Amazon Linux
    user: ec2-user
    tenancy: default
    ebs-optimized: no
    instance-initiated-shutdown-behavior: terminate

launch:
  num-slaves: 4

debug: true
```

Edit as needed:

**Credentials**

- `key-name`: name of the SSH key pair used for login
- `identity-file`: path to the secret key file of the SSH key pair

Furthermore, _you need to supply your AWS Access Key information_ when calling a `flintrock` command. [Here is how to set them as environment variables](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html) 

**Cluster configuration**

- `num-slaves`: number of worker nodes
- `instance-type`: the EC2 instance type you’d like to run. 
- `ami`: ID of the selected Amazon Machine Image, e.g. the latest Amazon Linux AMI (recommended)

**Instance configuration**

- `user`: the username of the non-root user created on each VM

## Create Cluster

The command for launching a new cluster has the following pattern:

```shell
> flintrock --config=<config-file.yaml> launch <cluster-name>

```

_Select one of the prepared configurations_

In [4]:
#cluster_name = "test-cluster"
cluster_name = "bigdata-cluster"

In [5]:
config_path = f"config/{cluster_name}.yaml"

_Launch cluster_

In [6]:
show_command(f"flintrock --config={config_path} launch {cluster_name}")

## Log in to Cluster

_Log into the master node of the cluster with SSH_

In [7]:
show_command(f"flintrock --config={config_path} login {cluster_name}")

## Run Commands

_Run any shell command on the cluster nodes_

In [8]:
command = "mkdir jobs"

In [9]:
show_command(f"flintrock --config={config_path} run-command {cluster_name} '{command}'")

_Run on the master node only_

In [10]:
show_command(f"flintrock --config={config_path} run-command --master-only {cluster_name} '{command}'")

## Copy Files

_Copy files to all nodes_

In [11]:
show_command(f"flintrock --config={config_path} copy-file {cluster_name} jobs/pi_approximation.py jobs/pi_approximation.py")

## Adding / Removing Nodes

_You can change the number of worker nodes after the cluster has been created_

In [12]:
n_workers = 3

In [13]:
show_command(f"flintrock --config={config_path} add-slaves {cluster_name} --num-slaves {n_workers}")

In [14]:
show_command(f"flintrock --config={config_path} remove-slaves {cluster_name} --num-slaves {n_workers}")

# Destroy Cluster

The command for destroying the cluster (terminating all instances) has the following pattern:

```shell
> flintrock --config=<config-file.yaml> destroy <cluster-name>

```

In [15]:
show_command(f"flintrock --config={config_path} destroy {cluster_name}")

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_