Opening a black-box

Run cluster locally

If first time here -> download data with getdata.sh in database directory.
Go to spark-config directory.
Run run_cluster_locally.sh.
Access spark-master UI at http://localhost:8080/
Submit jar with submit.sh in black-box directory.

Run cluster remotely

Configure ssh connection

It's recommended to use aliases for connection to cluster. Otherwise, some scripts won't work. Modify ~/.ssh/config following the pattern:

Host <number_of_node>
  Port 22
  User magisterka
  HostName <node_ip_address>

Create env file with password

cd scripts
touch password.env
echo <your_password> > password.env

Configure and run cluster

scripts/prepare_nodes.sh <git_branch_to_checkout:-main>
scripts/start_master.sh
scripts/start_workers.sh

Generate TPC-DS data

Make sure that TPC-DS tool is available on master node - if not got to database/README.

Parametrize the script with desired data size. Default value is 1 GB. TPC-DS enables to generate data from 1 GB to 10 TB.

database/generate_tpc_ds.sh <data_size_in_GB>

Submit jar to the cluster with script

scripts/sumbit.sh <function_name>

Expected output:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0{
  "action" : "CreateSubmissionResponse",
  "message" : "Driver successfully submitted as driver-20210402161642-0000",
  "serverSparkVersion" : "3.0.2",
  "submissionId" : "driver-20210402161642-0000",
  "success" : true
100   779  100   223  100   556    888   2215 --:--:-- --:--:-- --:--:--  3103

Submit jar to the cluster via REST API

curl --location --request POST '192.168.55.20:5000/submit' \
--header 'Content-Type: application/json' \
--data-raw '{
    "function_name": "averageTemperatureByDeviceIdSeason"
}'

Expected response:

{
  "action": "CreateSubmissionResponse",
  "message": "Driver successfully submitted as driver-20210407145229-0000",
  "serverSparkVersion": "3.0.2",
  "submissionId": "driver-20210407145229-0000",
  "success": true
}

Get the driver status via REST API

curl --location --request GET '192.168.55.20:5000/status' \
--header 'Content-Type: application/json' \
--data-raw '{
    "driver_id": "driver-20210407145229-0000"
}'

Expected response:

{
  "action": "SubmissionStatusResponse",
  "driverState": "FINISHED",
  "serverSparkVersion": "3.0.2",
  "submissionId": "driver-20210407145229-0000",
  "success": true,
  "workerHostPort": "10.5.0.6:40829",
  "workerId": "worker-20210407145657-10.5.0.6-40829"
}

Stop cluster

scripts/stop_all.sh

Potential problems

If you get error like

Error response from daemon: attaching to network failed, make sure your network options are correct and check manager logs: context deadline exceeded

Inspect docker network (spark-network) on the master node and make sure that it took addresses

10.5.0.2
10.5.0.3

Run experiments

Make sure you have hosts_info.csv file in monitor-manager directory. The file should end with an empty line.

host_ip	container_name
192.168.55.20	spark-master
192.168.55.11	spark-worker-1
192.168.55.12	spark-worker-2
192.168.55.13	spark-worker-3

Create experiments plan csv file

For example:

function_name	dataset_size	iterations
countDistinctTicketNumber	1GB	25

Check if monitor-manager is running

curl --location --request GET 'http://192.168.55.20:8888/'

Start experiments

curl --location --request POST 'http://192.168.55.20:8888/experiments'

Get experiments data

scripts/get_experiments_data.sh

Data preprocessing

To calculate mean value of CPU, RAM and duration of all experiment iterations (per node) run notebook/prepare_data.ipynb.

This notebook creates individual plots for all iterations (per node) and plots for the mean values of all iterations. Calculated mean RAM and CPU values, with a new column containing experiment mean duration time are stored in new file experiment_mean_data.csv

Name		Name	Last commit message	Last commit date
Latest commit History 407 Commits
.github/workflows		.github/workflows
black-box		black-box
database		database
gateway		gateway
monitor-manager		monitor-manager
notebooks		notebooks
scripts		scripts
spark-config		spark-config
system-monitor		system-monitor
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Opening a black-box

Run cluster locally

Run cluster remotely

Configure ssh connection

Create env file with password

Configure and run cluster

Generate TPC-DS data

Submit jar to the cluster with script

Submit jar to the cluster via REST API

Get the driver status via REST API

Stop cluster

Potential problems

Run experiments

Create experiments plan csv file

Check if monitor-manager is running

Start experiments

Get experiments data

Data preprocessing

About

Releases

Packages

Contributors 2

Languages

License

pientaa/opening-black-box

Folders and files

Latest commit

History

Repository files navigation

Opening a black-box

Run cluster locally

Run cluster remotely

Configure ssh connection

Create env file with password

Configure and run cluster

Generate TPC-DS data

Submit jar to the cluster with script

Submit jar to the cluster via REST API

Get the driver status via REST API

Stop cluster

Potential problems

Run experiments

Create experiments plan csv file

Check if monitor-manager is running

Start experiments

Get experiments data

Data preprocessing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages