# Image Classification Labs

## Lab 3-4 Distributed Training (TensorFlow)

## Lab Steps - TensorFlow

Step 1. Add SSH private key into the authentication agent (Skip this step if it is done at Dist MXNet lab)

```
$ ssh-add -K <private key file name>
```

Step 2. SSH to the master node

Step 3. Clone git repository

Step 4. Create command files for each workers

Step 5. Run command files on all workers


### Copy training data into EFS directory

```
mkdir $EFS_MOUNT/cifar10_data && \
wget http://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz --directory-prefix=$EFS_MOUNT/cifar10_data \
&& tar -xzvf $EFS_MOUNT/cifar10_data/cifar-10-binary.tar.gz -C $EFS_MOUNT/cifar10_data
```

### Generate commands to be run TensorFlow workers and parameter servers on cluster instances

```
cd $EFS_MOUNT/deeplearning-cfn/examples/tensorflow && \
# generates commands to run workers and parameter-servers on all the workers \
python generate_trainer.py --workers_file_path $DEEPLEARNING_WORKERS_PATH \
--worker_count $DEEPLEARNING_WORKERS_COUNT \
--worker_gpu_count $DEEPLEARNING_WORKER_GPU_COUNT \
--trainer_script_dir $EFS_MOUNT/deeplearning-cfn/examples/tensorflow \
--training_script $EFS_MOUNT/deeplearning-cfn/examples/tensorflow/cifar10_multi_machine_train.py \
--batch_size 128 --data_dir=$EFS_MOUNT/cifar10_data \
--train_dir=$EFS_MOUNT/deeplearning-cfn/examples/tensorflow/train \
--log_dir $EFS_MOUNT/deeplearning-cfn/examples/tensorflow/logs \
--max_steps 200000
```

The resulting command file, deeplearning-worker#.sh, is created for each worker.

```
deeplearning-worker1.sh
source /etc/profile

CUDA_VISIBLE_DEVICES='' python /myEFSvolume/deeplearning-cfn/examples/tensorflow/cifar10_multi_machine_train.py --batch_size 128 --data_dir=/myEFSvolume/cifar10_data --train_dir=/myEFSvolume/deeplearning-cfn/examples/tensorflow/train --ma
x_steps 200000 --ps_hosts=deeplearning-worker1:2222,deeplearning-worker2:2222,deeplearning-worker3:2222,deeplearning-worker4:2222 --worker_hosts= --job_name=ps --task_index=0 > /myEFSvolume/deeplearning-cfn/examples/tensorflow/logs/ps0 2>&1 &
```

### Run training command file on all workers

1. terminate all running Python processes across workers \
```
while read -u 10 host; do ssh -o "StrictHostKeyChecking no" $host "pkill -f python" ; \
done 10<$DEEPLEARNING_WORKERS_PATH
```

2. Run the distributed training across all of the workers:
```
trainer_script_dir=$EFS_MOUNT/deeplearning-cfn/examples/tensorflow && while read -u 10 host; \
do ssh -o "StrictHostKeyChecking no" $host "bash $trainer_script_dir/$host.sh" ; \
done 10<$DEEPLEARNING_WORKERS_PATH
```

>ubuntu    2047     1  3 19:03 ?        00:00:01 python /myEFSvolume/deeplearning-cfn/examples/tensorflow/cifar10_multi_machine_train.py --batch_size 128 --data_dir=/myEFSvolume/cifar10_data --train_dir=/myEFSvolume/deeplearning-cfn/examples/tensorflow/train --max_steps 200000 --ps_hosts=deeplearning-worker1:2222,deeplearning-worker2:2222 --worker_hosts=deeplearning-worker1:2230,deeplearning-worker2:2230 --job_name=ps --task_index=0
>
>ubuntu    2048     1  4 19:03 ?        00:00:01 python /myEFSvolume/deeplearning-cfn/examples/tensorflow/cifar10_multi_machine_train.py --batch_size 128 --data_dir=/myEFSvolume/cifar10_data --train_dir=/myEFSvolume/deeplearning-cfn/examples/tensorflow/train --max_steps 200000 --ps_hosts=deeplearning-worker1:2222,deeplearning-worker2:2222 --worker_hosts=deeplearning-worker1:2230,deeplearning-worker2:2230 --job_name=worker --task_index=0

### How to monitor GPU usage

```
$ nvidia-smi
Mon Jul  3 19:14:06 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000:00:1E.0     Off |                    0 |
| N/A   43C    P0    77W / 149W |  10935MiB / 11439MiB |     58%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      2048    C   python                                       10931MiB |
+-----------------------------------------------------------------------------+
```

watch -n 0.5 nvidia-smi


## Evaluating trained model

1. terminate all running Python processes across workers \
```
while read -u 10 host; do ssh -o "StrictHostKeyChecking no" $host "pkill -f python" ; \
done 10<$DEEPLEARNING_WORKERS_PATH
```

2. Run the evaluation on the trained mode
```
python $EFS_MOUNT/deeplearning-cfn/examples/tensorflow/models/tutorials/image/cifar10/cifar10_eval.py \
--data_dir=$EFS_MOUNT/cifar10_data/ \
--eval_dir=$EFS_MOUNT/deeplearning-cfn/examples/tensorflow/eval \
--checkpoint_dir=$EFS_MOUNT/deeplearning-cfn/examples/tensorflow/train
```

Sameple output
```
2017-07-03 19:15:44.131450: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-03 19:15:44.131492: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-03 19:15:44.131501: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-03 19:15:44.131507: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-03 19:15:44.131514: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-07-03 19:15:44.278553: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-07-03 19:15:44.279070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:1e.0
Total memory: 11.17GiB
Free memory: 11.11GiB
2017-07-03 19:15:44.279116: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0
2017-07-03 19:15:44.279129: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0:   Y
2017-07-03 19:15:44.279145: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0)
2017-07-03 19:15:50.506132: precision @ 1 = 0.784
```

### [Option]  Using TensorBoard

```
tensorboard --logdir $EFS_MOUNT/deeplearning-cfn/examples/tensorflow/train
```

### Reference

* Distributed TensorFlow https://www.tensorflow.org/deploy/distributed