Skip to content

Commit

Permalink
Merge pull request #5 from twinkarma/master
Browse files Browse the repository at this point in the history
Now working on the main repo
  • Loading branch information
twinkarma committed Oct 9, 2017
2 parents 0916d30 + 77312c1 commit 80d568b
Show file tree
Hide file tree
Showing 8 changed files with 475 additions and 28 deletions.
2 changes: 2 additions & 0 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ JADE-HPC Facility User guide
:align: right
:alt: A picture showing some of the Jade hardware.

The Joint Academic Data Science Endeavour (JADE) is the largest GPU facility in the UK supporting world-leading research in machine learning. The computational hub will harness the capabilities of the NVIDIA DGX-1 Deep Learning System and comprise of 22 servers, each containing 8 of the newest NVIDIA Tesla P100 GPUs linked by NVIDIA's NV link interconnect technology. The new JADE facility aims to address the gap between university systems and access to national HPC services. This will drive forward innovation in machine learning, identifying new applications and insights in to research challenges.

This is the documentation for Joint Academic Data Science Endeavour (JADE) High Performance Computing (HPC) facility.
Run by Research Software Engineering' Research Computing Group with additional support from the Research Software Engineering team in Computer Science, they support the computational needs of hundreds of researchers across all departments.

Expand Down
38 changes: 10 additions & 28 deletions jade/connecting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Connecting to the cluster using SSH
===================================

The most versatile way to **run commands and submit jobs** on one of the clusters is to
The most versatile way to **run commands and submit jobs** on the cluster is to
use a mechanism called `SSH <https://en.wikipedia.org/wiki/Secure_Shell>`__,
which is a common way of remotely logging in to computers
running the Linux operating system.
Expand Down Expand Up @@ -59,6 +59,15 @@ log in to a cluster: ::

Here you need to replace ``$USER`` with your username (e.g. ``te1st-test``)

.. note::

JADE has multiple front-end systems, and because of this some SSH software operating under stringent security settings might give **warnings about possible man-in-the-middle attacks** because of apparent changes in machine settings. This is a known issue and is being addressed, but in the meantime **these warnings can be safely ignored**

To ignore the warning, add the option `-o StrictHostKeyChecking=no` to your SSH command e.g.:
`ssh -o StrictHostKeyChecking=no -l $USER jade.hartree.stfc.ac.uk`
Or in your `~/.ssh/config` file, add the line:
`StrictHostKeyChecking no`

.. note::

**macOS users**: if this fails then:
Expand All @@ -69,30 +78,3 @@ Here you need to replace ``$USER`` with your username (e.g. ``te1st-test``)
This should give you a prompt resembling the one below: ::

te1st-test@dgj223:~$

At this prompt, to run ``bash`` in an interactive working node, type: ::

srun --pty bash

Like this: ::

te1st-test@dgj223:~$ srun --pty bash

Notice that you have been moved from the head node ``dgj223`` to the working node ``dgj113`` ready to run jobs interactively: ::

te1st-test@dgj113:~$


.. note::

When you login to a cluster you reach one of two login nodes.
You **should not** run applications on the login nodes.
Running ``srun`` gives you an interactive terminal
on one of the many worker nodes in the cluster.



What Next?
----------

Now you have connected to a cluster, you can look at how to submit jobs with :ref:`Slurm <slurm>` or look at the software installed on :ref:`JADE <software>`.
98 changes: 98 additions & 0 deletions software/apps/caffe.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
.. _caffe:

Caffe
=====

.. sidebar:: Caffe

:URL: http://caffe.berkeleyvision.org/

`Caffe <http://caffe.berkeleyvision.org/>`_ is a Deep Learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (`BVLC <http://bvlc.eecs.berkeley.edu/>`_) and by community contributors.

The Caffe Docker Container
--------------------------

Caffe is available on JADE through the use of a `Docker container <https://docker.com>`_. For more information on JADE's use of containers, see :ref:`containers`.


Using Caffe Interactively
-------------------------

All the contained applications are launched interactively in the same way within 1 compute node at a time. The number of GPUs to be used per node is requested using the “gres” option. To request an interactive session on a compute node the following command is issued from the login node: ::

# Requesting 2 GPUs for Caffe version 17.04
srun --gres=gpu:2 --pty /jmain01/apps/docker/caffe 17.04

This command will show the following, which is now running on a compute node: ::

==================
== NVIDIA Caffe ==
==================

NVIDIA Release 17.04 (build 26740)

Container image Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
Copyright (c) 2014, 2015, The Regents of the University of California (Regents)
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

groups: cannot find name for group ID 1002
I have no name!@124cf0e3582e:/home_directory$

You are now inside the container where the `Caffe` software is installed. Let's check the version ::

caffe --version

You can now begin training your network: ::

caffe train -solver=my_solver.prototxt

Using Caffe in Batch Mode
-------------------------

There are wrappers for launching the containers within batch mode.

Firstly navigate to the folder you wish your script to lauch from, for example we'll use the home directory: ::

cd ~

It is recommended that you create a script file e.g. `script.sh`: ::

#!/bin/bash

# Prints out Caffe's version number
caffe --version

And don't forget to make your `script.sh` executable: ::

chmod +x script.sh

Then create a Slurm batch script that is used to launch the code, e.g. `batch.sh`: ::

#!/bin/bash
#SBATCH --nodes=1
#SBATCH -p all
#SBATCH -J Caffe-job
#SBATCH --gres=gpu:8
#SBATCH --time=01:00:00

#Launching the commands within script.sh
/jmain01/apps/docker/caffe-batch –c ./script.sh

You can then submit the job using `sbatch`: ::

sbatch batch.sh

The output will appear in the slurm standard output file.









.. _GPUComputing@Sheffield: http://gpucomputing.shef.ac.uk
File renamed without changes.
File renamed without changes.
167 changes: 167 additions & 0 deletions software/apps/tensorflow.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
.. _tensorflow:

Tensorflow
==========

.. sidebar:: Tensorflow

:URL: https://www.tensorflow.org/

TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.

About Tensorflow on ShARC
-------------------------

**A GPU-enabled worker node must be requested in order to use the GPU version of this software. See** :ref:`GPUComputing_sharc` **for more information.**

Tensorlfow is available on ShARC as both Singularity images and by local installation.

As Tensorflow and all its dependencies are written in Python, it can be installed locally in your home directory. The use of Anaconda (:ref:`sharc-python-conda`) is recommended as it is able to create a virtual environment in your home directory, allowing the installation of new Python packages without admin permission.

This software and documentation is maintained by the `RSES group <http://rse.shef.ac.uk/>`_ and `GPUComputing@Sheffield <http://gpucomputing.shef.ac.uk/>`_. For feature requests or if you encounter any problems, please raise an issue on the `GPU Computing repository <https://github.com/RSE-Sheffield/GPUComputing/issues>`_.

Tensorflow Singularity Images
-----------------------------

Singularity images are self-contained virtual machines similar to Docker. For more information on Singularity and how to use the images, see :ref:`singularity_sharc`.

A symlinked file is provided that always point to the latest image: ::

#CPU Tensorflow
/usr/local/packages/singularity/images/tensorflow/cpu.img

#GPU Tensorflow
/usr/local/packages/singularity/images/tensorflow/gpu.img

To get a bash terminal in to an image for example, use the command: ::

singularity exec /usr/local/packages/singularity/images/tensorflow/gpu.img /bin/bash

The ``exec`` command can also be used to call any command/script inside the image e.g. ::

singularity exec /usr/local/packages/singularity/images/tensorflow/gpu.img python your_tensorflow_script.py

You may get a warning similar to ``groups: cannot find name for group ID ...``, this can be ignored and will not have an affect on running the image.

The paths ``/fastdata``, ``/data``, ``/home``, ``/scratch``, ``/shared`` are automatically mounted to your ShARC filestore directories. For GPU-enabled images the ``/nvlib`` and ``/nvbin`` is mounted to the correct Nvidia driver version for the node that you're using.

Tensorflow is installed as part of Anaconda and can be found inside the image at: ::

/usr/local/anaconda3-4.2.0/lib/python3.5/site-packages/tensorflow


**To submit jobs that uses a Singularity image, see** :ref:`use_image_batch_singularity_sharc` **for more detail.**

Image Index
^^^^^^^^^^^

Paths to the actual images and definition files are provided below for downloading and building of custom images.

* Shortcut to Latest Image
* CPU
* ``/usr/local/packages/singularity/images/tensorflow/cpu.img``
* GPU
* ``usr/local/packages/singularity/images/tensorflow/gpu.img``
* CPU Images
* Latest: 1.0.1-CPU-Ubuntu16.04-Anaconda3.4.2.0.img (GCC 5.4.0, Python 3.5)
* Path: ``/usr/local/packages/singularity/images/tensorflow/1.0.1-CPU-Ubuntu16.04-Anaconda3.4.2.0.img``
* Def file: :download:`/sharc/software/apps/singularity/tensorflow_cpu.def </sharc/software/apps/singularity/tensorflow_cpu.def>`
* GPU Images
* Latest: 1.0.1-GPU-Ubuntu16.04-Anaconda3.4.2.0-CUDA8-cudNN5.0.img (GCC 5.4.0, Python 3.5)
* Path: ``/usr/local/packages/singularity/images/tensorflow/1.0.1-GPU-Ubuntu16.04-Anaconda3.4.2.0-CUDA8-cudNN5.0.img``
* Def file: :download:`/sharc/software/apps/singularity/tensorflow_gpu.def </sharc/software/apps/singularity/tensorflow_gpu.def>`
* Latest: 1.0.1-GPU-Ubuntu16.04-Anaconda3.4.2.0-CUDA8-cudNN5.0.img (GCC 5.4.0, Python 3.5)
* Path: ``/usr/local/packages/singularity/images/tensorflow/1.0.1-GPU-Ubuntu16.04-Anaconda3.4.2.0-CUDA8-cudNN5.0.img``
* Def file: :download:`/sharc/software/apps/singularity/tensorflow_gpu.def </sharc/software/apps/singularity/tensorflow_gpu.def>`

Installation in Home Directory
------------------------------

The following is an instruction on how to setup Tensorflow on your user account.

First request an interactive session, e.g. with :ref:`qrshx`. To use GPUs see :ref:`GPUInteractive_sharc`.

Load the relevant modules (our example uses CUDA 8.0 with cuDNN 5.1 but :ref:`other versions are available <cudnn_sharc>`) ::

module load apps/python/anaconda3-4.2.0
module load libs/cudnn/5.1/binary-cuda-8.0.44


Create a conda environment to load relevant modules on your local user account and activate it ::

conda create -n tensorflow python=3.5
source activate tensorflow

Then install tensorflow with the following commands ::

export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.11.0-cp35-cp35m-linux_x86_64.whl
pip install $TF_BINARY_URL

You can test that Tensorflow is running on the GPU with the following python code ::

import tensorflow as tf
# Creates a graph.
with tf.device('/gpu:0'):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(c))

Which gives the following results ::

[[ 22. 28.]
[ 49. 64.]]

Every Session Afterwards and in Your Job Scripts
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The previous instuctions installs Tensorflow and its dependencies inside your home directory but every time you use a new session or within your job scripts, the modules must be loaded and conda must be activated again. Use the following command to activate the Conda environment with Tensorflow installed: ::

module load apps/python/anaconda3-4.2.0
module load libs/cudnn/5.1/binary-cuda-8.0.44
source activate tensorflow

Using multiple GPUs
-------------------
Example taken from `tensorflow documentation <https://www.tensorflow.org/versions/r0.11/how_tos/using_gpu/index.html>`_.

If you would like to run TensorFlow on multiple GPUs, you can construct your model in a multi-tower fashion where each tower is assigned to a different GPU. For example: ::

import tensorflow as tf
# Creates a graph.
c = []
for d in ['/gpu:2', '/gpu:3']:
with tf.device(d):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
c.append(tf.matmul(a, b))
with tf.device('/cpu:0'):
sum = tf.add_n(c)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print sess.run(sum)

You will see the following output. ::

Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla K20m, pci bus
id: 0000:02:00.0
/job:localhost/replica:0/task:0/gpu:1 -> device: 1, name: Tesla K20m, pci bus
id: 0000:03:00.0
/job:localhost/replica:0/task:0/gpu:2 -> device: 2, name: Tesla K20m, pci bus
id: 0000:83:00.0
/job:localhost/replica:0/task:0/gpu:3 -> device: 3, name: Tesla K20m, pci bus
id: 0000:84:00.0
Const_3: /job:localhost/replica:0/task:0/gpu:3
Const_2: /job:localhost/replica:0/task:0/gpu:3
MatMul_1: /job:localhost/replica:0/task:0/gpu:3
Const_1: /job:localhost/replica:0/task:0/gpu:2
Const: /job:localhost/replica:0/task:0/gpu:2
MatMul: /job:localhost/replica:0/task:0/gpu:2
AddN: /job:localhost/replica:0/task:0/cpu:0
[[ 44. 56.]
[ 98. 128.]]

0 comments on commit 80d568b

Please sign in to comment.