Merge pull request #5 from twinkarma/master

Now working on the main repo
jade-hpc-gpu · Oct 9, 2017 · 80d568b · 80d568b
2 parents 0916d30 + 77312c1
commit 80d568b
Show file tree

Hide file tree

Showing 8 changed files with 475 additions and 28 deletions.
diff --git a/index.rst b/index.rst
@@ -6,6 +6,8 @@ JADE-HPC Facility User guide
    :align: right
    :alt: A picture showing some of the Jade hardware.
 
+The Joint Academic Data Science Endeavour (JADE) is the largest GPU facility in the UK supporting world-leading research in machine learning. The computational hub will harness the capabilities of the NVIDIA DGX-1 Deep Learning System and comprise of 22 servers, each containing 8 of the newest NVIDIA Tesla P100 GPUs linked by NVIDIA's NV link interconnect technology. The new JADE facility aims to address the gap between university systems and access to national HPC services. This will drive forward innovation in machine learning, identifying new applications and insights in to research challenges.
+
 This is the documentation for Joint Academic Data Science Endeavour (JADE) High Performance Computing (HPC) facility.
 Run by Research Software Engineering' Research Computing Group with additional support from the Research Software Engineering team in Computer Science, they support the computational needs of hundreds of researchers across all departments.
 

diff --git a/jade/connecting.rst b/jade/connecting.rst
@@ -3,7 +3,7 @@
 Connecting to the cluster using SSH
 ===================================
 
-The most versatile way to **run commands and submit jobs** on one of the clusters is to
+The most versatile way to **run commands and submit jobs** on the cluster is to
 use a mechanism called `SSH <https://en.wikipedia.org/wiki/Secure_Shell>`__,
 which is a common way of remotely logging in to computers
 running the Linux operating system.
@@ -59,6 +59,15 @@ log in to a cluster: ::
 
 Here you need to replace ``$USER`` with your username (e.g. ``te1st-test``)
 
+.. note::
+
+  JADE has multiple front-end systems, and because of this some SSH software operating under stringent security settings might give **warnings about possible man-in-the-middle attacks** because of apparent changes in machine settings. This is a known issue and is being addressed, but in the meantime **these warnings can be safely ignored**
+
+  To ignore the warning, add the option `-o StrictHostKeyChecking=no` to your SSH command e.g.:
+      `ssh -o StrictHostKeyChecking=no -l $USER jade.hartree.stfc.ac.uk`
+  Or in your `~/.ssh/config` file, add the line:
+     `StrictHostKeyChecking no`
+
 .. note::
 
     **macOS users**: if this fails then:
@@ -69,30 +78,3 @@ Here you need to replace ``$USER`` with your username (e.g. ``te1st-test``)
 This should give you a prompt resembling the one below: ::
 
     te1st-test@dgj223:~$
-
-At this prompt, to run ``bash`` in an interactive working node, type: ::
-
-    srun --pty bash
-
-Like this: ::
-
-    te1st-test@dgj223:~$ srun --pty bash
-
-Notice that you have been moved from the head node ``dgj223`` to the working node ``dgj113`` ready to run jobs interactively: ::
-
-    te1st-test@dgj113:~$
-
-
-.. note::
-
-    When you login to a cluster you reach one of two login nodes.
-    You **should not** run applications on the login nodes.
-    Running ``srun`` gives you an interactive terminal
-    on one of the many worker nodes in the cluster.
-
-
-
-What Next?
-----------
-
-Now you have connected to a cluster, you can look at how to submit jobs with :ref:`Slurm <slurm>` or look at the software installed on :ref:`JADE <software>`.
diff --git a/software/apps/caffe.rst b/software/apps/caffe.rst
@@ -0,0 +1,98 @@
+.. _caffe:
+
+Caffe
+=====
+
+.. sidebar:: Caffe
+
+   :URL: http://caffe.berkeleyvision.org/
+
+`Caffe <http://caffe.berkeleyvision.org/>`_ is a Deep Learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (`BVLC <http://bvlc.eecs.berkeley.edu/>`_) and by community contributors.
+
+The Caffe Docker Container
+--------------------------
+
+Caffe is available on JADE through the use of a `Docker container <https://docker.com>`_. For more information on JADE's use of containers, see :ref:`containers`.
+
+
+Using Caffe Interactively
+-------------------------
+
+All the contained applications are launched interactively in the same way within 1 compute node at a time. The number of GPUs to be used per node is requested using the “gres”  option. To request an interactive session on a compute node the following command is issued from the login node: ::
+
+  # Requesting 2 GPUs for Caffe version 17.04
+  srun --gres=gpu:2 --pty  /jmain01/apps/docker/caffe 17.04
+
+This command will show the following, which is now running on a compute node: ::
+
+  ==================
+  == NVIDIA Caffe ==
+  ==================
+
+  NVIDIA Release 17.04 (build 26740)
+
+  Container image Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.
+  Copyright (c) 2014, 2015, The Regents of the University of California (Regents)
+  All rights reserved.
+
+  Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
+  NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
+
+  groups: cannot find name for group ID 1002
+  I have no name!@124cf0e3582e:/home_directory$
+
+You are now inside the container where the `Caffe` software is installed. Let's check the version ::
+
+  caffe --version
+
+You can now begin training your network: ::
+
+  caffe train -solver=my_solver.prototxt
+
+Using Caffe in Batch Mode
+-------------------------
+
+There are wrappers for launching the containers within batch mode.
+
+Firstly navigate to the folder you wish your script to lauch from, for example we'll use the home directory: ::
+
+  cd ~
+
+It is recommended that you create a script file e.g. `script.sh`: ::
+
+  #!/bin/bash
+
+  # Prints out Caffe's version number
+  caffe --version
+
+And don't forget to make your `script.sh` executable: ::
+
+  chmod +x script.sh
+
+Then create a Slurm batch script that is used to launch the code, e.g. `batch.sh`: ::
+
+  #!/bin/bash
+  #SBATCH --nodes=1
+  #SBATCH -p all
+  #SBATCH -J Caffe-job
+  #SBATCH --gres=gpu:8
+  #SBATCH --time=01:00:00
+
+  #Launching the commands within script.sh
+  /jmain01/apps/docker/caffe-batch –c ./script.sh
+
+You can then submit the job using `sbatch`: ::
+
+  sbatch batch.sh
+
+The output will appear in the slurm standard output file.
+
+
+
+
+
+
+
+
+
+.. _GPUComputing@Sheffield: http://gpucomputing.shef.ac.uk
diff --git a/software/apps/Gromacs.rst → software/apps/gromacs.rst b/software/apps/Gromacs.rst → software/apps/gromacs.rst
diff --git a/software/apps/NAMD.rst → software/apps/namd.rst b/software/apps/NAMD.rst → software/apps/namd.rst
diff --git a/software/apps/tensorflow.rst b/software/apps/tensorflow.rst
@@ -0,0 +1,167 @@
+.. _tensorflow:
+
+Tensorflow
+==========
+
+.. sidebar:: Tensorflow
+
+   :URL: https://www.tensorflow.org/
+
+TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.
+
+About Tensorflow on ShARC
+-------------------------
+
+**A GPU-enabled worker node must be requested in order to use the GPU version of this software. See** :ref:`GPUComputing_sharc` **for more information.**
+
+Tensorlfow is available on ShARC as both Singularity images and by local installation.
+
+As Tensorflow and all its dependencies are written in Python, it can be installed locally in your home directory. The use of Anaconda (:ref:`sharc-python-conda`) is recommended as it is able to create a virtual environment in your home directory, allowing the installation of new Python packages without admin permission.
+
+This software and documentation is maintained by the `RSES group <http://rse.shef.ac.uk/>`_ and `GPUComputing@Sheffield <http://gpucomputing.shef.ac.uk/>`_. For feature requests or if you encounter any problems, please raise an issue on the `GPU Computing repository <https://github.com/RSE-Sheffield/GPUComputing/issues>`_.
+
+Tensorflow Singularity Images
+-----------------------------
+
+Singularity images are self-contained virtual machines similar to Docker. For more information on Singularity and how to use the images, see :ref:`singularity_sharc`.
+
+A symlinked file is provided that always point to the latest image: ::
+
+  #CPU Tensorflow
+  /usr/local/packages/singularity/images/tensorflow/cpu.img
+
+  #GPU Tensorflow
+  /usr/local/packages/singularity/images/tensorflow/gpu.img
+
+To get a bash terminal in to an image for example, use the command: ::
+
+  singularity exec /usr/local/packages/singularity/images/tensorflow/gpu.img /bin/bash
+
+The ``exec`` command can also be used to call any command/script inside the image e.g. ::
+
+  singularity exec /usr/local/packages/singularity/images/tensorflow/gpu.img python your_tensorflow_script.py
+
+You may get a warning similar to ``groups: cannot find name for group ID ...``, this can be ignored and will not have an affect on running the image.
+
+The paths ``/fastdata``, ``/data``, ``/home``, ``/scratch``, ``/shared`` are automatically mounted to your ShARC filestore directories. For GPU-enabled images the ``/nvlib`` and ``/nvbin`` is mounted to the correct Nvidia driver version for the node that you're using.
+
+Tensorflow is installed as part of Anaconda and can be found inside the image at: ::
+
+  /usr/local/anaconda3-4.2.0/lib/python3.5/site-packages/tensorflow
+
+
+**To submit jobs that uses a Singularity image, see** :ref:`use_image_batch_singularity_sharc` **for more detail.**
+
+Image Index
+^^^^^^^^^^^
+
+Paths to the actual images and definition files are provided below for downloading and building of custom images.
+
+* Shortcut to Latest Image
+    * CPU
+        * ``/usr/local/packages/singularity/images/tensorflow/cpu.img``
+    * GPU
+        * ``usr/local/packages/singularity/images/tensorflow/gpu.img``
+* CPU Images
+    * Latest: 1.0.1-CPU-Ubuntu16.04-Anaconda3.4.2.0.img (GCC 5.4.0, Python 3.5)
+        * Path: ``/usr/local/packages/singularity/images/tensorflow/1.0.1-CPU-Ubuntu16.04-Anaconda3.4.2.0.img``
+        * Def file: :download:`/sharc/software/apps/singularity/tensorflow_cpu.def </sharc/software/apps/singularity/tensorflow_cpu.def>`
+* GPU Images
+    * Latest: 1.0.1-GPU-Ubuntu16.04-Anaconda3.4.2.0-CUDA8-cudNN5.0.img (GCC 5.4.0, Python 3.5)
+        * Path: ``/usr/local/packages/singularity/images/tensorflow/1.0.1-GPU-Ubuntu16.04-Anaconda3.4.2.0-CUDA8-cudNN5.0.img``
+        * Def file: :download:`/sharc/software/apps/singularity/tensorflow_gpu.def </sharc/software/apps/singularity/tensorflow_gpu.def>`
+    * Latest: 1.0.1-GPU-Ubuntu16.04-Anaconda3.4.2.0-CUDA8-cudNN5.0.img (GCC 5.4.0, Python 3.5)
+        * Path: ``/usr/local/packages/singularity/images/tensorflow/1.0.1-GPU-Ubuntu16.04-Anaconda3.4.2.0-CUDA8-cudNN5.0.img``
+        * Def file: :download:`/sharc/software/apps/singularity/tensorflow_gpu.def </sharc/software/apps/singularity/tensorflow_gpu.def>`
+
+Installation in Home Directory
+------------------------------
+
+The following is an instruction on how to setup Tensorflow on your user account.
+
+First request an interactive session, e.g. with :ref:`qrshx`. To use GPUs see :ref:`GPUInteractive_sharc`.
+
+Load the relevant modules (our example uses CUDA 8.0 with cuDNN 5.1 but :ref:`other versions are available <cudnn_sharc>`) ::
+
+	module load apps/python/anaconda3-4.2.0
+	module load libs/cudnn/5.1/binary-cuda-8.0.44
+
+
+Create a conda environment to load relevant modules on your local user account and activate it ::
+
+	conda create -n tensorflow python=3.5
+	source activate tensorflow
+
+Then install tensorflow with the following commands ::
+
+	export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.11.0-cp35-cp35m-linux_x86_64.whl
+	pip install $TF_BINARY_URL
+
+You can test that Tensorflow is running on the GPU with the following python code ::
+
+	import tensorflow as tf
+	# Creates a graph.
+	with tf.device('/gpu:0'):
+	  a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
+	  b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
+	  c = tf.matmul(a, b)
+	# Creates a session with log_device_placement set to True.
+	sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
+	# Runs the op.
+	print(sess.run(c))
+
+Which gives the following results ::
+
+	[[ 22.  28.]
+	 [ 49.  64.]]
+
+Every Session Afterwards and in Your Job Scripts
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The previous instuctions installs Tensorflow and its dependencies inside your home directory but every time you use a new session or within your job scripts, the modules must be loaded and conda must be activated again. Use the following command to activate the Conda environment with Tensorflow installed: ::
+
+	module load apps/python/anaconda3-4.2.0
+	module load libs/cudnn/5.1/binary-cuda-8.0.44
+	source activate tensorflow
+
+Using multiple GPUs
+-------------------
+Example taken from `tensorflow documentation <https://www.tensorflow.org/versions/r0.11/how_tos/using_gpu/index.html>`_.
+
+If you would like to run TensorFlow on multiple GPUs, you can construct your model in a multi-tower fashion where each tower is assigned to a different GPU. For example: ::
+
+	import tensorflow as tf
+	# Creates a graph.
+	c = []
+	for d in ['/gpu:2', '/gpu:3']:
+	  with tf.device(d):
+	    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
+	    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
+	    c.append(tf.matmul(a, b))
+	with tf.device('/cpu:0'):
+	  sum = tf.add_n(c)
+	# Creates a session with log_device_placement set to True.
+	sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
+	# Runs the op.
+	print sess.run(sum)
+
+You will see the following output. ::
+
+	Device mapping:
+	/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla K20m, pci bus
+	id: 0000:02:00.0
+	/job:localhost/replica:0/task:0/gpu:1 -> device: 1, name: Tesla K20m, pci bus
+	id: 0000:03:00.0
+	/job:localhost/replica:0/task:0/gpu:2 -> device: 2, name: Tesla K20m, pci bus
+	id: 0000:83:00.0
+	/job:localhost/replica:0/task:0/gpu:3 -> device: 3, name: Tesla K20m, pci bus
+	id: 0000:84:00.0
+	Const_3: /job:localhost/replica:0/task:0/gpu:3
+	Const_2: /job:localhost/replica:0/task:0/gpu:3
+	MatMul_1: /job:localhost/replica:0/task:0/gpu:3
+	Const_1: /job:localhost/replica:0/task:0/gpu:2
+	Const: /job:localhost/replica:0/task:0/gpu:2
+	MatMul: /job:localhost/replica:0/task:0/gpu:2
+	AddN: /job:localhost/replica:0/task:0/cpu:0
+	[[  44.   56.]
+	 [  98.  128.]]