Skip to content

Commit

Permalink
Merge pull request #6 from twinkarma/master
Browse files Browse the repository at this point in the history
Updated docs for pi and project managers
  • Loading branch information
twinkarma committed Oct 9, 2017
2 parents 80d568b + b68f25f commit e7691d6
Show file tree
Hide file tree
Showing 8 changed files with 165 additions and 250 deletions.
10 changes: 5 additions & 5 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@
.. image:: https://readthedocs.org/projects/jade-hpc/badge/?version=latest
:target: http://jade-hpc.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status

JADE HPC Facility Documentation
===============================
This is the source code for the documentation of JADE HPC facility user guide. It is written in the rst format. For a guide on the rst file format see `this <http://thomas-cokelaer.info/tutorials/sphinx/rest_syntax.html>`_ document.
This is the source code for the documentation of JADE HPC facility user guide. It is written in the rst format. For a guide on the rst file format see `this <http://thomas-cokelaer.info/tutorials/sphinx/rest_syntax.html>`_ document.


How to Contribute
Expand All @@ -26,7 +26,7 @@ Once you have made your changes and updated your Fork on GitHub you will need to
Building the documentation
##########################

#. Install Python on your machine
#. Install Python on your machine

#. Install sphinx: ::

Expand All @@ -47,7 +47,7 @@ Building the documentation
#. To build the HTML documentation run: ::

make html

Or if you don't have the ``make`` utility installed on your machine then build with *sphinx* directly: ::

sphinx-build . ./html
Expand All @@ -71,7 +71,7 @@ The application also serves up the site at port ``8000`` by default at http://lo
Making Changes to the Documentation
-----------------------------------

The documentation consists of a series of `reStructured Text <http://sphinx-doc.org/rest.html>`_ files which have the ``.rst`` extension. These files are then automatically converted to HTMl and combined into the web version of the documentation by sphinx. It is important that when editing the files the syntax of the rst files is followed.
The documentation consists of a series of `reStructured Text <http://sphinx-doc.org/rest.html>`_ files which have the ``.rst`` extension. These files are then automatically converted to HTMl and combined into the web version of the documentation by sphinx. It is important that when editing the files the syntax of the rst files is followed.


If there are any errors in your changes the build will fail and the documentation will not update, you can test your build locally by running ``make html``. The easiest way to learn what files should look like is to read the ``rst`` files already in the repository.
Expand Down
43 changes: 40 additions & 3 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,16 +6,53 @@ JADE-HPC Facility User guide
:align: right
:alt: A picture showing some of the Jade hardware.

The Joint Academic Data Science Endeavour (JADE) is the largest GPU facility in the UK supporting world-leading research in machine learning. The computational hub will harness the capabilities of the NVIDIA DGX-1 Deep Learning System and comprise of 22 servers, each containing 8 of the newest NVIDIA Tesla P100 GPUs linked by NVIDIA's NV link interconnect technology. The new JADE facility aims to address the gap between university systems and access to national HPC services. This will drive forward innovation in machine learning, identifying new applications and insights in to research challenges.
This is the documentation for the Joint Academic Data science Endeavour
(JADE) facility.

This is the documentation for Joint Academic Data Science Endeavour (JADE) High Performance Computing (HPC) facility.
Run by Research Software Engineering' Research Computing Group with additional support from the Research Software Engineering team in Computer Science, they support the computational needs of hundreds of researchers across all departments.
JADE is a UK Tier-2 resource, funded by EPSRC, owned by the University
of Oxford and hosted at the Hartree Centre. The hardware was supplied
and integrated by ATOS Bull.

A consortium of eight UK universities, led by the University of Oxford,
has been awarded £3 million by the Engineering and Physical Sciences
Research Council (EPSRC) to establish a new computing facility known as
the Joint Academic Data science Endeavour (JADE). This forms part of a
combined investment of £20m by EPSRC in the UK’s regional Tier 2
high-performance computing facilities, which aim to bridge the gap
between institutional and national resources.

JADE is unique amongst the Tier 2 centres in being designed for the
needs of machine learning and related data science applications. There
has been huge growth in machine learning in the last 5 years, and this
is the first national facility to support this rapid development, with
the university partners including the world-leading machine learning
groups in Oxford, Edinburgh, KCL, QMUL, Sheffield and UCL.

The system design exploits the capabilities of NVIDIA's DGX-1 Deep
Learning System which has eight of its newest Tesla P100 GPUs tightly
coupled by its high-speed NVlink interconnect. NVIDIA has clearly
established itself as the leader in massively-parallel computing for
deep neural networks, and the DGX-1 runs optimized versions of many
standard machine learning software packages such as Caffe, TensorFlow,
Theano and Torch.

This system design is also ideal for a large number of molecular
dynamics applications and so JADE will also provide a powerful resource
for molecular dynamics researchers at Bristol, Edinburgh, Oxford and
Southampton.

JADE Hardware
=============

JADE hardware consists of:
* 22 DGX-1 Nodes, each with 8 Nvidia P100 GPUs
* 2 Head nodes

.. toctree::
:maxdepth: -1
:hidden:


jade/index
software/index
more_info
Expand Down
8 changes: 4 additions & 4 deletions jade/getting-account.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,12 +28,12 @@ http://community.hartree.stfc.ac.uk/wiki/site/admin/safe%20user%20guide.html

Once your SAFE account is established, login to it and click on "Request Join Project".

From the drop-down list select the appropriate project, enter the signup code which you should have been given by the project PI or manager, and then click "Request".
From the drop-down list select the appropriate **project**, enter the **signup code** which you should have been given by the project PI or manager, and then click "Request".

The approval process goes through several steps:
a) approval by the PI or project manager -- once this is done the SAFE status changes to Pending
b) initial account setup -- once this is done the SAFE status changes to Active
c) completion of account setup -- once this is done you will get an email confirming you are all set, and your SAFE account will have full details on your new project account
a) approval by the PI or project manager -- once this is done the SAFE status changes to Pending
b) initial account setup -- once this is done the SAFE status changes to Active
c) completion of account setup -- once this is done you will get an email confirming you are all set, and your SAFE account will have full details on your new project account

This process shouldn't take more than 2 working days. If it takes more than that, check whether the PI or project manager is aware that you have applied, and therefore your application needs their approval through the SAFE system.

Expand Down
1 change: 1 addition & 0 deletions jade/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,4 @@ If you have not used a High Performance Computing (HPC) cluster, the Linux opera
scheduler/index
modules
containers
pi-projectmanager
87 changes: 87 additions & 0 deletions jade/pi-projectmanager.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
.. _pi-projectmanager:

Information for PIs and Project Managers
========================================

Creating a Project
------------------

Access to the HPC systems at the Hartree Centre is available to groups of academic and industry partners. The PI (principal investigator) should complete a Project Proposal form in consultation with help desk staff and submit it for approval. Please contact `hartree@stfc.ac.uk <hartree@stfc.ac.uk>`_ for a proposal form.

Getting an Account
------------------

After the proposal has been approved by the Help Desk, a `SAFE <https://um.hartree.stfc.ac.uk>`_ account must be created which allows the management of **users** and **projects**. At registration, the PI provides an institutional email address which will be used as their SAFE login ID (please note, emails such as gmail or hotmail will not be accepted). An SSH Key is also required at this stage, and instructions on how to generate and upload this are provided in `SAFE User Guide <http://community.hartree.stfc.ac.uk/wiki/site/admin/home.html>`_.

Once a project has been set up by Hartree staff, the PI can define associated project managers and then the PI and project managers are able to define groups (in a tree hierarchy), accept people into the project and allocate them to groups.

Projects and Groups on JADE
---------------------------

At the top-level there are **projects** with a PI, and within a **project** there are **groups** with resources such as GPU time and disk space. A **group must be created** in order to accept normal users in to the project.

The simplest approach is to create a single group for the Project for all users, but additional groups can be created if necessary.

**To create a group:**

1. From the main portal of `SAFE <https://um.hartree.stfc.ac.uk>`_, click the `Administer` button next to the name of the project that you manage.
2. Click the `Project Group Administration` button, then click `Add New`.
3. Type in a name and description and click `Create`.

Project Signup Code
-------------------

The **project signup code is required** so that users can request to join your project.

**To set a project signup code:**

1. From the main portal of `SAFE <https://um.hartree.stfc.ac.uk>`_, click the `Administer` button next to the name of the project that you manage.
2. Click the `Update` button.
3. In the **Password** field, set the desired **project signup code** and press `Update`.

Adding Users to the Project
---------------------------

After creating a group, users can be added to a project.

**To add users:**

1. Provide your users with the **project name**, **project signup code** (see above section) and signup instruction for regular users at:
`http://jade-hpc.readthedocs.io/en/latest/jade/getting-account.html <http://jade-hpc.readthedocs.io/en/latest/jade/getting-account.html>`_
2. Once a user has requested to join a project, there will be a "New Project Management Requests" box. Click the `Process` button and `Accept` (or `Reject`) the new member.
3. The PI can now add the member to the a **group** previously created.
4. Click on `Administer` for the Project then choose `Project Group Administration`.
5. Click on the `Group name`, then on `Add Account`, and then select any available user from a list (including the PI).
6. There is now a manual process, which may take up to 24 hours, after which the new Project member will be notified of their new userid and invited to log on to the Hartree systems for the first time.
7. The new project member will have an `Active` status in the group once the process is completed.

.. note::
The PI does NOT have to 'Request Join Project' as he/she is automatically a Project member. They must however, add themselves to a Group.


Project groups and shared file system area
------------------------------------------


Each Project group maps to a Unix group, and each Project group member's home file system is set up under a directory group structure. The example below starts from a user's home directory and shows that all other members of the Project group are assigned home directories as "peer" directories.
::
-bash-4.1$ pwd
/gpfs/home/training/jpf03/jpf26-jpf03
-bash-4.1$ cd ..
-bash-4.1$ ls
afg27-jpf03 bbl28-jpf03 cxe72-jpf03 dxd46-jpf03 hvs09-jpf03 jjb63-jpf03 jxm09-jpf03 mkk76-jpf03 phw57-jpf03 rrr25-jpf03 rxw47-jpf03 sxl18-jpf03
ajd95-jpf03 bwm51-jpf03 cxl10-jpf03 dxp21-jpf03 hxo76-jpf03 jkj47-jpf03 kxm85-jpf03 mxm86-jpf03 pxj86-jpf03 rrs70-jpf03 sca58-jpf03 tcn16-jpf03
axa59-jpf03 bxp59-jpf03 djc87-jpf03 fxb73-jpf03 ivk29-jpf03 jpf26-jpf03 lim17-jpf03 nxt14-jpf03 rja87-jpf03 rwt21-jpf03 shared txc61-jpf03
axw52-jpf03 bxv09-jpf03 dwn60-jpf03 gxx38-jpf03 jds89-jpf03 jrh19-jpf03 ltc84-jpf03 pag51-jpf03 rjb98-jpf03 rxl87-jpf03 sls56-jpf03 vvt17-jpf03

Important to some, please note that for each Project group there is a "shared" directory which can be reached at ::

../shared

from each user's home directory. Every member of the Project group is able to read and write to this shared directory, so it can be used for common files and applications for the Project.


Once a Project has Finished
---------------------------

It is Hartree Centre policy that, after the agreed date of completion of a Project, all data will be made read-only and will then remain retrievable for 3 months. During this period, users are able to login to retrieve their data, but will be unable to run jobs. After 3 months have elapsed, all login access associated with the Project will be terminated, and all data owned by the Project will be deleted.
101 changes: 12 additions & 89 deletions software/apps/tensorflow.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,93 +9,15 @@ Tensorflow

TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.

About Tensorflow on ShARC
-------------------------
Tensorflow Docker Container
-----------------------

**A GPU-enabled worker node must be requested in order to use the GPU version of this software. See** :ref:`GPUComputing_sharc` **for more information.**
Tensorflow is available on JADE through the use of a `Docker container <https://docker.com>`_. For more information on JADE's use of containers, see :ref:`containers`.

Tensorlfow is available on ShARC as both Singularity images and by local installation.

As Tensorflow and all its dependencies are written in Python, it can be installed locally in your home directory. The use of Anaconda (:ref:`sharc-python-conda`) is recommended as it is able to create a virtual environment in your home directory, allowing the installation of new Python packages without admin permission.

This software and documentation is maintained by the `RSES group <http://rse.shef.ac.uk/>`_ and `GPUComputing@Sheffield <http://gpucomputing.shef.ac.uk/>`_. For feature requests or if you encounter any problems, please raise an issue on the `GPU Computing repository <https://github.com/RSE-Sheffield/GPUComputing/issues>`_.

Tensorflow Singularity Images
-----------------------------

Singularity images are self-contained virtual machines similar to Docker. For more information on Singularity and how to use the images, see :ref:`singularity_sharc`.

A symlinked file is provided that always point to the latest image: ::

#CPU Tensorflow
/usr/local/packages/singularity/images/tensorflow/cpu.img

#GPU Tensorflow
/usr/local/packages/singularity/images/tensorflow/gpu.img

To get a bash terminal in to an image for example, use the command: ::

singularity exec /usr/local/packages/singularity/images/tensorflow/gpu.img /bin/bash

The ``exec`` command can also be used to call any command/script inside the image e.g. ::

singularity exec /usr/local/packages/singularity/images/tensorflow/gpu.img python your_tensorflow_script.py

You may get a warning similar to ``groups: cannot find name for group ID ...``, this can be ignored and will not have an affect on running the image.

The paths ``/fastdata``, ``/data``, ``/home``, ``/scratch``, ``/shared`` are automatically mounted to your ShARC filestore directories. For GPU-enabled images the ``/nvlib`` and ``/nvbin`` is mounted to the correct Nvidia driver version for the node that you're using.

Tensorflow is installed as part of Anaconda and can be found inside the image at: ::

/usr/local/anaconda3-4.2.0/lib/python3.5/site-packages/tensorflow


**To submit jobs that uses a Singularity image, see** :ref:`use_image_batch_singularity_sharc` **for more detail.**

Image Index
^^^^^^^^^^^

Paths to the actual images and definition files are provided below for downloading and building of custom images.

* Shortcut to Latest Image
* CPU
* ``/usr/local/packages/singularity/images/tensorflow/cpu.img``
* GPU
* ``usr/local/packages/singularity/images/tensorflow/gpu.img``
* CPU Images
* Latest: 1.0.1-CPU-Ubuntu16.04-Anaconda3.4.2.0.img (GCC 5.4.0, Python 3.5)
* Path: ``/usr/local/packages/singularity/images/tensorflow/1.0.1-CPU-Ubuntu16.04-Anaconda3.4.2.0.img``
* Def file: :download:`/sharc/software/apps/singularity/tensorflow_cpu.def </sharc/software/apps/singularity/tensorflow_cpu.def>`
* GPU Images
* Latest: 1.0.1-GPU-Ubuntu16.04-Anaconda3.4.2.0-CUDA8-cudNN5.0.img (GCC 5.4.0, Python 3.5)
* Path: ``/usr/local/packages/singularity/images/tensorflow/1.0.1-GPU-Ubuntu16.04-Anaconda3.4.2.0-CUDA8-cudNN5.0.img``
* Def file: :download:`/sharc/software/apps/singularity/tensorflow_gpu.def </sharc/software/apps/singularity/tensorflow_gpu.def>`
* Latest: 1.0.1-GPU-Ubuntu16.04-Anaconda3.4.2.0-CUDA8-cudNN5.0.img (GCC 5.4.0, Python 3.5)
* Path: ``/usr/local/packages/singularity/images/tensorflow/1.0.1-GPU-Ubuntu16.04-Anaconda3.4.2.0-CUDA8-cudNN5.0.img``
* Def file: :download:`/sharc/software/apps/singularity/tensorflow_gpu.def </sharc/software/apps/singularity/tensorflow_gpu.def>`

Installation in Home Directory
Using Tensorflow Interactively
------------------------------

The following is an instruction on how to setup Tensorflow on your user account.

First request an interactive session, e.g. with :ref:`qrshx`. To use GPUs see :ref:`GPUInteractive_sharc`.

Load the relevant modules (our example uses CUDA 8.0 with cuDNN 5.1 but :ref:`other versions are available <cudnn_sharc>`) ::

module load apps/python/anaconda3-4.2.0
module load libs/cudnn/5.1/binary-cuda-8.0.44


Create a conda environment to load relevant modules on your local user account and activate it ::

conda create -n tensorflow python=3.5
source activate tensorflow

Then install tensorflow with the following commands ::

export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.11.0-cp35-cp35m-linux_x86_64.whl
pip install $TF_BINARY_URL
.. TODO add info about using interactively
You can test that Tensorflow is running on the GPU with the following python code ::

Expand All @@ -115,17 +37,18 @@ Which gives the following results ::
[[ 22. 28.]
[ 49. 64.]]

Every Session Afterwards and in Your Job Scripts
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Using Tensorflow in Batch Mode
------------------------------





The previous instuctions installs Tensorflow and its dependencies inside your home directory but every time you use a new session or within your job scripts, the modules must be loaded and conda must be activated again. Use the following command to activate the Conda environment with Tensorflow installed: ::

module load apps/python/anaconda3-4.2.0
module load libs/cudnn/5.1/binary-cuda-8.0.44
source activate tensorflow

Using multiple GPUs
-------------------

Example taken from `tensorflow documentation <https://www.tensorflow.org/versions/r0.11/how_tos/using_gpu/index.html>`_.

If you would like to run TensorFlow on multiple GPUs, you can construct your model in a multi-tower fashion where each tower is assigned to a different GPU. For example: ::
Expand Down

0 comments on commit e7691d6

Please sign in to comment.