Merge pull request #6 from twinkarma/master

Updated docs for pi and project managers
jade-hpc-gpu · Oct 9, 2017 · e7691d6 · e7691d6
2 parents 80d568b + b68f25f
commit e7691d6
Show file tree

Hide file tree

Showing 8 changed files with 165 additions and 250 deletions.
diff --git a/README.rst b/README.rst
@@ -4,10 +4,10 @@
 .. image:: https://readthedocs.org/projects/jade-hpc/badge/?version=latest
 :target: http://jade-hpc.readthedocs.io/en/latest/?badge=latest
 :alt: Documentation Status
-    
+
 JADE HPC Facility Documentation
 ===============================
-This is the source code for the documentation of JADE HPC facility user guide. It is written in the rst format. For a guide on the rst file format see `this <http://thomas-cokelaer.info/tutorials/sphinx/rest_syntax.html>`_ document. 
+This is the source code for the documentation of JADE HPC facility user guide. It is written in the rst format. For a guide on the rst file format see `this <http://thomas-cokelaer.info/tutorials/sphinx/rest_syntax.html>`_ document.
 
 
 How to Contribute
@@ -26,7 +26,7 @@ Once you have made your changes and updated your Fork on GitHub you will need to
 Building the documentation
 ##########################
 
-#. Install Python on your machine 
+#. Install Python on your machine
 
 #. Install sphinx: ::
 
@@ -47,7 +47,7 @@ Building the documentation
 #. To build the HTML documentation run: ::
 
     make html
-	
+
    Or if you don't have the ``make`` utility installed on your machine then build with *sphinx* directly: ::
 
     sphinx-build . ./html
@@ -71,7 +71,7 @@ The application also serves up the site at port ``8000`` by default at http://lo
 Making Changes to the Documentation
 -----------------------------------
 
-The documentation consists of a series of `reStructured Text <http://sphinx-doc.org/rest.html>`_ files which have the ``.rst`` extension. These files are then automatically converted to HTMl and combined into the web version of the documentation by sphinx. It is important that when editing the files the syntax of the rst files is followed. 
+The documentation consists of a series of `reStructured Text <http://sphinx-doc.org/rest.html>`_ files which have the ``.rst`` extension. These files are then automatically converted to HTMl and combined into the web version of the documentation by sphinx. It is important that when editing the files the syntax of the rst files is followed.
 
 
 If there are any errors in your changes the build will fail and the documentation  will not update, you can test your build locally by running ``make html``. The easiest way to learn what files should look like is to read the ``rst`` files already in the repository.

diff --git a/index.rst b/index.rst
@@ -6,16 +6,53 @@ JADE-HPC Facility User guide
    :align: right
    :alt: A picture showing some of the Jade hardware.
 
-The Joint Academic Data Science Endeavour (JADE) is the largest GPU facility in the UK supporting world-leading research in machine learning. The computational hub will harness the capabilities of the NVIDIA DGX-1 Deep Learning System and comprise of 22 servers, each containing 8 of the newest NVIDIA Tesla P100 GPUs linked by NVIDIA's NV link interconnect technology. The new JADE facility aims to address the gap between university systems and access to national HPC services. This will drive forward innovation in machine learning, identifying new applications and insights in to research challenges.
+This is the documentation for the Joint Academic Data science Endeavour
+(JADE) facility.
 
-This is the documentation for Joint Academic Data Science Endeavour (JADE) High Performance Computing (HPC) facility.
-Run by Research Software Engineering' Research Computing Group with additional support from the Research Software Engineering team in Computer Science, they support the computational needs of hundreds of researchers across all departments.
+JADE is a UK Tier-2 resource, funded by EPSRC, owned by the University
+of Oxford and hosted at the Hartree Centre. The hardware was supplied
+and integrated by ATOS Bull.
 
+A consortium of eight UK universities, led by the University of Oxford,
+has been awarded £3 million by the Engineering and Physical Sciences
+Research Council (EPSRC) to establish a new computing facility known as
+the Joint Academic Data science Endeavour (JADE). This forms part of a
+combined investment of £20m by EPSRC in the UK’s regional Tier 2
+high-performance computing facilities, which aim to bridge the gap
+between institutional and national resources.
+
+JADE is unique amongst the Tier 2 centres in being designed for the
+needs of machine learning and related data science applications. There
+has been huge growth in machine learning in the last 5 years, and this
+is the first national facility to support this rapid development, with
+the university partners including the world-leading machine learning
+groups in Oxford, Edinburgh, KCL, QMUL, Sheffield and UCL.
+
+The system design exploits the capabilities of NVIDIA's DGX-1 Deep
+Learning System which has eight of its newest Tesla P100 GPUs tightly
+coupled by its high-speed NVlink interconnect. NVIDIA has clearly
+established itself as the leader in massively-parallel computing for
+deep neural networks, and the DGX-1 runs optimized versions of many
+standard machine learning software packages such as Caffe, TensorFlow,
+Theano and Torch.
+
+This system design is also ideal for a large number of molecular
+dynamics applications and so JADE will also provide a powerful resource
+for molecular dynamics researchers at Bristol, Edinburgh, Oxford and
+Southampton.
+
+JADE Hardware
+=============
+
+JADE hardware consists of:
+  * 22 DGX-1 Nodes, each with 8 Nvidia P100 GPUs
+  * 2 Head nodes
 
 .. toctree::
    :maxdepth: -1
    :hidden:
 
+
    jade/index
    software/index
    more_info

diff --git a/jade/getting-account.rst b/jade/getting-account.rst
@@ -28,12 +28,12 @@ http://community.hartree.stfc.ac.uk/wiki/site/admin/safe%20user%20guide.html
 
 Once your SAFE account is established, login to it and click on "Request Join Project".
 
-From the drop-down list select the appropriate project, enter the signup code which you should have been given by the project PI or manager, and then click "Request".
+From the drop-down list select the appropriate **project**, enter the **signup code** which you should have been given by the project PI or manager, and then click "Request".
 
 The approval process goes through several steps:
-a) approval by the PI or project manager -- once this is done the SAFE status changes to Pending
-b) initial account setup --  once this is done the SAFE status changes to Active
-c) completion of account setup -- once this is done you will get an email confirming you are all set, and your SAFE account will have full details on your new project account
+  a) approval by the PI or project manager -- once this is done the SAFE status changes to Pending
+  b) initial account setup --  once this is done the SAFE status changes to Active
+  c) completion of account setup -- once this is done you will get an email confirming you are all set, and your SAFE account will have full details on your new project account
 
 This process shouldn't take more than 2 working days.  If it takes more than that, check whether the PI or project manager is aware that you have applied, and therefore your application needs their approval through the SAFE system.
 

diff --git a/jade/index.rst b/jade/index.rst
@@ -20,3 +20,4 @@ If you have not used a High Performance Computing (HPC) cluster, the Linux opera
    scheduler/index
    modules
    containers
+   pi-projectmanager
diff --git a/jade/pi-projectmanager.rst b/jade/pi-projectmanager.rst
@@ -0,0 +1,87 @@
+.. _pi-projectmanager:
+
+Information for PIs and Project Managers
+========================================
+
+Creating a Project
+------------------
+
+Access to the HPC systems at the Hartree Centre is available to groups of academic and industry partners. The PI (principal investigator) should complete a Project Proposal form in consultation with help desk staff and submit it for approval. Please contact `hartree@stfc.ac.uk <hartree@stfc.ac.uk>`_ for a proposal form.
+
+Getting an Account
+------------------
+
+After the proposal has been approved by the Help Desk, a `SAFE <https://um.hartree.stfc.ac.uk>`_ account must be created which allows the management of **users** and **projects**. At registration, the PI provides an institutional email address which will be used as their SAFE login ID (please note, emails such as gmail or hotmail will not be accepted). An SSH Key is also required at this stage, and instructions on how to generate and upload this are provided in `SAFE User Guide <http://community.hartree.stfc.ac.uk/wiki/site/admin/home.html>`_.
+
+Once a project has been set up by Hartree staff, the PI can define associated project managers and then the PI and project managers are able to define groups (in a tree hierarchy), accept people into the project and allocate them to groups.
+
+Projects and Groups on JADE
+---------------------------
+
+At the top-level there are **projects** with a PI, and within a **project** there are **groups** with resources such as GPU time and disk space. A **group must be created** in order to accept normal users in to the project.
+
+The simplest approach is to create a single group for the Project for all users, but additional groups can be created if necessary.
+
+**To create a group:**
+
+  1. From the main portal of `SAFE <https://um.hartree.stfc.ac.uk>`_, click the `Administer` button next to the name of the project that you manage.
+  2. Click the `Project Group Administration` button, then click `Add New`.
+  3. Type in a name and description and click `Create`.
+
+Project Signup Code
+-------------------
+
+The **project signup code is required** so that users can request to join your project.
+
+**To set a project signup code:**
+
+  1. From the main portal of `SAFE <https://um.hartree.stfc.ac.uk>`_, click the `Administer` button next to the name of the project that you manage.
+  2. Click the `Update` button.
+  3. In the **Password** field, set the desired **project signup code** and press `Update`.
+
+Adding Users to the Project
+---------------------------
+
+After creating a group, users can be added to a project.
+
+**To add users:**
+
+  1. Provide your users with the **project name**, **project signup code** (see above section) and signup instruction for regular users at:
+    `http://jade-hpc.readthedocs.io/en/latest/jade/getting-account.html <http://jade-hpc.readthedocs.io/en/latest/jade/getting-account.html>`_
+  2. Once a user has requested to join a project, there will be a "New Project Management Requests" box. Click the `Process` button and `Accept` (or `Reject`) the new member.
+  3. The PI can now add the member to the a **group** previously created.
+  4. Click on `Administer` for the Project then choose `Project Group Administration`.
+  5. Click on the `Group name`, then on `Add Account`, and then select any available user from a list (including the PI).
+  6. There is now a manual process, which may take up to 24 hours, after which the new Project member will be notified of their new userid and invited to log on to the Hartree systems for the first time.
+  7. The new project member will have an `Active` status in the group once the process is completed.
+
+.. note::
+  The PI does NOT have to 'Request Join Project' as he/she is automatically a Project member. They must however, add themselves to a Group.
+
+
+Project groups and shared file system area
+------------------------------------------
+
+
+Each Project group maps to a Unix group, and each Project group member's home file system is set up under a directory group structure. The example below starts from a user's home directory and shows that all other members of the Project group are assigned home directories as "peer" directories.
+::
+  -bash-4.1$ pwd
+  /gpfs/home/training/jpf03/jpf26-jpf03
+  -bash-4.1$ cd ..
+  -bash-4.1$ ls
+  afg27-jpf03  bbl28-jpf03  cxe72-jpf03  dxd46-jpf03  hvs09-jpf03  jjb63-jpf03  jxm09-jpf03  mkk76-jpf03  phw57-jpf03  rrr25-jpf03  rxw47-jpf03  sxl18-jpf03
+  ajd95-jpf03  bwm51-jpf03  cxl10-jpf03  dxp21-jpf03  hxo76-jpf03  jkj47-jpf03  kxm85-jpf03  mxm86-jpf03  pxj86-jpf03  rrs70-jpf03  sca58-jpf03  tcn16-jpf03
+  axa59-jpf03  bxp59-jpf03  djc87-jpf03  fxb73-jpf03  ivk29-jpf03  jpf26-jpf03  lim17-jpf03  nxt14-jpf03  rja87-jpf03  rwt21-jpf03  shared       txc61-jpf03
+  axw52-jpf03  bxv09-jpf03  dwn60-jpf03  gxx38-jpf03  jds89-jpf03  jrh19-jpf03  ltc84-jpf03  pag51-jpf03  rjb98-jpf03  rxl87-jpf03  sls56-jpf03  vvt17-jpf03
+
+Important to some, please note that for each Project group there is a "shared" directory which can be reached at ::
+
+  ../shared
+
+from each user's home directory. Every member of the Project group is able to read and write to this shared directory, so it can be used for common files and applications for the Project.
+
+
+Once a Project has Finished
+---------------------------
+
+It is Hartree Centre policy that, after the agreed date of completion of a Project, all data will be made read-only and will then remain retrievable for 3 months. During this period, users are able to login to retrieve their data, but will be unable to run jobs. After 3 months have elapsed, all login access associated with the Project will be terminated, and all data owned by the Project will be deleted.
diff --git a/software/apps/tensorflow.rst b/software/apps/tensorflow.rst
@@ -9,93 +9,15 @@ Tensorflow
 
 TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.
 
-About Tensorflow on ShARC
--------------------------
+Tensorflow Docker Container
+-----------------------
 
-**A GPU-enabled worker node must be requested in order to use the GPU version of this software. See** :ref:`GPUComputing_sharc` **for more information.**
+Tensorflow is available on JADE through the use of a `Docker container <https://docker.com>`_. For more information on JADE's use of containers, see :ref:`containers`.
 
-Tensorlfow is available on ShARC as both Singularity images and by local installation.
-
-As Tensorflow and all its dependencies are written in Python, it can be installed locally in your home directory. The use of Anaconda (:ref:`sharc-python-conda`) is recommended as it is able to create a virtual environment in your home directory, allowing the installation of new Python packages without admin permission.
-
-This software and documentation is maintained by the `RSES group <http://rse.shef.ac.uk/>`_ and `GPUComputing@Sheffield <http://gpucomputing.shef.ac.uk/>`_. For feature requests or if you encounter any problems, please raise an issue on the `GPU Computing repository <https://github.com/RSE-Sheffield/GPUComputing/issues>`_.
-
-Tensorflow Singularity Images
------------------------------
-
-Singularity images are self-contained virtual machines similar to Docker. For more information on Singularity and how to use the images, see :ref:`singularity_sharc`.
-
-A symlinked file is provided that always point to the latest image: ::
-
-  #CPU Tensorflow
-  /usr/local/packages/singularity/images/tensorflow/cpu.img
-
-  #GPU Tensorflow
-  /usr/local/packages/singularity/images/tensorflow/gpu.img
-
-To get a bash terminal in to an image for example, use the command: ::
-
-  singularity exec /usr/local/packages/singularity/images/tensorflow/gpu.img /bin/bash
-
-The ``exec`` command can also be used to call any command/script inside the image e.g. ::
-
-  singularity exec /usr/local/packages/singularity/images/tensorflow/gpu.img python your_tensorflow_script.py
-
-You may get a warning similar to ``groups: cannot find name for group ID ...``, this can be ignored and will not have an affect on running the image.
-
-The paths ``/fastdata``, ``/data``, ``/home``, ``/scratch``, ``/shared`` are automatically mounted to your ShARC filestore directories. For GPU-enabled images the ``/nvlib`` and ``/nvbin`` is mounted to the correct Nvidia driver version for the node that you're using.
-
-Tensorflow is installed as part of Anaconda and can be found inside the image at: ::
-
-  /usr/local/anaconda3-4.2.0/lib/python3.5/site-packages/tensorflow
-
-
-**To submit jobs that uses a Singularity image, see** :ref:`use_image_batch_singularity_sharc` **for more detail.**
-
-Image Index
-^^^^^^^^^^^
-
-Paths to the actual images and definition files are provided below for downloading and building of custom images.
-
-* Shortcut to Latest Image
-    * CPU
-        * ``/usr/local/packages/singularity/images/tensorflow/cpu.img``
-    * GPU
-        * ``usr/local/packages/singularity/images/tensorflow/gpu.img``
-* CPU Images
-    * Latest: 1.0.1-CPU-Ubuntu16.04-Anaconda3.4.2.0.img (GCC 5.4.0, Python 3.5)
-        * Path: ``/usr/local/packages/singularity/images/tensorflow/1.0.1-CPU-Ubuntu16.04-Anaconda3.4.2.0.img``
-        * Def file: :download:`/sharc/software/apps/singularity/tensorflow_cpu.def </sharc/software/apps/singularity/tensorflow_cpu.def>`
-* GPU Images
-    * Latest: 1.0.1-GPU-Ubuntu16.04-Anaconda3.4.2.0-CUDA8-cudNN5.0.img (GCC 5.4.0, Python 3.5)
-        * Path: ``/usr/local/packages/singularity/images/tensorflow/1.0.1-GPU-Ubuntu16.04-Anaconda3.4.2.0-CUDA8-cudNN5.0.img``
-        * Def file: :download:`/sharc/software/apps/singularity/tensorflow_gpu.def </sharc/software/apps/singularity/tensorflow_gpu.def>`
-    * Latest: 1.0.1-GPU-Ubuntu16.04-Anaconda3.4.2.0-CUDA8-cudNN5.0.img (GCC 5.4.0, Python 3.5)
-        * Path: ``/usr/local/packages/singularity/images/tensorflow/1.0.1-GPU-Ubuntu16.04-Anaconda3.4.2.0-CUDA8-cudNN5.0.img``
-        * Def file: :download:`/sharc/software/apps/singularity/tensorflow_gpu.def </sharc/software/apps/singularity/tensorflow_gpu.def>`
-
-Installation in Home Directory
+Using Tensorflow Interactively
 ------------------------------
 
-The following is an instruction on how to setup Tensorflow on your user account.
-
-First request an interactive session, e.g. with :ref:`qrshx`. To use GPUs see :ref:`GPUInteractive_sharc`.
-
-Load the relevant modules (our example uses CUDA 8.0 with cuDNN 5.1 but :ref:`other versions are available <cudnn_sharc>`) ::
-
-	module load apps/python/anaconda3-4.2.0
-	module load libs/cudnn/5.1/binary-cuda-8.0.44
-
-
-Create a conda environment to load relevant modules on your local user account and activate it ::
-
-	conda create -n tensorflow python=3.5
-	source activate tensorflow
-
-Then install tensorflow with the following commands ::
-
-	export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.11.0-cp35-cp35m-linux_x86_64.whl
-	pip install $TF_BINARY_URL
+.. TODO add info about using interactively
 
 You can test that Tensorflow is running on the GPU with the following python code ::
 
@@ -115,17 +37,18 @@ Which gives the following results ::
 	[[ 22.  28.]
 	 [ 49.  64.]]
 
-Every Session Afterwards and in Your Job Scripts
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Using Tensorflow in Batch Mode
+------------------------------
+
+
+
+
 
-The previous instuctions installs Tensorflow and its dependencies inside your home directory but every time you use a new session or within your job scripts, the modules must be loaded and conda must be activated again. Use the following command to activate the Conda environment with Tensorflow installed: ::
 
-	module load apps/python/anaconda3-4.2.0
-	module load libs/cudnn/5.1/binary-cuda-8.0.44
-	source activate tensorflow
 
 Using multiple GPUs
 -------------------
+
 Example taken from `tensorflow documentation <https://www.tensorflow.org/versions/r0.11/how_tos/using_gpu/index.html>`_.
 
 If you would like to run TensorFlow on multiple GPUs, you can construct your model in a multi-tower fashion where each tower is assigned to a different GPU. For example: ::