# Data analysis tool overview

> All information can be found at [the Ley Digial Universe](http://confluence.eb.local/display/LDU/Infrastructure)

> Make sure to read [Andre Noll's user-guide](http://ilm.eb.local/user-guide/)!

* compute resources
* projects
* conda environments
* jupyter notebooks
* data analysis with R
* git

# Computational resources

## Your computer

### Pros

* easy to use and set up

### Concs

* limited computational resources
* hard for others to access your files  

## The lab "server"

Is actually comprised of multiple servers and virtual machines (VMs)
  
### File server

The file server just stores the files

Name: `lux`

Location: `/ebio/abt3_projects/`

### Submit host server: "ivy" 
   
* CPUs: 80
* Memory: 1TB
* Virtual machines (VMs) on ivy
  * "rick" = submit VM
  * "morty" = VM for Rstudio-server
  * "shiny-wetlab" = VM for Shiny apps 

### Compute cluster

* Many servers connected together
  * You submit jobs to the compute cluster from a "submit host"
  * Our submit host is the "rick" VM
* Sun Grid Engine (SGE)
  * Schedules resource usage for jobs
  * Jobs sit in a queue until the needed resources are available 

<img src="http://gridscheduler.sourceforge.net/howto/knoppix/grid-knoppix.jpg" width="400">

### Compute cluster

* Technical specs
  * \>20 nodes
  * ~2200 CPUs
  * Up to 2TB can be used for a single job (eg., metagenome assembly)
  * For more information, see [Ganglia](http://ilm.eb.local/ganglia/)

# Projects


**From Andre Noll's user-guide:**

For most departments and groups, the available file storage is subdivided into storage projects. Each storage project has its dedicated folder, the size of which can not exceed a pre-defined limit.
New storage projects will only be created if the project has been acknowledged and documented. Documenting the storage requirements helps to

* coordinate the disk usage among the members of the department,
* keep the tape backup jobs in sync with the data on disk,
* estimate the medium-term demand of disk space, 
* aid the **unlucky fellow** who is responsible to decide what to do with your data after you have left.

## Long term archive

**From Andre Noll's user-guide:**

According to the Max Planck regulations, all data used in publications must be kept for **10 years**, in case somebody wants to verify the results or redo the analysis performed in the paper

## Setting up a project

See [Project-specific environments](http://confluence.eb.local/display/LDU/Project-specific+environments) on Confluence

## Existing projects

In [3]:
ls /ebio/abt3_projects/

[0m[01;34mamylase_oral_stool_microbiome[0m/              [01;34mmethanogen_host_evo[0m/
[01;34mAnxiety_Twins_Metagenomes[0m/                  [01;34mMethanogen_SCFA[0m/
[01;34mBacteroides_sphingolipids_brainIHC[0m/         [01;34mmicrobiome_of_TLR5_KO_mice[0m/
[01;34mBGI_metagenomes[0m/                            [01;34mNIH_HIV_Miami[0m/
[34;42mBifido_Lactose_tolerance[0m/                   [01;34mOral_Stool_Pilot_Processing[0m/
[01;34mChristensenella_CM12[0m/                       [01;34mPGD_gut_microbiota[0m/
[01;34mChristensenella_CM19[0m/                       [01;34mPolar_bear_gut_communities[0m/
[01;34mChristensenella_Cornell[0m/                    [01;34mPseudoGenomes[0m/
[01;34mChristensenella_Project[0m/                    [01;34mPseudoMetagenomes[0m/
[01;34mChronic_Fatigue_Syndrome_Microbiome[0m/        [01;34mpseudomonas_maize_genome[0m/
[01;34mCincinnati_bad_water[0m/                       [01;34mPsychrobacter_comparativ

### Project disk usage

In [4]:
cat /ebio/abt3_projects/disk-usage

disk usage (unit: Gigabytes)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

inode usage (unit: 1000 files)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

project disk usage (unit: Gigabytes)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
databases_no-backup                       *********************** 8123    (18%)
amylase_oral_stool_microbiome             ********** 3734                  (8%)
databases                                 ********** 3536                  (8%)
TwinsUK_viromes_Shao_Pei                  ********* 3303                   (7%)
Christensenella_CM12                      ******** 3069                    (7%)
TwinsUK_Bifido                            ******** 2833                    (6%)
TwinsUK                                   ******* 2754                     (6%)
Christensenella_Project                   ***** 1838                       (4%)
methanogen_host_evo                       **** 1622                        (3%)
Bifido_Lactose_tolerance                  **** 1471                

## Project descriptions

Project descriptions are stored in [Andre's user-info git repo](http://ilm.eb.local/gitweb/?p=user-info;a=tree)

# conda environments


<img src="http://www.gurobi.com/images/logo-anaconda.png" width="600">

## What are conda environments?

Environments contain alternative installations of software (different software and/or software versions) that the user can easily switch between.


**Examples**

Example 1

> QIIME is Python2, but Software XXX is only compatible with Python3. How do I use both?

Example 2

> The analysis for Project A used QIIME 1.9, but I want to use QIIME 2.0 for my new project. What if I need to redo the Project A analysis (eg., in response to manuscript peer review)? Do I have to re-install QIIME 1.9? Will QIIME 2.0 produce different results?

Example 3

> I want to reproduce the results of a former lab member. The lab member provides code to redo the analysis, but it requires specific software versions. How do I install those specific versions easily? Moreover, I don't want to use those software versions for analyses on other projects.

## How do I use conda environments?

Examples:

* Activate a python2 environment
  * `source activate py2`
* Activate a python3 environment
  * `source activate py3`  
  
#### Using with Jupyter notebooks

Each notebook file can use a different conda environment

## How do I create environments?

### Creation

* Install conda 
* Add conda to your PATH
* Create a python 2 enviroment that includes `checkm`
  * `conda create -n py2_checkm python=2 bioconda::checkm-genome`
* Create a python 3 enviroment that includes `panX` 
  * `conda create -n py3_panx python=3 bioconda::panx`

## More information

See [conda environment setup](http://confluence.eb.local/display/LDU/conda+environment+setup) on Confluence

# Next: Jupyter notebooks

[notebook basics](./notebook_basics/01_resources.ipynb)