## &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Increasing re-usability/executability and reproducibility through virtualization </br>
### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - a walk-through example from virtual environments to docker and singularity

*Please note that this is `Jupyter Notebook` using the [bash kernel](https://github.com/takluyver/bash_kernel) to demonstrate the content of this tutorial. The outputs and the way they appear will differ when running the commands in a `shell` and based on the OS.*

Imagine you want to run a DTI analysis on a dataset you recently acquired or re-analyze an open dataset (more on that in the next few days). One amazing and comprehensive python package that can help you with that is [DIPY](https://nipy.org/dipy/). While going through [DIPY](https://nipy.org/dipy/)'s great and extensively documented example gallery, you find just the functionality you're interested in: [fiber tracking](http://nipy.org/dipy/examples_built/introduction_to_basic_tracking.html). After downloading the script, you're excited to run it using an example dataset to grasp a basic idea of how it works to later on adapt to your dataset:

In [3]:
python introduction_to_basic_tracking.py

Creating new folder /Users/peerherholz/.dipy/stanford_hardi
Downloading "HARDI150.nii.gz" to /Users/peerherholz/.dipy/stanford_hardi
Download Progress: [##################################] 100.00%  of 87.15 MBDownloading "HARDI150.bval" to /Users/peerherholz/.dipy/stanford_hardi
Download Progress: [##################################] 100.00%  of 0.00 MBDownloading "HARDI150.bvec" to /Users/peerherholz/.dipy/stanford_hardi
Download Progress: [##################################] 100.00%  of 0.00 MBFiles successfully downloaded to /Users/peerherholz/.dipy/stanford_hardi
Dataset is already in place. If you want to fetch it again please first remove the folder /Users/peerherholz/.dipy/stanford_hardi 
Downloading "aparc-reduced.nii.gz" to /Users/peerherholz/.dipy/stanford_hardi
Download Progress: [##################################] 100.00%  of 0.06 MBDownloading "label_info.txt" to /Users/peerherholz/.dipy/stanford_hardi
Download Progress: [##################################] 100.00%  of 0.

: 1

Depending on your setup at hand, you'll most likely receive an error message that can range from "not having DIPY installed" to "missing certain sub-functionality". As is this obviously sub-optimal, you want to resolve and fix whatever goes wrong here. But before you start, you should think about certain ways to approach this problem in order to make your solution and hopefully resulting code and analyses more re-usable/executable and reproducible in order to save your future self, lab colleagues that are working on similar topics or basically everyone out there that is potentially interested a lot of time and effort, as well as enable a form of "long term maintenance" helping to further improve (neuro)science. One, if not the perfect way to do this is "[virtualization](https://en.wikipedia.org/wiki/Virtualization)". So let's try how this can be achieved for our example.

As briefly introduced before, we have quite some levels of possible virtualization:

<img src="https://raw.githubusercontent.com/goanpeca/pyday-cali-2019/master/img_source/isolation.png"
     alt="virtualization layers"
     style="float: left; margin-right: 2px;" width=600; height=800  />

Based on that, our first option, having the least amount of virtualization, is to create a virtual python environment using either [virtuelenv](https://virtualenv.pypa.io/en/latest/) or [venv](https://docs.python.org/3/library/venv.html). The difference between both is that the former supports python 2.* and 3.* , while the latter supports only 3.6 and upwards. As python 2.* will retire in 2020, let's use [venv](https://docs.python.org/3/library/venv.html).

### Virtualiuation using python environments - venv

*This part heavily relies on a [virtual environment tutorial from www.realpython.com](https://realpython.com/python-virtual-environments-a-primer/).*

At first we have to create a new python environment by using `venv` the commandline, providing it with a name. The respective syntax is: `python -m venv *name*`, where `*name*` is the name of your new python environment. You can use almost every character and name, but it is definitely a good idea to provide a meaningful name:  

In [4]:
python -m venv dipy_tracks

By default no output will be given, so how can we check we happened? Just type `ls`. 

In [6]:
ls

[1m[34mdipy_tracks[39;49m[0m
empty
introduction_to_basic_tracking.py
introduction_virtualization_practicals.ipynb


As you can see, a new directory named as our environment was created. Let's check what's inside:

In [7]:
ls dipy_tracks

[1m[34mbin[39;49m[0m        [1m[34minclude[39;49m[0m    [1m[34mlib[39;49m[0m        pyvenv.cfg


We have `bin`, `include` and `lib`. Here's what they include: </br>

- `bin` : files that interact with the virtual environment
- `include` : C headers that compile the Python packages
- `lib` : a copy of the Python version along with a site-packages folder where each dependency is installed

But how can we work with this newly created python environment? At first, we have to `activate` it, by utilizing the `activate.sh` script in `bin`:

In [8]:
source dipy_tracks/bin/activate

(dipy_tracks) 

: 1

We're now in our newly created python environment and can use its resources. The change of environment is indicated through the display of its name left to the command prompt (only visible in the `shell`).

Using `deactivate` we (you guessed right) `deactivate` or "leave" our environment again.

In [10]:
deactivate

Now that we have that, let's try to run our example again, after activating our environment.

In [11]:
source dipy_tracks/bin/activate

(dipy_tracks) 

: 1

In [12]:
python introduction_to_basic_tracking.py

Traceback (most recent call last):
  File "introduction_to_basic_tracking.py", line 30, in <module>
    from dipy.data import read_stanford_labels
ModuleNotFoundError: No module named 'dipy'
(dipy_tracks) 

: 1

Most likely, you'll receive the error message "ModuleNotFoundError: No module named 'dipy'". But why is that? The reason is fairly simple: when creating virtual environments your already installed packages are not automatically included, but most be installed again in the new environment. Every python environment you have on your machine is its own entity in the sense that the binaries, libraries, etc. are not shared between environments and which python environment is used to execute a certain functionality is set by the `$PATH` variable in your `bash profile` or manually by you, activating it. Let's investigate that a bit more.

You can check which python environment is currently set by running `which python` from within your `shell`.

In [13]:
which python

/Users/peerherholz/google_drive/GitHub/NeuroDataSci-course-2019/content/day2/pm/dipy_tracks/bin/python
(dipy_tracks) 

: 1

You should see that the environment that is currently running is the one we just created. Now let's deactivate it and try again.

In [14]:
deactivate

In [15]:
which python

/Users/peerherholz/anaconda3/bin/python


You should now see the environment that is running as default on your system. Let's investigate `$PATH` to further grasp what's going on.

In [16]:
echo $PATH

/Users/peerherholz/anaconda3/bin:/Users/peerherholz/abin:/usr/local/antsbin/bin:/Applications/MATLAB_R2014a.app/bin:/Applications/freesurfer/bin:/Applications/freesurfer/fsfast/bin:/Applications/freesurfer/tktools:/usr/local/fsl/bin:/Applications/freesurfer/mni/bin:/usr/local/fsl/bin:/Users/peerherholz/anaconda3/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/Applications/workbench/bin_macosx64:/usr/local/texlive/2019/bin/x86_64-darwin/:/Users/peerherholz/abin


And now from within our new environment:

In [19]:
source dipy_tracks/bin/activate
echo $PATH

(dipy_tracks) /Users/peerherholz/google_drive/GitHub/NeuroDataSci-course-2019/content/day2/pm/dipy_tracks/bin:/Users/peerherholz/anaconda3/bin:/Users/peerherholz/anaconda3/bin:/Users/peerherholz/abin:/usr/local/antsbin/bin:/Applications/MATLAB_R2014a.app/bin:/Applications/freesurfer/bin:/Applications/freesurfer/fsfast/bin:/Applications/freesurfer/tktools:/usr/local/fsl/bin:/Applications/freesurfer/mni/bin:/usr/local/fsl/bin:/Users/peerherholz/anaconda3/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/Applications/workbench/bin_macosx64:/usr/local/texlive/2019/bin/x86_64-darwin/:/Users/peerherholz/abin
(dipy_tracks) 

: 1

As you can see, the first instance of `$PATH` is our newly created python environment. Therefore, for everything that will be run from the `command line` the first path within which the search of an executable will take place is that environment and not your default one. That means that everything will be executed through/within that environment based on its resources.

While this is at the core of `virtualization` it also requires you to pay close attention to what and how you're executing functions, scripts, etc. . That being said, we should populate our so far rather empty environment with, foremost, DIPY.

In [20]:
pip install dipy

Collecting dipy
[?25l  Downloading https://files.pythonhosted.org/packages/62/8f/63fe5e03244b0044307bcb9742df367ebcd6b00220c7ce1d16150fd5bcdb/dipy-0.16.0-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (11.7MB)
[K    100% |████████████████████████████████| 11.7MB 1.6MB/s ta 0:00:01
[?25hCollecting scipy>=0.9 (from dipy)
[?25l  Downloading https://files.pythonhosted.org/packages/81/ae/125c21f09b202c3009ad7d9fb0263fb7d6053813d4b67ccbbe4d65f7f53a/scipy-1.3.0-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (27.7MB)
[K    100% |████████████████████████████████| 27.7MB 621kB/s ta 0:00:011
[?25hCollecting h5py>=2.4.0 (from dipy)
[?25l  Downloading https://files.pythonhosted.org/packages/03/21/1cdf7fa7868528b35c1a08a770eb9334279574a8b5f1d7a2966dcec14e42/h5py-2.9.0-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64

: 1

As you can see, using `pip install dipy` not only downloaded and installed DIPY, but also all its dependencies. This is because most python libraries contain a `requirements.txt` where necessary dependencies are listed. The one of DIPY looks like [this](https://github.com/nipy/dipy/blob/master/requirements.txt) and includes all the packages we just downloaded and installed. This handy functionality is possible through `pip` being a `package manager`. A thus also included functionality is the easy investigation of our environment, for example which packages are installed: 

In [21]:
pip freeze

dipy==0.16.0
h5py==2.9.0
nibabel==2.4.1
numpy==1.17.0
scipy==1.3.0
six==1.12.0
[33mYou are using pip version 10.0.1, however version 19.2.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
(dipy_tracks) 

: 1

Furthermore we can also save the output of `pip freeze` to create our own `requirements.txt` which you then can share with however is interested in running your scripts and analyses using the same python libraries with the identical version. This is `virtualization` of python environments. 

In [22]:
pip freeze > requirements.txt
cat requirements.txt

[33mYou are using pip version 10.0.1, however version 19.2.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
(dipy_tracks) 

: 1

But before we do share our environment, we should ensure that everything is working as expected:

In [23]:
python introduction_to_basic_tracking.py

Dataset is already in place. If you want to fetch it again please first remove the folder /Users/peerherholz/.dipy/stanford_hardi 
Dataset is already in place. If you want to fetch it again please first remove the folder /Users/peerherholz/.dipy/stanford_hardi 
Dataset is already in place. If you want to fetch it again please first remove the folder /Users/peerherholz/.dipy/stanford_hardi 
Traceback (most recent call last):
  File "introduction_to_basic_tracking.py", line 98, in <module>
    from dipy.viz import window, actor, colormap as cmap, have_fury
ImportError: cannot import name 'window'
(dipy_tracks) 

: 1

"Luckily" we run into the same error as before in our default python environment. Now we can work on resolving this error and thus create an environment that is rather specifically than sharing our own most likely large default python environment with lots of libraries that are not necessary.

Most python packages have very useful error message like the one we're receiving. Paying a bit more attention, we see that something is missing and after a short google (or any other search engine) session, we know that the `fury` package is missing. By now, we know what to do:

In [24]:
pip install fury

Collecting fury
[?25l  Downloading https://files.pythonhosted.org/packages/36/f2/ad1109e541c091a3c591ce2841e55e5ef177cd738d90b1a5bd038fa1dfa0/fury-0.2.0-py3-none-any.whl (140kB)
[K    100% |████████████████████████████████| 143kB 1.8MB/s ta 0:00:01
Collecting vtk>=8.1.0 (from fury)
[?25l  Downloading https://files.pythonhosted.org/packages/8d/3b/a92a64a5d1203aae2af17dccc686ff4eb3bb7114db79eaab1593c03fb678/vtk-8.1.2-cp36-cp36m-macosx_10_6_x86_64.whl (54.9MB)
[K    100% |████████████████████████████████| 54.9MB 264kB/s ta 0:00:011
[?25hInstalling collected packages: vtk, fury
Successfully installed fury-0.2.0 vtk-8.1.2
[33mYou are using pip version 10.0.1, however version 19.2.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
(dipy_tracks) 

: 1

As you can see, `vtk` was installed as a dependency of `fury`. So far so good, let's try the example again.

In [25]:
python introduction_to_basic_tracking.py

Dataset is already in place. If you want to fetch it again please first remove the folder /Users/peerherholz/.dipy/stanford_hardi 
Dataset is already in place. If you want to fetch it again please first remove the folder /Users/peerherholz/.dipy/stanford_hardi 
Dataset is already in place. If you want to fetch it again please first remove the folder /Users/peerherholz/.dipy/stanford_hardi 
  orient = np.abs(orient / np.linalg.norm(orient))
(dipy_tracks) 

: 1

As this tutorial was put together in such a way to stress the many possibilities of errors and problems, thus the importance of virtualization, it will be divided from no one, based on the output you received above (I'm truly sorry, but we can't estimate the setup of all participants). If the above command worked for you, please continue below, if not, please go to the section [Virtualization using python environments - conda](#virtualization_conda).

If you see the error: `dipy_tracks/lib/python3.6/site-packages/fury/colormap.py:229: RuntimeWarning: invalid value encountered in true_divide orient = np.abs(orient / np.linalg.norm(orient))` don't worry for now, as it's a `warning`, not an `error` and our example script completely successfully (*Please note that you should of course investigate any warning when working on real scripts and analyses*). If we check our `pwd` we see four new files: two `.trk` and two graphics in `.png` format.

As we've seen, the script also allows to enable an interactive view. Let's check if that's working as well.

In [28]:
python introduction_to_basic_tracking.py

Dataset is already in place. If you want to fetch it again please first remove the folder /Users/peerherholz/.dipy/stanford_hardi 
Dataset is already in place. If you want to fetch it again please first remove the folder /Users/peerherholz/.dipy/stanford_hardi 
Dataset is already in place. If you want to fetch it again please first remove the folder /Users/peerherholz/.dipy/stanford_hardi 
  orient = np.abs(orient / np.linalg.norm(orient))
Terminated: 15
(dipy_tracks) 

: 1

Now that we've everything necessary in place, we can update our `requirements.txt` and share it:

In [29]:
pip freeze > requirements.txt
cat requirements.txt

[33mYou are using pip version 10.0.1, however version 19.2.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
(dipy_tracks) dipy==0.16.0
fury==0.2.0
h5py==2.9.0
nibabel==2.4.1
numpy==1.17.0
scipy==1.3.0
six==1.12.0
vtk==8.1.2
(dipy_tracks) 

: 1

<a id='virtualization_conda'></a>
### Virtualization using python environments - conda

If the last run of the example resulted in the something along the following lines for you, this means there is more work to do, work that goes a bit deeper: 
`from .vtkOpenGLKitPython import *
ImportError: libSM.so.6: cannot open shared object file: No such file or directory``
`ModuleNotFoundError: No module named 'vtkOpenGLKitPython'`

Even though we seemingly installed all necessary dependencies based on the script and the previous output, we still receive an error related to some missing resources. One thing we could try now, is to use a different virtualization method and python package manager. One that became well known and heavily used within the last few years is [conda](https://docs.conda.io/en/latest/). While before we used `venv` to create our virtual environment and `pip` as a package manager, we can use `conda` for both as it combines the respective functionality. Recreating our virtual environment from before is made very easy and straightforward through `conda`, with the general syntax being: `conda create -n *name* *python_version* *libraries*`, where `*name*` is the name of your virtual environment, `*python_version*` the python version you want to use and `*libraries*` the libraries you want to install. Adapted to our example, this looks as follows: 

In [1]:
conda create -y -n dipy_tracks  python=3.6 dipy 

Collecting package metadata: done
Solving environment: done


  current version: 4.6.14
  latest version: 4.7.10

Please update conda by running

    $ conda update -n base conda



## Package Plan ##

  environment location: /Users/peerherholz/anaconda3/envs/dipy_tracks

  added / updated specs:
    - dipy
    - python=3.6


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    bzip2-1.0.8                |       h01d97ff_0         148 KB  conda-forge
    certifi-2019.6.16          |           py36_1         149 KB  conda-forge
    h5py-2.9.0                 |nompi_py36h3d62f72_1103         980 KB  conda-forge
    libcxx-8.0.0               |                4         1.0 MB  conda-forge
    libopenblas-0.3.6          |       hd44dcd8_6         8.4 MB  conda-forge
    libpng-1.6.37              |       h2573ce8_0         298 KB  conda-forge
    matplotlib-base-3.1.1      |   py36h3a684a6_1     

As you can see, `conda`, by default, installs already a fair amount of libraries as compared to `venv`. However, we're still missing `fury`. So let's activate our newly create `conda environment` and install it. The steps and syntax are very similar to what we've done before: 

In [8]:
source activate dipy_tracks

(dipy_tracks) 

: 1

In [10]:
conda install fury

Collecting package metadata: done
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  - fury

Current channels:

  - https://conda.anaconda.org/conda-forge/osx-64
  - https://conda.anaconda.org/conda-forge/noarch
  - https://repo.anaconda.com/pkgs/main/osx-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/free/osx-64
  - https://repo.anaconda.com/pkgs/free/noarch
  - https://repo.anaconda.com/pkgs/r/osx-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.


(dipy_tracks) 

: 1

And yet another problem: `fury` can't be found in the channels `conda` is looking for it (*spoiler: installing `fury` using `conda` only worked till python3.4*). But we can get using `pip`. This is a good of example of not all libraries being available in all package managers.  

In [11]:
pip install fury

Collecting fury
  Using cached https://files.pythonhosted.org/packages/65/28/14fe94c26e947f650222f076f57b607148cbf0bb2c303324ce13a8b87bdc/fury-0.3.0-py3-none-any.whl
Collecting vtk>=8.1.0 (from fury)
  Using cached https://files.pythonhosted.org/packages/8d/3b/a92a64a5d1203aae2af17dccc686ff4eb3bb7114db79eaab1593c03fb678/vtk-8.1.2-cp36-cp36m-macosx_10_6_x86_64.whl
Installing collected packages: vtk, fury
Successfully installed fury-0.3.0 vtk-8.1.2
(dipy_tracks) 

: 1

Okay, so far so good. Let's check our environment before we try to run the example:

In [12]:
conda info --envs

# conda environments:
#
base                     /Users/peerherholz/anaconda3
dipy_tracks           *  /Users/peerherholz/anaconda3/envs/dipy_tracks
py27                     /Users/peerherholz/anaconda3/envs/py27
python3.6_test           /Users/peerherholz/anaconda3/envs/python3.6_test
                         /usr/local/fsl/fslpython/envs/fslpython

(dipy_tracks) 

: 1

In [13]:
which python

/Users/peerherholz/anaconda3/envs/dipy_tracks/bin/python
(dipy_tracks) 

: 1

In [14]:
conda list

# packages in environment at /Users/peerherholz/anaconda3/envs/dipy_tracks:
#
# Name                    Version                   Build  Channel
blosc                     1.17.0               h6de7cb9_0    conda-forge
bz2file                   0.98                       py_0    conda-forge
bzip2                     1.0.8                h01d97ff_0    conda-forge
ca-certificates           2019.6.16            hecc5488_0    conda-forge
certifi                   2019.6.16                py36_1    conda-forge
cvxopt                    1.2.3           py36h18a38e7_202    conda-forge
cycler                    0.10.0                     py_1    conda-forge
dipy                      0.16.0           py36h917ab60_0    conda-forge
dsdp                      5.8               h971f2e1_1203    conda-forge
fftw                      3.3.8           nompi_h5c49c53_1106    conda-forge
freetype                  2.10.0               h24853df_0    conda-forge
fury                      0.3.0                

: 1

Okay, looks good. Let's export the environment to be available for later:

In [15]:
conda env export > environment_conda.yml

(dipy_tracks) 

: 1

And we're ready for a new test run:

In [16]:
python introduction_to_basic_tracking.py

Dataset is already in place. If you want to fetch it again please first remove the folder /Users/peerherholz/.dipy/stanford_hardi 
Dataset is already in place. If you want to fetch it again please first remove the folder /Users/peerherholz/.dipy/stanford_hardi 
Dataset is already in place. If you want to fetch it again please first remove the folder /Users/peerherholz/.dipy/stanford_hardi 
  orient = np.abs(orient / np.linalg.norm(orient))
(dipy_tracks) 

: 1

Them same error as in our `venv` environment... Feeling the frustration? Good, that was plan. This example stresses two important things: a) you should read error messages carefully and b) quite often a virtualization at the pure python level is not enough. Focusing the latter, we need to install dependencies that are outside python, that is [VTK](https://vtk.org/). At this point there is nothing more we can do based on pure python virtualization, we have to go one stop further, also including non-python resources in order to enable a re-usable/executable and reproducible functionality that can easily be shared. To this end, let's have a look at our graphic from above:

As briefly introduced before, we have quite some level of possible virtualization:

<img src="https://raw.githubusercontent.com/goanpeca/pyday-cali-2019/master/img_source/isolation.png"
     alt="virtualization layers"
     style="float: left; margin-right: 2px;" width=600; height=800  />