# Local Installation Setup Tutorial

Hello! This tutorial shows you how to set up and tear down your workspace in a Jupyter Lab notebook (or ipython environment) in order to run the MIND end-to-end pathology analysis tutorial. Here are the steps we will review:

1. Prerequisites
2. Create a new directory for your project
3. Set up your virtual environment
4. Clone the repository and install dependencies
5. Teardown your project and virtual environment
6. References

## 1. Prerequisites

It is assumed you have a Jupyter lab environment set up for executing these notebooks. If not, you may follow the instruction at https://jupyterlab.readthedocs.io/en/stable/getting_started/installation.html to install the lab environment on your host system of choice. 

The prerequisites listed here must be installed on the host system and not through the jupyter lab (or ipython) environment. 

You must download Apache Spark to your local computer in the case that it is not already downloaded (https://spark.apache.org/downloads.html).

Make sure that you have the correct version of Java, Scala, Python, and R installed in the correct place on your computer. Apache Spark runs on Java 8/11, Scala 2.12, Python 3.6+ and R 3.5+.

Here are the links for installations of Java, Scala, Python, and R. Again, make sure you download the correct versions:

Java AdoptOpenJDK: https://adoptopenjdk.net/installation.html
Scala: https://www.scala-lang.org/download/
Python: https://www.python.org/downloads/
R: https://www.r-project.org/

It is important to have the path to your Java installation in your JAVA_HOME environment variable. 

In [1]:
!java -version
!python3 --version

import os, subprocess
os.environ['JAVA_HOME'] = subprocess.check_output(['bash','-c', 'which java']).decode("utf-8")
!echo 'JAVA_HOME=' $JAVA_HOME

openjdk version "1.8.0_275"
OpenJDK Runtime Environment (build 1.8.0_275-b01)
OpenJDK 64-Bit Server VM (build 25.275-b01, mixed mode)
Python 3.6.9
JAVA_HOME= /gpfs/mskmindhdp_emc/sw/env/bin/java


You must also download Hadoop for your computer. On mac, you may install with this command:

    brew install hadoop

Hadoop has special installation instructions for MacBooks. Here is an instruction link for a single cluster as a guide: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html.

Next, install Openslide (https://openslide.org/download/). This library will help with reading the svs images and their tiles. On mac, you may install with this command:

    brew install openslide

Lastly, you must find the location where your Spark software is installed on your machine and the SPARK_HOME environnment variable yourself. You may find your Spark installation directory by executing, 

    which spark-submit
    
If for example, the output is "/opt/spark-3.0.0-bin-hadoop3.2/bin/spark-submit", then set your SPARK_HOME environment variable to "/opt/spark-3.0.0-bin-hadoop3.2" running the code below in a code cell.

    import os
    os.environ['SPARK_HOME']='/opt/spark-3.0.0-bin-hadoop3.2'
    !echo $SPARK_HOME

## 2. Create a new directory for your project

In [2]:
!pwd
!rm -rf pathology-tutorial-sandbox
!mkdir -p pathology-tutorial-sandbox/project pathology-tutorial-sandbox/data/toy-data-set 

/gpfs/mskmindhdp_emc/user/shared_data_folder/pathology-tutorial


## 3. Set up your virtual environment

Next, set up your virtual environment within the jupyter lab (or ipython) environment. The end of this tutorial has steps for tearing down this virtual environment. 

Open a terminal in your Jupyter Lab environment by selecting File -> New -> Terminal and execute the following commands. It is assumed that your default python environment on the host system has python3-venv installed (sudo apt-get install python3-venv -y).

    # change directory to your pathology tutorial sandbox directory
    cd [LOCATION-WHERE-YOU-WANT-TO-CREATE-THE-VIRTUAL-ENV]

    # create the virtual environment
    python3 -m venv pt-venv
    
    # activate the virtual environment
    source pt-venv/bin/activate 
    
    # upgrade pip
    python3 -m pip install --upgrade pip
    
    # install ipykernel
    pip install ipykernel

    # Register this env with jupyter lab. It’ll now show up in the
    # launcher & kernels list once you refresh the page
    python3 -m ipykernel install --user --name pt-venv --display-name "pathology tutorial venv"

    # List kernels to ensure it was created successfully
    jupyter kernelspec list
    
    # deactivate the virtual environment in the terminal
    deactivate

Now, apply the new kernel to your notebook by first selecting the default kernel (which is typically "Python 3") and then selecting your new kernel "pathology tutorial venv" from the drop-down list. **NOTE:** It may take a minute for the drop-down list to update. 

Any python packages you pip install through the jupyter environment will now persist only in this environment.


## Clone the repository and install dependencies

In [3]:
!pwd
!cd pathology-tutorial-sandbox && git clone https://github.com/msk-mind/data-processing.git 
!tree -d -L 2 pathology-tutorial-sandbox

/gpfs/mskmindhdp_emc/user/shared_data_folder/pathology-tutorial
Cloning into 'data-processing'...
remote: Enumerating objects: 8628, done.[K
remote: Counting objects: 100% (669/669), done.[K
remote: Compressing objects: 100% (347/347), done.[K
remote: Total 8628 (delta 447), reused 448 (delta 286), pack-reused 7959[K
Receiving objects: 100% (8628/8628), 110.67 MiB | 6.35 MiB/s, done.
Resolving deltas: 100% (5820/5820), done.
Checking out files: 100% (615/615), done.
[01;34mpathology-tutorial-sandbox[00m
├── [01;34mdata[00m
│   └── [01;34mtoy-data-set[00m
├── [01;34mdata-processing[00m
│   ├── [01;34mconf[00m
│   ├── [01;34mdata_processing[00m
│   ├── [01;34mintegration[00m
│   └── [01;34mtests[00m
└── [01;34mproject[00m

8 directories


Please contact the MSK MIND team if there are access issues cloning into the repository.

At this point, this is what the tree of your root directory should have the following setup:

In [1]:
!which python

/gpfs/mskmindhdp_emc/sw/env/bin/python


Next, navigate to the data-processing root folder and install the python dependencies. 

In [7]:
%pip install -q -e pathology-tutorial-sandbox/data-processing/.

You should consider upgrading via the '/gpfs/mskmindhdp_emc/user/pashaa/pt-venv/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


At this point, you may check if additional dependencies were installed by running the "%pip list" in the terminal.

*Note: if you receive an error message about a particular installation during this process that halts the previous command from being fully executed, run '%pip install x', where x is the package, and then run the previous command again.*

If you have followed all of these steps so far, your jupyter installation should be set up! Try importing the data_processing library

In [None]:
import data_processing

You should have no errors with this step. Congratulations, you are ready to move on to the dataset prep!

## 5. Teardown your project and virtual environment

**WARNING:** Follow these steps only after you are done with using this jupyter environment and you are ready to restore you sytem back to its original state. 

    # in your jupyter terminal, uninstall the pt-venv kernel
    jupyter kernelspec uninstall pt-venv
    
    # delete the virtual environment 
    rm -rf pt-venv
    
Next, delete the sandbox.

In [1]:
!rm -rf pathology-tutorial-sandbox

## 6. References:

Use Virtual Environments Inside Jupyter Notebooks & Jupter Lab [Best Practices] -
https://www.zainrizvi.io/blog/jupyter-notebooks-best-practices-use-virtual-environments/

Installing the IPython kernel - 
https://ipython.readthedocs.io/en/stable/install/kernel_install.html#kernels-for-python-2-and-3