<a href="https://colab.research.google.com/github/andrewkern/disperseNN2/blob/adk_doc/docs/disperseNN2_vignette.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# disperseNN2 Colab Notebook Vignette
This notebook is meant to give an example of training a disperseNN2 model on a small dataset.
It is meant to be run on Google Colab, which provides a GPU to speed up training but can also be run locally
if the user has the required packages installed with or without a GPU. The steps we will take are as follows:


## 1. Set up the environment

First we need to set up our colab instance by installing software, installing disperseNN2, cloning the repo to get example data, and importing packages.

In [None]:
%%bash
# install software we will need for the vignette
apt-get install poppler-utils pigz -y
pip install disperseNN2 pdf2image

# clone repo
git clone https://github.com/chriscrsmith/disperseNN2.git



## 2. Grab preprocessed data
Rather than wait on simulations, we have created a tarball for this training example that contains preprocessed data.
All of the simulations were created following the detailed descriptions [in our documentation](https://dispersenn2.readthedocs.io/en/latest/vignette.html#vignette-simulation). Further the tree sequences were preproccessed using the `disperseNN2 --preprocess` mode and the metadata was
extracted according the protocol [here](https://dispersenn2.readthedocs.io/en/latest/vignette.html#vignette-preprocessing)
 We will download and extract this tarball.

In [None]:
# grab data from google drive using gdown
!gdown 1eKaX19H0nWneOKi5_tBDpiGMMSaQnMfO
# also available by downloading using wget but too slow for colab
# wget http://sesame.uoregon.edu/~adkern/vignette.tar.gz .


uncompress

In [None]:
%%bash
pigz -d vignette.tar.gz
tar xf vignette.tar


## 3. Train a model
We will train a model on the data we just downloaded. We will use the `disperseNN2 --train` mode to train a model.
In the below `disperseNN2` training command, we set pairs to 1000;
this is the number of pairs of individuals from each training dataset that are included in the analysis, and we chose 1000 to reduce the memory requirement.  We’ve found that using 100 for `--pairs_encode` works well, and reduces memory significantly. Training takes approximately 20min on a t4 instance

In [None]:
%%bash
disperseNN2 \
             --out vignette/output_dir \
             --seed 12345 \
             --train \
             --max_epochs 100 \
             --validation_split 0.2 \
             --batch_size 10 \
             --learning_rate 1e-4 \
             --pairs 1000 \
             --pairs_encode 100 \
             --gpu any \
             > vignette/output_dir/training_history_12345.txt

Okay training is done! Let's plot the training history and then display it here in the notebook

In [None]:
!disperseNN2 --plot_history vignette/output_dir/training_history_12345.txt

In [None]:


from pdf2image import convert_from_path
from IPython.display import display, Image
images = convert_from_path("vignette/output_dir/training_history_12345.txt_plot.pdf")
for i, image in enumerate(images):
    fname = "image" + str(i) + ".png"
    image.save(fname, "PNG")
Image(fname)

## 4. Validation
Next, we will validate the trained model on simulated test data. In a real application you should hold out datasets from training.



In [None]:
%%bash
disperseNN2 \
    --out vignette/output_dir \
    --seed 12345 \
	--predict \
	--batch_size 10 \
	--pairs 1000 \
	--pairs_encode 100 \
	--num_pred 100 \
	--gpu any

Below is a plot of the predictions, ``vignette/output_dir/Test/predictions_12345.txt``:


In [None]:
import pandas as pd
from matplotlib import pyplot as plt

x = pd.read_csv('vignette/output_dir/Test/predictions_12345.txt', sep='\t', header=None)
plt.scatter(x[0], x[1])
plt.xlabel('true')
plt.ylabel('predicted')

looks pretty good!

# 5. Empirical application
Since we are satisfied with the performance of the model on the held-out test set, we can finally predict σ in our empirical data.

Before predicting with ``disperseNN2`` we need both the empirical .vcf and .locs in the same place

In [None]:
!ln -s $PWD/disperseNN2/Examples/VCFs/iraptus.vcf vignette/

And then we can run ``disperseNN2`` to predict σ in the empirical data. We will use the ``--predict`` mode

In [None]:
%%bash
disperseNN2 \
    --out vignette/output_dir \
    --seed 12345 \
    --predict \
    --empirical vignette/iraptus \
    --batch_size 10 \
    --pairs 1000 \
    --pairs_encode 100 \
    --num_reps 10

The final empirical results are stored in: ``vignette/output_dir/empirical_12345.txt``.


In [None]:
%%bash
cat vignette/output_dir/empirical_12345.txt

**Interpretation**.
The output, $\sigma$, is an estimate for the standard deviation of the Gaussian dispersal kernel from our training simulations; in addition, the same parameter was used for the mating distance (and competition distance). Therefore, to get the distance to a random parent, i.e., effective $\sigma$,  we would apply a posthoc correction of $\sqrt{\frac{3}{2}} \times \sigma$ (see original [disperseNN paper](https://doi.org/10.1093/genetics/iyad068) for details). In this example, we trained with only 100 generations spatial, hence the dispersal rate estimate reflects demography in the recent past.




