In [None]:
import numpy as np
import scipy as sp
import scipy.sparse
import pandas as pd
import os

# conda install scvi-tools -c bioconda -c conda-forge

import anndata as ad

import scanpy as sc
import scvi

import matplotlib.pyplot as plt

import hummingbird as hbdx
from hummingbird.io import print, rule

%matplotlib inline
%load_ext autoreload
%autoreload 2

def apply_preprocessing(adata, feature_selection=False, n_top_genes=n_features):
    adata = adata.copy()


    adata.layers["counts"] = adata.raw.X

    adata = adata[adata.obs.n_genes_by_counts > 5, :] # filtering out low gene count?

    _ = hbdx.pipeline.StandardScaler()
    adata = _.fit_transform(adata)
    hvg_selector = hbdx.pipeline.HVGSelector(n_top_genes=n_top_genes, batch_key="batch")
    
    adata = hvg_selector.fit_transform(adata)


    return adata

def filt(_ad, col, val):
    # Filter out small batches
    _indx = _ad.obs.groupby(col).filter(lambda x: len(x) > val).index
    _ad = _ad[_indx, :]
    return _ad

## Generative models


We will only talk about 

## What is a distribution!

Before we talk about generative models we need to talk about data.

And in specific we need to talk about distributions. 

### Q: What is the distribution we are working with?

### A: Sequenced blood samples from all human beings

### Q: What are the variables/parameters for this distribution?

### A: Age, Gender, Cancer status, Time of day, Hospital, Nurse, Sequencer, Sample deterioration, etc. etc. etc.


![distributions.png](attachment:distributions.png)




Say that we wanted to model this distribution. Then that's not easy, since the data distribution looks very complicated, and we have so many sources of variance to account for! But we can try

In generative models we speak of the distribution as p(x). Sometimes you want to condition the distribution to a specific subset, like p(x|z), with z being something like gender. 

Note! Generative models do not work with targets. This is a fundamentally different problem.  



#### Say that we have some data...

![Screenshot%202021-05-20%20at%2016.16.26.png](attachment:Screenshot%202021-05-20%20at%2016.16.26.png)

## Why use these generative models?

- Uncertainty calculation

- Calculate the probability of data (Outlier Rejection)

- Simulated Data generation

- Feature learning (on the latent)

(most of these only apply to VAE)


### This is different from classification models

Classification models do not inherently care about how the data is distributed. They aim to find decision boudaries

#### The importance of I.I.D

IID stands for independently identically distributed data. Our data is, for all intents and purposes, indepedent. However it is not identically distributed.

For one we are actively sampling patients with cancer. Second we are sampling from various different distributions, most important of which 'batch'. 

This influences how our distribution looks like! More on this later


### Demo GAN

Today we will only look at GANs and VAEs, they are the most influencial generative models and they are very different. 

![GAN1.png](attachment:GAN1.png)

![GAN2.png](attachment:GAN2.png)

#### generated images

![32000.png](attachment:32000.png)

#### interpolation (walking) between 2 generated images

![interpolation_25000.png](attachment:interpolation_25000.png)

#### Demo VAE

The big takeaway is to understand the difference between Variational auto encoders and Autoencoders.

An autoencoder is a supression tool. It encodes/compresses to a lower dimension and outputs the same data. The process is deterministic. A variational autoencoder introduces smoothness through its generative properties. The lower dimension is called a latent distribution which is modelled on the data distribution.



![VAE1.png](attachment:VAE1.png)

Variational aspect comes from the latent being a distribution. The Encoder gives the parameters for this distribution (mean and variance)

#### Loss

The loss is calculated with two terms. The Reconstruction loss and the ELBO. 

The reconstruction loss compares the output of the decoder with the input of the Encoder.

The EBLO term is intuitively the 'distance' between the distribution from the data and the prior (a unit normal)

The ELBO term regularizes the latent to be smooth and have variational properties. The reconstruction loss makes sure the latents contain useful information. 

##### Properties of the latent

If you would plot the latent dimension you would notice that it has some interesting properties!

![VAE_manifold_epoch_39.png](attachment:VAE_manifold_epoch_39.png)

![BETA_VAE_manifold_epoch_39.png](attachment:BETA_VAE_manifold_epoch_39.png)

This is arguably the most valuable property of VAEs (and maybe of generative models)

A seperatation of data which has been achieved completely unsupervised! In other words, disentanglement. Furthermore, with beta-VAE this disentanglement is even more pronounced.

![VAE5.png](attachment:VAE5.png)

### SCVI tools

Single cell variational inference


Single cell research BUT! They are fully integrated with anndata so it is actually quite useful.

The goal is to remove unwanted variance (batch effect) while retaining relevant variance (cell type)

Let's take a look at how their models operate!

In [2]:
adata_path = '~/data/LC__ngs__rpm_log-21.5.0.h5ad'

n_features = 2**14

sc.set_figure_params(figsize=(10,10))

adata = hbdx.io.load(adata_path)
adata.obs["batch"] = adata.obs[batch_col]

adata = filt(adata, "Lab_Multiplexing_pool_ID", 10)
adata.obs.groupby(["Sequencer","Lab_RNA_extr_protocol"]).batch.value_counts().unstack([0,1]).plot.barh()

![SCVI1.png](attachment:SCVI1.png)


## SCGen

![Screenshot%202021-05-23%20at%2015.58.07.png](attachment:Screenshot%202021-05-23%20at%2015.58.07.png)