# This jupyter notebook contains an analysis example using [EmptyNN](https://github.com/lkmklsmn/empty_nn). The code reproduces the preprocessing analysis of singe-cell RNA sequencing data published by [Stoeckius et al](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1603-1), which we refer to as "cell hahing dataset" in our manuscript. 
### The following code runs in R environment. No requirment on R version. The libraries installed include EmptyNN, Seurat (for downstream analysis), ggplot2 (for visualization).

## Load R libraries

In [5]:
suppressMessages({library("EmptyNN")
                  library("Seurat")
                  library("pheatmap")
                  library("ggplot2")})

## Load cell hahing data from Stoeckius et al

### Option 1: Run "sh ./code/download_data.sh" in terminal to download example datasets <br> Option 2: The raw **cell hashing** datasets can be downloaded from [paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1603-1) or [GSE108313](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE108313). <br> Option 3: The RData file used in our analysis can be downloaded from [google drive](https://drive.google.com/file/d/12y0fW_Y9OdhBLns_2gpjo2Xq25c4qnGY/view?usp=sharing). It contains both the count matrix and label for each barcode.

In [7]:
load("../data/cell_hashing_raw.RData")

## We apply EmptyNN to do the preprocessing. 
### The input is raw count matrix, either in h5 or mtx format. The rows are barcodes and columns are genes. <br> The output is a boolean vector showing cell-free droplet (empty droplet) or cell-containing droplet.

### Runtime depends on the dataset to be processed and parameter setting. A higher k_folds or iteration means longer runtime. It takes ~30 mins to run with the default parameters. <br> For demonstration purposes, we set parameters to a low number, which takes ~5 mins.

In [9]:
nn.res <- emptynn(t(counts),threshold=50,k=2,iteration=1)
nn.keep <- nn.res$nn.keep

[1] "there are 11865 in P set"
[1] "there are 27977 in U set"
[1] "Samples in U set were split into 2 folds"
[1] "data normalization"
[1] "start training"
[1] "iteration 1"
[1] "training fold 1"
[1] "training fold 2"
