# Data pre-processing
For this tutorial, we will be imputing a dataset of Melanoma Cells
freely available from
[GSE99330](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE99330).
 
## 1. Download example dataset

In [RNA-seq dataset](https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE99330&format=file&file=GSE99330%5FdropseqHumanDge%2Etxt%2Egz),
8,640 single cells were sequenced on the Illumina NextSeq 500 platform.

To assess the performance, an independent reference [smFISH dataset](https://www.dropbox.com/s/ia9x0iom6dwueix/fishSubset.txt?dl=0)
with 26 genes in thousands of melanoma cells is used.

In [None]:
import numpy as np
import pandas as pd
import h5py

Download melanoma RNA-seq data for imputation.

In [None]:
melanoma_rnaseq_path = "E:/DISC/reproducibility/data/MELANOMA/original_data/GSE99330_dropseqHumanDge.txt.gz"
melanoma_rnaseq_pd = pd.read_csv(melanoma_rnaseq_path, sep=" ", compression='gzip', index_col=0, skiprows=1)

In [None]:
melanoma_rnaseq_pd

We download melanoma FISH data for validation.

In [None]:
melanoma_fish_path = "E:/DISC/reproducibility/data/MELANOMA/original_data/fishSubset.txt"
melanoma_fish_pd = pd.read_csv(melanoma_fish_path, sep=" ", index_col=0).T

In [None]:
melanoma_fish_pd

We fill missing values with -1 as [loom](http://loompy.org/) not support np.nan dtype.

In [None]:
melanoma_fish_pd = melanoma_fish_pd.fillna(-1)
melanoma_fish_pd

## 2. Format transformation and cell filtering

`DISC` uses [loom](http://loompy.org/) as its I/O format so we save these data as loom-formatted files.

In [None]:
with h5py.File("E:/DISC/reproducibility/data/MELANOMA/original.loom", "w") as out_f:
    out_f.create_group("row_graphs")
    out_f.create_group("col_graphs")
    out_f.create_group("layers")
    out_f["row_attrs/Gene"] = melanoma_rnaseq_pd.index.values.astype(np.string_)
    out_f["col_attrs/CellID"] = melanoma_rnaseq_pd.columns.values.astype(np.string_)
    out_f.create_dataset("matrix", shape=melanoma_rnaseq_pd.shape,
                             chunks=(melanoma_rnaseq_pd.shape[0], 1), dtype=np.float32, fletcher32=False,
                             compression="gzip", shuffle=False, compression_opts=2)
    out_f["matrix"][...] = melanoma_rnaseq_pd.values

In [None]:
with h5py.File("E:/DISC/reproducibility/data/MELANOMA/fish.loom", "w") as out_f:
    out_f.create_group("row_graphs")
    out_f.create_group("col_graphs")
    out_f.create_group("layers")
    out_f["row_attrs/Gene"] = melanoma_fish_pd.index.values.astype(np.string_)
    out_f["col_attrs/CellID"] = melanoma_fish_pd.columns.values.astype(np.string_)
    out_f.create_dataset("matrix", shape=melanoma_fish_pd.shape,
                             chunks=(melanoma_fish_pd.shape[0], 1), dtype=np.float32, fletcher32=False,
                             compression="gzip", shuffle=False, compression_opts=2)
    out_f["matrix"][...] = melanoma_fish_pd.values

We remove cells with library size less than 500 or greater than 20,000 for RNA-seq data as [SAVER](https://www.nature.com/articles/s41592-018-0033-z). does.

In [None]:
with h5py.File("E:/DISC/reproducibility/data/MELANOMA/raw.loom", "w") as out_f:
    with h5py.File("E:/DISC/reproducibility/data/MELANOMA/original.loom", "r", libver='latest', swmr=True) as f:
        gene_bc_mat = f["matrix"][...]
        gene_name = f["row_attrs/Gene"][...]
        cell_id = f["col_attrs/CellID"][...]
    out_f.create_group("row_graphs")
    out_f.create_group("col_graphs")
    out_f.create_group("layers")
    out_f["row_attrs/Gene"] = gene_name
    cell_filter = np.bitwise_and(gene_bc_mat.sum(0) >= 500, gene_bc_mat.sum(0) <= 20000)
    out_f["col_attrs/CellID"] = cell_id[cell_filter]
    gene_bc_filt = gene_bc_mat[:, cell_filter]
    out_f.create_dataset("matrix", shape=gene_bc_filt.shape,
                             chunks=(gene_bc_filt.shape[0], 1), dtype=np.float32, fletcher32=False,
                             compression="gzip", shuffle=False, compression_opts=2)
    out_f["matrix"][...] = gene_bc_filt

We will use `raw.loom`(RNA-seq) for imputation and `fish.loom`(FISH) for evaluation.

Reference: 

1. Huang, M. et al. Nature methods 15, 539â€“542 (2018).