# Combining scNT-seq data

Thank the authors of the scNT-seq paper ([X. Qiu, P. Hu, et al, 2020](https://www.nature.com/articles/s41592-020-0935-4)) for making the analysis scripts and processed data publicly avaialble, which saves a lot of time for using this cool data set and software to repeat the analysis and validate methodology.

All data sets are found from the author-provided repo: https://github.com/wulabupenn/scNT-seq

**Data are linked to here**:
* Links on Processed data (neuron_splicing_4_11.h5ad, Neu_one_shot.h5ad, 0408_grp_info.txt): https://github.com/wulabupenn/scNT-seq/blob/master/notebook_for_figures/neuron_revision_figures_n_s_velocity.ipynb
* Links for UMAP and selected cells: 
  [Fig3b.rds](https://drive.google.com/drive/folders/1CTdrLUpzye_nlZXWJH9ggS7BRzM-VSqQ?usp=sharing)
  on Google Drive. 

**Dumping the data with R**

```
dat = readRDS('Fig3b.rds')
write.table(dat, 'Fig3b.tsv', sep='\t', quote=FALSE, row.names=FALSE)
```

## Annotated data - scNT

In [1]:
import numpy as np
import scanpy as sc

In [2]:
dat_dir = '/storage/yhhuang/research/rnaVelo/scNT/'

In [3]:
adata = sc.read(dat_dir + '/neuron_splicing_4_11.h5ad')

In [4]:
np.sum((adata.X > 0).sum(1) > 2000)

3132

### Get total RNAs, add UMAP, and keep qc cells

In [5]:
obs_dat = np.genfromtxt(dat_dir + '/Fig3b.tsv', delimiter='\t', dtype='str')
obs_dat.shape

(3067, 7)

In [6]:
obs_dat[:3, :]

array([['cell', 'umap_0', 'umap_1', 'time', 'cluster', 'early', 'late'],
       ['Neu-4sU-only-run1n2_CAACATGACCGC', '2.9563963', '-4.174852',
        '0', 'Ex', '0.132488875641741', '0.403597919486812'],
       ['Neu-4sU-only-run1n2_TGGACGCTGCAA', '2.2422173', '-3.8909454',
        '0', 'Ex', '0.202600940514841', '0.430327636205354']],
      dtype='<U38')

#### match

In [7]:
import hilearn
mm = hilearn.match(obs_dat[1:, 0], adata.obs_names)

In [8]:
adata_lite = adata[mm, :]
np.mean(adata_lite.obs.index == obs_dat[1:, 0])

1.0

In [9]:
adata_lite.obsm['X_umap'] = obs_dat[1:, 1:3].astype(np.float32)
adata_lite.obs['time'] = obs_dat[1:, 3].astype(np.int32)
adata_lite.obs['early'] = obs_dat[1:, 5].astype(np.float32)
adata_lite.obs['late'] = obs_dat[1:, 6].astype(np.float32)

In [10]:
adata_lite

AnnData object with n_obs × n_vars = 3066 × 44021
    obs: 'cellname', 'time', 'early', 'late'
    var: 'gene_short_name'
    obsm: 'X_umap'
    layers: 'spliced', 'unspliced'

#### save

In [11]:
adata_lite.write(dat_dir + "/neuron_splicing_totalRNA.h5ad")

In [12]:
df = adata_lite.obs['time']
df.to_csv(dat_dir + '/neuron_splicing_time.tsv', sep='\t', index_label='cellID')

### Replacement unspliced with nascent RNAs

In [13]:
Neu = sc.read(dat_dir + '/Neu_one_shot.h5ad') # Neu

In [14]:
intersect_gene = list(set(adata_lite.var_names).intersection(Neu.var_names))
# intersect_cells = list(set(adata_lite.obs_names).intersection(Neu.obs_names))

In [15]:
adata_comb = adata_lite[:, intersect_gene]
Neu_matched = Neu[adata_lite.obs_names, :][:, intersect_gene]

In [16]:
np.mean(Neu_matched.obs.index == adata_comb.obs.index)

1.0

In [17]:
adata_comb.X = Neu_matched.X.copy()
adata_comb.layers['unspliced'] = Neu_matched.layers['new'].copy()

  self._set_arrayXarray_sparse(i, j, x)


In [18]:
adata_comb.write(dat_dir + "/neuron_splicing_nascent.h5ad")