# Liver data preliminary analysis and whitelisting

We would like to pseudoalign and use the Andrews human liver data, specifically that for donor 4 (C72). To avoid potential sources of model misspecification, we need to fit individual cell types. More generally, we want to only use the cells with existing annotations. We could pseudoalign everything, but it is considerably easier to just take the existing annotations for C72, and use the barcodes present in the annotations as a whitelist for pseudoalignment. Here, we do not apply knee plot filtering.

This notebook constructs these whitelists and inspects the cell type abundances.

In [19]:
import numpy as np

## Metadata extraction

In [20]:
meta_str = '/home/ggorin/datasets/liver_andrews/GSE185477_Final_Metadata.txt'

In [21]:
import pandas as pd

In [22]:
meta = pd.read_csv(meta_str,sep='\t')

There are around 10k cells per dataset.

In [24]:
len(meta[meta['sample']=='C72_RESEQ']['cell_barcode'])

11219

In [25]:
len(meta[meta['sample']=='C72_TST']['cell_barcode'])

9054

Export whitelists.

In [26]:
np.savetxt('/home/ggorin/datasets/liver_andrews/sc/whitelist.txt', \
           np.asarray(meta[meta['sample']=='C72_RESEQ']['cell_barcode'],dtype=str),fmt='%16s')
np.savetxt('/home/ggorin/datasets/liver_andrews/sn/whitelist.txt', \
           np.asarray(meta[meta['sample']=='C72_TST']['cell_barcode'],dtype=str),fmt='%16s')


Which cell types are most promising? Pericentral, periportal, and interzonal hepatocyces.

In [29]:
meta[meta['sample']=='C72_TST']['Manual_Annotation'].value_counts()

CentralHep       2888
PortalHep        2862
InterHep         1888
cvLSECs           359
Stellate          331
Cholangiocyte     183
NonInfMac         175
PortalEndo        146
Bcells            107
InfMac             63
NKTcell            49
Erythroid           3
Name: Manual_Annotation, dtype: int64

In [30]:
meta[meta['sample']=='C72_RESEQ']['Manual_Annotation'].value_counts()

PortalHep        4523
CentralHep       1779
cvLSECs          1050
InterHep         1046
NonInfMac         689
Stellate          600
NKTcell           472
Cholangiocyte     343
InfMac            322
PortalEndo        226
Bcells            167
Erythroid           2
Name: Manual_Annotation, dtype: int64