### Notebook for the evaluation of `scNym` label transfer for HHH vs DCM immune populations

- **Developed by**: Carlos Talavera-López Ph.D
- **Institute of Computational Biology - Computational Health Centre - Helmholtz Munich**
- v220414 

### Import required modules

In [1]:
import anndata
import scipy as sp
import numpy as np
import pandas as pd
import scanpy as sc

### Set up working environment

In [2]:
sc.settings.verbosity = 3
sc.logging.print_versions()
sc.settings.set_figure_params(dpi = 140, color_map = 'magma_r', dpi_save = 300, vector_friendly = True, format = 'svg')

-----
anndata     0.8.0
scanpy      1.9.1
-----
PIL                         8.4.0
anyio                       NA
appnope                     0.1.2
attr                        21.2.0
babel                       2.9.1
backcall                    0.2.0
beta_ufunc                  NA
binom_ufunc                 NA
bottleneck                  1.3.2
brotli                      NA
certifi                     2021.10.08
cffi                        1.14.6
chardet                     4.0.0
charset_normalizer          2.0.4
cloudpickle                 2.0.0
colorama                    0.4.4
cycler                      0.10.0
cython_runtime              NA
cytoolz                     0.11.0
dask                        2021.10.0
dateutil                    2.8.2
debugpy                     1.4.1
decorator                   5.1.0
defusedxml                  0.7.1
entrypoints                 0.3
fsspec                      2021.08.1
google                      NA
h5py                        3.6.0
hyp

### Read in `scNym`-annotated object

In [3]:
heart_immune = sc.read_h5ad('/Volumes/Bf110/ct5/raw_data/heart/analysis/subpopulations/3-immune/immune_HHH_vs_DCM_scNym.v1.h5ad')
heart_immune

AnnData object with n_obs × n_vars = 142914 × 15172
    obs: 'NRP', 'age_group', 'cell_source', 'cell_type', 'donor', 'gender', 'n_counts', 'n_genes', 'percent_mito', 'percent_ribo', 'region', 'sample', 'scrublet_score', 'source', 'type', 'version', 'cell_states', 'Used', 'Cells_Nuclei', 'combined', 'label', 'study', 'Type', 'Individual', 'state', 'leiden', '_scvi_labels', '_scvi_batch', 'leiden_annotated', 'set', 'scNym', 'scNym_confidence', 'Patient', 'Sample', 'Gender', 'Gene', 'Diagnosis', 'Clinical.dominant.mutation', 'Age', 'Mutation.Type', 'Genomic.location', 'Region', 'Origin', 'X10X_version'
    var: 'gene_ids-query', 'feature_types-query', 'genome-query', 'gene_ids-Harvard-Nuclei-full-reference-reference', 'feature_types-Harvard-Nuclei-full-reference-reference', 'gene_ids-Sanger-Nuclei-full-reference-reference', 'feature_types-Sanger-Nuclei-full-reference-reference', 'gene_ids-Sanger-Cells-full-reference-reference', 'feature_types-Sanger-Cells-full-reference-reference', 'gene

### Fix labels for downstream analyses

- Add disease status label

In [4]:
  heart_immune.obs['study'].cat.categories

Index(['Litvinukova_2020', 'MDC_2022', 'Rao_2021', 'Tucker_2020', 'Wang_2019',
       'clara'],
      dtype='object')

In [5]:
heart_immune.obs['disease_status'] = 'Healthy'
heart_immune.obs.loc[heart_immune.obs['study'] == 'clara', ['disease_status']] = 'DCM'
heart_immune.obs['disease_status'] = heart_immune.obs['disease_status'].astype('category')
heart_immune.obs['disease_status'].cat.categories

Index(['DCM', 'Healthy'], dtype='object')

- Normalise `adata.obs['region']` and `adata.obs['donor']`

In [6]:
  heart_immune_dcm =   heart_immune[heart_immune.obs['disease_status'].isin(['DCM'])]
  heart_immune_dcm

View of AnnData object with n_obs × n_vars = 60312 × 15172
    obs: 'NRP', 'age_group', 'cell_source', 'cell_type', 'donor', 'gender', 'n_counts', 'n_genes', 'percent_mito', 'percent_ribo', 'region', 'sample', 'scrublet_score', 'source', 'type', 'version', 'cell_states', 'Used', 'Cells_Nuclei', 'combined', 'label', 'study', 'Type', 'Individual', 'state', 'leiden', '_scvi_labels', '_scvi_batch', 'leiden_annotated', 'set', 'scNym', 'scNym_confidence', 'Patient', 'Sample', 'Gender', 'Gene', 'Diagnosis', 'Clinical.dominant.mutation', 'Age', 'Mutation.Type', 'Genomic.location', 'Region', 'Origin', 'X10X_version', 'disease_status'
    var: 'gene_ids-query', 'feature_types-query', 'genome-query', 'gene_ids-Harvard-Nuclei-full-reference-reference', 'feature_types-Harvard-Nuclei-full-reference-reference', 'gene_ids-Sanger-Nuclei-full-reference-reference', 'feature_types-Sanger-Nuclei-full-reference-reference', 'gene_ids-Sanger-Cells-full-reference-reference', 'feature_types-Sanger-Cells-full-re

In [10]:
heart_immune_dcm.obs['study'].cat.categories

Index(['clara'], dtype='object')

In [11]:
heart_immune_dcm.obs['region'] = heart_immune_dcm.obs['Region']
heart_immune_dcm.obs['cell_source'] = 'MDC-Nuclei'

  heart_immune_dcm.obs['region'] = heart_immune_dcm.obs['Region']


In [12]:
trans_from=[['AP', 'AX'],['S', 'SP'],['RV'],['FW', 'LV'],['RA'],['LA'],['nan']]
trans_to = ['AX', 'SP', 'RV', 'LV', 'RA', 'LA', 'U']

heart_immune_dcm.obs['region'] = [str(i) for i in heart_immune_dcm.obs['region']]
for leiden,celltype in zip(trans_from, trans_to):
    for leiden_from in leiden:
        heart_immune_dcm.obs['region'][heart_immune_dcm.obs['region'] == leiden_from] = celltype

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  heart_immune_dcm.obs['region'][heart_immune_dcm.obs['region'] == leiden_from] = celltype
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  heart_immune_dcm.obs['region'][heart_immune_dcm.obs['region'] == leiden_from] = celltype
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  heart_immune_dcm.obs['region'][heart_immune_dcm.obs['region'] == leiden_from] = celltype
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http

In [13]:
heart_immune_dcm.obs['region'] = heart_immune_dcm.obs['region'].astype('category')
heart_immune_dcm.obs['region'].cat.categories

Index(['AX', 'LV', 'RV', 'SP'], dtype='object')

In [14]:
  heart_immune_h =   heart_immune[~heart_immune.obs['disease_status'].isin(['DCM'])]
  heart_immune_h

View of AnnData object with n_obs × n_vars = 82602 × 15172
    obs: 'NRP', 'age_group', 'cell_source', 'cell_type', 'donor', 'gender', 'n_counts', 'n_genes', 'percent_mito', 'percent_ribo', 'region', 'sample', 'scrublet_score', 'source', 'type', 'version', 'cell_states', 'Used', 'Cells_Nuclei', 'combined', 'label', 'study', 'Type', 'Individual', 'state', 'leiden', '_scvi_labels', '_scvi_batch', 'leiden_annotated', 'set', 'scNym', 'scNym_confidence', 'Patient', 'Sample', 'Gender', 'Gene', 'Diagnosis', 'Clinical.dominant.mutation', 'Age', 'Mutation.Type', 'Genomic.location', 'Region', 'Origin', 'X10X_version', 'disease_status'
    var: 'gene_ids-query', 'feature_types-query', 'genome-query', 'gene_ids-Harvard-Nuclei-full-reference-reference', 'feature_types-Harvard-Nuclei-full-reference-reference', 'gene_ids-Sanger-Nuclei-full-reference-reference', 'feature_types-Sanger-Nuclei-full-reference-reference', 'gene_ids-Sanger-Cells-full-reference-reference', 'feature_types-Sanger-Cells-full-re

- Merge objects

In [15]:
cardiac_immune = heart_immune_dcm.concatenate(heart_immune_h, batch_key = 'diagnosis', batch_categories = ['dcm', 'no_dcm'], join = 'inner')
cardiac_immune

  [AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],


AnnData object with n_obs × n_vars = 142914 × 15172
    obs: 'NRP', 'age_group', 'cell_source', 'cell_type', 'donor', 'gender', 'n_counts', 'n_genes', 'percent_mito', 'percent_ribo', 'region', 'sample', 'scrublet_score', 'source', 'type', 'version', 'cell_states', 'Used', 'Cells_Nuclei', 'combined', 'label', 'study', 'Type', 'Individual', 'state', 'leiden', '_scvi_labels', '_scvi_batch', 'leiden_annotated', 'set', 'scNym', 'scNym_confidence', 'Patient', 'Sample', 'Gender', 'Gene', 'Diagnosis', 'Clinical.dominant.mutation', 'Age', 'Mutation.Type', 'Genomic.location', 'Region', 'Origin', 'X10X_version', 'disease_status', 'diagnosis'
    var: 'gene_ids-query', 'feature_types-query', 'genome-query', 'gene_ids-Harvard-Nuclei-full-reference-reference', 'feature_types-Harvard-Nuclei-full-reference-reference', 'gene_ids-Sanger-Nuclei-full-reference-reference', 'feature_types-Sanger-Nuclei-full-reference-reference', 'gene_ids-Sanger-Cells-full-reference-reference', 'feature_types-Sanger-Cells-f

### Clean up object `adata.obs` and `adata.var`

In [16]:
cardiac_immune

AnnData object with n_obs × n_vars = 142914 × 15172
    obs: 'NRP', 'age_group', 'cell_source', 'cell_type', 'donor', 'gender', 'n_counts', 'n_genes', 'percent_mito', 'percent_ribo', 'region', 'sample', 'scrublet_score', 'source', 'type', 'version', 'cell_states', 'Used', 'Cells_Nuclei', 'combined', 'label', 'study', 'Type', 'Individual', 'state', 'leiden', '_scvi_labels', '_scvi_batch', 'leiden_annotated', 'set', 'scNym', 'scNym_confidence', 'Patient', 'Sample', 'Gender', 'Gene', 'Diagnosis', 'Clinical.dominant.mutation', 'Age', 'Mutation.Type', 'Genomic.location', 'Region', 'Origin', 'X10X_version', 'disease_status', 'diagnosis'
    var: 'gene_ids-query', 'feature_types-query', 'genome-query', 'gene_ids-Harvard-Nuclei-full-reference-reference', 'feature_types-Harvard-Nuclei-full-reference-reference', 'gene_ids-Sanger-Nuclei-full-reference-reference', 'feature_types-Sanger-Nuclei-full-reference-reference', 'gene_ids-Sanger-Cells-full-reference-reference', 'feature_types-Sanger-Cells-f

In [17]:
del(cardiac_immune.obs['NRP'])
del(cardiac_immune.obs['age_group'])
del(cardiac_immune.obs['cell_type'])
del(cardiac_immune.obs['gender'])
del(cardiac_immune.obs['n_counts'])
del(cardiac_immune.obs['n_genes'])
del(cardiac_immune.obs['percent_mito'])
del(cardiac_immune.obs['percent_ribo'])
del(cardiac_immune.obs['sample'])
del(cardiac_immune.obs['scrublet_score'])
del(cardiac_immune.obs['source'])
del(cardiac_immune.obs['version'])
del(cardiac_immune.obs['Used'])
del(cardiac_immune.obs['Cells_Nuclei'])
del(cardiac_immune.obs['combined'])
del(cardiac_immune.obs['label'])
del(cardiac_immune.obs['Type'])
del(cardiac_immune.obs['state'])
del(cardiac_immune.obs['leiden'])
del(cardiac_immune.obs['_scvi_labels'])
del(cardiac_immune.obs['_scvi_batch'])
del(cardiac_immune.obs['leiden_annotated'])
del(cardiac_immune.obs['set'])
del(cardiac_immune.obs['Sample'])
del(cardiac_immune.obs['Gender'])
del(cardiac_immune.obs['Diagnosis'])
del(cardiac_immune.obs['Clinical.dominant.mutation'])
del(cardiac_immune.obs['Age'])
del(cardiac_immune.obs['Mutation.Type'])
del(cardiac_immune.obs['Genomic.location'])
del(cardiac_immune.obs['Region'])
del(cardiac_immune.obs['Origin'])
del(cardiac_immune.obs['X10X_version'])
del(cardiac_immune.obs['diagnosis'])
cardiac_immune

AnnData object with n_obs × n_vars = 142914 × 15172
    obs: 'cell_source', 'donor', 'region', 'type', 'cell_states', 'study', 'Individual', 'scNym', 'scNym_confidence', 'Patient', 'Gene', 'disease_status'
    var: 'gene_ids-query', 'feature_types-query', 'genome-query', 'gene_ids-Harvard-Nuclei-full-reference-reference', 'feature_types-Harvard-Nuclei-full-reference-reference', 'gene_ids-Sanger-Nuclei-full-reference-reference', 'feature_types-Sanger-Nuclei-full-reference-reference', 'gene_ids-Sanger-Cells-full-reference-reference', 'feature_types-Sanger-Cells-full-reference-reference', 'gene_ids-Sanger-CD45-full-reference-reference', 'feature_types-Sanger-CD45-full-reference-reference', 'n_cells-myeloid-reference-reference', 'n_counts-myeloid-reference-reference', 'n_cells-reference', 'n_counts-reference', 'n_cells', 'n_counts'
    obsm: 'X_scnym', 'X_umap'

In [18]:
del(cardiac_immune.var['gene_ids-query'])
del(cardiac_immune.var['feature_types-query'])
del(cardiac_immune.var['genome-query'])
del(cardiac_immune.var['gene_ids-Harvard-Nuclei-full-reference-reference'])
del(cardiac_immune.var['feature_types-Harvard-Nuclei-full-reference-reference'])
del(cardiac_immune.var['gene_ids-Sanger-Nuclei-full-reference-reference'])
del(cardiac_immune.var['feature_types-Sanger-Nuclei-full-reference-reference'])
del(cardiac_immune.var['gene_ids-Sanger-Cells-full-reference-reference'])
del(cardiac_immune.var['feature_types-Sanger-Cells-full-reference-reference'])
del(cardiac_immune.var['gene_ids-Sanger-CD45-full-reference-reference'])
del(cardiac_immune.var['feature_types-Sanger-CD45-full-reference-reference'])
del(cardiac_immune.var['n_cells-myeloid-reference-reference'])
del(cardiac_immune.var['n_counts-myeloid-reference-reference'])
del(cardiac_immune.var['n_cells-reference'])
del(cardiac_immune.var['n_counts-reference'])
del(cardiac_immune.var['n_cells'])
del(cardiac_immune.var['n_counts'])
cardiac_immune

AnnData object with n_obs × n_vars = 142914 × 15172
    obs: 'cell_source', 'donor', 'region', 'type', 'cell_states', 'study', 'Individual', 'scNym', 'scNym_confidence', 'Patient', 'Gene', 'disease_status'
    obsm: 'X_scnym', 'X_umap'

In [19]:
del(cardiac_immune.obsm)
cardiac_immune

AnnData object with n_obs × n_vars = 142914 × 15172
    obs: 'cell_source', 'donor', 'region', 'type', 'cell_states', 'study', 'Individual', 'scNym', 'scNym_confidence', 'Patient', 'Gene', 'disease_status'

In [20]:
cardiac_immune.write('/Volumes/Bf110/ct5/raw_data/heart/analysis/subpopulations/3-immune/immune_HHH-DCM_scNym.raw.h5ad')

... storing 'cell_source' as categorical
... storing 'donor' as categorical
... storing 'region' as categorical
... storing 'type' as categorical
... storing 'cell_states' as categorical
... storing 'study' as categorical
... storing 'Individual' as categorical
... storing 'Patient' as categorical
... storing 'Gene' as categorical
... storing 'disease_status' as categorical


In [23]:
sc.pp.normalize_per_cell(cardiac_immune, counts_per_cell_after = 1e4)
sc.pp.log1p(cardiac_immune)
cardiac_immune.X = cardiac_immune.X.tocsc() ### Thanks to `kp9` for help with this!

normalizing by total count per cell
    finished (0:00:02): normalized adata.X and added    'n_counts', counts per cell before normalization (adata.obs)


### Export object

In [24]:
adata_export = anndata.AnnData(X = cardiac_immune.X.todense(), var = cardiac_immune.var, obs = cardiac_immune.obs)
adata_export

AnnData object with n_obs × n_vars = 142914 × 15172
    obs: 'cell_source', 'donor', 'region', 'type', 'cell_states', 'study', 'Individual', 'scNym', 'scNym_confidence', 'Patient', 'Gene', 'disease_status', 'n_counts'

- Split between healthy and diseased

In [27]:
adata_export.obs['disease_status'].cat.categories

Index(['DCM', 'Healthy'], dtype='object')

In [29]:

adata_export_H = adata_export[adata_export.obs['disease_status'].isin(['Healthy'])]
adata_export_D = adata_export[adata_export.obs['disease_status'].isin(['DCM'])]

In [30]:
adata_export_H.write('/Volumes/Bf110/ct5/raw_data/heart/analysis/subpopulations/3-immune/immune_HHH_scNym.log.h5ad')

In [31]:
adata_export_D.write('/Volumes/Bf110/ct5/raw_data/heart/analysis/subpopulations/3-immune/immune_DCM_scNym.log.h5ad')