### Notebook for the training of the human PBMC data from TB with `scNym`.

- **Developed by**: Carlos Talavera-López
- **Institute of Computational Biology - Computational Health Centre - Hemlholtz Munich**
- v221102

### Load required modules

In [1]:
import time
import scnym
import anndata
import scipy as sp
import numpy as np
import pandas as pd
import scanpy as sc

### Set up working environment

In [2]:
sc.settings.verbosity = 3
sc.logging.print_versions()
sc.settings.set_figure_params(dpi = 160, color_map = 'magma_r', dpi_save = 300, vector_friendly = True, format = 'svg')

The `sinfo` package has changed name and is now called `session_info` to become more discoverable and self-explanatory. The `sinfo` PyPI package will be kept around to avoid breaking old installs and you can downgrade to 0.3.2 if you want to use it without seeing this message. For the latest features and bug fixes, please install `session_info` instead. The usage and defaults also changed slightly, so please review the latest README at https://gitlab.com/joelostblom/session_info.
-----
anndata     0.8.0
scanpy      1.6.0
sinfo       0.3.4
-----
PIL                 9.2.0
absl                NA
asttokens           NA
backcall            0.2.0
certifi             2022.06.15
chardet             3.0.4
cycler              0.10.0
cython_runtime      NA
dateutil            2.8.2
debugpy             1.6.0
decorator           5.1.1
dunamai             1.12.0
entrypoints         0.4
executing           0.9.1
get_version         3.5.4
google              NA
h5py                3.7.0
idna          

In [3]:
config_name = "new_identity_discovery"
config = scnym.api.CONFIGS[config_name]
config["domain_groupby"] = "domain_label"

### Read in reference object

In [4]:
combined_object = sc.read_h5ad('/home/cartalop/data/single_cell/lung/tb/working_objects/CaiY_PBMC_TB_pre-scnym_ctl221017.h5ad')
combined_object

AnnData object with n_obs × n_vars = 319065 × 22792
    obs: 'object', 'domain_label', 'cell_states'
    var: 'gene_id-query'

In [5]:
sc.pp.filter_genes(combined_object, min_cells = 3)
sc.pp.filter_genes(combined_object, min_counts  = 10)
combined_object

filtered out 2474 genes that are detected in less than 3 cells
filtered out 119 genes that are detected in less than 10 counts


AnnData object with n_obs × n_vars = 319065 × 20199
    obs: 'object', 'domain_label', 'cell_states'
    var: 'gene_id-query', 'n_cells', 'n_counts'

In [6]:
combined_object.X[:8,:8].todense()

matrix([[0.       , 0.       , 0.       , 0.       , 0.       ,
         0.       , 0.       , 0.       ],
        [0.       , 0.       , 0.       , 0.       , 0.       ,
         0.       , 0.       , 0.       ],
        [0.       , 0.       , 0.       , 0.       , 0.       ,
         0.       , 0.       , 0.       ],
        [0.       , 0.       , 0.       , 0.       , 0.       ,
         5.9510117, 0.       , 0.       ],
        [0.       , 0.       , 0.       , 0.       , 0.       ,
         6.3671494, 0.       , 0.       ],
        [0.       , 0.       , 0.       , 0.       , 0.       ,
         0.       , 0.       , 0.       ],
        [0.       , 0.       , 0.       , 0.       , 0.       ,
         0.       , 0.       , 0.       ],
        [0.       , 0.       , 0.       , 0.       , 0.       ,
         0.       , 0.       , 0.       ]], dtype=float32)

In [7]:
sc.pp.normalize_total(combined_object, target_sum = 1e6)
sc.pp.log1p(combined_object)

normalizing counts per cell
    finished (0:00:01)


### Filter low quality cells

In [8]:
sc.pp.filter_genes(combined_object, min_cells = 3)
sc.pp.filter_genes(combined_object, min_counts  = 10)
combined_object

AnnData object with n_obs × n_vars = 319065 × 20199
    obs: 'object', 'domain_label', 'cell_states'
    var: 'gene_id-query', 'n_cells', 'n_counts'
    uns: 'log1p'

### Train reference with `scNym`

- Record start time for `scNym` training

In [9]:
start_time = time.time()

- Train model

- Record end time for scNym label transfer

In [10]:
end_time = time.time()

- Compute the elapsed time

In [11]:
total_time = end_time - start_time
print(f"Total time: {total_time}")

Total time: 0.09972429275512695


### Predict cell labels

In [12]:
from scnym.api import scnym_api

scnym_api(
    adata = combined_object,
    task = 'predict',
    key_added = 'scNym',
    trained_model = '/home/cartalop/data/single_cell/lung/tb/models/scnym_model/',
    out_path = '/home/cartalop/data/single_cell/lung/tb/models/scnym_model/',
    config = 'new_identity_discovery',
)


CUDA compute device found.


FileNotFoundError: [Errno 2] No such file or directory: '/home/cartalop/data/single_cell/lung/tb/models/scnym_model/scnym_train_results.pkl'

### Visualise label transfer and cofindence using `X_scNym`

In [None]:
sc.pp.neighbors(combined_object, use_rep = 'X_scnym', n_neighbors = 50)
sc.tl.umap(combined_object, min_dist = 0.3, spread = 8, random_state = 1712)
sc.pl.umap(combined_object, color = ['scNym', 'scNym_confidence', 'cell_states'], size = 0.6, frameon = False, legend_loc = 'on data', legend_fontsize = 4)

In [None]:
sc.pl.umap(combined_object, color = ['scNym', 'scNym_confidence', 'object'], size = 0.2, frameon = False, legend_fontsize = 5)

In [None]:
combined_object

### Save object

In [None]:
adata_export = anndata.AnnData(X = combined_object.X, obs = combined_object.obs, var = combined_object.var, uns = combined_object.uns, obsm = combined_object.obsm, obsp = combined_object.obsp)
adata_export

### Subset query only 

In [None]:
adata_query = adata_export[adata_export.obs['object'].isin(['query'])]
adata_query

In [None]:
adata_query.write('/home/cartalop/data/single_cell/lung/tb/working_objects/CaiY_PBMC_TB_post-scnym_ctl220717.h5ad')