## Normalization of Human Head and Neck Squamous Cell Carcinoma samples from *Zheng et al.* in Nature Communications published in 2020. __[(GSE145370)](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE145370)__

#### Import the required libraries... this notebook was executed using a HPC to utilize a high power GPU and cut down on time.

In [1]:
import scanpy as sc
import pandas as pd
import numpy as np
import scvi
from scipy.sparse import csr_matrix


Global seed set to 0


#### Load in our h5ad data from the preprocessing and integration step

In [None]:
adata_combined = sc.read_h5ad('projects/def-jinkol/SCRNA-Seq/integrated.h5ad')
adata_combined.obs_names_make_unique

#### Convert our adata_combined matrix to a csr matrix to avoid using a sparse matrix

In [5]:
adata_combined.X = csr_matrix(adata_combined.X)

In [6]:
adata_combined.obs.groupby('Sample').count()

  adata_combined.obs.groupby('Sample').count()


Unnamed: 0_level_0,n_genes,n_genes_by_counts,total_counts,total_counts_mt,pct_counts_mt
Sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Normal,40928,40928,40928,40928,40928
Tumor,65099,65099,65099,65099,65099


In [7]:
adata_combined.layers['Counts'] = adata_combined.X.copy()

#### Here we use scvi-tools to normalize our count data. SCVI uses variational inference to learn both model parameters and to approximate a posterior distribution. First we have to train the model.

In [9]:
scvi.model.SCVI.setup_anndata(adata_combined, layer = 'Counts',
                              categorical_covariate_keys=['Sample'],
                              continuous_covariate_keys=['pct_counts_mt', 'total_counts'])

model=scvi.model.SCVI(adata_combined)
model.train()


For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.

For creation, use `anndata.experimental.sparse_dataset(X)` instead.

  return _abc_instancecheck(cls, instance)
CUDA backend failed to initialize: Found CUDA version 11070, but JAX was built against version 11080, which is newer. The copy of CUDA that is installed must be at least as new as the version against which JAX was built. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Set SLURM handle signals.


Epoch 1/75:   0%|          | 0/75 [00:00<?, ?it/s]


For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.

For creation, use `anndata.experimental.sparse_dataset(X)` instead.

  return _abc_instancecheck(cls, instance)


Epoch 75/75: 100%|██████████| 75/75 [18:50<00:00, 15.08s/it, loss=3.58e+03, v_num=1]


#### Here we pass the latent representation of our model (mean of the variational distribution to approximate the posteriror distribution  of the latent representation *zi* for cell *i*) to a multidimensional array

In [10]:
adata_combined.obsm['X_scVI'] = model.get_latent_representation()

#### Here we use the model we trained earlier to normalize counts to 10,000 and pass it to its own layer.

In [11]:
adata_combined.layers['scvi_normalized'] = model.get_normalized_expression(library_size = 1e4)

#### We can now use this to calculater nearest neighbours which is useful for clustering/dimensionality reduction visualization later

In [12]:
sc.pp.neighbors(adata_combined, use_rep = 'X_scVI')

  @numba.jit()
  @numba.jit()
  @numba.jit()
  @numba.jit(
  @numba.jit(
  @numba.jit()


#### Export our integrated and normalized data to analyze and visualize

In [14]:
adata_combined.write_h5ad('normalized_scvi.h5ad')