### QC+ for Cai 2020 dataset

#### Objective: Run advanced QC for the Cai 2020 dataset, including data normalization, log transformation, and UMAP visualization


**Developed by**: Mairi McClean

**Affiliation**: Institute of Computational Biology - Computational Health Centre - Helmholtz Munich

**v230207**


### Load modules

In [1]:
import anndata
import logging
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sb
import scrublet as scr
import matplotlib.pyplot as plt
import igraph as ig
from matplotlib import colors
from matplotlib import rcParams

#### Log file and figure output settings

In [2]:
sc.settings.verbosity = 3
sc.logging.print_versions()
sc.settings.set_figure_params(dpi = 160, color_map = 'RdPu', dpi_save = 180, vector_friendly = True, format = 'svg')

-----
anndata     0.8.0
scanpy      1.9.1
-----
PIL                 9.2.0
appnope             0.1.3
asttokens           NA
backcall            0.2.0
beta_ufunc          NA
binom_ufunc         NA
cffi                1.15.1
colorama            0.4.6
cycler              0.10.0
cython_runtime      NA
dateutil            2.8.2
debugpy             1.6.3
decorator           5.1.1
defusedxml          0.7.1
entrypoints         0.4
executing           1.1.1
google              NA
h5py                3.6.0
hypergeom_ufunc     NA
igraph              0.10.2
ipykernel           6.16.2
ipython_genutils    0.2.0
ipywidgets          8.0.2
jedi                0.18.1
joblib              1.2.0
kiwisolver          1.4.4
leidenalg           0.9.0
llvmlite            0.39.1
louvain             0.8.0
matplotlib          3.6.1
mpl_toolkits        NA
natsort             8.2.0
nbinom_ufunc        NA
ncf_ufunc           NA
numba               0.56.4
numexpr             2.8.1
numpy               1.23.5
packaging  

### Read in anndata object

In [4]:
adata = sc.read_h5ad('/Users/mairi.mcclean/github/data/tb_pbmc_datasets/qcd_objects/2111_2511_exported_objects/23/Nathan2021_PBMC_TB_QCed_pre-process_mm221123.h5ad')
adata

AnnData object with n_obs × n_vars = 500089 × 33538
    obs: 'cell_id', 'nUMI', 'nGene', 'percent_mito', 'batch', 'TB_status', 'UMAP_1', 'UMAP_2', 'cluster_name', 'cluster_ids', 'donor', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'percent_mt2', 'n_counts', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'predicted_doublets'
    var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
    uns: 'TB_status_colors', 'cluster_ids_colors'
    layers: 'counts', 'sqrt_norm'

### Data normalization

target_sum taken from Scanpy tutorial [https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html]

If exclude_highly_expressed=True, very highly expressed genes are excluded from the computation of the normalization factor (size factor) for each cell. 
> This is meaningful as these can strongly influence the resulting normalized values for all other genes [Weinreb17]."

1e4 changed to 1e6; option to exclude highly expressed genes, set as true - CHANGE THIS FIRST before changing any other variable to observe effects on PCA output


In [5]:
sc.pp.normalize_total(adata, target_sum=1e6, exclude_highly_expressed=True)

: 

: 

### Data log transformation

In [None]:
sc.pp.log1p(adata)

### Identify highly variable genes

Code from Carlos' notebook scVI_exploratory_analysis; Seurat should be used for all generative models
Number of genes (top_genes) based on how mixed the sample is, and computational power. Highest gene number is 10,000
Batch_key is related to the data that we want to perform the filtering on
Carlos runs between 4000 (low RAM) and 7000
subset = True will remove all non-variable genes

In [None]:
sc.pp.highly_variable_genes(
    adata,
    flavor = "seurat_v3",
    n_top_genes = 8000,
    layer = "counts",
    batch_key = "sample",
    subset = True
)

In [None]:
adata.var.head()


In [None]:
sc.pl.highly_variable_genes(adata)


We want to see between 0.25 and 0.75 after 1.
This particular pattern either means that the data is garbage or that it is highly significant, caused by disease.
Could see if changing the number of genes from between 4000 to 7000 affects it

### PCA

In [None]:
sc.tl.pca(adata, svd_solver='arpack')

In [None]:
adata

Can add covariates here below; can remove frames from image

In [None]:
adata.obs.head()

In [None]:
sc.pl.pca(adata, color = ['sample', 'donor', 'n_genes_by_counts', 'status'], wspace=0.5)

### Computing neighbourhood graph

Carlos chooses 50 neighbours, and 50 PCs (PCs used to be taken from elbow graph)

This step is done to create a 'scaffold' of the data; UMAP embedding will then place data over scaffold to see how it fits

In [None]:
sc.pp.neighbors(adata, n_neighbors=50, n_pcs=50)



### UMAP embedding

#### Clustering

In [None]:
# added to avoid error arising from running subsequent cell on its own

sc.tl.leiden(adata)

In [None]:
sc.tl.umap(adata)


In [None]:
adata.var.head()


In [None]:
adata.obs.head()

In [None]:
sc.pl.umap(adata, color=[   
    'leiden',  
    'percent_chrY', 
    'XIST-counts',  
    'pct_counts_ribo',  
    'percent_mt2', 
    'n_genes_by_counts', 
    'n_counts',   
    'predicted_doublets',   
    'sample',], size = 1, wspace=0.50)

In [None]:
# Each sample has made it's own cluster, which is batch effect
# choose variety of covariates from obs for panel

In [None]:
# Now we can use the following code to plot the scaled and corrected gene expression data

sc.pl.umap(adata, use_raw=False, )

### Writing out object


In [None]:
adata.write('/Users/mairi.mcclean/github/data/tb_pbmc_datasets/qc_plus_visualisation/230207_Nathan2021_MM_QCplus.h5ad')
# needs extension .h5ad