# *xomx tutorial:* **preprocessing and clustering 3k PBMCs**

-----

This tutorial follows the single cell RNA-seq Scanpy tutorial on 3k PBMCs:
https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html.

The objective is to analyze a dataset of Peripheral Blood Mononuclear Cells (PBMC)
freely available from 10X Genomics, composed of 2,700 single cells that were
sequenced on the Illumina NextSeq 500.
We replace some Scanpy plots by interactive *xomx* plots, and modify the
computation of marker genes. Instead of using a t-test, Wilcoxon-Mann-Whitney test
or logistic regression, we perform recursive feature elimination with
the Extra-Trees algorithm.

In [None]:
# imports:
import os
from IPython.display import clear_output
import requests
try:
    import xomx
except ImportError:
    !pip install git+https://github.com/perrin-isir/xomx.git
    clear_output()
    import xomx
try:
    import scanpy as sc
except ImportError:
    !pip install scanpy
    clear_output()
    import scanpy as sc
import numpy as np

We define `save_dir`, the folder in which everything will be saved.

In [None]:
save_dir = os.path.join(os.path.expanduser('~'), 'results', 'xomx-tutorials', 'pbmc')
os.makedirs(save_dir, exist_ok=True)

In [None]:
# Setting the pseudo-random number generator
rng = np.random.RandomState(0)

## Step 1: data importation, preprocessing and clustering

We download scRNA-seq data freely available from 10x Genomics:

In [None]:
pbmc3k_file = 'pbmc3k.tar.gz'
if not os.path.isfile(os.path.join(save_dir, pbmc3k_file)):
    url = (
        "https://cf.10xgenomics.com/samples/cell/pbmc3k/"
        + "pbmc3k_filtered_gene_bc_matrices.tar.gz"
    )
    r = requests.get(url, allow_redirects=True)
    open(os.path.join(save_dir, "pbmc3k.tar.gz"), "wb").write(r.content)
    os.popen(
        "tar -xzf " + os.path.join(save_dir, "pbmc3k.tar.gz") + " -C " + save_dir
    ).read()

We turn this data into an [AnnData](https://anndata.readthedocs.io) object with the Scanpy function 
`read_10x_mtx()`:

In [None]:
xd = sc.read_10x_mtx(
    os.path.join(save_dir, "filtered_gene_bc_matrices", "hg19"),
    var_names="gene_symbols",
)
xd.var_names_make_unique()

We apply basic filtering, annotate the group of mitochondrial genes and compute various
metrics, as it is done in the [Scanpy tutorial](
https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html):

In [None]:
sc.pp.filter_cells(xd, min_genes=200)
sc.pp.filter_genes(xd, min_cells=3)
xd.var["mt"] = xd.var_names.str.startswith(
    "MT-"
)  # annotate the group of mitochondrial genes as 'mt'
sc.pp.calculate_qc_metrics(
    xd, qc_vars=["mt"], percent_top=None, log1p=False, inplace=True
)

In [None]:
xd

We compute the following NumPy array:

In [None]:
# The k-th element of the following array is the mean fraction of counts of the
# k-th gene in each single cell, across all cells
mean_count_fractions = np.squeeze(
    np.asarray(
        np.mean(
            xd.X / np.array(xd.obs["total_counts"]).reshape((xd.n_obs, 1)), axis=0
        )
    )
)

The k-th element of `mean_count_fractions` is the mean fraction of counts of the k-th 
gene in each single cell, across all cells.

Below are 3 examples of interactive plots with *xomx* functions:

1. Plot, for all genes, the mean fraction of counts in single cells, across all cells.  
We use `xomx.pl.plot()`. Besides the AnnData object, it takes in input a 
function (here `lambda idx: mean_count_fractions[idx]`) which itself takes as input 
the index of a feature (if `obs_or_var` is 'var') or a sample (if `obs_or_var` is 'obs'). 

In [None]:
# Plot, for all genes, the mean fraction
# of counts in single cells, across all cells
xomx.pl.plot(
    xd,
    lambda idx: mean_count_fractions[idx],
    obs_or_var='var',
    ylog_scale=False,
    xlabel='genes',
    ylabel='mean fractions of counts across all cells',
)

Hovering over points with the cursor shows information about the corresponding genes.

2. Plot the total counts per cell.

In [None]:
# Plot the total counts per cell
xomx.pl.plot(
    xd,
    lambda idx: xd.obs['total_counts'][idx],
    obs_or_var='obs',
    ylog_scale=False,
    xlabel='cells',
    ylabel='total number of counts',
)

Hovering over points with the cursor shows information about the corresponding cells.

3. Plot mitochondrial count percentages vs total number of counts.  
We use `xomx.pl.scatter()` which takes in input two functions, one for the x-axis, and one for the y-axis (both of them must take in input the index a feature if `obs_or_var` is 'var' or the index of a sample if `obs_or_var` is 'obs'. We use a log scale for the total number of counts (x axis).

In [None]:
# Plot mitochondrial count percentages vs total number of counts
xomx.pl.scatter(
    xd,
    lambda idx: xd.obs['total_counts'][idx],
    lambda idx: xd.obs['pct_counts_mt'][idx],
    obs_or_var='obs',
    xlog_scale=True,
    ylog_scale=False,
    xlabel='total number number of counts',
    ylabel='mitochondrial count percentages',
)

We then follow the steps of the [Scanpy tutorial](
https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html) for the preprocessing
and clustering of the data:

In [None]:
xd_processed = xd[xd.obs.n_genes_by_counts < 2500, :]
xd_processed = xd_processed[xd_processed.obs.pct_counts_mt < 5, :]
xd_processed = sc.pp.normalize_total(xd_processed, target_sum=1e4, copy=True)
xd_processed = sc.pp.log1p(xd_processed, copy=True)
sc.pp.highly_variable_genes(xd_processed, min_mean=0.0125, max_mean=3, min_disp=0.5)
xd_processed.raw = xd_processed
xd_processed = xd_processed[:, xd_processed.var.highly_variable]
sc.pp.regress_out(xd_processed, ["total_counts", "pct_counts_mt"])
sc.pp.scale(xd_processed, max_value=10)
sc.tl.pca(xd_processed, svd_solver="arpack", random_state=rng.randint(1000))
sc.pp.neighbors(xd_processed, n_neighbors=10, n_pcs=40, random_state=rng.randint(1000))
sc.tl.leiden(xd_processed, random_state=rng.randint(1000))

We rename the clusters as it is done in the [Scanpy tutorial](
https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html):


In [None]:
new_cluster_names = [
    "CD4 T",
    "CD14 Monocytes",
    "B",
    "CD8 T",
    "NK",
    "FCGR3A Monocytes",
    "Dendritic",
    "Megakaryocytes",
]
xd_processed.rename_categories("leiden", new_cluster_names)  # ignore the warning

In [None]:
xd_processed.obs['leiden']

To compute efficiently the neighborhood graph and clusters, the data was filtered by selecting only the top highly variable genes (`xd_processed = xd_processed[:, xd_processed.var.highly_variable]`) .  

Now, we retrieve the data with all the features, as follows:

In [None]:
obsp = xd_processed.obsp.copy()
xd_processed = xd_processed.raw.to_adata()
xd_processed.obsp = obsp

The copy of `xd_processed.obsp` is necessary as it is not restored by `xd_processed.raw.to_adata()`.

We compute the dictionary of feature indices, which is required by some *xomx* functions:

In [None]:
xd_processed.uns["var_indices"] = xomx.tl.var_indices(xd_processed)

Example:  `xd_processed.uns['var_indices']['MALAT1']` is 7854 and `xd_processed.var_names[7854]` is 
'MALAT1'.

The 'leiden' clusters define labels, but *xomx* uses labels stored in `.obs['labels']`, so
we make the following copy:

In [None]:
xd_processed.obs['labels'] = xd_processed.obs['leiden']

Several *xomx* functions require the list of all labels and the 
dictionary of sample indices per label:

In [None]:
xd_processed.uns['all_labels'] = xomx.tl.all_labels(xd_processed.obs['labels'])
xd_processed.uns['obs_indices_per_label'] = xomx.tl.indices_per_label(xd_processed.obs['labels'])

Example: `xd_processed.uns['obs_indices_per_label']['Megakaryocytes']` is the list of indices
of the samples that are labelled as 'Megakaryocytes'.

We then randomly split the samples into training and test sets:

In [None]:
xomx.tl.train_and_test_indices(xd_processed, "obs_indices_per_label", test_train_ratio=0.25, rng=rng)

With `test_train_ratio=0.25`, for every label, 25% of the samples are assigned to 
the test set, and 75% to the train set. It creates the following unstructured 
annotations:
- `xd_processed.uns['train_indices']`: the array of indices of all samples that belong 
to the training set.
- `xd_processed.uns['test_indices']`: the array of indices of all samples that belong 
to the test set.
- `xd_processed.uns['train_indices_per_label']`: the dictionary of sample indices in the 
training set, per label. For instance,
`xd_processed.uns['train_indices_per_label']['Megakaryocytes']` is the array
of indices of all the samples labelled as 'Megakaryocytes' that belong to the
training set.
- `xd_processed.uns['test_indices_per_label']`: the dictionary of sample indices in the 
test set, per label.

We use the Scanpy function `rank_genes_groups()` to rank the genes for each 
cluster with a t-test:

In [None]:
sc.tl.rank_genes_groups(xd_processed, 'leiden', method='t-test')

After that, the ranking information is contained in 
`xd_processed.uns['rank_genes_groups']`. For instance, 
`xd_processed.uns['rank_genes_groups']['names']['Megakaryocytes']` is the list of genes 
ordered from highest to lowest rank for the label 'Megakaryocytes'.

We save `xd_processed` as the file **xomx_pbmc.h5ad**
in the `save_dir` directory:

In [None]:
xd_processed.write(os.path.join(save_dir, 'xomx_pbmc.h5ad'))

## Step 2: training binary classifiers and performing recursive feature elimination

Loading the AnnData object:

In [None]:
xd_processed = sc.read(os.path.join(save_dir, 'xomx_pbmc.h5ad'), cache=True)

Just like in the [xomx_kidney_classif_2.ipynb tutorial](
https://colab.research.google.com/github/perrin-isir/xomx-tutorials/blob/main/tutorials/xomx_kidney_classif_2.ipynb),
we use the Extra-Trees algorithms and run it several times per label to select
100, then 30, 20, 15 and finally 10 marker genes for each label.  
The only difference here is the use of the option `init_selection_size=8000` in `init()`. 
This option speeds up the process of feature elimination by starting with an
initial selection of features of size 8000, different for each label (while 
in [xomx_kidney_classif_2.ipynb](
https://colab.research.google.com/github/perrin-isir/xomx-tutorials/blob/main/tutorials/xomx_kidney_classif_2.ipynb), a global filtering was applied
to start with a common initial selection of 8000 highly variable genes).  

With the `init_selection_size` option, we must also give in input a list or array of features ordered by rank (the most important features first).
The first `init_selection_size` features will be selected.  
In our case, `xd_processed.uns['rank_genes_groups']` has been computed before, and for each label, 
`xd_processed.uns['rank_genes_groups']['names'][label]` is an array of the features ordered by rank.  
So for each label, our initial selection of 8000 genes coincides with the
highest ranked features in `xd.uns['rank_genes_groups']['names'][label]`.

After the training, for each label, `feature_selectors[label]` is a
binary classifier using only 10 features to discriminate samples with the label 
from other samples.

In [None]:
feature_selectors = {}
for label in xd_processed.uns['all_labels']:
    print('Label: ' + label)
    feature_selectors[label] = xomx.fs.RFEExtraTrees(
        xd_processed,
        label,
        n_estimators=450,
        random_state=rng,
    )
    feature_selectors[label].init(init_selection_size=8000, rank=xd_processed.uns['rank_genes_groups']['names'][label])
    for siz in [100, 30, 20, 15, 10]:
        print('Selecting', siz, 'features...')
        feature_selectors[label].select_features(siz)
        print(
            'MCC score:',
            xomx.tl.matthews_coef(feature_selectors[label].confusion_matrix),
        )
    feature_selectors[label].save(os.path.join(save_dir, 'feature_selectors', label))
    print('Done.')

## Step 3: visualizing the results

Loading the AnnData object:

In [None]:
xd_processed = sc.read(os.path.join(save_dir, 'xomx_pbmc.h5ad'), cache=True)

Loading the binary classifiers, and creating `gene_dict`, a dictionary of the 10-gene
signatures for each label:

In [None]:
feature_selectors = {}
gene_dict = {}
for label in xd_processed.uns['all_labels']:
    feature_selectors[label] = xomx.fs.load_RFEExtraTrees(
        os.path.join(save_dir, 'feature_selectors', label),
        xd_processed,
    )
    gene_dict[label] = [
        xd_processed.var_names[idx_]
        for idx_ in feature_selectors[label].current_feature_indices
    ]

We construct a multiclass classifier based on the binary classifiers:


In [None]:
sbm = xomx.cl.ScoreBasedMulticlass(xd_processed, xd_processed.uns['all_labels'], feature_selectors)

In [None]:
sbm.plot()

With the function plot_var(), we visualize the 10-gene signatures of CD14 Monocytes and FCGR3A Monocytes:

In [None]:
xomx.pl.plot_var(xd_processed, gene_dict["CD14 Monocytes"] + gene_dict["FCGR3A Monocytes"])

Some categories have significantly less samples than others, so we can pass the option `equal_size=True` to duplicate some of the samples and get a plot with categories of equal sizes:

In [None]:
xomx.pl.plot_var(xd_processed, gene_dict["CD14 Monocytes"] + gene_dict["FCGR3A Monocytes"], equal_size=True, width=1000)

We gather all the selected genes in a single list:

In [None]:
all_selected_genes = np.asarray(list(gene_dict.values())).flatten()

For comparison, we define a list of known biomarkers as suggested in the 
[Scanpy tutorial](
https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html):

In [None]:
biomarkers = {
    "IL7R",
    "CD14",
    "LYZ",
    "MS4A1",
    "CD8A",
    "GNLY",
    "NKG7",
    "FCGR3A",
    "MS4A7",
    "FCER1A",
    "CST3",
    "PPBP",
}

In [None]:
print(biomarkers.intersection(all_selected_genes))

We use Scanpy to create a UMAP embedding, stored in `.obsm["X_umap"]`: 

In [None]:
sc.tl.umap(xd_processed)

Using `xomx.pl.plot_2d_obsm()`, we get an interactive plot of this embedding:

In [None]:
xomx.pl.plot_2d_obsm(xd_processed, "X_umap")

By default, different colors correspond to the different labels, but 
we can also specify a feature:

In [None]:
xomx.pl.plot_2d_obsm(xd_processed, "X_umap", "CST3")

In [None]:
xd_processed.obs["colors"] = xomx.tl._to_dense(xd_processed[:, "CST3"].X)

In [None]:
xd_processed.obs["colors"]

In [None]:
xomx.tl._to_dense