# *xomx tutorial:* preprocessing and clustering 3k PBMCs

-----

This tutorial follows the single cell RNA-seq Scanpy tutorial on 3k PBMCs:
https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html.

The objective is to analyze a dataset of Peripheral Blood Mononuclear Cells (PBMC)
freely available from 10X Genomics, composed of 2,700 single cells that were
sequenced on the Illumina NextSeq 500.
We replace some Scanpy plots by interactive *xomx* plots, and modify the
computation of marker genes. Instead of using a t-test, Wilcoxon-Mann-Whitney test
or logistic regression, we perform recursive feature elimination with
the Extra-Trees algorithm.

In [None]:
import numpy as np
import scanpy as sc
import os
import requests
try:
  import xomx
except ImportError:
    !pip install git+https://github.com/perrin-isir/xomx
import xomx

In [None]:
# To display interactive plots:
%matplotlib widget

We first define `save_dir`, the folder in which everything will be saved.

In [None]:
save_dir = os.path.join(os.path.expanduser('~'), 'results', 'xomx', 'tutorials', 'xomx_pbmc')
os.makedirs(save_dir, exist_ok=True)

In [None]:
# Setting the pseudo-random number generator
rng = np.random.RandomState(0)

We download scRNA-seq data freely available from 10x Genomics:

In [None]:
pbmc3k_file = 'pbmc3k.tar.gz'
if not os.path.isfile(os.path.join(save_dir, pbmc3k_file)):
    url = (
        "https://cf.10xgenomics.com/samples/cell/pbmc3k/"
        + "pbmc3k_filtered_gene_bc_matrices.tar.gz"
    )
    r = requests.get(url, allow_redirects=True)
    open(os.path.join(save_dir, "pbmc3k.tar.gz"), "wb").write(r.content)
    os.popen(
        "tar -xzf " + os.path.join(save_dir, "pbmc3k.tar.gz") + " -C " + save_dir
    ).read()

We turn this data into an [AnnData](https://anndata.readthedocs.io) object with the Scanpy function 
`read_10x_mtx()`:

In [None]:
xd = sc.read_10x_mtx(
    os.path.join(save_dir, "filtered_gene_bc_matrices", "hg19"),
    var_names="gene_symbols",
)
xd.var_names_make_unique()

We apply basic filtering, annotate the group of mitochondrial genes and compute various
metrics, as it is done in the [Scanpy tutorial](
https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html):

In [None]:
sc.pp.filter_cells(xd, min_genes=200)
sc.pp.filter_genes(xd, min_cells=3)
xd.var["mt"] = xd.var_names.str.startswith(
    "MT-"
)  # annotate the group of mitochondrial genes as 'mt'
sc.pp.calculate_qc_metrics(
    xd, qc_vars=["mt"], percent_top=None, log1p=False, inplace=True
)

In [None]:
xd

In [None]:
# The k-th element of the following array is the mean fraction of counts of the
# k-th gene in each single cell, across all cells
mean_count_fractions = np.squeeze(
    np.asarray(
        np.mean(
            xd.X / np.array(xd.obs["total_counts"]).reshape((xd.n_obs, 1)), axis=0
        )
    )
)

In [None]:
# Plot, for all genes, the mean fraction
# of counts in single cells, across all cells
xomx.pl.function_plot(
    xd,
    lambda idx: mean_count_fractions[idx],
    obs_or_var="var",
    violinplot=False,
    ylog_scale=False,
    xlabel="genes",
    ylabel="mean fractions of counts across all cells",
)

In [None]:
# Plot the total counts per cell
xomx.pl.function_plot(
    xd,
    lambda idx: xd.obs["total_counts"][idx],
    obs_or_var="obs",
    violinplot=True,
    ylog_scale=False,
    xlabel="cells",
    ylabel="total number of counts",
)

In [None]:
# Plot mitochondrial count percentages vs total number of counts
xomx.pl.function_scatter(
    xd,
    lambda idx: xd.obs["total_counts"][idx],
    lambda idx: xd.obs["pct_counts_mt"][idx],
    obs_or_var="obs",
    violinplot=False,
    xlog_scale=False,
    ylog_scale=False,
    xlabel="total number number of counts",
    ylabel="mitochondrial count percentages",
)