# Exploratory Data Analysis

---

This few-shot benchmark tests various meta-learning methods in the context of
biomedical applications. In particular, we are dealing with the [Tabula Muris]()
and [SwissProt]() datasets. One is a cell type classification task based on
single-cell gene expressions and the other is a protein function prediction task
based on protein sequences. The goal of this notebook is to explore basic
statistics about the two datasets, as well as understand how the data loading is
implemented for the episodic training during meta-training.


## Setup

---

First, let's import the relevant modules needed.


In [2]:
# ruff: noqa: E402
# Reload modules automatically
%reload_ext autoreload
%autoreload 2

# Module imports
import sys
import time
import seaborn as sns

# External imports
import numpy as np
import torch

In [3]:
# Add path to load local modules
sys.path.append("..")

# Set styles
sns.set_style("whitegrid")

## Base Classes

---

Both datasets are implemented as subclasses of the `FewShotDataset` class and
use some other generic utility classes. We will explore these here in detail.
They are all defined in the `datasets.dataset` module.


### FewShotDataset

The `FewShotDataset(torch.utils.data.Dataset)` is the base class for all
few-shot datasets. It implements the `__getitem__` and `__len__` methods and has
some utilities for checking the data validty. Furthermore, it is responsible for
loading and extracting the dataset into the `root` directory if specified and
not yet existent. However, as it is a abstract base class, it cannot be
instantiated, e.g. it requires the `_dataset_name` and `_dataset_dir` as class
attributes.


In [4]:
# Demo: FewShotDataset
from datasets.dataset import FewShotDataset  # noqa

try:
    few_shot_dataset = FewShotDataset()
except Exception as e:
    print(f"❌ Fails with error {e}.")

❌ Fails with error FewShotDataset must have attribute _dataset_name..


In [5]:
# Demo FewShotSubDataset
from datasets.dataset import FewShotSubDataset  # noqa

# Create a random dataset with 100 samples, 5 features and 5 classes
samples = torch.rand(100, 5)
targets = torch.randint(0, 5, (100,))  # 5-way
subset_target = 4

# Get all samples that belong to class 4
subset_samples = samples[targets == subset_target]

# Create a few-shot dataset for class 4
few_shot_sub_dataset = FewShotSubDataset(subset_samples, subset_target)

# Sanity checks
assert (
    len(few_shot_sub_dataset) == (targets == subset_target).sum()
), "❌ Length of few-shot dataset is not correct."
assert (
    few_shot_sub_dataset.dim == samples.shape[1]
), "❌ Dimension of few-shot dataset is not correct."

### Episodic Batch Sampler

The `EpisodicBatchSampler` is a utility class that randomly samples `n_way`
classes (out of a total of `n_classes`) for a total of `n_episodes`. It can be
used in episodic training to sample the classes used in each episode.

The sampler is `n_episodes` long and each time samples randomly (without
replacement) from `{0, ..., n_classes-1}` `n_way` times.


In [6]:
# Demo: EpisodicBatchSampler
from datasets.dataset import EpisodicBatchSampler  # noqa

# Demo of EpisodicBatchSampler
n_episodes, n_way, n_classes = 3, 5, 10
episodic_batch_sampler = EpisodicBatchSampler(n_classes, n_way, n_episodes)

print(f"Episodes: {n_episodes}, Ways: {n_way}, Classes: {n_classes}")
for batch_idx, indices in enumerate(episodic_batch_sampler):
    print(f"Episode {batch_idx+1} w/ classes {indices.numpy()}")

Episodes: 3, Ways: 5, Classes: 10
Episode 1 w/ classes [5 8 1 2 6]
Episode 2 w/ classes [9 5 6 7 8]
Episode 3 w/ classes [2 0 3 8 5]


## Tabula Muris

---

**Tabula Muris** is a dataset of single cell transcriptome data (gene
expressions) from mice, containing nearly `100,000` cells from `20` organs and
tissues. The data allow for direct and controlled comparison of gene expression
in cell types shared between tissues, such as immune cells from distinct
anatomical locations. They also allow for a comparison of two distinct technical
approaches:

_More Resources_:

- [Tabular Muris Website](https://tabula-muris.ds.czbiohub.org/)
- [SF Biohub Article](https://www.czbiohub.org/sf/tabula-muris/)


### TMSimpleDataset

The `TMSimpleDataset` is a simple dataset class that is designed for regular
multi-class classification training/ fine-tuning. It loads the entire
(processed) dataset into memory and wraps inside a PyTorch Dataset object.
Supports functionality for retrieving a single sample, a batched data loader and
the dimensionality of the data.

_Note: Upon first call, the `TMSimpleDataset` class will download the data into
the `root` directory._


In [7]:
# Demo: TMSimpleDataset
from datasets.cell.tabula_muris import TMSimpleDataset  # noqa

# Arguments to provide
batch_size = 10
root = "../data"
min_samples = 20
subset = 1.0

kwargs = {
    "batch_size": batch_size,
    "root": root,
    "min_samples": min_samples,
    "subset": subset,
}

# Controls data split (returns subset of tissue types)
modes = ["train", "val", "test"]

# Initialise TabulaMuris training dataset
for mode in modes:
    start = time.time()
    data = TMSimpleDataset(**kwargs, mode=mode)
    print(
        f"✅ TabulaMuris {mode} split ({len(data)}) loaded in {time.time() - start:.2f} seconds."
    )

✅ TabulaMuris train split (65812) loaded in 4.97 seconds.
✅ TabulaMuris val split (14962) loaded in 2.08 seconds.
✅ TabulaMuris test split (25065) loaded in 2.81 seconds.


Now, the raw data is downloaded and saved onto disk. Let's view the size of the
raw and processed data.


In [8]:
# This asssumes the data is already downloaded
!du -sh ../data/tabula_muris/*

 84M	../data/tabula_muris/gene_association.mgi
 32M	../data/tabula_muris/go-basic.obo
3.2G	../data/tabula_muris/processed
2.3G	../data/tabula_muris/tabula-muris-comet.h5ad


Python(11732) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


In [9]:
!du -sh ../data/tabula_muris/processed/*

Python(11733) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


4.0K	../data/tabula_muris/processed/mapping.pkl
3.2G	../data/tabula_muris/processed/tabula-muris.pkl


Nice, let's take a look at the data. We will only use the training set for now.


In [10]:
data = TMSimpleDataset(**kwargs, mode="train")

In [11]:
# Statistics on the data splits
num_samples = len(data)
dim = data.dim
unique_classes = np.unique(data.targets)
num_classes = len(unique_classes)

print(f"ℹ️ Tabula Muris dataset has {num_samples} train samples.")
print(f"ℹ️ Each sample has {dim} features (gene expression levels).")
print(f"ℹ️ Tabula Muris dataset has {num_classes} unique classes.")

ℹ️ Tabula Muris dataset has 65812 train samples.
ℹ️ Each sample has 2866 features (gene expression levels).
ℹ️ Tabula Muris dataset has 57 unique classes.


We can get a sample by indexing the dataset.


In [12]:
# Get sample by indexing
x, y = data[0]

print(f"Training sample shape: {x.shape} and target {y}")

Training sample shape: (2866,) and target 17


Makes sense. Each sample in the processed data has 2866 features (gene
expression levels) and a single integer indicating the target tissue type.


In [13]:
# Get data loader
data_loader = data.get_data_loader(num_workers=0, pin_memory=False)

# Create iterator
data_iter = iter(data_loader)

# Get five batches
for batch_idx in range(5):
    tr_smps, tr_trgs = next(data_iter)
    print(f"Batch {batch_idx+1}: Features {tr_smps.shape} and target {tr_trgs}")

Batch 1: Features torch.Size([10, 2866]) and target tensor([32, 24,  0, 26, 38, 39, 48,  0,  0, 16], dtype=torch.int32)
Batch 2: Features torch.Size([10, 2866]) and target tensor([51, 30, 33, 26, 39, 39, 23, 12, 16, 39], dtype=torch.int32)
Batch 3: Features torch.Size([10, 2866]) and target tensor([39, 24, 16, 39, 39, 48, 39, 39,  0, 48], dtype=torch.int32)
Batch 4: Features torch.Size([10, 2866]) and target tensor([37, 33, 39, 11, 42, 55,  0, 16, 23, 17], dtype=torch.int32)
Batch 5: Features torch.Size([10, 2866]) and target tensor([51, 18, 17, 12, 38, 40, 39, 17, 16, 16], dtype=torch.int32)


**NB 1:** On CPU setting `num_workers=0` and `pin_memory=False` is recommended
and loads the batches instant. However, the original parameters in the code are
`num_workers=4` and `pin_memory=True` which is likely GPU optimised. To have
full control, the signature of the `get_data_loader` method has been changed to
allow for customisation of the data loader.

**NB 2:** Shuffling the data is crucial during training for regular fine-tuning.
The raw data does not seem to be shuffled well which will likely hurt the
training.


### TMSetDataset

The `TMSetDataset` is designed for few-shot learning. Most configurations are
the same as for the `TMSimpleDataset`, but crucially the dataset class will
return an episodic batch sampler based on the `n_way`, `n_support`, `n_query`
and parameters.


In [14]:
# Demo: TMSetDataset
from datasets.cell.tabula_muris import TMSetDataset  # noqa

# Arguments to provide
root = "../data"
n_way = 5
n_support = 3
n_query = 3
subset = 1.0

kwargs = {
    "n_way": n_way,
    "n_support": n_support,
    "n_query": n_query,
    "root": root,
    "subset": subset,
}

# Controls data split (returns subset of tissue types)
modes = ["train", "val", "test"]

for mode in modes:
    start = time.time()
    TMSetDataset(**kwargs, mode=mode)

    print(
        f"✅ TMSetDataset {mode} split loaded in {time.time() - start:.2f} seconds.")

✅ TMSetDataset train split loaded in 8.39 seconds.
✅ TMSetDataset val split loaded in 2.29 seconds.
✅ TMSetDataset test split loaded in 3.71 seconds.


In a few-shot learning dataset, a single "sample" is defined not as the feature
vector, target tuple of the gene expression levels and the target tissue but as
a set of support and query samples within a class. Thus, the `__getitem__`
method returns a tuple of the support and query samples and targets for the
`i`-th class. Thus, the returned tensor dimension will be
`(n_support + n_query, n_features)` for the samples and
`(n_support + n_query, )` for the targets.


In [15]:
# Load Dataset
data = TMSetDataset(**kwargs, mode="train")

In [16]:
# Get sample by indexing
tr_smp, tr_trg = data[0]

# Support samples and target
sup_tr_smp, sup_tr_trg = tr_smp[:n_support], tr_trg[:n_support]

# Query samples and target
que_tr_smp, que_tr_trg = tr_smp[n_support:], tr_trg[n_support:]

print(f"Training samples shape: {tr_smp.shape} and target {tr_trg}")
print(f"Support samples shape: {sup_tr_smp.shape} and target {sup_tr_trg}")
print(f"Query samples shape: {sup_tr_smp.shape} and target {sup_tr_trg}")

Training samples shape: torch.Size([6, 2866]) and target tensor([0, 0, 0, 0, 0, 0], dtype=torch.int32)
Support samples shape: torch.Size([3, 2866]) and target tensor([0, 0, 0], dtype=torch.int32)
Query samples shape: torch.Size([3, 2866]) and target tensor([0, 0, 0], dtype=torch.int32)


Next, the data loader class combines the support-query sampler per class (as
defined above) and the `EpisodicBatchSampler` to create a data loader that
returns batches of episodes where each time we get `n_way` classes with
`n_support` support samples and `n_query` query samples per class. First, the
episodic batch sample samples the `n_way` random class indices and then the
support-query sampler samples the support and query samples for each class.
Thus, the final tensor shapes will be `(n_way, n_query + n_support, n_features)`
for the samples and `(n_way, n_query + n_support, )` for the targets.


In [17]:
# Get data loader
train_loader = data.get_data_loader(num_workers=0, pin_memory=False)

# Get batch
tr_smps, tr_trgs = next(iter(train_loader))

print(f"Training batch shape: {tr_smps.shape} and target {tr_trgs.shape}")
print(f"Targets:\n{tr_trgs}")

Training batch shape: torch.Size([5, 6, 2866]) and target torch.Size([5, 6])
Targets:
tensor([[24, 24, 24, 24, 24, 24],
        [31, 31, 31, 31, 31, 31],
        [10, 10, 10, 10, 10, 10],
        [46, 46, 46, 46, 46, 46],
        [ 1,  1,  1,  1,  1,  1]], dtype=torch.int32)


**NB 1:** Shuffling in meta-learning tasks is not necessary because the episodic
batch sampler and the sub-class sampler are already random.


## SwissProt

---

SWISS-PROT is an annotated protein sequence database, which was created at the
Department of Medical Biochemistry of the University of Geneva (first started
1987). In SWISS-PROT two classes of data can be distinguished: the core data and
the annotation. For each sequence entry the core data consists of the sequence
data; the citation information (bibliographical references) and the taxonomic
data (description of the biological source of the protein), while the annotation
consists of the description of the following items:

- Function(s) of the protein
- Post-translational modification(s). For example carbohydrates,
  phosphorylation, acetylation, GPI-anchor, etc.
- Domains and sites. For example calcium binding regions, ATP-binding sites,
  zinc fingers, homeoboxes, SH2 and SH3 domains, etc.
- Secondary structure. For example alpha helix, beta sheet, etc.
- Quaternary structure. For example homodimer, heterotrimer, etc.
- Similarities to other proteins
- Disease(s) associated with deficiencie(s) in the protein
- Sequence conflicts, variants, etc.

Within this project we will focus on the function annotation of proteins, thus
given the protein sequence (string of amino acids) we want to predict the
function of the protein.

_More Resources_:

- [National Library of Medicine](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC102476/)


### SPSimpleDataset

The `SPSimpleDataset` is a simple dataset class that is designed for regular
multi-class classification training/ fine-tuning. It loads the entire
(processed) dataset into memory and wraps inside a PyTorch Dataset object.
Supports functionality for retrieving a single sample, a batched data loader and
the dimensionality of the data.


In [18]:
# The data has to be manually downloaded to this folder
!du -sh ../data/swissprot/*

111M	../data/swissprot/embeds
 47M	../data/swissprot/filtered_goa_uniprot_all_noiea.gaf
 30M	../data/swissprot/go-basic.obo
 74M	../data/swissprot/processed
8.2M	../data/swissprot/sprot_ancestors.txt
271M	../data/swissprot/uniprot_sprot.fasta


Python(11747) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


In [46]:
# Demo: TMSetDataset
from datasets.prot.swissprot import SPSimpleDataset  # noqa

# Arguments to provide
root = "../data"
batch_size = 10
min_samples = 6

kwargs = {
    "root": root,
    "batch_size": batch_size,
    "min_samples": min_samples,
}

# Show loading time for each split
modes = ["train", "val", "test"]
for mode in modes:
    start = time.time()
    data = SPSimpleDataset(**kwargs, mode=mode)

    print(
        f"✅ SwissProt {mode} split ({len(data)} samples) loaded in {time.time() - start:.2f} seconds."
    )

✅ SwissProt train split (10795 samples) loaded in 1.49 seconds.
✅ SwissProt val split (1139 samples) loaded in 1.30 seconds.
✅ SwissProt test split (1183 samples) loaded in 1.04 seconds.


In [43]:
# This asssumes the data is downloaded manually
!du -sh ../data/swissprot/processed*

 74M	../data/swissprot/processed


Python(11920) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


Nice, we can load the data fast and easy. The class also supports subsetting
which is useful for debugging the training. However, because we are still
loading all the data into memory before subsetting, we do not get any speed up
(_NB: We could decide to save and load splitted data to disk, but this is not
implemented yet_).

**NB:** The SwissProt dataset as-is is relatively small and can be loaded into
memory without many problems. In this case, subsetting the data decreases the
number of classes that are included according to our specification of the
minimum number of classes and is therefore generally not advised.


Let's load the full training split again and do some exploratory data analysis.


In [47]:
# Demo: SPSimpleDataset
data = SPSimpleDataset(**kwargs, mode="train")

# Get some basic statistics
num_samples = len(data)
dim = data.dim
classes = np.unique([smp.annot for smp in data.samples])
unique_classes = len(classes)

print(f"ℹ️ SwissProt train split has {len(data)} samples")
print(f"ℹ️ Each sample is an encoded protein sequence of length {dim}")
print(f"ℹ️ SwissProt train split has {unique_classes} classes.")

ℹ️ SwissProt train split has 10795 samples
ℹ️ Each sample is an encoded protein sequence of length 1280
ℹ️ SwissProt train split has 168 classes.


We can get a single sample by indexing the dataset.


In [48]:
# Sample by indexing
x, y = data[0]

print(f"Training sample (encoded protein sequence): {x.shape} and target {y}")

Training sample (encoded protein sequence): torch.Size([1280]) and target 1474


And we can get a regular batch by using the data loader returned by the
`get_data_loader` method.


In [49]:
# Sample using data loader (use pin_memory=True if using GPU)
data_loader = data.get_data_loader(num_workers=0, pin_memory=False)

# Create iterator
data_iter = iter(data_loader)

# Get five batches
for batch_idx in range(5):
    xb, yb = next(data_iter)
    print(f"Batch {batch_idx+1} Sequence: {xb.shape} and target {yb}")

Batch 1 Sequence: torch.Size([10, 1280]) and target tensor([1239,  736, 1789, 3111, 4164, 1508, 2251, 1042, 2251, 1060])
Batch 2 Sequence: torch.Size([10, 1280]) and target tensor([1044, 2951, 1520, 1520, 1060,  736, 1789, 1789, 1796, 1789])
Batch 3 Sequence: torch.Size([10, 1280]) and target tensor([ 474, 1042,  960, 4501, 1044, 1524, 1044, 1044, 4118, 1380])
Batch 4 Sequence: torch.Size([10, 1280]) and target tensor([1812, 1044, 2145,  736,  736, 1044,  731, 3538, 1789, 1789])
Batch 5 Sequence: torch.Size([10, 1280]) and target tensor([1044, 1087,  731, 1042, 1789, 1789, 1380, 1396, 1042, 1789])


### SPSetDataset

The `SPSetDataset` is designed for few-shot learning. Most configurations are
the same as for the `SPSimpleDataset`, but crucially the dataset class will
return an episodic batch sampler based on the `n_way`, `n_support`, `n_query`
and parameters.


In [50]:
# Demo: TMSetDataset
from datasets.prot.swissprot import SPSetDataset  # noqa

# Arguments to provide
root = "../data"
n_way = 5
n_support = 3
n_query = 3
subset = 1.0

kwargs = {
    "n_way": n_way,
    "n_support": n_support,
    "n_query": n_query,
    "root": root,
    "subset": subset,
}

modes = ["train", "val", "test"]
for mode in modes:
    start = time.time()
    data = SPSetDataset(**kwargs, mode=mode)
    print(
        f"✅ SPSetDataset {mode} split ({len(data)} class data loaders) loaded in {time.time() - start:.2f} seconds."
    )

✅ SPSetDataset train split (168 class data loaders) loaded in 1.09 seconds.
✅ SPSetDataset val split (33 class data loaders) loaded in 1.24 seconds.
✅ SPSetDataset test split (18 class data loaders) loaded in 1.01 seconds.


Let's load the full training split again and do some exploratory data analysis.


In [51]:
# Demo: SPSetDataset
data = SPSetDataset(**kwargs, mode="train")

In [52]:
# Get some basic statistics
num_samples = len(data)
dim = data.dim

print(f"ℹ️ SwissProt train split has {len(data)} classes")
print(f"ℹ️ Each sample is an encoded protein sequence of length {dim}")

ℹ️ SwissProt train split has 168 classes
ℹ️ Each sample is an encoded protein sequence of length 1280


Again, in few-shot learning a single "sample" is defined not as the feature
vector, target tuple but as a set of support and query samples within a class.
Thus, the `__getitem__` method returns a tuple of the support and query samples
and targets for the `i`-th class. Thus, the returned tensor dimension will be
`(n_support + n_query, n_features)` for the samples and
`(n_support + n_query, )` for the targets.


In [53]:
# Sample by indexing
x, y = data[0]

print(
    f"Training sample (encoded protein sequence): feature shape {x.shape} and target shape {y.shape}"
)

Training sample (encoded protein sequence): feature shape torch.Size([6, 1280]) and target shape torch.Size([6])


Next, the data loader class combines the support-query sampler per class (as
defined above) and the `EpisodicBatchSampler` to create a data loader that
returns batches of episodes where each time we get `n_way` classes with
`n_support` support samples and `n_query` query samples per class. First, the
episodic batch sample samples the `n_way` random class indices and then the
support-query sampler samples the support and query samples for each class.
Thus, the final tensor shapes will be `(n_way, n_query + n_support, n_features)`
for the samples and `(n_way, n_query + n_support, )` for the targets.


In [54]:
# Sample using data loader (use pin_memory=True if using GPU)
data_loader = data.get_data_loader(num_workers=0, pin_memory=False)

# Get one batches
xb, yb = next(iter(data_loader))
print(f"Batch {batch_idx+1} Sequence: {xb.shape} and target {yb.shape}")
print("Target")
print(yb)

Batch 5 Sequence: torch.Size([5, 6, 1280]) and target torch.Size([5, 6])
Target
tensor([[4115, 4115, 4115, 4115, 4115, 4115],
        [3218, 3218, 3218, 3218, 3218, 3218],
        [3237, 3237, 3237, 3237, 3237, 3237],
        [ 960,  960,  960,  960,  960,  960],
        [1812, 1812, 1812, 1812, 1812, 1812]])


### Additional: Explore Raw Tabula Muris Data

**Note: Code here doesn't run anymore**

The `MacaData` class is responsible for loading and processing the Tabula Muris
dataset. Thus, before looking at the `TMSimpleDataset` and `TMSetDataset`, let's
investigate the data loading/ processing first.

**Changes to the original implementation:**

Originally, the class loads the entire dataset and processes it within the
constructor. This comes with several limitations:

1. We cannot easily look at the raw data.

2. We have to load and preprocess the entire dataset, even if we just want to
   use samples within a specific data split.

3. It does support any subsampling.

To account for this, the `MacaData` class has been augmented in the following
way.

1. Data processing is not performed inside the constructor but has to be called
   via public method `process_data`.

2. The processed data may now be saved to disk via the public method
   `save_data`.

To support efficient loading of subsets and splits of the data, we later also
implement the `MacaDataLoader` class which will be used to load the data during
training.


#### Raw Data

We first look at the raw data. The `MacaData` class expects the path to a
`.h5ad` file containing the data as input and loads the data as well as computes
the class mapping.


In [29]:
# from datasets.cell.utils import MacaData  # noqa
#
# path = os.path.join("..", "data", "tabula_muris", "tabula-muris-comet.h5ad")
#
# start = time.time()
# maca_data = MacaData(path=path)
# print(f"⌛ Loaded data in {time.time() - start:.2f} seconds.")

In [30]:
#       # Save attributes
#       raw_data = maca_data.adata
#       trg2idx, idx2trg = maca_data.trg2idx, maca_data.idx2trg

The `MacaData` class stores the loaded data in the attribute `adata` (annotated
data) as an object of type `anndata.AnnData`. It is a data structure that stores
the data including annotations which is often used for bioinformatics data. We
can get detailled information about the data by printing the object.


In [31]:
# Print meta-data of entire dataset
#       raw_data

We can view the annotation for each cell (sample) and each gene (feature) by
accessing the `obs` and `var` attributes of the `anndata.AnnData` object. The
`obs` attribute is a `pandas.DataFrame` with the cell annotations and the `var`
attribute is a `pandas.DataFrame` with the gene annotations.


In [32]:
# Cell annotations
#       raw_data.obs

We observe the meta data of each cell in the dataset. The meta data contains
information about the mic (like the id, gender, age, etc.) and the cell type
(like the cell type, (sub-)tissue, etc.) and much more. There are a total of
105.960 cells in the dataset.


In [33]:
# Gene annotations
#       raw_data.var

In [34]:
#       # We can get the features and targets as numpy arrays (this is done in the TMDataset class as well)
#       feature_matrix = raw_data.X
#       targets = raw_data.obs["label"].cat.codes.to_numpy()
#
#       print(f"Feature matrix: {feature_matrix.shape}, Targets: {targets.shape}")
#       print(f"Number of target tissues: {len(np.unique(targets))}")

Let's visualise the distribution of the target tissues by showing the top 10
most frequent tissues.


In [35]:
#       # Plot Cell Type Distribution
#       _, ax = plt.subplots(figsize=(20, 10))
#       cell_types = [maca_data.idx2trg[idx] for idx in targets]
#
#       top_k = 10
#       counts = collections.Counter(cell_types)
#       counts = dict(sorted(counts.items(), key=lambda x: x[1], reverse=True)[:top_k])
#
#       sns.barplot(x=list(counts.keys()), y=list(counts.values()), palette="mako", ax=ax)
#       ax.set(
#           xlabel="Cell type", ylabel="Count", title=f"Cell Type Distribution (Top {top_k})"
#       )
#       ax.set_xticklabels(ax.get_xticklabels(), fontsize=8)

For each sample we record the gene expression levels for all genes. The gene
annotation contains some summary statistics about the expressivitiy of each
genes as meta data. The index in this data frame is the column names in the gene
expression feature matrix.


#### Pre-Processed Data

Let's preprocess the data according to the original implementation. The
following steps are performed in the `process_data` method:

- Filter out cells with no target
- Filter out genes that are expressed in less than 5 cells
- Filter out cells with less than 5000 counts and 500 genes expressed
- Normalize per cell (simple lib size normalization)
- Filter out genes with low dispersion (retain the once with high variance)
- Log transform and scale the data
- Zero-imputation of Nans


In [36]:
# Process data
#       start = time.time()
#       maca_data.process_data()
#       print(f"⌛ Processed data in {time.time() - start:.2f} seconds.")

In [37]:
#       # Save attributes
#       processed_data = maca_data.adata
#       trg2idx, idx2trg = maca_data.trg2idx, maca_data.idx2trg

In [38]:
# Cell annotations
#       processed_data.obs

In [39]:
# Gene annotations
#       processed_data.var

In [40]:
# We can get the features and targets as numpy arrays (this is done in the TMDataset class as well)
# feature_matrix = processed_data.X
#       targets = processed_data.obs["label"].cat.codes.to_numpy()
#
#       print(f"Feature matrix: {feature_matrix.shape}, Targets: {targets.shape}")
#       print(f"Number of target tissues: {len(np.unique(targets))}")

In [41]:
#       # Plot Cell Type Distribution
#       _, ax = plt.subplots(figsize=(20, 10))
#       cell_types = [maca_data.idx2trg[idx] for idx in targets]
#
#       top_k = 10
#       counts = collections.Counter(cell_types)
#       counts = dict(sorted(counts.items(), key=lambda x: x[1], reverse=True)[:top_k])
#
#       sns.barplot(x=list(counts.keys()), y=list(
#           counts.values()), palette="mako", ax=ax)
#       ax.set(
#           xlabel="Cell type", ylabel="Count", title=f"Cell Type Distribution (Top {top_k})"
#       )
#       ax.set_xticklabels(ax.get_xticklabels(), fontsize=8)