## Overview


The goal of this sprint is to develop _rich_ (HTML/CSS/JS) representations of popular objects in the 
scverse ecosystem.

We will start with [`AnnData`](https://anndata.readthedocs.io/en/latest/), the core 
object for storing single-cell data that powers [Scanpy](https://scanpy.readthedocs.io/en/stable/) - a toolkit for analyzing single-cell gene
expression.

`AnnData` is a systematic way of storing and retrieving intermediate analysis results, like 
principal components scores, UMAP embeddings, cluster labels, etc for single cell experiments.

From the [paper](https://www.biorxiv.org/content/10.1101/2021.12.16.473007v1.full.pdf),

> The anndata library provides a canonical data structure for book-keeping [both original and learned single-cell annotations, as well as task-associated representations], a capability not addressed by pandas (McKinney, 2010), xarray (Hoyer & Hamman, 2017), or commonly-used modeling packages like scikit-learn (Pedregosa et al., 2011).


TL;DR - `AnnData` is a special container around several popular libraries (pandas, dask, xarray, scikit-learn) 
that provides a standard way to operate on and store single-cell data.

## AnnData Model

Annotated data are stored as follows:

<img width="300" src="https://raw.githubusercontent.com/scverse/anndata/main/docs/_static/img/anndata_schema.svg" />

- `X`: Main data matrix with observations as rows and variables as columns (Array)
- `obs`: One-dimensional annotations for observations (DataFrame)
- `var`: One-dimensional annotations for variables (DataFrame)
- `obsm`: Multi-dimensional annotations for observations (Dictionary)
- `varm`: Multi-dimensional annotations for variables (Dictionary)
- `obsp`: Pairwise relationships among observations (Sparse Matrix)
- `varp`: Pairwise relationships among variables (Sparse Matrix)
- `uns`: Unstructured data associated with the dataset (Dictionary)



## Motivation

As you can see, `AnnData` objects can be complex, with many related data structures and features.

Let's take a closer look at an `AnnData` object, specifically this PBMC 3K dataset from the 
AnnData tutorial.

In [1]:
import anndata
import pooch

# download a dataset
datapath = pooch.retrieve(
    url="https://figshare.com/ndownloader/files/40067737",
    known_hash="md5:b80deb0997f96b45d06f19c694e46243",
    path="../data",
    fname="scverse-getting-started-anndata-pbmc3k_processed.h5ad",
)
adata = anndata.read_h5ad("../data/scverse-getting-started-anndata-pbmc3k_processed.h5ad")
adata

AnnData object with n_obs × n_vars = 2638 × 11505
    obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain_cell_types'
    var: 'gene_names', 'n_cells', 'gene_ids'
    uns: 'louvain', 'louvain_colors', 'pca'
    obsm: 'X_pca', 'X_tsne', 'X_umap'
    layers: 'raw'
    obsp: 'distances_all'

As you can see, we have a nice high-level overview of the dataset with this simple text `__repr__`,
but visually inspecting various components like the main data matrix (`X`), annotations (`obs`, `vars`),
and related structures requires executing additional cells to get the nested _representations_ of the
underlying data objects.



In [2]:
adata.X

<2638x11505 sparse matrix of type '<class 'numpy.float32'>'
	with 2076576 stored elements in Compressed Sparse Row format>

In [3]:
adata.obs.head()

Unnamed: 0_level_0,n_genes,percent_mito,n_counts,louvain_cell_types
cell_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AAACATACAACCAC-1,781,0.030178,2419.0,CD4 T cells
AAACATTGAGCTAC-1,1352,0.037936,4903.0,B cells
AAACATTGATCAGC-1,1131,0.008897,3147.0,CD4 T cells
AAACCGTGCTTCCG-1,960,0.017431,2639.0,CD14+ Monocytes
AAACCGTGTATGCG-1,522,0.012245,980.0,NK cells


In [4]:
adata.var.head()

Unnamed: 0_level_0,gene_names,n_cells,gene_ids
gene_names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
LINC00115,LINC00115,18,ENSG00000225880
NOC2L,NOC2L,258,ENSG00000188976
KLHL17,KLHL17,9,ENSG00000187961
PLEKHN1,PLEKHN1,7,ENSG00000187583
HES4,HES4,145,ENSG00000188290


### What can we do?

IPython (and thus Jupyter) has [special "hooks"](https://ipython.readthedocs.io/en/stable/config/integrating.html) to display richer representations of objects using technologies beyond simple text. This feature let's you integrate your Python objects.

In fact, you are seeing this feature with the Pandas dataframe used for `X.obs` above. See how when we just run the cell:


In [5]:
adata.obs

Unnamed: 0_level_0,n_genes,percent_mito,n_counts,louvain_cell_types
cell_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AAACATACAACCAC-1,781,0.030178,2419.0,CD4 T cells
AAACATTGAGCTAC-1,1352,0.037936,4903.0,B cells
AAACATTGATCAGC-1,1131,0.008897,3147.0,CD4 T cells
AAACCGTGCTTCCG-1,960,0.017431,2639.0,CD14+ Monocytes
AAACCGTGTATGCG-1,522,0.012245,980.0,NK cells
...,...,...,...,...
TTTCGAACTCTCAT-1,1155,0.021104,3459.0,CD14+ Monocytes
TTTCTACTGAGGCA-1,1227,0.009294,3443.0,B cells
TTTCTACTTCCTCG-1,622,0.021971,1684.0,B cells
TTTGCATGAGAGGC-1,454,0.020548,1022.0,B cells


We get a richly formatted table in the notebook that uses HTML/CSS. We can hover rows
and get high-level information about our data. This is possible because the `pd.DataFrame` object implments
some "special" methods to alert Jupyter it has a _rich_, HTML-based representation of the object.

Jupyter detects these methods and then will prefer to render this nicer representation within a web-based notebook. 
Other environments may need to render the raw text (if they run outside of the browser). To access the original
text representation, we can still call `print()` or `repr()` on the object:

In [6]:
print(adata.obs)

                  n_genes  percent_mito  n_counts louvain_cell_types
cell_barcode                                                        
AAACATACAACCAC-1      781      0.030178    2419.0        CD4 T cells
AAACATTGAGCTAC-1     1352      0.037936    4903.0            B cells
AAACATTGATCAGC-1     1131      0.008897    3147.0        CD4 T cells
AAACCGTGCTTCCG-1      960      0.017431    2639.0    CD14+ Monocytes
AAACCGTGTATGCG-1      522      0.012245     980.0           NK cells
...                   ...           ...       ...                ...
TTTCGAACTCTCAT-1     1155      0.021104    3459.0    CD14+ Monocytes
TTTCTACTGAGGCA-1     1227      0.009294    3443.0            B cells
TTTCTACTTCCTCG-1      622      0.021971    1684.0            B cells
TTTGCATGAGAGGC-1      454      0.020548    1022.0            B cells
TTTGCATGCCTCAC-1      724      0.008065    1984.0        CD4 T cells

[2638 rows x 4 columns]


IPython supports enhanced object representations through custom `_repr_*_()` methods, enabling objects to be displayed in formats beyond standard text. If these methods are absent or return `None`, the basic (text) `repr()` is used.

Multiple formats can be defined, and the UI chooses which to display, without side effects.
Each method should return data in its specific format.


| Format             | REPL | Notebook | Qt Console |
|--------------------|------|----------|------------|
| `_repr_pretty_`    | yes  | yes      | yes        |
| `_repr_svg_`       | no   | yes      | yes        |
| `_repr_png_`       | no   | yes      | yes        |
| `_repr_jpeg_`      | no   | yes      | yes        |
| `_repr_html_`      | no   | yes      | no         |
| `_repr_javascript_`| no   | yes      | no         |
| `_repr_markdown_`  | no   | yes      | no         |
| `_repr_latex_`     | no   | yes      | no         |
| `_repr_mimebundle_`| no   | ?        | ?          |

We can see that the `adata.obs` implements the `_repr_html_` method, which returns HTML. Jupyter detects this method and automatically calls it in the call. We can see this final string, just by calling it ourselves:


In [7]:
print(adata.obs.head(2)._repr_html_())

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>n_genes</th>
      <th>percent_mito</th>
      <th>n_counts</th>
      <th>louvain_cell_types</th>
    </tr>
    <tr>
      <th>cell_barcode</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>AAACATACAACCAC-1</th>
      <td>781</td>
      <td>0.030178</td>
      <td>2419.0</td>
      <td>CD4 T cells</td>
    </tr>
    <tr>
      <th>AAACATTGAGCTAC-1</th>
      <td>1352</td>
      <td>0.037936</td>
      <td>4903.0</td>
      <td>B cells</td>
    </tr>
  </tbody>
</table>
</div>


## Your first `_html_repr_`

We can implement a custom repr on a simple object of our own:

In [8]:
class Welcome:
    def __init__(self, name):
        self.name = name 

    def _repr_html_(self):
        return f"""
        <style>
            .scverse-hackathon {{
                max-width: 300px;
                display: grid;
                place-items: center;
            }}
        </style>
        <div class="scverse-hackathon">
            <img src="https://hms-dbmi.github.io/scverse-hackathon-spring-2024/logo.svg" />
            <p>Welcome to the hackathon, {self.name}!</p>
        </div>
        """

Welcome(name="Lisa Simpson")

## Other libraries / inspiration

Both dask and xarray provide very nice, expandable HTML reprs for the objects associated with these frameworks. I think these could serve as a source of inspriation for our work with `AnnData`.

For example, Dask provides a high-level view of mulidimenional arrays with some additional metadata about memory/chunk sizes:

In [9]:
import numpy as np
import xarray as xr
import dask.array as da

data = np.random.randn(2, 10_000)
ddata = da.array(data)
ddata # dask array

Unnamed: 0,Array,Chunk
Bytes,156.25 kiB,156.25 kiB
Shape,"(2, 10000)","(2, 10000)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 156.25 kiB 156.25 kiB Shape (2, 10000) (2, 10000) Dask graph 1 chunks in 1 graph layer Data type float64 numpy.ndarray",10000  2,

Unnamed: 0,Array,Chunk
Bytes,156.25 kiB,156.25 kiB
Shape,"(2, 10000)","(2, 10000)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


The `xarray.DataArray` extends this further by adding _collapsable_ metadata about the different additional components beyond the main mulidimensional array container. I think xarray's notion of coordinates could be a real inspriation for our work.

In [10]:
xrdata = xr.DataArray(ddata, dims=("x", "y"), coords={"x": [10, 20]})
xrdata

Unnamed: 0,Array,Chunk
Bytes,156.25 kiB,156.25 kiB
Shape,"(2, 10000)","(2, 10000)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 156.25 kiB 156.25 kiB Shape (2, 10000) (2, 10000) Dask graph 1 chunks in 1 graph layer Data type float64 numpy.ndarray",10000  2,

Unnamed: 0,Array,Chunk
Bytes,156.25 kiB,156.25 kiB
Shape,"(2, 10000)","(2, 10000)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


Click on the arrows and the database icons to expand and collapse fields.

This is very nice user experience to try to understand the dataset at a high-level. Another benefit is that SVG-based graphics can be exported and embedded as figures to display the datasets!

You can see a more complete version of a `xarray.DataArray` with their built-in example dataset:


In [11]:
ds = xr.tutorial.load_dataset("air_temperature")
ds.air

Notice how we could expand/collapse obs & var for `AnnData` in a table like _Coordinates_.