# Getting started with anndata

This notebook follows a tutorial on the AnnData library for single-cell data analysis in Python.

AnnData is meant to handle matrix data (e.g. cell-by-gene) that has metadata (e.g. donor information for each cell, or alternative gene symbols for each gene). It also naturally handles sparsity and unstructured data.

In [1]:
import numpy as np
import pandas as pd
import anndata as ad
from scipy.sparse import csr_matrix 
print(ad.__version__)

0.10.9


### Initializing AnnData

In [2]:
counts = csr_matrix(np.random.poisson(1, size=(100, 2000)), dtype=np.float32)
adata = ad.AnnData(counts)
adata

AnnData object with n_obs × n_vars = 100 × 2000

In [3]:
# adata.X holds the sparse matrix with the data
adata.X

<Compressed Sparse Row sparse matrix of dtype 'float32'
	with 126422 stored elements and shape (100, 2000)>

In [4]:
# provide indices for both "obs" (row) and "var" (col) axes using
#   .obs_names and .var_names
adata.obs_names = [f"Cell_{i:d}" for i in range(adata.n_obs)]
adata.var_names = [f"Gene_{i:d}" for i in range(adata.n_vars)]
print(adata.obs_names[:10])

Index(['Cell_0', 'Cell_1', 'Cell_2', 'Cell_3', 'Cell_4', 'Cell_5', 'Cell_6',
       'Cell_7', 'Cell_8', 'Cell_9'],
      dtype='object')


### Subsetting AnnData

We can use the index names specified above to subsample parts of our data by name.

In [5]:
adata[["Cell_1", "Cell_10"], ["Gene_5", "Gene_1900"]]

View of AnnData object with n_obs × n_vars = 2 × 2

### Adding aligned metadata

We can add metadata to both rows and columns of adata, since adata.obs and adata.vars are Pandas dataframes.

In [6]:
ct = np.random.choice(["B", "T", "Monocyte"], size=(adata.n_obs,))  # random cell type labels
adata.obs["cell_type"] = pd.Categorical(ct)  # Categoricals are preferred for efficiency
adata.obs

Unnamed: 0,cell_type
Cell_0,Monocyte
Cell_1,T
Cell_2,B
Cell_3,T
Cell_4,B
...,...
Cell_95,T
Cell_96,B
Cell_97,B
Cell_98,T


In [7]:
# let's see how the AnnData representation has been upddated
adata

AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type'

In [8]:
# we can also subset the data using metadata such as our "cell types"
bdata = adata[adata.obs.cell_type == "B"]
bdata

View of AnnData object with n_obs × n_vars = 33 × 2000
    obs: 'cell_type'

In [9]:
# testing adding another annotation
cell_size = np.random.uniform(0, 1, size = adata.n_obs)
adata.obs["cell_size"] = cell_size
adata.obs

Unnamed: 0,cell_type,cell_size
Cell_0,Monocyte,0.588107
Cell_1,T,0.853403
Cell_2,B,0.653777
Cell_3,T,0.403350
Cell_4,B,0.413571
...,...,...
Cell_95,T,0.442407
Cell_96,B,0.697304
Cell_97,B,0.371230
Cell_98,T,0.073341


In [10]:
adata

AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type', 'cell_size'

### Observation / variable-level matrices

We can also have metadata with multiple dimensions to it, such as UMAP embeddings. We can store these using the .obsm and .varm attributes. These are identified by keys.

In [12]:
adata.obsm['X_umap'] = np.random.normal(0, 1, size=(adata.n_obs, 2))
adata.obsm['X_tsne'] = np.random.normal(0, 1, size=(adata.n_obs, 3))
adata.varm['gene_stuff'] = np.random.normal(0, 1, size=(adata.n_vars, 5))
adata.obsm

AxisArrays with keys: X_umap, X_tsne

In [13]:
adata

AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type', 'cell_size'
    obsm: 'X_umap', 'X_tsne'
    varm: 'gene_stuff'

### Unstructured metadata

AnnData has .uns, which allows for any unstructured metadata. This could be anything, like a list or a dictionary holding and general information. It does not have to be one per obs or var.

In [14]:
adata.uns["random"] = [1, 2, 3]
adata.uns

OrderedDict([('random', [1, 2, 3])])

### Layers

We may have different forms of our original core data, such as one that is normalized and one that is not. We can store these as different "layers" in the same AnnData object.

In [15]:
# example: log-transform data and store it in a layer
adata.layers["log_transformed"] = np.log1p(adata.X)
adata

AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type', 'cell_size'
    uns: 'random'
    obsm: 'X_umap', 'X_tsne'
    varm: 'gene_stuff'
    layers: 'log_transformed'

### Conversion to DataFrames

We can also ask AnnData to return a DataFrame from one of the layers.

In [16]:
adata.to_df(layer="log_transformed")

Unnamed: 0,Gene_0,Gene_1,Gene_2,Gene_3,Gene_4,Gene_5,Gene_6,Gene_7,Gene_8,Gene_9,...,Gene_1990,Gene_1991,Gene_1992,Gene_1993,Gene_1994,Gene_1995,Gene_1996,Gene_1997,Gene_1998,Gene_1999
Cell_0,1.098612,0.000000,0.000000,0.693147,0.000000,0.693147,0.693147,1.098612,0.693147,0.000000,...,0.693147,0.000000,1.098612,1.098612,0.693147,1.098612,0.693147,0.000000,0.693147,1.386294
Cell_1,0.693147,1.098612,1.098612,0.000000,0.693147,0.693147,1.098612,1.609438,0.693147,1.609438,...,1.609438,1.098612,0.693147,1.098612,1.098612,1.098612,0.000000,0.000000,1.386294,0.693147
Cell_2,0.693147,0.000000,0.000000,0.693147,0.000000,0.693147,0.000000,0.693147,0.693147,0.693147,...,0.693147,1.609438,0.693147,0.693147,1.386294,0.000000,0.000000,0.693147,0.693147,0.000000
Cell_3,0.000000,1.098612,0.693147,0.693147,0.693147,0.000000,0.693147,1.098612,0.000000,0.000000,...,0.693147,0.693147,1.386294,0.000000,0.693147,0.693147,0.000000,0.000000,0.000000,1.098612
Cell_4,1.098612,0.000000,1.098612,0.000000,0.693147,0.000000,0.693147,0.693147,0.000000,0.000000,...,0.693147,1.098612,0.693147,0.000000,0.693147,0.693147,1.098612,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Cell_95,0.693147,0.000000,1.098612,0.693147,1.386294,0.000000,0.693147,0.000000,1.098612,0.000000,...,0.693147,1.098612,0.693147,1.098612,1.098612,0.000000,0.693147,0.000000,0.000000,1.098612
Cell_96,0.693147,0.000000,0.693147,0.000000,0.693147,0.000000,0.693147,0.000000,1.098612,1.098612,...,0.000000,0.693147,0.693147,0.000000,0.693147,0.000000,0.000000,0.693147,0.000000,1.098612
Cell_97,0.693147,0.000000,1.098612,0.000000,0.693147,0.000000,0.693147,0.693147,0.693147,1.791759,...,0.693147,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.098612,0.693147
Cell_98,0.693147,0.693147,0.000000,0.693147,0.693147,0.000000,0.000000,1.609438,0.693147,0.693147,...,1.098612,0.693147,0.000000,0.000000,1.098612,0.000000,0.693147,0.693147,0.000000,0.000000


### Writing results to disk and reading back in

AnnData comes with its own HDF5-based file format: h5ad. We can use AnnData.write to save our matrix to the disk. String columns with a small number of labels will be transformed into categoricals automatically.

In [17]:
adata.write('my_results.h5ad', compression='gzip')

In [18]:
!h5ls 'my_results.h5ad'

'h5ls' is not recognized as an internal or external command,
operable program or batch file.


In [20]:
# load the data
mydat = ad.read_h5ad('my_results.h5ad')
mydat

AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type', 'cell_size'
    uns: 'random'
    obsm: 'X_tsne', 'X_umap'
    varm: 'gene_stuff'
    layers: 'log_transformed'

In [None]:
# if a single h5ad is very large, you can partially read it into memory using "backed mode"
# here we open it in read only mode (r) to avoid any risks
bigdat = ad.read_h5ad('my_results.h5ad', backed='r')
bigdat.isbacked

True

In [23]:
# if you do this, you'll need to remember that the AnnData object has an open connection to the file used for reading
bigdat.filename

WindowsPath('my_results.h5ad')

In [24]:
# close the connection
bigdat.file.close()