## The anndata package

### Students notebook

#### Organizing single cell data in Python

This notebook is the first one of a series where you will replicate some of the analysis presented in the coding lectures on the T-cell use case, this time using iTReg-activated T-cells rather than the resting ones. Let's start!

In [1]:
# importing anndata, numpy and pandas as we did in the coding lectures
import anndata as ad
import numpy as np
import pandas as pd

In [2]:
# exercise: read the scRNA-seq dataset from the './data/scde.h5ad' file and name it "scde"
scde = ad.read_h5ad('./data/scde.h5ad')
# your code here

print(scde)

AnnData object with n_obs × n_vars = 500 × 20953


In [None]:
# tip: use the scanpy function for reading h5ad files!


Note that we have only 500 cells; this will allow us to proceed with the analysis faster.
Let's check the `obs` data frame:

In [3]:
# obs data frame
scde.obs

N_Th2_r1_GCATACACAGTAGAGC
N_Th2_r1_TAGAGCTGTTAAGATG
M_Th17_r1_TCTTTCCAGCACGCCT
M_Th17_r2_GAGGTGACAGGCGATA
N_Th17_r1_CTCGTACAGTGGCACA
...
M_iTreg_r2_GTGGGTCAGTGGGATC
M_iTreg_r2_TCGAGGCGTCAAAGAT
M_iTreg_r2_TCGCGAGGTGCAACTT
M_iTreg_r2_TCGTACCAGGTGGGTT
M_iTreg_r2_TCTTCGGAGCTGCCCA


It appears that we do not have any information on the selected cells in the scanpy object. However, we know that this information is stored in a csv file, namely 'obs.csv'. Let's load it and assign the resulting data frame to `scde.obs`.

In [5]:
# exercise: read the ./data/obs.csv file into a pandas data frame called obs
obs = pd.read_csv('./data/obs.csv', index_col=0)
# your code here

print(obs)

                            cell.type cytokine.condition donor.id  batch.10X  \
N_Th2_r1_GCATACACAGTAGAGC       Naive                Th2       D3          1   
N_Th2_r1_TAGAGCTGTTAAGATG       Naive                Th2       D2          1   
M_Th17_r1_TCTTTCCAGCACGCCT     Memory               Th17       D3          1   
M_Th17_r2_GAGGTGACAGGCGATA     Memory               Th17       D3          2   
N_Th17_r1_CTCGTACAGTGGCACA      Naive               Th17       D2          1   
...                               ...                ...      ...        ...   
M_iTreg_r2_GTGGGTCAGTGGGATC    Memory              iTreg       D2          2   
M_iTreg_r2_TCGAGGCGTCAAAGAT    Memory              iTreg       D4          2   
M_iTreg_r2_TCGCGAGGTGCAACTT    Memory              iTreg       D4          2   
M_iTreg_r2_TCGTACCAGGTGGGTT    Memory              iTreg       D1          2   
M_iTreg_r2_TCTTCGGAGCTGCCCA    Memory              iTreg       D4          2   

                             nGene   nU

In [None]:
# tip: use the pandas function for reading csv files! The first column is the index column


In [6]:
# exercise: assign the obs data frame to scde.obs
# your code here
scde.obs = obs
print(scde.obs)

                            cell.type cytokine.condition donor.id  batch.10X  \
N_Th2_r1_GCATACACAGTAGAGC       Naive                Th2       D3          1   
N_Th2_r1_TAGAGCTGTTAAGATG       Naive                Th2       D2          1   
M_Th17_r1_TCTTTCCAGCACGCCT     Memory               Th17       D3          1   
M_Th17_r2_GAGGTGACAGGCGATA     Memory               Th17       D3          2   
N_Th17_r1_CTCGTACAGTGGCACA      Naive               Th17       D2          1   
...                               ...                ...      ...        ...   
M_iTreg_r2_GTGGGTCAGTGGGATC    Memory              iTreg       D2          2   
M_iTreg_r2_TCGAGGCGTCAAAGAT    Memory              iTreg       D4          2   
M_iTreg_r2_TCGCGAGGTGCAACTT    Memory              iTreg       D4          2   
M_iTreg_r2_TCGTACCAGGTGGGTT    Memory              iTreg       D1          2   
M_iTreg_r2_TCTTCGGAGCTGCCCA    Memory              iTreg       D4          2   

                             nGene   nU

In [None]:
# tip: it's a simple assignment! :-)


Now that we have the cell annotation from the original study, we can check what types of cells are included.

In [7]:
# exercise: create a list indicating the cell types contained in the cluster.id column of obs. 
#cell_types = []
# your code here
cell_types = scde.obs['cluster.id'].unique().tolist()
print(cell_types)

['TN', 'TEM', 'TCM', 'TEMRA']


In [None]:
# tip: each cell type should be present once. Consider using the `unique()` and `tolist()` methods

We now want to ensure that all cells have a sufficient amount of reads before continuing our analyses. Thus, let's compute the sum of reads for each cell and let's story it in a numpy array.

In [8]:
# exercise: compute the number of reads for each cell
num_reads = scde.X.sum(axis=1)
# your code here

print(num_reads[0:10])

[[15615.]
 [18644.]
 [19328.]
 [18852.]
 [ 7192.]
 [41474.]
 [47778.]
 [22769.]
 [48416.]
 [ 7270.]]


In [None]:
# tip: simple sum across all genes for each cell

In [9]:
# exercise: remove cells with less than 10000 reads in total. Name the new anndata object 'scder'
scder = scde[num_reads >= 10000, :]
# your code here

print(scder.n_obs)

424


In [None]:
# tip: recall how anndata object can be subsetted

We are almost done with this first phase of the analysis. We just need to save the reduced scanpy object in a new file, 'scdr.h5ad'

In [10]:
# making sure the 'scdr.h5ad' file is not on the disk already
import os
if os.path.isfile('./data/scder.h5ad'):
    os.remove('./data/scder.h5ad')

In [11]:
# exercise: save the new scanpy object in a file named './data/scder.h5ad'
# your code here
scder.write('./data/scder.h5ad')

Trying to set attribute `.obs` of view, copying.
... storing 'cell.type' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'cytokine.condition' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'donor.id' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'Phase' as categorical
Trying to set attribute `.obs` of view, copying.
... storing 'cluster.id' as categorical


In [None]:
# tip: use the scanpy method for writing h5ad files