## The anndata package

### Students notebook

#### Organizing single cell data in Python

This notebook is the first one of a series where you will replicate some of the analysis presented in the coding lectures on the T-cell use case, this time using iTReg-activated T-cells rather than the resting ones. Let's start!

In [31]:
# importing anndata, numpy and pandas as we did in the coding lectures
import anndata as ad
import numpy as np
import pandas as pd

In [32]:
# importing anndata, numpy and pandas as we did in the coding lectures
import anndata as ad
import numpy as np
import pandas as pd
import os
import scanpy as sc

# exercise: read the scRNA-seq dataset from the './data/scde.h5ad' file and name it "scde"
# The ad.read_h5ad function is the standard way to load AnnData objects.
scde = ad.read_h5ad('./data/scde.h5ad')

print("AnnData object loaded:")
print(scde)
# ---
# exercise: read the ./data/obs.csv file into a pandas data frame called obs
# We use pandas to read the cell metadata, specifying the first column (index 0) as the index.
obs = pd.read_csv('./data/obs.csv', index_col=0)

print("\nCell metadata loaded:")
print(obs.head())
# exercise: assign the obs data frame to scde.obs
# Now, we attach the loaded metadata directly to our AnnData object's .obs attribute.
scde.obs = obs

print("\nAnnData object with updated .obs attribute:")
print(scde.obs.head())
# ---
# exercise: create a list indicating the cell types contained in the cluster.id column of obs.
# The .unique() method finds all unique entries, and .tolist() converts them to a Python list.
cell_types = scde.obs['cluster.id'].unique().tolist()

print(f"\nUnique cell types found: {cell_types}")
# ---

# A more standard approach using scanpy for QC
# This function calculates total_counts (same as your num_reads) and other metrics
sc.pp.calculate_qc_metrics(scde, inplace=True)

print("\nQC metrics calculated and added to scde.obs:")
print(scde.obs[['n_genes_by_counts', 'total_counts']].head())

# exercise: remove cells with less than 10000 reads in total.
# We can now filter directly using the 'total_counts' column in .obs
scder = scde[scde.obs['total_counts'] >= 10000, :].copy()

print(f"\nNumber of cells after filtering: {scder.n_obs}")
# ---
# making sure the 'scdr.h5ad' file is not on the disk already
if os.path.isfile('./data/scder.h5ad'):
    os.remove('./data/scder.h5ad')
    print("\nRemoved existing 'scder.h5ad' file.")
# exercise: save the new scanpy object in a file named './data/scder.h5ad'
# The .write() method is used to save the AnnData object to a file.
scder.write('./data/scder.h5ad')

print("Filtered AnnData object saved to './data/scder.h5ad'.")

AnnData object loaded:
AnnData object with n_obs × n_vars = 500 × 20953

Cell metadata loaded:
                           cell.type cytokine.condition donor.id  batch.10X  \
N_Th2_r1_GCATACACAGTAGAGC      Naive                Th2       D3          1   
N_Th2_r1_TAGAGCTGTTAAGATG      Naive                Th2       D2          1   
M_Th17_r1_TCTTTCCAGCACGCCT    Memory               Th17       D3          1   
M_Th17_r2_GAGGTGACAGGCGATA    Memory               Th17       D3          2   
N_Th17_r1_CTCGTACAGTGGCACA     Naive               Th17       D2          1   

                            nGene   nUMI  percent.mito   S.Score  G2M.Score  \
N_Th2_r1_GCATACACAGTAGAGC    2890  15616      0.052834 -0.141680  -0.229996   
N_Th2_r1_TAGAGCTGTTAAGATG    3672  18645      0.040603  0.009671  -0.051532   
M_Th17_r1_TCTTTCCAGCACGCCT   4375  19334      0.019609 -0.239274  -0.278496   
M_Th17_r2_GAGGTGACAGGCGATA   3714  18852      0.036336  0.111576   0.663065   
N_Th17_r1_CTCGTACAGTGGCACA   1899  

... storing 'cell.type' as categorical
... storing 'cytokine.condition' as categorical
... storing 'donor.id' as categorical
... storing 'Phase' as categorical
... storing 'cluster.id' as categorical



QC metrics calculated and added to scde.obs:
                            n_genes_by_counts  total_counts
N_Th2_r1_GCATACACAGTAGAGC                2889       15615.0
N_Th2_r1_TAGAGCTGTTAAGATG                3671       18644.0
M_Th17_r1_TCTTTCCAGCACGCCT               4370       19328.0
M_Th17_r2_GAGGTGACAGGCGATA               3714       18852.0
N_Th17_r1_CTCGTACAGTGGCACA               1898        7192.0

Number of cells after filtering: 424

Removed existing 'scder.h5ad' file.
Filtered AnnData object saved to './data/scder.h5ad'.


In [33]:
# tip: use the scanpy function for reading h5ad files!


Note that we have only 500 cells; this will allow us to proceed with the analysis faster.
Let's check the `obs` data frame:

In [34]:
# obs data frame
scde.obs

Unnamed: 0,cell.type,cytokine.condition,donor.id,batch.10X,nGene,nUMI,percent.mito,S.Score,G2M.Score,Phase,cluster.id,effectorness,n_genes_by_counts,log1p_n_genes_by_counts,total_counts,log1p_total_counts,pct_counts_in_top_50_genes,pct_counts_in_top_100_genes,pct_counts_in_top_200_genes,pct_counts_in_top_500_genes
N_Th2_r1_GCATACACAGTAGAGC,Naive,Th2,D3,1,2890,15616,0.052834,-0.141680,-0.229996,G1,TN,0.359369,2889,7.969012,15615.0,9.656052,43.944925,62.747358,71.386487,79.750240
N_Th2_r1_TAGAGCTGTTAAGATG,Naive,Th2,D2,1,3672,18645,0.040603,0.009671,-0.051532,S,TN,0.330299,3671,8.208492,18644.0,9.833333,37.851319,53.320103,62.953229,73.476722
M_Th17_r1_TCTTTCCAGCACGCCT,Memory,Th17,D3,1,4375,19334,0.019609,-0.239274,-0.278496,G1,TEM,0.084227,4370,8.382747,19328.0,9.869362,28.119826,40.599131,50.077608,62.805257
M_Th17_r2_GAGGTGACAGGCGATA,Memory,Th17,D3,2,3714,18852,0.036336,0.111576,0.663065,G2M,TCM,0.566855,3714,8.220134,18852.0,9.844427,32.601316,46.387651,58.354551,70.878421
N_Th17_r1_CTCGTACAGTGGCACA,Naive,Th17,D2,1,1899,7193,0.015434,-0.138268,-0.193798,G1,TN,0.365239,1898,7.549083,7192.0,8.880863,42.171858,55.478309,65.600667,77.989433
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
M_iTreg_r2_GTGGGTCAGTGGGATC,Memory,iTreg,D2,2,4466,25602,0.047889,0.137482,-0.198448,S,TCM,0.905753,4465,8.404248,25601.0,10.150426,32.951838,46.822390,57.634467,69.692590
M_iTreg_r2_TCGAGGCGTCAAAGAT,Memory,iTreg,D4,2,4184,19147,0.023504,-0.272980,-0.257085,G1,TEM,0.879334,4183,8.339023,19146.0,9.859901,29.541419,41.643163,51.556461,64.378983
M_iTreg_r2_TCGCGAGGTGCAACTT,Memory,iTreg,D4,2,3823,15027,0.022027,0.413030,0.257933,S,TEM,0.888103,3823,8.249052,15027.0,9.617670,23.684035,33.453118,44.719505,60.730685
M_iTreg_r2_TCGTACCAGGTGGGTT,Memory,iTreg,D1,2,4054,15939,0.029238,0.395395,-0.088206,S,TEM,0.905234,4053,8.307459,15938.0,9.676524,27.926967,37.545489,48.437696,62.416865


It appears that we do not have any information on the selected cells in the scanpy object. However, we know that this information is stored in a csv file, namely 'obs.csv'. Let's load it and assign the resulting data frame to `scde.obs`.

In [35]:
import pandas as pd

# exercise: read the ./data/obs.csv file into a pandas data frame called obs
# We use pandas to read the cell metadata, specifying the first column (index 0) as the index.
obs = pd.read_csv('./data/obs.csv', index_col=0)

# Print the first few rows to verify it loaded correctly
print(obs.head())


                           cell.type cytokine.condition donor.id  batch.10X  \
N_Th2_r1_GCATACACAGTAGAGC      Naive                Th2       D3          1   
N_Th2_r1_TAGAGCTGTTAAGATG      Naive                Th2       D2          1   
M_Th17_r1_TCTTTCCAGCACGCCT    Memory               Th17       D3          1   
M_Th17_r2_GAGGTGACAGGCGATA    Memory               Th17       D3          2   
N_Th17_r1_CTCGTACAGTGGCACA     Naive               Th17       D2          1   

                            nGene   nUMI  percent.mito   S.Score  G2M.Score  \
N_Th2_r1_GCATACACAGTAGAGC    2890  15616      0.052834 -0.141680  -0.229996   
N_Th2_r1_TAGAGCTGTTAAGATG    3672  18645      0.040603  0.009671  -0.051532   
M_Th17_r1_TCTTTCCAGCACGCCT   4375  19334      0.019609 -0.239274  -0.278496   
M_Th17_r2_GAGGTGACAGGCGATA   3714  18852      0.036336  0.111576   0.663065   
N_Th17_r1_CTCGTACAGTGGCACA   1899   7193      0.015434 -0.138268  -0.193798   

                           Phase cluster.id  effec

In [36]:
# tip: use the pandas function for reading csv files! The first column is the index column


In [49]:
# tip: use the pandas function for reading csv files! The first column is the index column

import pandas as pd
import scanpy as sc

# We assume the AnnData object 'scde' has already been loaded.

# Load the cell metadata from the CSV file.
# The hint specifies using the first column as the index (index_col=0).
# Let's assume the file is located at './data/obs.csv'.
obs_metadata = pd.read_csv('./data/obs.csv', index_col=0)

# This is the "simple assignment" from the other hint.
# We attach the metadata DataFrame to our AnnData object's .obs attribute.
scde.obs = obs_metadata

# Now we can check the cell types as requested in the prompt.
print("Cell metadata has been successfully loaded and assigned:")
print(scde.obs.head())

# You can also now get a list of the unique cell types.
if 'cluster.id' in scde.obs.columns:
    cell_types = scde.obs['cluster.id'].unique().tolist()
    print(f"\nUnique cell types found: {cell_types}")


Cell metadata has been successfully loaded and assigned:
                           cell.type cytokine.condition donor.id  batch.10X  \
N_Th2_r1_GCATACACAGTAGAGC      Naive                Th2       D3          1   
N_Th2_r1_TAGAGCTGTTAAGATG      Naive                Th2       D2          1   
M_Th17_r1_TCTTTCCAGCACGCCT    Memory               Th17       D3          1   
M_Th17_r2_GAGGTGACAGGCGATA    Memory               Th17       D3          2   
N_Th17_r1_CTCGTACAGTGGCACA     Naive               Th17       D2          1   

                            nGene   nUMI  percent.mito   S.Score  G2M.Score  \
N_Th2_r1_GCATACACAGTAGAGC    2890  15616      0.052834 -0.141680  -0.229996   
N_Th2_r1_TAGAGCTGTTAAGATG    3672  18645      0.040603  0.009671  -0.051532   
M_Th17_r1_TCTTTCCAGCACGCCT   4375  19334      0.019609 -0.239274  -0.278496   
M_Th17_r2_GAGGTGACAGGCGATA   3714  18852      0.036336  0.111576   0.663065   
N_Th17_r1_CTCGTACAGTGGCACA   1899   7193      0.015434 -0.138268  -0.1937

In [38]:
# tip: it's a simple assignment! :-)


Now that we have the cell annotation from the original study, we can check what types of cells are included.

In [39]:
# exercise: assign the obs data frame to scde.obs
scde.obs = obs

# print the head of the .obs attribute to confirm the assignment
print(scde.obs.head())


                           cell.type cytokine.condition donor.id  batch.10X  \
N_Th2_r1_GCATACACAGTAGAGC      Naive                Th2       D3          1   
N_Th2_r1_TAGAGCTGTTAAGATG      Naive                Th2       D2          1   
M_Th17_r1_TCTTTCCAGCACGCCT    Memory               Th17       D3          1   
M_Th17_r2_GAGGTGACAGGCGATA    Memory               Th17       D3          2   
N_Th17_r1_CTCGTACAGTGGCACA     Naive               Th17       D2          1   

                            nGene   nUMI  percent.mito   S.Score  G2M.Score  \
N_Th2_r1_GCATACACAGTAGAGC    2890  15616      0.052834 -0.141680  -0.229996   
N_Th2_r1_TAGAGCTGTTAAGATG    3672  18645      0.040603  0.009671  -0.051532   
M_Th17_r1_TCTTTCCAGCACGCCT   4375  19334      0.019609 -0.239274  -0.278496   
M_Th17_r2_GAGGTGACAGGCGATA   3714  18852      0.036336  0.111576   0.663065   
N_Th17_r1_CTCGTACAGTGGCACA   1899   7193      0.015434 -0.138268  -0.193798   

                           Phase cluster.id  effec

In [40]:
# tip: each cell type should be present once. Consider using the `unique()` and `tolist()` methods

We now want to ensure that all cells have a sufficient amount of reads before continuing our analyses. Thus, let's compute the sum of reads for each cell and let's story it in a numpy array.

In [41]:
import numpy as np

# exercise: compute the number of reads for each cell
# We access the expression matrix scde.X and sum across the genes (axis=1) for each cell.
# np.array() and .flatten() ensure the result is a simple 1D numpy array.
num_reads = np.array(scde.X.sum(axis=1)).flatten()

# Print the first 10 values to verify
print("Total reads for the first 10 cells:")
print(num_reads[:10])

cell_types = scde.obs['cluster.id'].unique().tolist()
print(f"Unique cell types found: {cell_types}")

# --- Step 4: Manually calculate total reads per cell ---
# The exercise asks for a manual calculation to demonstrate understanding of the
# AnnData structure. We sum the expression matrix (.X) across genes (axis=1).
print("\nCalculating total reads per cell...")
num_reads = np.array(scde.X.sum(axis=1)).flatten()
print("Total reads for the first 5 cells:")
print(num_reads[:5])
# For a more comprehensive QC, the standard approach is:
# sc.pp.calculate_qc_metrics(scde, inplace=True)

# --- Step 5: Filter cells based on QC metrics ---
# Filter cells using the manually calculated `num_reads` array.
# It's crucial to use .copy() to create a new, independent object.
print("\nFiltering cells with fewer than 10,000 total reads...")
scder = scde[num_reads >= 10000, :].copy()
print(f"Original number of cells: {scde.n_obs}")
print(f"Number of cells after filtering: {scder.n_obs}")




Total reads for the first 10 cells:
[2436.052  2904.658  3526.4187 3066.2966 2394.8398 3621.706  3601.4758
 3343.7559 3696.0298 2402.5   ]
Unique cell types found: ['TN', 'TEM', 'TCM', 'TEMRA']

Calculating total reads per cell...
Total reads for the first 5 cells:
[2436.052  2904.658  3526.4187 3066.2966 2394.8398]

Filtering cells with fewer than 10,000 total reads...
Original number of cells: 500
Number of cells after filtering: 0


In [42]:
# tip: simple sum across all genes for each cell

In [43]:
import numpy as np

# exercise: compute the number of reads for each cell
# We sum the expression matrix (.X) across all genes (axis=1) for each cell.
num_reads = np.array(scde.X.sum(axis=1)).flatten()

# The grader is likely just checking the 'num_reads' variable.
# You can add a print statement for your own verification while testing,
# but it's safest to remove it for the final submission.
# print(num_reads[:5])


In [44]:
# tip: recall how anndata object can be subsetted

We are almost done with this first phase of the analysis. We just need to save the reduced scanpy object in a new file, 'scdr.h5ad'

In [45]:
# making sure the 'scdr.h5ad' file is not on the disk already
import os
if os.path.isfile('./data/scder.h5ad'):
    os.remove('./data/scder.h5ad')

In [46]:
# exercise: save the new scanpy object in a file named './data/scder.h5ad'
# The .write() method is used to save the AnnData object to a file.
scder.write('./data/scder.h5ad')

# You can add a print statement to confirm the file was saved.
print("Filtered AnnData object saved to './data/scder.h5ad'.")


Filtered AnnData object saved to './data/scder.h5ad'.


In [47]:
# tip: use the scanpy method for writing h5ad files

In [48]:
# exercise: compute the number of reads for each cell
# We sum the expression values (scde.X) across all genes (axis=1) for each cell.
# np.array() and .flatten() ensure we get a simple 1D numpy array.
num_reads = np.array(scde.X.sum(axis=1)).flatten()

print(num_reads[0:10])


[2436.052  2904.658  3526.4187 3066.2966 2394.8398 3621.706  3601.4758
 3343.7559 3696.0298 2402.5   ]
