**Setup**

If using your own machine and Python installation:
- Install the environment according to the [Github instructions](https://github.com/iMM-Workshops/2024_Hitchhiker_Guide_scRNA-seq/tree/main/Day_3_Adaptive_Immune_Receptor)
- Run this notebook with the `imm_air_env` environment we have previously created

If using Google Colab:
- Add the data from https://drive.google.com/drive/folders/1Uk6pmMRzpwnjfZobabHChDNMoGAd-aHE?usp=drive_link to your own Google Drive
- Run the below lines to mount your Google Drive to the Colab session
- Install the dependencies

In [None]:
from google.colab import drive
drive.mount("/content/drive")

In [None]:
!pip install git+https://github.com/Teichlab/cell2tcr.git@db_extension scirpy seaborn bbknn

# Exploratory data analysis

We will base our tutorial on a longitudinal scRNA-seq and scVDJ-seq dataset, where 16 donors were profiled at 6 different time points during a viral challenge trial: 

**Human SARS-CoV-2 challenge uncovers local and systemic response dynamics**
[Lindeboom et al, Nature, 2024](https://doi.org/10.1038/s41586-024-07575-x).

The samples were processed with the 10x Chromium Single Cell Immune Profiling kit. For more information, read the [10x documentation](https://cdn.10xgenomics.com/image/upload/v1660261285/support-documents/CG000361_GettingStartedImmuneProfiling_RevA.pdf).

You can access the data at this Google Drive folder: https://drive.google.com/drive/folders/1Uk6pmMRzpwnjfZobabHChDNMoGAd-aHE?usp=sharing

Some information on the VDJ notation:
- VJ refers to TCR alpha chain, VDJ to TCR beta chain
- Each cell has 2 alleles of for the TCR alpha and beta loci, and can theoretically IR_VJ_1 refers to the more abundant TCR alpha chain, IR_VJ_2 to the less abundant one (if any). Analogous for IR_VDJ_1 / IR_VDJ_2.

## T cells are diverse!
### Load environment and dataset

In [None]:
import scanpy as sc
import pandas as pd
import scirpy as ir
import cell2tcr

sc.settings.verbosity = 3  # use this flag to manage verbosity: errors (0), warnings (1), info (2), hints (3)
sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(6, 6), facecolor="white")

In [None]:
# load data
# file path depends on where you save the objects to !
# If on Google Colab: If you have saved the h5ad to a folder on your Drive called 'data', you would access it like this:
# adata = sc.read_h5ad('drive/MyDrive/data/human_challenge_gex.h5ad')

# adata = sc.read_h5ad('/nfs/team205/ld21_sharing/imm_workshop/human_challenge_w_UMAP_w_BBKNN.h5ad')
adata = sc.read_h5ad('/path/to/human_challenge_gex.h5ad')
adata

### Visualise without batch correction

In [None]:
# compute 500 highly variable genes (HVGs)
sc.pp.highly_variable_genes(
    adata, 
    layer='logcounts',
    n_top_genes=500
)

In [None]:
# subset to HVGs
adata = adata[:,adata.var.highly_variable]

In [None]:
# compute principal components
sc.tl.pca(adata, layer='logcounts')

In [None]:
# compute nearest neighbours
sc.pp.neighbors(adata)

In [None]:
# compute 2D UMAP projection
sc.tl.umap(adata)

In [None]:
# plot UMAP
sc.pl.umap(
    adata,
    color=['donor_id','time_point'],
    ncols=1
)

sc.pl.umap(
    adata,
    color=['cell_compartment'],
    ncols=1,
)

sc.pl.umap(
    adata,
    color=['cell_type'],
    ncols=1,
    groups=adata.obs.cell_type.value_counts()[:13].index.values,
    legend_loc='on data',
    legend_fontsize=6
)

### Visualise with simple batch correction
We will integrate the samples by correcting for donor-specific effects with [BBKNN](https://doi.org/10.1093/bioinformatics/btz625) (Polanski et al, 2020).

Restart the kernel and load the dataset anew, then process using the below code.

In [None]:
# compute 500 highly variable genes (HVGs)
sc.pp.highly_variable_genes(
    adata, 
    layer='logcounts',
    n_top_genes=500
)

In [None]:
# subset to HVGs
adata = adata[:,adata.var.highly_variable]

In [None]:
# compute principal components
sc.tl.pca(adata, layer='logcounts')

In [None]:
# compute batch-balanced k-nearest neighbour graph
sc.external.pp.bbknn(adata, batch_key='donor_id')

In [None]:
# compute 2D UMAP projection
sc.tl.umap(adata)

In [None]:
# plot UMAP
sc.pl.umap(
    adata,
    color=['donor_id','time_point'],
    ncols=1
)

sc.pl.umap(
    adata,
    color=['cell_compartment'],
    ncols=1,
)

sc.pl.umap(
    adata,
    color=['cell_type'],
    ncols=1,
    groups=adata.obs.cell_type.value_counts()[:13].index.values,
    legend_loc='on data',
    legend_fontsize=6
)

### Visualise after proper batch correction
We will visualise the UMAP obtained after more detailed batch correction in the paper (process not shown here).

In [None]:
# UMAP after proper batch correction
# we restrict the visible space of the UMAP to T cells as the original UMAP included non-T cell types which are no longer present in this object

sc.pl.embedding(
    adata[(adata.obsm['X_umap_harmony_rna_wvdj_30pcs_6000hvgs'][:,0]>-2)&(adata.obsm['X_umap_harmony_rna_wvdj_30pcs_6000hvgs'][:,1]<0)],
    'X_umap_harmony_rna_wvdj_30pcs_6000hvgs',
    color=['donor_id','time_point'],
    ncols=1
)

sc.pl.embedding(
    adata[(adata.obsm['X_umap_harmony_rna_wvdj_30pcs_6000hvgs'][:,0]>-2)&(adata.obsm['X_umap_harmony_rna_wvdj_30pcs_6000hvgs'][:,1]<0)],
    'X_umap_harmony_rna_wvdj_30pcs_6000hvgs',
    color=['cell_compartment'],
    ncols=1,
)

sc.pl.embedding(
    adata[(adata.obsm['X_umap_harmony_rna_wvdj_30pcs_6000hvgs'][:,0]>-2)&(adata.obsm['X_umap_harmony_rna_wvdj_30pcs_6000hvgs'][:,1]<0)],
    'X_umap_harmony_rna_wvdj_30pcs_6000hvgs',
    color=['cell_type'],
    ncols=1,
    groups=adata.obs.cell_type.value_counts()[:13].index.values,
    legend_loc='on data',
    legend_fontsize=6
)

In [None]:
sc.pl.embedding(
    adata[(adata.obsm['X_umap_harmony_rna_wvdj_30pcs_6000hvgs'][:,0]>-2)&(adata.obsm['X_umap_harmony_rna_wvdj_30pcs_6000hvgs'][:,1]<0)],
    'X_umap_harmony_rna_wvdj_30pcs_6000hvgs',
    color=['cell_state'],
    ncols=1,
    groups=adata.obs.cell_state.value_counts()[:15].index.values,
    legend_loc='on data',
    legend_fontsize=6,
    legend_fontoutline=2,
)

## Understanding the VDJ columns in our data
- Find the columns that contain immune receptor data
**Tips**

- Per-cell information is in `adata.obs`, which is a pandas dataframe `pd.DataFrame()`. Any function of pandas can be applied to it, usually by directly appending it to the dataframe.
- View all columns: Simply `adata` or `adata.obs.columns`
- Select columns by text: `adata.obs.loc[:,adata.obs.columns.str.contains('yourtexthere')]`

In [None]:
# your code for inspecting the VDJ columns

In [None]:
adata

## Missing or extra immune receptor chains
- How many T cells have 1, 2, 3, or 4 IR chains ?

**Tips**

- Select columns by name: `adata.obs.loc[:,['IR_VDJ_1_v_gene_tcr','IR_VDJ_1_j_gene_tcr']]`
- Select cells by conditioning on a column: `adata[adata.obs.donor_id=='participant_1']`
- Select cells by conditioning on multiple columns: `adata[(adata.obs.donor_id=='participant_1')&(adata.obs.time_point=='D10')]`
- Use `pd.DataFrame().isna()` and `pd.DataFrame().notna()` functions to query which cells have real values in a column. Cells without TCR information will have `Nan` instead.
- Use `pd.DataFrame().value_counts()` to show the elements present in a column and how often they appear.
- View help for a function by going inside the parenthesis to see the docstring or looking up examples online.

For the remainder of this tutorial, we will not need the gene expression data. Thus, we can create a copy of the per-cell dataframe and operate on it directly.

In [None]:
# create dataframe with per-cell information
df = adata.obs.copy()

In [None]:
# How many cells have how many IR chains ? Try to find this yourself or inspect the code below
df.loc[:,['IR_VJ_1_cdr3_tcr','IR_VDJ_1_cdr3_tcr','IR_VJ_2_cdr3_tcr','IR_VDJ_2_cdr3_tcr']].notna().sum(axis=1).value_counts().sort_index().plot.bar(title='IR chain distribution', xlabel='Number of chains', ylabel='Number of T cells')

## TCR repertoire diversity
**Paired clonotype diversity:** For each TCR clone in the filtered contig annotations, count the number of unique barcodes (cells). Divide the counts of unique barcodes by the total number of unique barcodes to derive the proportional abundances. Then square each of the proportional abundances. One over the sum of the squared values gives the inverse Simpson’s index.

The index value ranges from one to the estimated number of cells. A value of one indicates no diversity, and a value equal to the estimated number of cells indicates maximal diversity. If clonotypes are absent, the value is set to zero. The paired clonotype diversity metric is useful to compare samples, e.g. pre and post treatment, and to assess clonal expansion. 
- Write a function to compute paired clonotype diversity for an arbitrary set of T cells
- What is the paired clonotype diversity for the different samples? And for the donors?

In [None]:
def paired_clonotype_diversity(df):
    # compute abundance of each clonotype, defined by unique AA sequence. Normalize by sample size. 
    # Square and sum these values, then take the inverse
    return 1/df.value_counts(['IR_VJ_1_cdr3_tcr','IR_VDJ_1_cdr3_tcr'], normalize=True).pow(2).sum()

In [None]:
for i in adata.obs.sample_id.unique():
    print(i, f': {paired_clonotype_diversity(adata[adata.obs.sample_id==i].obs):.1f}')

## Public and private immune receptor repertoire
- Do you find public TCRs at the amino acid level? And at the nucleotide level?
- How are T cells with public TCRs distributed by phenotype? E.g. compare naive vs effector/memory compartments.


**Tips**

- Show statistics for a single column: `df['yourcolumn'].mean()`, or `.unique()`, `.nunique()`, `.value_counts()`, etc
- Compute statistics by column with `df.groupby('yourcolumn')`. Useful statistics are `.mean()`, `.nunique()`
- Sort a dataframe by the values of a column: `df.sort_values('yourcolumn')` or `df['yourcolumn'].sort_values()`
- Select rows or columns by slicing: `df.iloc[:10, :5]`. The first entry will select rows (cells), the second columns
- Sort a dataframe by its index: `df.sort_index()`
- Numerical comparison on a column: 
    - `df['yourcolumn'].max()` : return the maximum value
    - `df['yourcolumn'].min()` : return the minimum value
    - `df['yourcolumn'].gt(2)` : greater than e.g. 2, returns a boolean (True/False)
    - `df['yourcolumn'].eq(2)` : equal to e.g. 2
    - `df['yourcolumn'].lt(2)` : less than e.g. 2
    - You can combine the above operations: `df['yourcolumn'].gt(2).sum()`

In [None]:
# define clonotype : Any TCR with unique combination of VJ and VDJ gene calls as well as unique amino acid (AA) sequence
# this will assign a clonotype number to every cell, and cells with identical TCRs will get the same number.

df['clonotype_AA'] = df.groupby(['IR_VDJ_1_cdr3_tcr', 'IR_VDJ_1_v_gene_tcr', 'IR_VDJ_1_j_gene_tcr', 'IR_VJ_1_cdr3_tcr', 'IR_VJ_1_v_gene_tcr','IR_VJ_1_j_gene_tcr'], sort=False).ngroup()

In [None]:
# are any clonotypes found in multiple donors ? How many clonotypes are public ? 
df.groupby('clonotype_AA').nunique().donor_id.gt(1).sum()

In [None]:
# what are the 10 clonotype_AA ids for which we have most sharing?
df.groupby('clonotype_AA').nunique().donor_id.sort_values(ascending=False).iloc[:10].index

In [None]:
# for the below, it may help to save the columns as strings instead of categories. This way, for value_counts(), categories with zeros are ommitted
for column in ['donor_id','cell_compartment','cell_type','cell_state']:
    df[column] = df[column].astype(str)

In [None]:
# iterate through the 10 most public TCRs, show how many donors they are shared by, and which cell states they belong to
for i in df.groupby('clonotype_AA').nunique().donor_id.sort_values(ascending=False).iloc[:10].index:
    print('Clonotype:', i)
    print('Donors:', df[df.clonotype_AA==i].donor_id.unique())
    print(df[df.clonotype_AA==i].cell_state.value_counts(),'\n') # '\n' inserts a new line to separate the print output, purely aestethic

# Scirpy package
The [Scirpy package](https://scirpy.scverse.org/en/latest/index.html) by [Sturm et al, 2020](https://doi.org/10.1093/bioinformatics/btaa611), comes with many built-in VDJ analysis functions, which we will use to investigate our dataset in more detail.

It requires the AIR data to be compatible with Scirpy. This can be easily done using one of the `ir.read_` functions of Scirpy, detailed in their [documentation](https://scirpy.scverse.org/en/latest/tutorials/tutorial_io.html). For the purpose of this tutorial, continue with the data you can find on [Google Drive](https://drive.google.com/drive/folders/1Uk6pmMRzpwnjfZobabHChDNMoGAd-aHE?usp=drive_link) (human_challenge_airr.h5ad).

In [None]:
# adata = sc.read_h5ad('/nfs/team205/ld21_sharing/imm_workshop/human_challenge_airr_w_obs_covid_status.h5ad')
adata = sc.read_h5ad('/path/to/human_challenge_air.h5ad')
adata

In [None]:
# subset to relevant T cell subsets
adata = adata[adata.obs.cell_type.isin(['T CD4 Naive', 'T CD4 Helper', 'T CD8 Naive', 'T CD8 CTL','T CD8 Memory', 'T Reg', 'T MAI', 'T CD4 CTL', 'T Double Negative'])]

In [None]:
# Scirpy processing
ir.pp.index_chains(adata)

In [None]:
ir.tl.chain_qc(adata)

## AIR abundance plots
Check out the below plots
- What do you deduce from them?
- Are they as expected?
- Try plotting some other metadata (phenotype, time point)

In [None]:
ir.pl.group_abundance(adata, groupby='chain_pairing')

In [None]:
ir.pl.group_abundance(adata, groupby='donor_id', target_col='chain_pairing')

In [None]:
# subset to cells with full alpha and beta chain
adata.obs.chain_pairing.value_counts()

In [None]:
adata = adata[adata.obs.chain_pairing=='single pair']

## Compute TCR similarity and clonotypes
Have a look at the functions `ir.pp.ir_dist` and `ir.pp.define_clonotypes`.
- What do the function parameters `metric` and `cutoff` mean?
- What impact will the parameter `sequence` have?

In [None]:
# using default parameters, `ir_dist` will compute nucleotide sequence identity
ir.pp.ir_dist(adata)

In [None]:
# This step is computationally expensive and may run for 5 minutes
ir.tl.define_clonotypes(adata, receptor_arms='all', dual_ir='primary_only', n_jobs=1) # n_jobs determines how many CPUs are used - can try more if your computer supports it

## Visualise clonotypes
We can visualise the clonotypes in network plots.
- Plot the clonotype network for all T cells. Is it helpful or informative? 
- Subset the dataset (e.g. by phenotype and/or donor ) and use different colors (donor / time point).

**Tips**

- Subsetting by donor: `adata[adata.obs.donor_id=='participant_XX']`
- Adjust these parameters: `base_size`, `label_fontsize`

In [None]:
ir.tl.clonotype_network(adata, min_cells=2)

In [None]:
ir.pl.clonotype_network(adata, color='donor_id')

## Clonotypes at the amino acid level
This computation will take a long time. Thus, we will try the below on a subset of our data. For this example, I selected the entire CD8 T cell compartment. This took 25 minutes for computing VJ and VDJ distance matrices. In case this runs for too long, find a checkpoint object in the Google Drive folder called `human_challenge_air_checkpoint_1.h5ad`, which you can load to skip the `ir.pp.ir_dist` step. You can also try this with a random sample of cell types you find interesting.

**Tips**
- Subsetting adata to 5000 randomly sampled cells: `adata[adata.obs.sample(5000).index]`

In [None]:
adata_cd8 = adata[adata.obs.cell_compartment=='T CD8']

# Alternatively:
# adata_cd8 = sc.read_h5ad('path/to/human_challenge_air_checkpoint_1.h5ad')

In [None]:
ir.pp.ir_dist(
    adata_cd8,
    metric="fastalignment",
    sequence="aa",
    cutoff=15,
    n_jobs=1,
)

In [None]:
ir.tl.define_clonotypes(adata_cd8, receptor_arms="all", dual_ir="primary_only", n_jobs=1)

In [None]:
ir.tl.define_clonotype_clusters(adata_cd8, receptor_arms="all", dual_ir="primary_only", n_jobs=1, metric='fastalignment')

In [None]:
ir.tl.clonotype_network(adata_cd8, min_cells=3, sequence="aa", metric="fastalignment")

In [None]:
# checkpoint object saved at this stage
# adata_cd8.write_h5ad('/nfs/team205/ld21_sharing/imm_workshop/human_challenge_air_checkpoint_1.h5ad', compression='gzip')

In [None]:
ir.pl.clonotype_network(adata_cd8, color="donor_id", label_fontsize=5)#, panel_size=(7, 7), base_size=20)

## Clonal expansion
We can use our full dataset for these plots. Alternatively, find and load the checkpoint object in the Google Drive folder called `human_challenge_air_checkpoint_2.h5ad` to continue.
- Which T cell compartments show most expanded clones? Does this make sense to you?
- How are clones shared across T cell states, types and compartments?
- How are clones shared across invidiuals? Is this sharing different for MAI

In [None]:
# if loading from checkpoint 2:
# adata = sc.read_h5ad('/path/to/human_challenge_air_checkpoint_2.h5ad')

In [None]:
ir.tl.clonal_expansion(adata)

In [None]:
ir.pl.clonal_expansion(adata, target_col='clone_id', groupby='cell_compartment', breakpoints=(1, 3), normalize=False)

In [None]:
ir.pl.clonal_expansion(adata, target_col='clone_id', groupby='cell_type', breakpoints=(1, 2, 5), normalize=True)

In [None]:
ir.pl.alpha_diversity(adata, metric="normalized_shannon_entropy", groupby="cell_type")

In [None]:
ir.pl.group_abundance(adata, groupby="clone_id", target_col="cell_state", max_cols=15)

In [None]:
ir.pl.group_abundance(adata, groupby="clone_id", target_col="donor_id", max_cols=25)

## VDJ gene usage
- Inspect the gene usage by donor
- Is there a VDJ bias in any of the cell types? If yes, which?

In [None]:
with ir.get.airr_context(adata, "j_call"):
    ir.pl.group_abundance(
        adata,
        groupby="VDJ_1_j_call",
        target_col="donor_id",
        normalize=True,
        max_cols=20,
    )

In [None]:
ir.pl.vdj_usage(
    adata,
    full_combination=False,
    vdj_cols=('VJ_1_v_call', 'VJ_1_j_call', 'VDJ_1_v_call', 'VDJ_1_j_call'),
    max_segments=None,
    max_labelled_segments=10,
    max_ribbons=60,
#     fig_kws={"figsize": (10, 10)},
)

In [None]:
ir.pl.spectratype(adata, color="cell_type", viztype="bar")

In [None]:
ir.pl.spectratype(
    adata,
    color="cell_type",
    viztype="curve",
    curve_layout="shifted",
    fig_kws={"dpi": 120},
    kde_kws={"kde_norm": False},
)

## Repertoire overlap
Now that we have defined T cell clonotypes, we can check the repertoire overlap across categories.
- Investigate the longitudinal aspect or our data: Does the repertoire overlap group donors together? For this, use `groupby='sample_id'` and use `heatmap_cats=['donor_id']`.
- Do time points and covid_status show clonotype overlap?
- What about overlaps across phenotypes?

In [None]:
ir.pl.repertoire_overlap(
    adata,
    'sample_id',
    heatmap_cats=['donor_id'],
    yticklabels=False,
    xticklabels=False,
)

In [None]:
ir.pl.repertoire_overlap(test, 'sample_id', pair_to_plot=['participant_9_D3', 'participant_9_D14'])

In [None]:
ir.pl.repertoire_overlap(test, "cell_type", pair_to_plot=['T CD8 CTL', 'T CD8 Memory'], )

In [None]:
# Checkpoint 2 
# adata.write_h5ad('/nfs/team205/ld21_sharing/imm_workshop/human_challenge_air_checkpoint_2.h5ad', compression='gzip')

## Antigen specificity database
We can load the entries of [VDJdb](https://vdjdb.cdr3.net/) directly through Scirpy. VDJdb is a database that includes experimental data on immune receptor specificity.
- Can you find the columns that specify the following antigen information: Epitope sequence, species and protein?
- Which species have most entries?

**Tips**
- Use `df['yourcolumn'].value_counts().plot.barh()` to visualise distribution of elements in 'yourcolumn'

In [None]:
vdjdb = ir.datasets.vdjdb()

In [None]:
ir.pp.ir_dist(adata, vdjdb, metric="identity", sequence="aa") # takes ~7min

In [None]:
ir.tl.ir_query( # takes ~7min
    adata,
    vdjdb,
    metric="identity",
    sequence="aa",
    receptor_arms="any",
    dual_ir="any",
    n_jobs=1
)

In [None]:
ir.tl.ir_query_annotate(
    adata,
    vdjdb,
    metric="identity",
    sequence="aa",
    include_ref_cols=["antigen.species", "antigen.gene", "antigen.epitope"],
    strategy="most-frequent",
)

In [None]:
# Checkpoint 3
# adata.write_h5ad('/nfs/team205/ld21_sharing/imm_workshop/human_challenge_air_checkpoint_3.h5ad', compression='gzip')

## Antigen specificity prediction
We have now annotated our clonotypes by comparing them to the VDJdb database entries. If you have trouble running the annotation part, find and load the checkpoint object in the Google Drive folder called `human_challenge_air_checkpoint_3.h5ad` to continue.
- How many T cells could we annotate in this way?
- How are matches distributed across phenotypes?
- How across sample metadata like `covid_status` and `time_point`?
- What other interesting information can you extract from this?

In [None]:
# if loading from checkpoint 3
# adata = sc.read_h5ad('/path/to/human_challenge_air_checkpoint_3.h5ad')

In [None]:
adata.obs.value_counts(['antigen.species','cell_type']).head(40)

# Cell2TCR
We will now reproduce some results from the paper **Human SARS-CoV-2 challenge uncovers local and systemic response dynamics** by [Lindeboom et al, Nature, 2024](https://doi.org/10.1038/s41586-024-07575-x). In particular, we will compute TCR motifs, which correspond to larger clonotype groups that still likely recognise the same epitope. We will use the [Cell2TCR](https://github.com/Teichlab/cell2tcr/tree/main) package to this end.

We will combine the AIR data from the Human SARS-CoV-2 challenge dataset with 5 other COVID-19 datasets to look for patterns of SARS-CoV-2 specific T cells.

**Datasets**
- COMBAT consortium, 2022, doi.org/10.1016/j.cell.2022.01.012
- Liu et al, 2021, doi.org/10.1016/j.cell.2021.02.018
- Ren et al, 2021, doi.org/10.1016/j.cell.2021.01.053
- Stephenson et al, 2021, doi.org/10.1038/s41591-021-01329-
- Yoshida et al, 2021, doi.org/10.1038/s41586-021-04345-x

## Computing TCR motifs

For convenience, a dataframe holding all immune receptors together with phenotype information from the above studies is available on Google Drive as `air_combined_6_datasets.csv`. The data object is too big to compute the motifs during this session (it takes several hours for the 779884 cells), but you can find the precomputed motifs in the column `motif_precomputed`. We will compute TCR motifs for a subset of the overall data:
- Create a dataframe with all activated T cell states
- Run Cell2TCR on it
- How do you interpret the results?
- How do the results change if you use another phenotype (e.g. memory or naive)?

**Tips**
- Slicing by text:  `df_sub = df[df.cell_state.str.contains('yourtext')]`
- Running Cell2TCR:  `cell2tcr.motifs(df)`
- Showing CDR3s for a set of TCRs:  `cell2tcr.draw_cdr3(df)`
- Cell2TCR assigns the motif id in decreasing order of TCR motif size. Thus, plotting the motifs 0-10 corresponds to plotting the 10 largest TCR motifs

In [None]:
# df = pd.read_csv('/nfs/team205/ld21_sharing/imm_workshop/air_combined_6_datasets.csv')
df = pd.read_csv('/path/to/air_combined_6_datasets.csv')
df

In [None]:
df_sub = df[df.cell_state.str.contains('CD8 Activated')]

In [None]:
# if the below code runs into Memory Allocation Errors or gets stuck, you can reduce the chunk_size
cell2tcr.motifs(df_sub, chunk_size=200)

In [None]:
# draw the 10 biggest TCR motifs
for i in range(10):
    cell2tcr.draw_cdr3(df_sub[df_sub.motif==i])

In [None]:
cell2tcr.draw_cdr3(df[df.motif_precomputed==28])
df[df.motif_precomputed==28].cell_state.value_counts()

## Scoring matches with the antigen specificity database
Cell2TCR includes a function to compare TCR sequences to the [iedb.org](https://www.iedb.org/) database. It will return a dataframe of matches `db_matches`, where every row corresponds to a TCR sequence of the database that was matched against one of our own cells `input_sequence`.

- How many matches were found?
- Do some input_sequences have several matches? What does this imply?
- Is the species/organism distribution of matches comparable to the Scirpy analysis?

In [None]:
db_matches = cell2tcr.db_match(df_sub.IR_VDJ_1_junction_aa) # takes a few minutes

In [None]:
db_matches

## Assigning likely antigen specificity
We can now assign the database matches into our dataframe using `cell2tcr.db_annotate()`. 

**Tips**
- Drop rows with `nan` values using `df.dropna()`

In [None]:
cell2tcr.db_annotate(df_sub, db_matches, 'IR_VDJ_1_junction_aa')

# Gamma-delta T cell analysis
We also have gamma-delta T cell data for the same paper **Human SARS-CoV-2 challenge uncovers local and systemic response dynamics**
[Lindeboom et al, Nature, 2024](https://doi.org/10.1038/s41586-024-07575-x). You can access it in the [Google Drive](https://drive.google.com/drive/folders/1Uk6pmMRzpwnjfZobabHChDNMoGAd-aHE?usp=drive_link) under `air_gammadelta.csv`.

- How are gamma delta TCR motifs different from alpha beta ones?

In [None]:
# df_gd = pd.read_csv('/nfs/team205/ld21_sharing/imm_workshop/air_gammadelta.csv')
df_gd = pd.read_csv('/path/to/air_gammadelta.csv')
df_gd

In [None]:
cell2tcr.motifs(df_gd, add_suffix=False)

In [None]:
for i in range(10,20):
    cell2tcr.draw_cdr3(df_gd[df_gd.motif==i])

# Optional: Loading 10x VDJ data
If you would like to practise loading data that has been processed using CellRanger,  continue with the code below (taken from the [tutorial by Scirpy](https://scirpy.scverse.org/en/latest/tutorials/tutorial_io.html)).

**Single-cell landscape of bronchoalveolar immune cells in COVID-19 patients**

Liao et al.
- Paper: https://doi.org/10.1101/2020.02.23.20026690
- Data: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE145926

Let us download the data from GEO using the above link. In a terminal, type:

`wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE145nnn/GSE145926/suppl/GSE145926_RAW.tar `

Next, untar the file:

`tar -xvf GSE145926_RAW.tar`

Now we can read in the different VDJ files using `ir.io.read_10x_vdj`

**Tips**

- Use `glob` to fetch all file names that match a specificiation.

In [None]:
import glob
import anndata

In [None]:
# Load the TCR data
adatas = []
# for file in glob.glob('/nfs/team205/tcr_pipeline/datasets/GSE145926/*filtered_contig_annotations.csv.gz'):
for file in glob.glob('/path/to/GSE145926/*filtered_contig_annotations.csv.gz'):
    print(file)
    adata_tcr = ir.io.read_10x_vdj(file)
    adata_tcr.obs['sample_name'] = file.split('_')[2] # add unique sample_id to each file
    adatas.append(adata_tcr)

In [None]:
adata = anndata.concat(adatas)

In [None]:
ir.pp.index_chains(adata)
ir.tl.chain_qc(adata)

In [None]:
adata

Done! Feel free to browse this dataset in more detail using the methods we tried in the previous sections. You can also load the gene expression data and start looking into T cells by phenotype.