# IsoTools Hands-on Practice

This practical exercise has been designed for the LongTREC bioinformatics summer school in Long-reads transcriptomics.

IsoTools

- GitHub repository: https://github.com/HerwigLab/IsoTools2
- Documentation and tutorial: https://isotools.readthedocs.io/en/latest/index.html
- pip package: https://pypi.org/project/isotools/

This practice include:

- transcriptome reconstruction
- data export
- gene model characteristics
- alternative splicing analysis and differential splicing analysis

To reduce the running time, we only consider a subset of chromosome 8 for demonstration purposes. This was done using `samtools`.

First, import all the dependencies.

In [None]:
from isotools import Transcriptome
from isotools._transcriptome_io import write_fasta
from isotools.plots import plot_diff_results, plot_embedding, plot_str_var_number, triangle_plot
from isotools import __version__ as isotools_version

import logging
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd

logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO)
logger = logging.getLogger('isotools')

logger.info(f'This practice uses IsoTools version {isotools_version}.')

Specify the path to the directory in which the demonstration data and reference are saved. Create a folder to save the output files.

In [None]:
path = os.getcwd()

out_dir = os.path.join(path, 'output')
os.makedirs(out_dir, exist_ok=True)

## Transcriptome reconstruction

to reconstruct the transcriptome from sample reads.

For this exercise, we require the following input files:

- genome sequence in `fasta` format (human genome GRCh38.p14 downloaded from Gencode)
- reference annotation in `gtf` or `gff3` format, sorted and indexed (Gencode v45)
- long-read alignments in `bam` format and corresponding index (raw fastq files downloaded from Encode)

### Import of reference annotation

to initiate the transcriptome object.

In [None]:
ref_gtf = "reference/gencode.v45.chr8.gtf"

transcriptome = Transcriptome.from_reference(ref_gtf)

### Import of sample data

We need to know the sample names as well as the corresponding alignment file names and group assignment, which are saved in the file `sample_table_chr8.txt`, in order to tell IsoTools where to find the sample read files and which groups they belong to.

First, we import this sample annotation table.

In [None]:
samples = pd.read_csv(f'{path}/sample_table_chr8.txt', sep='\t')

samples

The 'h1' refers to the H1 human embryonic stem (hES) cell line. These cells are undifferentiated and have the ability to develop into any cell type.

The 'endo' refers to definitive endoderm derived from H1. These cells can differentiate into various endoderm-derived cell types, including those found in the liver, pancreas and lungs.

These two cell lines are at different developmental stages. There are three biological replicates of each.

Next, we start to import the sequencing data and reconstruct the transcriptome.

In [None]:
for _,row in samples.iterrows():
    transcriptome.add_sample_from_bam(row["file"], sample_name=row["id"], group=row["group"], progress_bar=False)

As you can see here in IsoTools, we have quality control of the alignments. Cleaner alignments reduce the number of reads that are discarded.

During transcriptome reconstruction, IsoTools also detects chimeric reads, also known as split reads.

These indicate structural variation and mean that one sequencing read aligns to two distinct portions of the genome with little or no overlap.

![chimeric read](https://www.drive5.com/usearch/manual/chimera.gif)

For this analysis, we will focus on the non-chimeric reads.

Let's take a look at how many reads have been imported.

In [None]:
transcriptome.sample_table

Let's have a look at the group information, and define colors for groups for visualisation later on.

In [None]:
groups = transcriptome.groups()

groups

In [None]:
# choose the colors for the groups

group_colors = {'h1': '#f28e2b', 'endo': 'rebeccapurple'}

## Quality control and filtering

Quality control is important for downstream analysis. IsoTools has a built-in function to compute quality control metrics based on the genetic features of reads. It measures:

- downstream A content
- direct repeat length at junctions
- noncanonical splicing
- potential fragments


In [None]:
genome_fn = f'{path}/reference/GRCh38.p14.chr8.fa'

transcriptome.add_qc_metrics(genome_fn, correct_tss=False)

We want to remove low-quality transcripts and retain high-quality ones for downstream analysis. IsoTools implements transcript filtering using a flexible query syntax based on logical combinations of tags, which are by convention a single word in capital letters.

Some predefined tags are available in IsoTools for common technical artifacts, which are detected based on minimal coverage and the genetic features mentioned above. These include:

- Internal priming (IP): unspliced and downstream adenosine content of at least 50%.
- Reverse transcriptase template switching (RTTS): non-canonical splicing where neither splice site is in the reference.
- Fragments: transcripts contained within other transcripts with no TSS/PAS overlap with the reference annotation.

<img src="https://www.researchgate.net/publication/337117264/figure/fig3/AS:962623642300428@1606518769874/The-mechanisms-of-internal-priming-and-template-switching-a-Internal-priming-occurs.png" width="400">

In [None]:
# predefined tags to detect technical artifacts
artifact_tags = ['INTERNAL_PRIMING', 'RTTS', 'FRAGMENT']

for tag in artifact_tags:
    print(f'{tag}: {transcriptome.filter["transcript"][tag]}')

Based on previous experience, we typically observe around 10-15% of artifact transcripts.

What percentage of the transcripts in each group are artifacts?

In [None]:
transcriptome.filter_stats(tags=artifact_tags, groups=groups, weight_by_coverage=False)

What percentage of the transcripts in each group are novels?

Hint: the tag for a 'novel transcript' is 'NOVEL_TRANSCRIPT'.

In [None]:
transcriptome.filter_stats( ... )

Quality control is essential for meaningful biological analysis!

There are predefined tags for filtering based on coverage and on whether a transcript is affected by technical artifacts.

In [None]:
for tag in ['HIGH_COVER', 'PERMISSIVE', 'BALANCED', 'STRICT']:
    print(f'{tag}: {transcriptome.filter["transcript"][tag]}')

We will use 'BALANCED' for demonstration in this practice. It's possible to customise the query depending on your needs.

In [None]:
query_string = 'BALANCED'

Create a table summarising the transcripts that passed the filtering.

In [None]:
transcript_tab = transcriptome.transcript_table(groups=groups, query=query_string, coverage=True)

transcript_tab

Basic statistics about the transcriptome after selection:

- how many transcripts?
- how many genes?
- how many transcripts per gene on average?

## Transcriptome export

IsoTools supports the export of the transcriptome in different formats.

1. pickle file: it contains the entire transcriptome data, which can be restored in an IsoTools session without re-importing the alignment files.

In [None]:
transcriptome.save(f'{out_dir}/h1_endo_chr8.pkl')

If you need to import the pickle file later, please try:

In [None]:
transcriptome = Transcriptome.load(f'{out_dir}/h1_endo_chr8.pkl')

2. GTF (General Feature Format) + expression matrix: you can apply filters and export only transcripts that have passed the filter.

In [None]:
transcriptome.write_gtf(f'{out_dir}/h1_endo_chr8_{query_string}.gtf', source='isotools', query=query_string, gzip=False)

In [None]:
transcript_tab.to_csv(f'{out_dir}/h1_endo_chr8_{query_string}.txt', sep='\t', index=False)

3. fasta: you can also export the sequences of selected transcripts in fasta file format.

In [None]:
transcriptome.write_fasta(genome_fn=genome_fn,
                          fn=f'{out_dir}/h1_endo_chr8_{query_string}.fasta',
                          query=query_string)

## Gene model characteristics

### Transcript identification - structural variation

From transcript identification, it is known which transcripts map to which genes. There are usually more than one transcript per gene. They exhibit some variation in structure.

This structural variation comes from the transcription start site (TSS), the exon chain, and the polyadenylation site (PAS). Let's take a closer look.

In [None]:
str_var_count = transcriptome.str_var_calculation(groups=groups, query=query_string, strict_ec=0, strict_pos=15, count_number=True)

for gn in str_var_count.columns[~str_var_count.columns.str.startswith('gene')].str.replace('_tss|_ec|_pas', '', regex=True).unique():
    fig = plot_str_var_number(str_var_count, group_name=gn)

We normalise the number of different TSSs, exon chains and PASs of a gene to simplex coordinates, which can then be used for visualisation in a triangle plot.

In [None]:
str_var_tab = transcriptome.str_var_calculation(groups=groups, query=query_string, strict_ec=0, strict_pos=15)

str_var_tab

Based on the simplex coordinates, we can divide the triangle plot into five categories:

- splicing high (top)
- TSS high (bottom left)
- PAS high (bottom right)
- simple (the dot in the centre)
- mix (middle)

First, let's have a general overview of all the genes.

In [None]:
fig, ax = plt.subplots(1, figsize=(8, 7))
triangle_plot(str_var_tab, ax=ax, colors=group_colors, tax_title='all genes together')
fig.tight_layout()

There are genes whose category changes between h1 and endo. Let's look at some examples.

In [None]:
# ENSG00000168615.13 - splicing high in endo, simple in h1
example_gene = 'ENSG00000168615.13'

What does the triangle plot look like for this gene?

In [None]:
fig, ax = plt.subplots(1, figsize=(8, 7))
triangle_plot(str_var_tab[str_var_tab['gene_id'] == example_gene], ax=ax, colors=group_colors, tax_title=example_gene)
fig.tight_layout()

What are the transcripts like in this gene?

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 2))
transcriptome[example_gene].gene_track(ax=ax, reference=False, query=query_string)
ax.set_title(f"{query_string} transcripts in {example_gene}")
plt.tight_layout()
plt.show()

Please think about ...

- Which transcripts are found in endo and which in h1?
- Where does the structural variation happen?
- Explain how the category is changed between conditions.

Explore some other examples.

### Transcript quantification - entropy

We know the coverage of transcripts. This quantification information can help us to discover genes whose transcript usage changes between groups.

In [None]:
entropy_tab = transcriptome.entropy_calculation(groups=groups, query=query_string, relative=True)

entropy_tab

Let's find some examples.

In [None]:
entropy_tab[
    (abs(entropy_tab['endo_rel_entropy'] - entropy_tab['h1_rel_entropy']) >= 0.5)
]

In [None]:
# ENSG00000170961.7 - lower entropy in endo, higher in h1
example_gene = 'ENSG00000170961.7'

Let's check the expression intensity of transcripts in these gene between h1 and endo.

In [None]:
gene_part = transcript_tab.loc[transcript_tab['gene_id'] == example_gene]
entropy_part = entropy_tab.loc[entropy_tab['gene_id'] == example_gene]

# create a 1x2 grid of plots
plt.rcParams["figure.figsize"] = (6, 3)
fig, axs = plt.subplots(1, 2, constrained_layout=True)

for i, group in enumerate(groups):
    group_trs = gene_part.loc[gene_part[group+"_sum_coverage"] > 0, ['transcript_nr', group+"_sum_coverage"]]

    rel_entropy = float(entropy_part[f'{group}_rel_entropy'].values[0])

    # define the color based on relative entropy
    # --- can customize the thresholds as needed ---
    if rel_entropy < 0.3: # low entropy
        tr_color = 'lightcoral'
    elif rel_entropy > 0.7: # medium entropy
        tr_color = 'powderblue'
    else: # high entropy
        tr_color = 'lightgrey'

    # plot the bar chart for each group
    axs[i%2].bar([j for j in range(0,len(group_trs))], group_trs[group+"_sum_coverage"], color =tr_color, width = 0.4)
    
    # add the text on top of the bar
    for j, val in enumerate(group_trs[group+"_sum_coverage"]):
        axs[i%2].text(j, val, f'{val/sum(group_trs[group+"_sum_coverage"])*100:.1f}%', ha='center', va='center', fontsize=8, color='black')
    axs[i%2].set_xticks([j for j in range(0,len(group_trs))])
    axs[i%2].set_xticklabels(group_trs["transcript_nr"])

    if i%2 == 0: axs[0].set_ylabel('coverage')
    axs[i%2].set_xlabel('transcript id')
    axs[i%2].set_title(
        f'{group}\nrelative entropy = {float(entropy_part[f"{group}_rel_entropy"].values[0]):.2f}',
        fontsize=10,
        color=group_colors[group]
    )

    axs[i%2].grid(False)

    # add the ticks for x and y axes
    axs[i%2].tick_params(axis='x', which='both', bottom=True, top=False)
    axs[i%2].tick_params(axis='y', which='both', left=True, right=False)

plt.tight_layout()
plt.show()

What are the transcripts like in this gene?

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 4))
transcriptome[example_gene].gene_track(ax=ax, reference=False, colorbySqanti=True, query=query_string)
ax.set_title(f"{query_string} transcripts in {example_gene}")
plt.tight_layout()
plt.show()

Explore some other examples.

## Alternative splicing analysis

### ASE identification

to identify various types of AS events, including:

- exon skipping (ES)
- intron retention (IR)
- mutually exclusive exons (ME)
- 3’ alternative splicing (3AS)
- 5’ alternative splicing (5AS)
- alternative first exons (TSS)
- alternative last exons (PAS)

Firstly, identify the alternative splicing events and summarise the number of each type in a table.

In [None]:
splice_events = transcriptome.alternative_splicing_events()

# count the identified events by type
splice_events.splice_type.value_counts()

Two-dimensional embeddings (PCA and UMAP) have been implemented to illustrate the relationship between samples and common ASEs.

In [None]:
# plot PCA embedding
pca = {}

plt.rcParams["figure.figsize"] = (10, 10)
f,axs = plt.subplots(3, 2)
for ax,t in zip(axs.flatten(), ['all', '3AS', '5AS', 'ES', 'IR', 'ME']):
    pca[t] = plot_embedding(splice_events,
                            ax=ax,
                            labels=True,
                            groups=groups,
                            splice_types=t)
axs[0,0].legend(fontsize='medium', ncol=4,handleheight=2.4, labelspacing=0.05,
                bbox_to_anchor=(0, 1.1), loc='lower left')
plt.tight_layout()

### Differential splicing events

to detect differentially expressed ASEs.

In [None]:
types_of_interest = ['ES','ME','5AS','3AS','IR']

diff_splice = transcriptome.altsplice_test(groups,
                                           types=types_of_interest,
                                           min_total=200)
diff_splice = diff_splice.sort_values('pvalue').reset_index(drop=True)

sig = diff_splice.padj < 0.05
n_genes = len(diff_splice.loc[sig,"gene"].unique())
print(f'{sum(sig)} differential splice sites in {n_genes} genes for '+
      " vs ".join(groups))

In [None]:
pd.set_option('display.max_columns', None)
diff_splice

IsoTools implements a specific plot to depict differential splicing results. The curves show the distribution of the posterior probability of PSI values for the two groups, while the dots represent the observed PSI values for individual samples.

In [None]:
plt.rcParams["figure.figsize"] = (10,4)

f,axs,plotted = plot_diff_results(diff_splice,
                                  min_diff=0.1,
                                  min_support=2,
                                  grid_shape=(1,2),
                                  group_colors=group_colors)

Additionally, the structure of the isoforms and the read coverage over the event can be visualised using a sashimi plot.

In [None]:
row = diff_splice.iloc[0]

plt.rcParams["figure.figsize"] = (10,10)
pos = [row['start']-500, row['end']+500]
joi = [(row['start'], row['end'])]
fig,axs = plt.subplots(3)
gene = transcriptome[row['gene_id']]
gene.gene_track(x_range=pos,
                ax=axs[0],
                reference=False,
                select_transcripts=gene.filter_transcripts('SUBSTANTIAL'))
gene.sashimi_plot(samples=groups['h1'],
                  junctions_of_interest=joi,
                  x_range=pos,
                  ax=axs[1],
                  title='h1',
                  log_y=False)
gene.sashimi_plot(samples=groups['endo'],
                  junctions_of_interest=joi,
                  x_range=pos,
                  ax=axs[2],
                  title='endo',
                  log_y=False)

fig.tight_layout()

## Conclusion

In this hands-on practice, we have explored some features of IsoTools for long-read transcriptomics analysis, including:

- transcriptome reconstruction from long-read alignments with quality control  
- data export in multiple formats (gtf, fasta, pickle) for downstream analysis  
- gene model characterization using simplex coordinates for structural variation and relative entropy for expression variation  
- ASE detection and differential splicing analysis between conditions

Long-read data enables the comprehensive detection of transcript isoforms and splicing events. However, quality control is crucial for meaningful biological interpretation.

IsoTools provides flexible filtering options to remove technical artifacts (e.g. internal priming, template switching and fragments) and offers various options for downstream analysis and visualisation.

**Further analysis**:
- differential isoform usage
- functional annotation of transcripts
- pathways and functional domains affected by splicing

**Happy long-read transcriptomics analysis!** 🧬📊