# Day 2: SQANTI3 Practice Session
see https://github.com/ConesaLab/SQANTI3

This Practice Session will focus on learning how to:
1. Configure and run SQANTI3 QC to perform Quality Control on a custom transcriptome using different types of orthogonal data;
2. Configure and run SQANTI3 Filter (Machine Learning) to use the results of Quality Control to filter out likely artifacts;
3. Configure and run SQANTI3 Rescue to "rescue" reference transcripts to represent reads that were lost during the Filter.

If you are interested in learning more about SQANTI3 and the SQANTI-verse, especially how to use the Rules-Filter, you can find more tutorials here: https://github.com/ConesaLab/courses-SQANTI_verse

# Setup

In [2]:
# imports
import os
import subprocess
import pandas as pd

pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)

In [3]:
# initialize integrated IGV viewer in the notebook
import igv_notebook
igv_notebook.init()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Installing SQANTI3

In [4]:
sqanti_path = os.path.abspath(os.path.expanduser("~/tools/SQANTI3-5.4"))

In [5]:
%%script echo Skipping SQANTI3 installation
# How to install SQANTI3?
# see also: https://github.com/ConesaLab/SQANTI3/wiki/Dependencies-and-installation
sqanti_zip = "~/downloads/SQANTI3_v5.4.zip"
sqanti_env = "sus_sqanti"
!wget https://github.com/ConesaLab/SQANTI3/releases/download/v5.4/SQANTI3_v5.4.zip -O $sqanti_zip
!rm -rf $sqanti_path
!mkdir -p $sqanti_path
!unzip $sqanti_zip -d $sqanti_path
!mv $sqanti_path/release_sqanti3/* $sqanti_path/
!rm -r $sqanti_path/release_sqanti3

# run in parent shell (outside of this notebook)
# micromamba env create -f $sqanti_path/SQANTI3.conda_env.yml -n $sqanti_env
# micromamba activate $sqanti_env
# additional requirements for this course
# micromamba install -y ipykernel

Skipping SQANTI3 installation


## Preparing paths to data ...

In [6]:
n_cores = 8
data_dir = os.path.abspath(os.path.expanduser("../data"))
output_dir = os.path.abspath(os.path.expanduser("../output"))
sample_name = "h1_endo_chr8_isotools"

# reference
ref_genome = f"{data_dir}/GRCh38.primary_assembly.genome_chr8.fa"
ref_gtf = f"{data_dir}/gencode.v45.annotation_chr8.gtf"

# transcriptome
input_gtf = f"{data_dir}/h1_endo_chr8_all.gtf"

# output prefixes and paths
qc_prefix = f"{sample_name}_qc"
qc_dir = f"{output_dir}/sqanti_qc/{sample_name}"
filter_prefix = f"{sample_name}_filter"
filter_dir = f"{output_dir}/sqanti_filter/{sample_name}"
ref_qc_prefix = "gencode.v45.annotation_chr8_qc"
ref_qc_dir = f"{output_dir}/sqanti_qc/gencode.v45.annotation_chr8"
rescue_prefix = f"{sample_name}_rescue"
rescue_dir = f"{output_dir}/sqanti_rescue/{sample_name}"

# orthogonal data
polyA_motifs = f"{sqanti_path}/data/polyA_motifs/mouse_and_human.polyA_motif.txt"
cage = f"{sqanti_path}/data/ref_TSS_annotation/human.refTSS_v3.1.hg38.bed"
counts = f"{data_dir}/h1_endo_chr8_all_count.txt"
polyA_peaks = f"{data_dir}/atlas.clusters.2.0.GRCh38.96_chr8.bed"
splice_junctions = f"{data_dir}/ENCFF498FDF_ENCFF181VTPSJ.out_chr8.tab"
sr_bam = f"{data_dir}/ENCFF498FDF_ENCFF181VTPAligned.sortedByCoord.out_chr8.bam"


# SQANTI3 Quality Control

## What is SQANTI3 QC?

See also: https://github.com/ConesaLab/SQANTI3/wiki/Introduction-to-SQANTI3

The main purpose of the SQANTI3 QC module is, of course, to perform quality control on a custom transcriptome by comparing it to a reference annotation as well as orthogonal supporting data. In doing so, we can **a)** gather information about our transcriptome to better understand the known as well as novel transcripts we are observing, and **b)** later use the Filter and Rescue modules of SQANTI3 to curate our transcriptome. This allows us to separate transcripts isoforms that are probably real from probable artifacts that should be removed from our transcriptome before further downstream analyses (e.g. differential expression).

In this demonstration, besides comparing a custom transcriptome generated with **IsoTools** to a gencode reference annotation, we will also utilize the following types of supporting orthogonal data:

* **Quantification:** the count table of our isoforms as generated by **IsoTools**. Isoforms with higher expression levels are likely more trustworthy than ones with lower expression.
* **Short-read RNA-sequencing data:** the LRGASP data set also contains short-read RNA-sequencing data, which provides valuable information.
    * **Splice Junction coverage:** the "*SJ.out.tab" file as generated by STAR. This informs us whether short reads also support the same splice junctions as we observe in long-read data.
    * **Short-read alignments:** the alignment in .bam format as generated by STAR. This allows SQANTI3 to calculate the "ratio_TSS" value, which measures the ratio of short reads observed before and after a possible alternate TSS.
* **CAGE peaks:** CAGE (*cap analysis of gene expression*) peak data from the [refTSS](http://reftss.clst.riken.jp/reftss/Main_Page) database for human and mouse comes bundled with SQANTI3. This is particularly useful for validating alternative TSS annotations.
* **PolyA peaks:** Obtained from the [polyAsite](https://polyasite.unibas.ch/atlas#2) database, data from protocols such as QuantSeq and PolyA-seq. Similar to CAGE peaks, but for validating alternative TTS. 
* **PolyA motifs:** PolyA motifs for human and mouse come bundled with SQANTI3. These are also useful for validating TTS.

In [7]:
# How to run SQANTI3 Quality Control?
# See also: https://github.com/ConesaLab/SQANTI3/wiki/Running-SQANTI3-Quality-Control#running
!$sqanti_path/sqanti3_qc.py --help


      ░██████╗░░█████╗░
      ██╔═══██╗██╔══██╗
      ██║██╗██║██║░░╚═╝
      ╚██████╔╝██║░░██╗
      ░╚═██╔═╝░╚█████╔╝
      ░░░╚═╝░░░░╚════╝░
    
usage: sqanti3_qc.py [-h] --isoforms ISOFORMS --refGTF REFGTF --refFasta
                     REFFASTA [--min_ref_len MIN_REF_LEN] [--force_id_ignore]
                     [--fasta] [--genename]
                     [--novel_gene_prefix NOVEL_GENE_PREFIX] [-s SITES]
                     [-w WINDOW]
                     [--aligner_choice {minimap2,deSALT,gmap,uLTRA}]
                     [-x GMAP_INDEX] [--skipORF] [--orf_input ORF_INPUT]
                     [--short_reads SHORT_READS] [--SR_bam SR_BAM]
                     [--CAGE_peak CAGE_PEAK]
                     [--polyA_motif_list POLYA_MOTIF_LIST]
                     [--polyA_peak POLYA_PEAK] [--phyloP_bed PHYLOP_BED]
                     [-e EXPRESSION] [-c COVERAGE] [-fl FL_COUNT]
                     [--isoAnnotLite] [--gff3 GFF3] [-o OUTPUT] [-d DIR]
                     [--s

## Running SQ3 Quality Control

Below we have already set up a command to run SQANTI3 QC - except there's one parameter missing. 

Review the following cell, identify the missing parameter and complete the command to run SQANTI3 QC.

If you need some pointers, here's some advice on how you could get started:

* Check the output of the "--help" option in the cell above to see the parameters of SQANTI3 QC
    * If you are uncertain about what any of the parameters mean, check the [documentation](https://github.com/ConesaLab/SQANTI3/wiki/Running-SQANTI3-Quality-Control#arguments-and-parameters-in-sqanti3-qc) to find explanations of each parameter and the data that should be supplied to it.
* Review the incomplete command in the cell below, maybe you can already spot which parameter is missing.
* Try to run the cell below to execute the incomplete command to see if there is an error message.

In [10]:
# Build the SQANTI3 QC command
cmd = [
    "/usr/bin/time", "-v",                  # Measure time and resources of our execution
    f"{sqanti_path}/sqanti3_qc.py",         # SQANTI3 QC script
    "--isoforms", input_gtf,                # GTF to Quality Control
    "--refGTF", ref_gtf,                    # Reference GTF
    "--refFasta", ref_genome,               # Reference Genome
    "--fl_count", counts,                   # Counts file
    "--coverage", splice_junctions,         # splice junction short-read coverage (from STAR)
    "--SR_bam", sr_bam,                     # Short-read BAMs
    "--CAGE_peak", cage,                    # CAGE Peaks
    "--polyA_motif_list", polyA_motifs,     # PolyA Motifs
    "--polyA_peak", polyA_peaks,            # PolyA Peaks
    "--output", qc_prefix,                  # Output Prefix
    "--dir", qc_dir,                        # Output Location
    "--skipORF",                          # Skip ORF Prediction (takes longer)
    "--cpus", str(n_cores)                  # Number of Threads
]

# Print the full command for reference
print("Running command:")
print(" ".join(cmd))

# Run the command
result = subprocess.run(cmd, capture_output=True, text=True)

# Print output and errors
print("Standard Output:")
print(result.stdout)
print("Standard Error:")
print(result.stderr)

Running command:
/usr/bin/time -v /home/fjetzinger/tools/SQANTI3-5.4/sqanti3_qc.py --isoforms /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/data/h1_endo_chr8_all.gtf --refGTF /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/data/gencode.v45.annotation_chr8.gtf --refFasta /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/data/GRCh38.primary_assembly.genome_chr8.fa --fl_count /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/data/h1_endo_chr8_all_count.txt --coverage /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/data/ENCFF498FDF_ENCFF181VTPSJ.out_chr8.tab --SR_bam /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/data/ENCFF498FDF_ENCFF181VTPAligned.sortedByCoord.out_chr8.bam --CAGE_peak /home/fjetzinger/tools/SQANTI3-5.4/data/ref_TSS_annotation/human.refTSS_v3.1.hg38.bed --polyA_motif_list /home/fjetzinger/tools/SQANTI3-5.4/data/polyA_motifs/mouse_and_human.polyA_mot

#### Solution

We have not supplied the custom transcriptome that we want to perform our quality control on! We can do this with the "--isoforms" parameter. This is one of the required parameters, so SQANTI3 QC cannot run without it.

In [8]:
%%script echo Skipping SQANTI3 QC solution cell
# Solution: full SQANTI3 QC command
cmd = [
    "/usr/bin/time", "-v",
    f"{sqanti_path}/sqanti3_qc.py",         # SQANTI3 QC script
    "--isoforms", input_gtf,                # GTF to Quality Control
    "--refGTF", ref_gtf,                    # Reference GTF
    "--refFasta", ref_genome,               # Reference Genome
    "--polyA_motif_list", polyA_motifs,     # PolyA Motif List
    "--polyA_peak", polyA_peaks,            # PolyA Peaks
    "--fl_count", counts,                   # Counts file
    "--coverage", splice_junctions,         # splice junction short-read coverage file (from STAR)
    "--CAGE_peak", cage,                    # CAGE Peak file
    "--SR_bam", sr_bam,                     # Short-read BAM file
    "--output", qc_prefix,                  # Output Prefix
    "--dir", qc_dir,                        # Output Location
    # "--skipORF",                          # Skip ORF Prediction (takes longer)
    "--cpus", str(n_cores)                  # Number of Threads
]

Skipping SQANTI3 QC solution cell


## Investigating SQANTI3 QC results

see also: https://github.com/ConesaLab/SQANTI3/wiki/Understanding-the-output-of-SQANTI3-QC

### Can you find answers to the following Questions in the SQANTI3 Quality Control report?

SQANTI3 provides a detailed report with a great variety of information, statistics, and plots that detail the results of the Quality Control process. You can find it under `data/sqanti_qc/h1_endo_chr8_isotools/h1_endo_chr8_isotools_qc_SQANTI3_report.html`

1. **Splice Junction Classification:**
    
    a. How many non-canonical splice junctions exist in the transcriptome? How many of them are known vs. novel?

    b. Which structural categories do you think the majority of transcripts with novel non-canonical splice junctions belong to? 

2. **Exon Structure:**

    a. How many transcripts are multi- vs. mono-exonic? How does this differ between reference and novel transcripts?

    b. Can you think of some possible reasons for novel transcripts to contain more mono-exons than reference transcripts?

3. **Features of Good Quality:**

    a. Examine how the support levels of annotation, canonical splice junctions, splice junction coverage, and CAGE coverage differ across the FSM, ISM, NIC, and NNC structural categories. Then, try to think of possible explanations for what you are observing.

    b. Can you explain the following observations?

    1. ISM transcripts show noticably lower levels of CAGE support.

    2. NNC transcripts show noticably lower levels of Canonical splice junctions as well as splice junction coverage.


### Investigate classification file as pandas dataframe...

Below are some examples of how pandas dataframes (or similar data structures) can be used to explore the SQANTI3 QC classification out file. Along with the report, this is one of the key outputs of SQANTI3 QC as it details all transcript isoforms in our transcriptome along with the additional information that SQANTI3 QC has added through comparison to reference and orthogonal data (e.g. structural categories, distance to CAGE peak, etc.)

Feel free to explore and play with the following expressions to select different combinations of conditions and investigate transcripts in more detail.

In [9]:
# Load classification file
sq3_qc_class_file = f"{qc_dir}/{qc_prefix}_classification.txt"
sq3_qc_class = pd.read_csv(sq3_qc_class_file, sep='\t', low_memory=False)
sq3_qc_class.head()

Unnamed: 0,isoform,chrom,strand,length,exons,structural_category,associated_gene,associated_transcript,ref_length,ref_exons,diff_to_TSS,diff_to_TTS,diff_to_gene_TSS,diff_to_gene_TTS,subcategory,RTS_stage,all_canonical,min_sample_cov,min_cov,min_cov_pos,sd_cov,FL,n_indels,n_indels_junc,bite,iso_exp,gene_exp,ratio_exp,FSM_class,coding,ORF_length,CDS_length,CDS_start,CDS_end,CDS_genomic_start,CDS_genomic_end,predicted_NMD,perc_A_downstream_TTS,seq_A_downstream_TTS,dist_to_CAGE_peak,within_CAGE_peak,dist_to_polyA_site,within_polyA_site,polyA_motif,polyA_dist,polyA_motif_found,ORF_seq,ratio_TSS,FL.cDNA_PacBio-endo_1_coverage,FL.cDNA_PacBio-endo_2_coverage,FL.cDNA_PacBio-endo_3_coverage,FL.cDNA_PacBio-h1_1_coverage,FL.cDNA_PacBio-h1_2_coverage,FL.cDNA_PacBio-h1_3_coverage
0,ENSG00000003987.14_0,chr8,-,5071,14,full-splice_match,ENSG00000003987.14,ENST00000180173.10,5110.0,14.0,-24.0,-15.0,0.0,-15.0,reference_match,False,canonical,1.0,2.0,junction_1,1.47564,,,,False,,,,C,coding,660.0,1983.0,36.0,2018.0,17413292.0,17299862.0,False,35.0,CACTGAGATGAACTACTCAT,7.0,True,2.0,False,AATAAA,-4.0,True,MEHIRTPKVENVRLVDRVSPKKAALGTLYLTATHVIFVENSPDPRK...,3.487562,1,2,6,1,5,1
1,ENSG00000003987.14_1,chr8,-,3527,2,incomplete-splice_match,ENSG00000003987.14,ENST00000180173.10,5110.0,14.0,-111087.0,-15.0,-9409.0,-15.0,3prime_fragment,False,canonical,1.0,2.0,junction_1,0.0,,,,False,,,,C,coding,150.0,453.0,22.0,474.0,17302243.0,17299862.0,False,35.0,CACTGAGATGAACTACTCAT,,False,2.0,False,AATAAA,-4.0,True,MQPRQSVTDYLMAVKEETQQLEEELEALEERLEKIQKVQLNCTKVK...,1.333111,1,0,0,0,0,0
2,ENSG00000003987.14_2,chr8,-,1803,1,incomplete-splice_match,ENSG00000003987.14,ENST00000180173.10,5110.0,14.0,-114753.0,-2.0,-13075.0,-2.0,mono-exon,False,,,,,,,,,,,,,C,non_coding,,,,,,,,30.0,TACTCATGTGATTATGTAGA,,False,-11.0,True,AATAAA,-17.0,True,,1.19992,0,1,1,2,0,0
3,ENSG00000003987.14_3,chr8,-,4775,12,novel_in_catalog,ENSG00000003987.14,novel,5110.0,14.0,,,0.0,-15.0,combination_of_known_splicesites,False,canonical,0.0,0.0,junction_10,1.772144,,,,False,,,,C,coding,364.0,1095.0,628.0,1722.0,17313378.0,17299862.0,False,35.0,CACTGAGATGAACTACTCAT,7.0,True,2.0,False,AATAAA,-4.0,True,MSDFLWGLENSGWLRHIKAIMDAGIFIAKAVSEEGASVLVHCSDGW...,3.487562,0,1,0,0,0,0
4,ENSG00000003987.14_4,chr8,-,5180,15,novel_not_in_catalog,ENSG00000003987.14,novel,5110.0,14.0,,,0.0,-15.0,at_least_one_novel_splicesite,False,canonical,0.0,0.0,junction_14,1.933855,,,,False,,,,C,coding,544.0,1635.0,493.0,2127.0,17361236.0,17299862.0,False,35.0,CACTGAGATGAACTACTCAT,7.0,True,2.0,False,AATAAA,-4.0,True,MLDKEEREQGWVLIDLSEEYTRMGLPNHYWQLSDVNRDYRVCDSYP...,3.487562,0,1,0,0,0,0


In [10]:
# Investigate specific gene...
sq3_qc_class[sq3_qc_class["associated_gene"] == "ENSG00000249395.4"]

Unnamed: 0,isoform,chrom,strand,length,exons,structural_category,associated_gene,associated_transcript,ref_length,ref_exons,diff_to_TSS,diff_to_TTS,diff_to_gene_TSS,diff_to_gene_TTS,subcategory,RTS_stage,all_canonical,min_sample_cov,min_cov,min_cov_pos,sd_cov,FL,n_indels,n_indels_junc,bite,iso_exp,gene_exp,ratio_exp,FSM_class,coding,ORF_length,CDS_length,CDS_start,CDS_end,CDS_genomic_start,CDS_genomic_end,predicted_NMD,perc_A_downstream_TTS,seq_A_downstream_TTS,dist_to_CAGE_peak,within_CAGE_peak,dist_to_polyA_site,within_polyA_site,polyA_motif,polyA_dist,polyA_motif_found,ORF_seq,ratio_TSS,FL.cDNA_PacBio-endo_1_coverage,FL.cDNA_PacBio-endo_2_coverage,FL.cDNA_PacBio-endo_3_coverage,FL.cDNA_PacBio-h1_1_coverage,FL.cDNA_PacBio-h1_2_coverage,FL.cDNA_PacBio-h1_3_coverage
17415,ENSG00000249395.5_0,chr8,-,1665,2,full-splice_match,ENSG00000249395.4,ENST00000669950.1,1704.0,2.0,-39.0,0.0,0.0,0.0,reference_match,True,canonical,1.0,6.0,junction_1,0.0,,,,False,,,,C,non_coding,,,,,,,,30.0,TTCTATAGCTAATGGAAGTG,,False,-18.0,True,AATAAA,-20.0,True,,1.940623,3,2,0,0,0,0
17416,ENSG00000249395.5_1,chr8,-,1513,2,full-splice_match,ENSG00000249395.4,ENST00000670695.1,1513.0,2.0,0.0,0.0,0.0,0.0,reference_match,False,canonical,1.0,1.0,junction_1,0.0,,,,False,,,,C,non_coding,,,,,,,,30.0,TTCTATAGCTAATGGAAGTG,,False,-18.0,True,AATAAA,-20.0,True,,0.900025,1,1,0,0,1,0
17417,ENSG00000249395.5_2,chr8,-,415,2,incomplete-splice_match,ENSG00000249395.4,ENST00000676364.1,668.0,3.0,-202713.0,-4.0,959.0,0.0,3prime_fragment,False,canonical,1.0,1.0,junction_1,0.0,,,,False,,,,C,non_coding,,,,,,,,10.0,CTGCTACGTTTGTGGTCATT,,False,-19.0,True,AATAAA,-16.0,True,,1.664452,0,1,0,0,0,0
17418,ENSG00000249395.5_3,chr8,-,6303,3,novel_not_in_catalog,ENSG00000249395.4,novel,668.0,3.0,,,10821.0,0.0,at_least_one_novel_splicesite,False,canonical,0.0,0.0,junction_2,0.5,,,,True,,,,C,coding,102.0,309.0,1084.0,1392.0,75130536.0,75130228.0,True,10.0,CTGCTACGTTTGTGGTCATT,,False,-19.0,True,AATAAA,-16.0,True,MDLQLHMAGLSTWGKIFPHDSITSHRVPPMWEFKMRFGWGHSQIIS...,,0,1,0,0,0,0
17419,ENSG00000249395.5_4,chr8,-,1236,1,genic,ENSG00000249395.4,novel,,,,,-7887.0,0.0,mono-exon,False,,,,,,,,,,,,,C,non_coding,,,,,,,,30.0,TTCTATAGCTAATGGAAGTG,,False,-18.0,True,AATAAA,-20.0,True,,1.608563,0,1,0,0,2,0
17420,ENSG00000249395.5_5,chr8,-,798,2,novel_not_in_catalog,ENSG00000249395.4,novel,668.0,3.0,,,0.0,-89699.0,intron_retention,False,canonical,1.0,1.0,junction_1,0.0,,,,True,,,,C,non_coding,,,,,,,,10.0,CTTTTCCTGACTGGACCCTG,,False,-9.0,True,ATTAAA,-19.0,True,,1.940623,0,1,0,0,0,0
17421,ENSG00000249395.5_6,chr8,-,3935,1,full-splice_match,ENSG00000249395.4,ENST00000675539.1,386.0,1.0,3545.0,4.0,3545.0,0.0,mono-exon,False,,,,,,,,,,,,,C,non_coding,,,,,,,,10.0,TACGTTTGTGGTCATTTGTT,,False,-23.0,True,AATAAA,-20.0,True,,,0,0,2,0,0,0
17422,ENSG00000249395.5_7,chr8,-,730,2,novel_not_in_catalog,ENSG00000249395.4,novel,668.0,3.0,,,0.0,1.0,intron_retention,False,non_canonical,0.0,0.0,junction_1,0.0,,,,False,,,,C,non_coding,,,,,,,,10.0,TGCTACGTTTGTGGTCATTT,,False,-20.0,True,AATAAA,-17.0,True,,1.940623,0,0,1,0,0,0


In [11]:
# Investigate specific conditions...
# Display multi-exonic FSM transcripts that are reference matches and have a predicted coding ORF
sq3_qc_class[
    (sq3_qc_class["exons"] > 1) &
    (sq3_qc_class["structural_category"] == "full-splice_match") &
    (sq3_qc_class["subcategory"] == "reference_match") &
    (sq3_qc_class["coding"] == "coding")
]

Unnamed: 0,isoform,chrom,strand,length,exons,structural_category,associated_gene,associated_transcript,ref_length,ref_exons,diff_to_TSS,diff_to_TTS,diff_to_gene_TSS,diff_to_gene_TTS,subcategory,RTS_stage,all_canonical,min_sample_cov,min_cov,min_cov_pos,sd_cov,FL,n_indels,n_indels_junc,bite,iso_exp,gene_exp,ratio_exp,FSM_class,coding,ORF_length,CDS_length,CDS_start,CDS_end,CDS_genomic_start,CDS_genomic_end,predicted_NMD,perc_A_downstream_TTS,seq_A_downstream_TTS,dist_to_CAGE_peak,within_CAGE_peak,dist_to_polyA_site,within_polyA_site,polyA_motif,polyA_dist,polyA_motif_found,ORF_seq,ratio_TSS,FL.cDNA_PacBio-endo_1_coverage,FL.cDNA_PacBio-endo_2_coverage,FL.cDNA_PacBio-endo_3_coverage,FL.cDNA_PacBio-h1_1_coverage,FL.cDNA_PacBio-h1_2_coverage,FL.cDNA_PacBio-h1_3_coverage
0,ENSG00000003987.14_0,chr8,-,5071,14,full-splice_match,ENSG00000003987.14,ENST00000180173.10,5110.0,14.0,-24.0,-15.0,0.0,-15.0,reference_match,False,canonical,1.0,2.0,junction_1,1.475640,,,,False,,,,C,coding,660.0,1983.0,36.0,2018.0,17413292.0,17299862.0,False,35.0,CACTGAGATGAACTACTCAT,7.0,True,2.0,False,AATAAA,-4.0,True,MEHIRTPKVENVRLVDRVSPKKAALGTLYLTATHVIFVENSPDPRK...,3.487562,1,2,6,1,5,1
63,ENSG00000008853.18_0,chr8,+,5477,10,full-splice_match,ENSG00000008853.18,ENST00000251822.7,5482.0,10.0,0.0,-5.0,0.0,2.0,reference_match,False,canonical,1.0,13.0,junction_7,5.130398,,,,False,,,,C,coding,727.0,2184.0,569.0,2752.0,23004435.0,23017469.0,False,15.0,CACAATGTGCTTCCTTGTTT,5.0,True,-23.0,True,CATAAA,-18.0,True,MDSDMDYERPNVETIKCVVVGDNAVGKTRLICARACNATLTQYQLL...,2.921599,8,10,14,5,3,0
66,ENSG00000008853.18_3,chr8,+,3459,4,full-splice_match,ENSG00000008853.18,ENST00000690180.1,3425.0,4.0,0.0,34.0,0.0,2.0,reference_match,False,canonical,1.0,13.0,junction_1,4.242641,,,,False,,,,C,coding,177.0,534.0,201.0,734.0,23010568.0,23017469.0,False,15.0,CACAATGTGCTTCCTTGTTT,-2731.0,False,-23.0,True,CATAAA,-18.0,True,MRAVLEYLYTGMFTSSPDLDDMKLIILANRLCLPHLVALTEQYTVT...,0.971437,1,0,1,0,0,0
79,ENSG00000008988.11_1,chr8,-,519,4,full-splice_match,ENSG00000008988.11,ENST00000009589.8,519.0,4.0,0.0,0.0,0.0,0.0,reference_match,False,canonical,1.0,1789.0,junction_1,54.162308,,,,False,,,,C,coding,119.0,360.0,124.0,483.0,56074383.0,56073090.0,False,10.0,CTTCTGTTGGTTTTTATTCA,0.0,True,-25.0,True,AATAAA,-19.0,True,MAFKDTGKTPVEPEVAIHRIRITLTSRNVKSLEKVCADLIRGAKEK...,8.591371,179,252,270,82,435,78
80,ENSG00000008988.11_2,chr8,-,646,4,full-splice_match,ENSG00000008988.11,ENST00000521262.5,650.0,4.0,-4.0,0.0,0.0,0.0,reference_match,False,canonical,1.0,19.0,junction_3,864.060183,,,,False,,,,C,coding,119.0,360.0,251.0,610.0,56074256.0,56073090.0,False,10.0,CTTCTGTTGGTTTTTATTCA,0.0,True,-25.0,True,AATAAA,-19.0,True,MAFKDTGKTPVEPEVAIHRIRITLTSRNVKSLEKVCADLIRGAKEK...,8.591371,2,5,3,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18855,ENSG00000291080.2_6,chr8,-,5207,3,full-splice_match,ENSG00000291080.1,ENST00000426361.7,5211.0,3.0,-1.0,-3.0,0.0,-3.0,reference_match,False,canonical,1.0,1.0,junction_1,3.000000,,,,False,,,,C,coding,481.0,1446.0,215.0,1660.0,144999633.0,144977139.0,False,20.0,CCATTATTGCCCATTTTCAT,-3.0,True,-20.0,True,ATTAAA,-17.0,True,MGTYGNLLSLGCDSRIEKEEMIPKQDISEELESQRAKSEDHVRNIF...,16.841584,0,1,5,0,2,0
18890,ENSG00000291317.1_1,chr8,-,1469,2,full-splice_match,ENSG00000291317.1,ENST00000403000.6,1477.0,2.0,0.0,-8.0,0.0,0.0,reference_match,False,canonical,1.0,65.0,junction_1,0.000000,,,,False,,,,C,coding,192.0,579.0,589.0,1167.0,144464901.0,144464127.0,False,30.0,TATGAAAACTGCCCATTTTT,63.0,True,-27.0,True,AATAAA,-21.0,True,MAPKPGAEWSTALSHLVLGVVSLHAAVSTAEASRGAAAGFLLQVLA...,3.515318,7,5,3,1,7,2
18891,ENSG00000291317.1_2,chr8,-,1213,3,full-splice_match,ENSG00000291317.1,ENST00000306145.10,1216.0,3.0,-3.0,0.0,0.0,0.0,reference_match,False,canonical,1.0,33.0,junction_2,16.000000,,,,False,,,,C,coding,192.0,579.0,333.0,911.0,144464901.0,144464127.0,False,30.0,TATGAAAACTGCCCATTTTT,63.0,True,-27.0,True,AATAAA,-21.0,True,MAPKPGAEWSTALSHLVLGVVSLHAAVSTAEASRGAAAGFLLQVLA...,3.515318,6,21,15,2,13,2
18897,ENSG00000291317.1_8,chr8,-,1187,3,full-splice_match,ENSG00000291317.1,ENST00000424149.6,1162.0,3.0,33.0,-8.0,14.0,0.0,reference_match,False,canonical,1.0,1.0,junction_2,32.000000,,,,False,,,,C,coding,192.0,579.0,307.0,885.0,144464901.0,144464127.0,False,30.0,TATGAAAACTGCCCATTTTT,-4.0,True,-27.0,True,AATAAA,-21.0,True,MAPKPGAEWSTALSHLVLGVVSLHAAVSTAEASRGAAAGFLLQVLA...,1.624220,0,0,1,0,1,0


In [12]:
# Investigate specific conditions...

# dynamically identify columns with quantification data per sample
fl_sample_cols = [col for col in sq3_qc_class.columns if col.startswith('FL.cDNA_PacBio-')]

# Display high-confidence, coding novel isoforms
sq3_qc_class[
    (sq3_qc_class["exons"] > 1) &                                               # multi-exonic
    ((sq3_qc_class["structural_category"] == "novel_in_catalog") |              # NIC (novel combination of known splice-sites)
        (sq3_qc_class["structural_category"] == "novel_not_in_catalog")) &      # or NNC (at least 1 novel splice sites)
    (sq3_qc_class["coding"] == "coding") &                                      # has predicted coding sequence
    (sq3_qc_class["min_cov"] > 0) &                                             # every splice junction has coverage of at least 1 short read
    (sq3_qc_class["within_CAGE_peak"]) &                                        # TSS is within CAGE peak
    (sq3_qc_class["within_polyA_site"]) &                                       # TTS is within polyA site
    (sq3_qc_class["polyA_motif_found"]) &                                       # TTS has polyA motif
    ((sq3_qc_class[fl_sample_cols] > 0).sum(axis=1) >= 2) &                     # detected in at least 2 samples
    ((sq3_qc_class[fl_sample_cols].sum(axis=1)) >= 10)                          # detected with at least 10 reads
]

Unnamed: 0,isoform,chrom,strand,length,exons,structural_category,associated_gene,associated_transcript,ref_length,ref_exons,diff_to_TSS,diff_to_TTS,diff_to_gene_TSS,diff_to_gene_TTS,subcategory,RTS_stage,all_canonical,min_sample_cov,min_cov,min_cov_pos,sd_cov,FL,n_indels,n_indels_junc,bite,iso_exp,gene_exp,ratio_exp,FSM_class,coding,ORF_length,CDS_length,CDS_start,CDS_end,CDS_genomic_start,CDS_genomic_end,predicted_NMD,perc_A_downstream_TTS,seq_A_downstream_TTS,dist_to_CAGE_peak,within_CAGE_peak,dist_to_polyA_site,within_polyA_site,polyA_motif,polyA_dist,polyA_motif_found,ORF_seq,ratio_TSS,FL.cDNA_PacBio-endo_1_coverage,FL.cDNA_PacBio-endo_2_coverage,FL.cDNA_PacBio-endo_3_coverage,FL.cDNA_PacBio-h1_1_coverage,FL.cDNA_PacBio-h1_2_coverage,FL.cDNA_PacBio-h1_3_coverage
474,ENSG00000040341.18_1,chr8,-,2865,14,novel_in_catalog,ENSG00000040341.18,novel,2970.0,15.0,,,0.0,-1.0,combination_of_known_junctions,False,canonical,1.0,22.0,junction_13,78.103103,,,,False,,,,C,coding,538.0,1617.0,247.0,1863.0,73738301.0,73421372.0,False,5.0,AGCGCTGTTGTTGTTCTTTT,-12.0,True,-14.0,True,AGTAAA,-23.0,True,MLQINQMFSVQLSLGEQTWESEGSSIKKAQQAVANKALTESTLPKP...,6.549390,2,5,3,2,4,0
526,ENSG00000040341.18_53,chr8,-,4088,12,novel_in_catalog,ENSG00000040341.18,novel,4061.0,12.0,,,0.0,0.0,combination_of_known_junctions,False,canonical,1.0,22.0,junction_11,72.177811,,,,False,,,,C,coding,479.0,1440.0,247.0,1686.0,73738301.0,73552006.0,False,35.0,AAAGTTGTTTTAACTTTTAA,-12.0,True,-10.0,True,AGTAAA,-23.0,True,MLQINQMFSVQLSLGEQTWESEGSSIKKAQQAVANKALTESTLPKP...,6.549390,2,3,2,1,4,1
903,ENSG00000066777.9_2,chr8,-,7132,40,novel_in_catalog,ENSG00000066777.9,novel,7320.0,39.0,,,0.0,1.0,combination_of_known_junctions,True,canonical,1.0,10.0,junction_39,9.298328,,,,False,,,,C,coding,1840.0,5523.0,349.0,5871.0,67343260.0,67198934.0,False,35.0,TCAGTGCAATCAAAACCTGT,-3.0,True,-26.0,True,AATAAA,-18.0,True,MFLTRALEKILADKEVKKAHHSQLRKACEVALEEIKAETEKQSPPH...,1.157853,5,5,4,4,6,1
1018,ENSG00000067167.8_7,chr8,-,2763,11,novel_not_in_catalog,ENSG00000067167.8,novel,3056.0,11.0,,,0.0,-2.0,intron_retention,False,canonical,1.0,8.0,junction_9,118.005254,,,,True,,,,C,coding,373.0,1122.0,154.0,1275.0,70608199.0,70574932.0,False,20.0,AAGTTGTATCCTGGATGTTT,23.0,True,-29.0,True,AATAAA,-16.0,True,MAIRKKSTKSPPVLSHEFVLQNHADIVSCVAMVFLLGLMFEITAKA...,7.607977,4,4,6,5,3,2
1059,ENSG00000070501.12_1,chr8,+,1231,13,novel_in_catalog,ENSG00000070501.12,novel,1290.0,14.0,,,0.0,-1.0,combination_of_known_junctions,False,canonical,1.0,17.0,junction_1,12.583885,,,,False,,,,C,coding,181.0,546.0,536.0,1081.0,42357209.0,42371657.0,False,30.0,ATTGGATGATGATGATCTTA,0.0,True,-26.0,True,AATAAA,-16.0,True,MLQMQDIVLNEVKKVDSEYIATVCGSFRRGAESSGDMDVLLTHPSF...,4.540192,1,7,6,1,6,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17267,ENSG00000241852.11_1,chr8,+,2109,8,novel_in_catalog,ENSG00000248235.6,novel,977.0,5.0,,,-5.0,2050.0,combination_of_known_splicesites,False,canonical,1.0,6.0,junction_2,39.158495,,,,False,,,,B,coding,400.0,1203.0,79.0,1281.0,22589597.0,22603306.0,False,30.0,CAAAGTGCTTTGTAAACCTC,2.0,True,-21.0,True,ATTAAA,-7.0,True,MDSEGGSLLLDEDSEVFKMLQENREGRAAPRQSSSFRLLQEALEAE...,2.980002,2,4,2,0,1,1
17282,ENSG00000241852.11_16,chr8,+,3167,6,novel_in_catalog,ENSG00000241852.11,novel,2056.0,7.0,,,0.0,1.0,intron_retention,False,canonical,1.0,21.0,junction_5,16.702096,,,,False,,,,C,coding,438.0,1317.0,1023.0,2339.0,22600623.0,22603306.0,False,30.0,CAAAGTGCTTTGTAAACCTC,-4.0,True,-21.0,True,ATTAAA,-7.0,True,MDVACQGFAYHTAQSTGSLSLTTVAKGIPSTHLPDSWGHIRAARGC...,3.054985,2,2,4,2,1,1
17509,ENSG00000250571.7_8,chr8,+,2868,4,novel_not_in_catalog,ENSG00000250571.7,novel,1335.0,4.0,,,0.0,0.0,at_least_one_novel_splicesite,True,canonical,1.0,3.0,junction_1,13.199327,,,,False,,,,C,coding,381.0,1146.0,1596.0,2741.0,143275659.0,143276804.0,False,10.0,GTGGGAGCCGGCACCTGCCT,17.0,True,-32.0,True,AATAAA,-20.0,True,MVPAALGHPATPPTPVSMSALLGTALRAPVRLSPIHWVPGPSLSPL...,4.530979,3,1,6,0,0,0
17511,ENSG00000250571.7_10,chr8,+,1949,4,novel_not_in_catalog,ENSG00000250571.7,novel,1335.0,4.0,,,0.0,0.0,at_least_one_novel_splicesite,True,canonical,1.0,3.0,junction_1,12.036980,,,,False,,,,C,coding,376.0,1131.0,692.0,1822.0,143269397.0,143276804.0,False,10.0,GTGGGAGCCGGCACCTGCCT,17.0,True,-32.0,True,AATAAA,-20.0,True,MAALGDIQESPSVPSPVSLSSPGTPGTQHHEPQLHLHGHQHGSPGS...,4.530979,4,9,3,0,0,0


### Investigate in Genome Viewer...

In [13]:
igv_browser_qc= igv_notebook.Browser(
    {
        "reference": {
            "id": "hg38",
            "name": "Human (GRCH38/hg38)",
            "fastaPath": "../../data/reference/GRCh38.primary_assembly.genome_chr8.fa",
            "indexPath": "../../data/reference/GRCh38.primary_assembly.genome_chr8.fa.fai"
        },
        "locus": "chr8:75,391,802-75,577,425",
        "tracks": [
            {
                "name": "IsoTools",
                "path": "../../data/sqanti_qc/h1_endo_chr8_isotools/h1_endo_chr8_isotools_qc_corrected.gtf",
                "format": "gtf",
                "type": "annotation",
                "displayMode": "SQUISHED"
            },
            {
                "name": "Reference",
                "path": "../../data/reference/gencode.v45.annotation_chr8.gtf",
                "format": "gtf",
                "type": "annotation",
                "displayMode": "SQUISHED"
            }
        ]
    }
)

<IPython.core.display.Javascript object>

# SQANTI3 Filter

See also: https://github.com/ConesaLab/SQANTI3/wiki/Running-SQANTI3-filter

Now that we have applied the SQANTI3 Quality Control, we can differentiate between isoforms with higher confidence (e.g. high expression, supported by different types of orthogonal data, etc.), and lower confidence (e.g. no orthogonal data support). To automate the process of differentiating between real transcript isoforms and artifacts, the SQANTI3 Filter module gives us 2 options:

* Define a set of hard-coded **rules** by which to differentiate isoforms and artifacts.
    * While the decisions taken by this approach are easily explainable as all rules are explicitly detailed, it takes considerable expertise to craft and refine a suitable filter, and requirements may also change based on data type, availability of data, and many more factors.
* Apply **machine learning** (specifically, a random forest model) to perform this differentiation in a more autmated fashion.
    * Although this approach still requires a set of individual rules to identify high-confidence true positive and true negative sets by which to perform the initial training of the model, it then automatically learns how to differentiate isoforms and artifacts. Explainability in this case is limited to evaluating the importance of the features used in the filter.


In this course, we will focus on the application of **the machine learning filter**.

In [14]:
# How to run SQ3 Filter?
# see also: https://github.com/ConesaLab/SQANTI3/wiki/Running-SQANTI3-filter
# specifically for the Machine Learning Filter: https://github.com/ConesaLab/SQANTI3/wiki/Running-SQANTI3-filter#ml
!$sqanti_path/sqanti3_filter.py ml --help


      ███████╗██╗██╗░░░░░████████╗███████╗██████╗░
      ██╔════╝██║██║░░░░░╚══██╔══╝██╔════╝██╔══██╗
      █████╗░░██║██║░░░░░░░░██║░░░█████╗░░██████╔╝
      ██╔══╝░░██║██║░░░░░░░░██║░░░██╔══╝░░██╔══██╗
      ██║░░░░░██║███████╗░░░██║░░░███████╗██║░░██║
      ╚═╝░░░░░╚═╝╚══════╝░░░╚═╝░░░╚══════╝╚═╝░░╚═╝
    
usage: sqanti3_filter.py ml [-h] --sqanti_class SQANTI_CLASS
                            [--isoAnnotGFF3 ISOANNOTGFF3]
                            [--filter_isoforms FILTER_ISOFORMS]
                            [--filter_gtf FILTER_GTF]
                            [--filter_sam FILTER_SAM]
                            [--filter_faa FILTER_FAA] [-o OUTPUT] [-d DIR]
                            [--skip_report] [-e] [-v] [-c CPUS]
                            [-t PERCENT_TRAINING] [-p TP] [-n TN]
                            [-j THRESHOLD] [-f] [--intermediate_files]
                            [-r REMOVE_COLUMNS] [-z MAX_CLASS_SIZE]
                            [-i INTRAPRIMING]

ML fil

## Running SQ3 Filter

### Preprocessing

#### Choosing True Positive / True Negative Transcript Sets 
While this can be done automatically, we have found the following set of criteria to yield more stringent filtering results.
The SQANTI3 Machine Learning Filter then trains a Random Forest model on the basis of these True Positive / True Negative transcripts, which is then applied to filter the entire transcriptome.

In [15]:
filter_sets_dir = os.path.join(filter_dir, "filter_sets")
max_set_size = 300 # reduced from 3000 to 300 for faster runtime for the course; use e.g. 3000 for real analysis

# Ensure required columns are present
required_columns = ['structural_category', 'all_canonical', 'within_CAGE_peak', 'within_polyA_site', 'isoform']
for col in required_columns:
    if col not in sq3_qc_class.columns:
        raise ValueError(f"Required column '{col}' not found in the classification file.")

# Define True Positives (TP)
tp_conditions = (
    (sq3_qc_class['structural_category'] == 'full-splice_match') &
    (sq3_qc_class['all_canonical'] == "canonical") &
    (sq3_qc_class['within_CAGE_peak'] == True) &
    (sq3_qc_class['within_polyA_site'] == True) &
    (sq3_qc_class['exons'] > 1) 
)
tp_df = sq3_qc_class[tp_conditions]

    # Define True Negatives (TN)
tn_conditions = (
    (sq3_qc_class['structural_category'] != 'full-splice_match') &
    (
        (sq3_qc_class['all_canonical'] == "non_canonical") |
        (sq3_qc_class['within_CAGE_peak'] != True) |
        (sq3_qc_class['within_polyA_site'] != True)
    ) &
    (sq3_qc_class['exons'] > 1)
)
tn_df = sq3_qc_class[tn_conditions]

# Sample up to max_size transcripts for each set
tp_sample = tp_df.sample(n=min(len(tp_df), max_set_size), random_state=42)
tn_sample = tn_df.sample(n=min(len(tn_df), max_set_size), random_state=42)

if len(tp_sample) < 250:
    raise ValueError(f"Not enough TP transcripts found. Found {len(tp_sample)}, required 250.")
if len(tn_sample) < 250:
    raise ValueError(f"Not enough TN transcripts found. Found {len(tn_sample)}, required 250.")

# Save isoform IDs to files
os.makedirs(filter_sets_dir, exist_ok=True)
tp_file = os.path.join(filter_sets_dir, "TP_list.txt")
tn_file = os.path.join(filter_sets_dir, "TN_list.txt")
tp_sample['isoform'].to_csv(tp_file, index=False, header=False)
tn_sample['isoform'].to_csv(tn_file, index=False, header=False)

print(f"Saved {len(tp_sample)} TP isoforms to {tp_file}")
print(f"Saved {len(tn_sample)} TN isoforms to {tn_file}")
    
# Exclusion file for variables that should't be useeed in ML filtering
exclusion_file = os.path.join(filter_sets_dir, "exclusion_list.txt")
with open(exclusion_file, 'w') as f:
    f.write("all_canonical\n")
    f.write("within_CAGE_peak\n")
    f.write("within_polyA_site\n")
    f.write("dist_to_CAGE_peak\n")
    f.write("dist_to_polyA_site\n")
print(f"Exclusion list saved to {exclusion_file}")


Saved 300 TP isoforms to /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/output/sqanti_filter/h1_endo_chr8_isotools/filter_sets/TP_list.txt
Saved 300 TN isoforms to /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/output/sqanti_filter/h1_endo_chr8_isotools/filter_sets/TN_list.txt
Exclusion list saved to /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/output/sqanti_filter/h1_endo_chr8_isotools/filter_sets/exclusion_list.txt


#### Adjusting quantification columns

The quantification data is currently split into 6 individual columns, one for each sample. However, we assume that presenting the information to the filter like this may not be as effective as e.g. providing the total sum of counts, or perhaps the counts per condition rather than per sample - so we can adjust the classification file before we supply it to the Filter.

In [16]:
fl_endo_cols = [col for col in fl_sample_cols if col.startswith('FL.cDNA_PacBio-endo')]
fl_h1_cols = [col for col in fl_sample_cols if col.startswith('FL.cDNA_PacBio-h1')]

sq3_qc_class["FL"] = sq3_qc_class[fl_sample_cols].sum(axis=1)
sq3_qc_class["FL.endo"] = sq3_qc_class[fl_endo_cols].sum(axis=1)
sq3_qc_class["FL.h1"] = sq3_qc_class[fl_h1_cols].sum(axis=1)

sq3_qc_class.drop(columns=fl_sample_cols, inplace=True)

sq3_qc_class_file_mod = f"{qc_dir}/{qc_prefix}_classification_mod.txt"
sq3_qc_class.to_csv(sq3_qc_class_file_mod, sep='\t', index=False)

In [None]:
# Run SQ3 Filter

corrected_fasta = qc_dir + "/" + qc_prefix + "_corrected.fasta"
corrected_gtf = qc_dir + "/" + qc_prefix + "_corrected.gtf"

# Build the command

# Default execution with automatic definition of TP and TN sets:
# TP: all multi-exon Reference Match (RM, subcategory of FSM where TTS/TSS within 50bp); if <250, all FSM
# TN: all multi-exon Novel Not in Catalog (NNC) that have at least one non-canonical junction; if <250, all NNC
cmd_default = [
    "/usr/bin/time", "-v",
    f"{sqanti_path}/sqanti3_filter.py",         # SQANTI3 Filter script
    "ml",                                       # Mode: Machine Learning
    "--sqanti_class", sq3_qc_class_file_mod,    # Classification file
    "--dir", filter_dir,                        # Output Location
    "--filter_isoforms", corrected_fasta,       # Corrected fasta (isoform sequences to filter)
    "--filter_gtf", corrected_gtf,              # GTF to filter
    "--output", filter_prefix,                  # Output prefix
    "--cpus", str(n_cores),                     # Number of cores
]

# Command with manual definition of TP and TN sets
cmd = [
    "/usr/bin/time", "-v",
    f"{sqanti_path}/sqanti3_filter.py",         # SQANTI3 Filter script
    "ml",                                       # Mode: Machine Learning
    "--sqanti_class", sq3_qc_class_file_mod,    # Classification file
    "--dir", filter_dir,                        # Output Location
    "--filter_isoforms", corrected_fasta,       # Corrected fasta (isoform sequences to filter)
    "--filter_gtf", corrected_gtf,              # GTF to filter
    "--threshold", "0.7",                       # Threshold for Machine Learning
    "--TP", tp_file,                            # TP list
    "--TN", tn_file,                            # TN list
    "--remove_columns", exclusion_file,         # Exclusion list (features that were used in identifying TP/TN sets)
    "--output", filter_prefix,                  # Output prefix
    "--cpus", str(n_cores),                     # Number of cores
]

# Print the command for reference
print("Running command:")
print(" ".join(cmd))

# Run the command
result = subprocess.run(cmd, capture_output=True, text=True)

# Print output and errors
print("Output:")
print(result.stdout)
print("Errors:")
print(result.stderr)

Running command:
/usr/bin/time -v /home/fjetzinger/tools/SQANTI3-5.4/sqanti3_filter.py ml --sqanti_class /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/output/sqanti_qc/h1_endo_chr8_isotools/h1_endo_chr8_isotools_qc_classification_mod.txt --dir /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/output/sqanti_filter/h1_endo_chr8_isotools --filter_isoforms /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/output/sqanti_qc/h1_endo_chr8_isotools/h1_endo_chr8_isotools_qc_corrected.fasta --filter_gtf /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/output/sqanti_qc/h1_endo_chr8_isotools/h1_endo_chr8_isotools_qc_corrected.gtf --threshold 0.7 --TP /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/output/sqanti_filter/h1_endo_chr8_isotools/filter_sets/TP_list.txt --TN /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/output/sqanti_filter/h1_endo_chr8_isotools/filter_sets/TN_list.

## Investigating SQANTI3 Filter Results

### Can you find answers to the following Questions?

Similarly to the report of the Quality Control process, SQANTI3 also provides a report for the Filter process. You can find it under `data/sqanti_filter/h1_endo_chr8_isotools/h1_endo_chr8_isotools_filter_SQANTI3_filter_report.html`

1. **Isoforms per category:**
    
    Which structural categories are more strongly affected by the filter?

2. **Variable importance in Random Forest classifier:**

    The *variable importance* is a measure of how strong of an effect a feature has on the outcome of the random forest classifier (whether an isoform is identified as artifact or not). Review which features are particularly important. Can you think of which types of artifact each of the different features may help to identify?

3. **Configuration:** (review documentation rather than report)

    In the course of a real analysis, it could occur that you are not yet happy with the results of the SQANTI3 Filter - maybe you would like the filter to be more or less stringent, or you observe a particular type of artifact in your results that you would like the filter to remove. Which options do you have to further adjust the filter?



#### **Configuration**: solution

In order to further adjust the SQANTI3 Filter, you could consider the following options:

* Look for additional data (short reads, CAGE, polyA). There are publicly available data sets for many species and tissues. These can give SQANTI3 QC and later Filter valuable information.
* Supply a custom set of True Positive and True Negative isoforms, or adjust the sets you supply. E.g. if you observe a specific type of artifact you would like to filter out, ensure that sufficient examples of that type of artifact are present in the True Negative set. However, keep in mind that any features you use to identify the TP/TN sets should be excluded from the Filter to avoid overfitting.
* Adjust the *--threshold* parameter to make the filter more or less stringent.
* (Out of scope of SQANTI) Review the data and data quality as well as the choice of tool and its parameters used to build the transcriptome. Try a different set of parameters or a different tool and see what changes.

### Investigate in Genome Viewer... 

In [18]:
igv_browser_filter= igv_notebook.Browser(
    {
        "reference": {
            "id": "hg38",
            "name": "Human (GRCH38/hg38)",
            "fastaPath": "../../data/reference/GRCh38.primary_assembly.genome_chr8.fa",
            "indexPath": "../../data/reference/GRCh38.primary_assembly.genome_chr8.fa.fai"
        },
        "locus": "chr8:75,391,802-75,577,425",
        "tracks": [
            {
                "name": "IsoTools (after filter)",
                "path": "../../data/sqanti_filter/h1_endo_chr8_isotools/h1_endo_chr8_isotools_filter.filtered.gtf",
                "format": "gtf",
                "type": "annotation",
                "displayMode": "SQUISHED"
            },
            {
                "name": "Reference",
                "path": "../../data/reference/gencode.v45.annotation_chr8.gtf",
                "format": "gtf",
                "type": "annotation",
                "displayMode": "SQUISHED"
            }
        ]
    }
)

<IPython.core.display.Javascript object>

# SQANTI3 Rescue

See also: https://github.com/ConesaLab/SQANTI3/wiki/Running-SQANTI3-rescue

Now that SQANTI3 Filter has filtered out probable artifacts, the remaining transcripts in our transcriptome are much more reliable. However, we have also lost some data in this process - and while a specific isoform may have been filtered out, the underlying reads could still be useful.

SQANTI3 Rescue now allows us to "rescue" some transcripts from the reference annotations. 

Image, for instance, the following situtation:

* Gene X exists in the reference annotation with 1 transcript, X.1

* Our custom transcriptome does not contain this transcript X.1, but instead defined a slightly different novel transcript X.2 with 10 reads supporting it.

* The Filter decides that X.2 is likely an artifact and therefore removes it from our curated transcriptome. However, the data obtained from those 10 reads is now completely lost! The X.2 transcript may be an artifact, but we could still consider those 10 reads to be evidence that the Gene X is expressed. 

* So now, rather than completely losing Gene X from our transcriptome, SQANTI3 Rescue can "rescue" the transcript X.1 from the reference annotation to represent those 10 reads, provided that **1.** X.2 is either an FSM or ISM of X.1, or it maps well to X.1, and **2.** the transcript X.1 passes the filter.

To use the terminology of the graphics below, in this case the novel transcript (artifact) X.2 is the rescue candidate, while the reference transcript X.1 is the rescue target.

<img src="../images/SQ3_rescue_01-automatic.png" width=500 style="background-color:lightgrey;" />
<img src="../images/SQ3_rescue_02-mapping.png" width=364 style="background-color:lightgrey;" />
<img src="../images/SQ3_rescue_03-ref-filter.png" width=869 style="background-color:lightgrey;" />
<img src="../images/SQ3_rescue_04-rescue.png" width=869 style="background-color:lightgrey;" />

In [26]:
# How to run SQ3 Rescue?
# see also: https://github.com/ConesaLab/SQANTI3/wiki/Running-SQANTI3-rescue
!$sqanti_path/sqanti3_rescue.py ml --help


      ██████╗░███████╗░██████╗░█████╗░██╗░░░██╗███████╗
      ██╔══██╗██╔════╝██╔════╝██╔══██╗██║░░░██║██╔════╝
      ██████╔╝█████╗░░╚█████╗░██║░░╚═╝██║░░░██║█████╗░░
      ██╔══██╗██╔══╝░░░╚═══██╗██║░░██╗██║░░░██║██╔══╝░░
      ██║░░██║███████╗██████╔╝╚█████╔╝╚██████╔╝███████╗
      ╚═╝░░╚═╝╚══════╝╚═════╝░░╚════╝░░╚═════╝░╚══════╝
    
INFO:art_logger:
      ██████╗░███████╗░██████╗░█████╗░██╗░░░██╗███████╗
      ██╔══██╗██╔════╝██╔════╝██╔══██╗██║░░░██║██╔════╝
      ██████╔╝█████╗░░╚█████╗░██║░░╚═╝██║░░░██║█████╗░░
      ██╔══██╗██╔══╝░░░╚═══██╗██║░░██╗██║░░░██║██╔══╝░░
      ██║░░██║███████╗██████╔╝╚█████╔╝╚██████╔╝███████╗
      ╚═╝░░╚═╝╚══════╝╚═════╝░░╚════╝░░╚═════╝░╚══════╝
    
usage: sqanti3_rescue.py ml [-h] --filter_class FILTER_CLASS --refGTF REFGTF
                            --refFasta REFFASTA
                            [--rescue_isoforms RESCUE_ISOFORMS]
                            [--rescue_gtf RESCUE_GTF] [-k REFCLASSIF]
                            [--counts COU

## Running SQANTI3 QC to classify the reference transcriptome

In [27]:
# %%script echo Skipping SQANTI3 QC on reference
# Run SQ3 QC on reference with the same orthogonal data
cmd = [
    "/usr/bin/time", "-v",
    f"{sqanti_path}/sqanti3_qc.py",         # SQANTI3 QC script
    "--isoforms", ref_gtf,                  # GTF to Quality Control
    "--refGTF", ref_gtf,                    # Reference GTF
    "--refFasta", ref_genome,               # Reference Genome
    "--polyA_motif_list", polyA_motifs,     # PolyA Motif List
    "--polyA_peak", polyA_peaks,            # PolyA Peaks
    "--fl_count", counts,                   # Counts file
    "--coverage", splice_junctions,         # SJ file
    "--CAGE_peak", cage,                    # CAGE Peak file
    "--SR_bam", sr_bam,                     # SR BAM file
    "--output", ref_qc_prefix,              # Output Prefix
    "--dir", ref_qc_dir,                    # Output Location
    # "--skipORF",                            # Skip ORF Prediction (takes longer)
    "--cpus", str(n_cores)                  # Number of Threads
]

# Print the command for reference
print("Running command:")
print(" ".join(cmd))

# Run the command
result = subprocess.run(cmd, capture_output=True, text=True)

# Print output and errors
print("Output:")
print(result.stdout)
print("Errors:")
print(result.stderr)

ref_qc_class_file = os.path.join(ref_qc_dir, f"{ref_qc_prefix}_classification.txt")

Running command:
/usr/bin/time -v /home/fjetzinger/tools/SQANTI3-5.4/sqanti3_qc.py --isoforms /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/data/gencode.v45.annotation_chr8.gtf --refGTF /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/data/gencode.v45.annotation_chr8.gtf --refFasta /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/data/GRCh38.primary_assembly.genome_chr8.fa --polyA_motif_list /home/fjetzinger/tools/SQANTI3-5.4/data/polyA_motifs/mouse_and_human.polyA_motif.txt --polyA_peak /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/data/atlas.clusters.2.0.GRCh38.96_chr8.bed --fl_count /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/data/h1_endo_chr8_all_count.txt --coverage /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/data/ENCFF498FDF_ENCFF181VTPSJ.out_chr8.tab --CAGE_peak /home/fjetzinger/tools/SQANTI3-5.4/data/ref_TSS_annotation/human.refTSS_v3.1.hg38.be

In [28]:
# Apply the same quantification changes to the reference classification file
ref_qc_class = pd.read_csv(ref_qc_class_file, sep='\t', low_memory=False)

ref_qc_class["FL"] = ref_qc_class[fl_sample_cols].sum(axis=1)
ref_qc_class["FL.endo"] = ref_qc_class[fl_endo_cols].sum(axis=1)
ref_qc_class["FL.h1"] = ref_qc_class[fl_h1_cols].sum(axis=1)

ref_qc_class.drop(columns=fl_sample_cols, inplace=True)
ref_qc_class_file_mod = os.path.join(ref_qc_dir, f"{ref_qc_prefix}_classification_mod.txt")
ref_qc_class.to_csv(ref_qc_class_file_mod, sep='\t', index=False)

## Running SQANTI3 Rescue

In [29]:
filter_classification = os.path.join(filter_dir, f"{filter_prefix}_ML_result_classification.txt")
filtered_fasta = os.path.join(filter_dir, f"{filter_prefix}.filtered.fasta")
filtered_gtf = os.path.join(filter_dir, f"{filter_prefix}.filtered.gtf")
random_forest_model = os.path.join(filter_dir, f"{filter_prefix}_randomforest.RData")

# Run SQ3 Rescue
cmd = [
    "/usr/bin/time", "-v",
    f"{sqanti_path}/sqanti3_rescue.py",         # SQANTI3 Rescue script
    "ml",                                       # Filter Mode: Machine Learning
    "--rescue_isoforms", filtered_fasta,        # Filtered fasta
    "--rescue_gtf", filtered_gtf,               # Filtered GTF
    "--filter_class", filter_classification,    # Filter Classification
    "--refGTF", ref_gtf,                        # Reference GTF
    "--refFasta", ref_genome,                   # Reference Genome
    "--refClassif", ref_qc_class_file_mod,      # Reference Classification
    # "--requant",                              # Requantify
    # "--counts", counts,                       # Counts file
    "--random_forest", random_forest_model,     # Random Forest Model from Filter
    "--threshold", "0.7",                       # Threshold for Machine Learning
    "--mode", "full",                           # Rescue Mode: Full (extend rescue to non-FSM isoforms)
    "--output", rescue_prefix,                  # Output Prefix
    "--dir", rescue_dir,                        # Output Location
    "--cpus", str(n_cores)                      # Number of Threads
]

# Print the command for reference
print("Running command:")
print(" ".join(cmd))

# Run the command
result = subprocess.run(cmd, capture_output=True, text=True)

# Print output and errors
print("Output:")
print(result.stdout)
print("Errors:")
print(result.stderr)

Running command:
/usr/bin/time -v /home/fjetzinger/tools/SQANTI3-5.4/sqanti3_rescue.py ml --rescue_isoforms /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/output/sqanti_filter/h1_endo_chr8_isotools/h1_endo_chr8_isotools_filter.filtered.fasta --rescue_gtf /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/output/sqanti_filter/h1_endo_chr8_isotools/h1_endo_chr8_isotools_filter.filtered.gtf --filter_class /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/output/sqanti_filter/h1_endo_chr8_isotools/h1_endo_chr8_isotools_filter_ML_result_classification.txt --refGTF /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/data/gencode.v45.annotation_chr8.gtf --refFasta /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/data/GRCh38.primary_assembly.genome_chr8.fa --refClassif /mnt/c/Users/jetzi/other_repos/summer_school/practicals/day2/sqanti3/output/sqanti_qc/gencode.v45.annotation_chr8/gencode.v45.annota

## Investigating SQANTI3 Rescue Results

### Can you find answers to the following Questions in the SQANTI3 Rescue results?

While SQANTI3 Rescue does not provide a report, we can explore the results to find answers to the following questions.

1. **Rescue:**
    
    a. How many reference transcripts were added to the transcriptome in total due to the rescue?

    b. How many genes were affected? 


### Aggregate and investigate classification file...

In [30]:
rescued_gtf = os.path.join(rescue_dir, f"{rescue_prefix}_rescued.gtf")

# extract transcript ids from rescued gtf
transcript_ids = set()
with open(rescued_gtf, 'r') as f:
    for line in f:
        if line.startswith('#'):
            continue
            
        fields = line.strip().split('\t')
        if len(fields) >= 9 and fields[2] == 'transcript':
            attributes = fields[8]
            for attr in attributes.split(';'):
                if 'transcript_id' in attr:
                    transcript_id = attr.split('"')[1]
                    transcript_ids.add(transcript_id)
                    break

# load classification file
filtered_class = pd.read_csv(filter_classification, sep='\t', low_memory=False)

# merge filtered_class and ref_qc_class for final transcript ids
rescued_class = pd.concat([filtered_class, ref_qc_class])

# filter for final transcript ids
rescued_class = rescued_class[rescued_class['isoform'].isin(transcript_ids)]

# fill NaN in filter_result with "Rescue"
rescued_class['filter_result'] = rescued_class['filter_result'].fillna('Rescue')

rescued_class

Unnamed: 0,isoform,chrom,strand,length,exons,structural_category,associated_gene,associated_transcript,ref_length,ref_exons,diff_to_TSS,diff_to_TTS,diff_to_gene_TSS,diff_to_gene_TTS,subcategory,RTS_stage,all_canonical,min_sample_cov,min_cov,min_cov_pos,sd_cov,FL,n_indels,n_indels_junc,bite,iso_exp,gene_exp,ratio_exp,FSM_class,coding,ORF_length,CDS_length,CDS_start,CDS_end,CDS_genomic_start,CDS_genomic_end,predicted_NMD,perc_A_downstream_TTS,seq_A_downstream_TTS,dist_to_CAGE_peak,within_CAGE_peak,dist_to_polyA_site,within_polyA_site,polyA_motif,polyA_dist,polyA_motif_found,ORF_seq,ratio_TSS,FL.endo,FL.h1,POS_MLprob,NEG_MLprob,ML_classifier,intra_priming,filter_result
0,ENSG00000003987.14_0,chr8,-,5071,14,full-splice_match,ENSG00000003987.14,ENST00000180173.10,5110.0,14.0,-24.0,-15.0,0.0,-15.0,reference_match,False,canonical,1.0,2.0,junction_1,1.475640,16.0,,,False,,,,C,coding,660.0,1983.0,36.0,2018.0,17413292.0,17299862.0,False,35.0,CACTGAGATGAACTACTCAT,7.0,True,2.0,False,AATAAA,-4.0,True,MEHIRTPKVENVRLVDRVSPKKAALGTLYLTATHVIFVENSPDPRK...,3.487562,9.0,7.0,0.900,0.100,Positive,False,Isoform
2,ENSG00000003987.14_2,chr8,-,1803,1,incomplete-splice_match,ENSG00000003987.14,ENST00000180173.10,5110.0,14.0,-114753.0,-2.0,-13075.0,-2.0,mono-exon,False,,,,,,4.0,,,,,,,C,non_coding,,,,,,,,30.0,TACTCATGTGATTATGTAGA,,False,-11.0,True,AATAAA,-17.0,True,,1.199920,2.0,2.0,,,,False,Isoform
7,ENSG00000003987.14_7,chr8,-,3372,1,genic,ENSG00000003987.14,novel,,,,,25738.0,1143.0,mono-exon,False,,,,,,1.0,,,,,,,C,non_coding,,,,,,,,30.0,CTATAGGTAAGTGATGGCCA,,False,,False,AGTAAA,-43.0,True,,1.000000,1.0,0.0,,,,False,Isoform
8,ENSG00000003987.14_8,chr8,-,1448,2,full-splice_match,ENSG00000003987.14,ENST00000521177.1,506.0,2.0,-201.0,1143.0,0.0,1143.0,alternative_3end5end,False,canonical,1.0,3.0,junction_1,0.000000,1.0,,,False,,,,C,non_coding,,,,,,,,30.0,CTATAGGTAAGTGATGGCCA,7.0,True,,False,AGTAAA,-43.0,True,,3.487562,1.0,0.0,0.720,0.280,Positive,False,Isoform
15,ENSG00000003987.14_15,chr8,-,7012,13,full-splice_match,ENSG00000003987.14,ENST00000521857.5,2158.0,13.0,0.0,4854.0,0.0,-3.0,alternative_3end,False,canonical,1.0,2.0,junction_9,1.381927,1.0,,,False,,,,C,coding,555.0,1668.0,36.0,1703.0,17413292.0,17302106.0,False,25.0,CTACTCATGTGATTATGTAG,7.0,True,-10.0,True,AATAAA,-16.0,True,MEHIRTPKVENVRLVDRVSPKKAALGTLYLTATHVIFVENSPDPRK...,3.487562,0.0,1.0,0.894,0.106,Positive,False,Isoform
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10137,ENST00000702030.1,chr8,+,1984,2,full-splice_match,ENSG00000286113.2,ENST00000702030.1,1984.0,2.0,0.0,0.0,0.0,0.0,reference_match,False,canonical,0.0,0.0,junction_1,0.000000,0.0,,,False,,,,C,non_coding,,,,,,,,15.0,GTCCTGTGAACATTTGTTTG,-1.0,True,-9.0,False,ATTAAA,-18.0,True,,1.497512,0.0,0.0,,,,,Rescue
10219,ENST00000704784.1,chr8,-,3727,4,full-splice_match,ENSG00000164983.9,ENST00000704784.1,3727.0,4.0,0.0,0.0,0.0,0.0,reference_match,False,canonical,1.0,9.0,junction_2,5.099020,0.0,,,False,,,,C,coding,112.0,339.0,1704.0,2042.0,124324613.0,124324275.0,True,45.0,CATTTATAATAATATAATTT,,False,0.0,True,AGTAAA,-1.0,True,MPYLLDCCYQVPLFYIGAPVFLHSDMMIHMFILFLCLFLPFLPLYF...,1.000000,0.0,0.0,,,,,Rescue
10224,ENST00000704789.1,chr8,+,1652,10,full-splice_match,ENSG00000197858.12,ENST00000704789.1,1652.0,10.0,0.0,0.0,0.0,0.0,reference_match,False,canonical,0.0,0.0,junction_3,102.369459,0.0,,,False,,,,C,coding,487.0,1464.0,98.0,1561.0,144082731.0,144086125.0,False,5.0,TTTGAGCTCCTGGCCCGCTG,5.0,True,-38.0,True,AATAAA,-23.0,True,MGLLSDPVRRRALARLVLRLNAPLCVLSYVAGIAWFLALVFPPLTQ...,6.718815,0.0,0.0,,,,,Rescue
10260,ENST00000707113.1,chr8,+,2052,3,full-splice_match,ENSG00000136997.22,ENST00000707113.1,2052.0,3.0,0.0,0.0,0.0,0.0,reference_match,False,canonical,0.0,0.0,junction_1,3.500000,0.0,,,False,,,,C,coding,439.0,1320.0,257.0,1576.0,127738263.0,127740958.0,False,10.0,GTTGTGAATGTTTTGTTTCG,-725.0,False,-34.0,True,AATAAA,-36.0,True,MPLNVSFTNRNYDLDYDSVQPYFYCDEEENFYQQQQQSELQPPAPS...,1.000000,0.0,0.0,,,,,Rescue


In [31]:
print("Nr. of final transcripts: ", len(rescued_class))
print("Nr. of final genes: ", len(set(rescued_class['associated_gene'])))

print("Nr. of rescued transcripts: ", len(rescued_class[rescued_class['filter_result'] == 'Rescue']))
print("Nr. of genes with at least 1 rescued transcript: ", len(set(rescued_class[rescued_class['filter_result'] == 'Rescue']['associated_gene'])))

# group by gene and then display only genes where all transcripts are rescued
rescue_only_genes = rescued_class.groupby('associated_gene').filter(lambda x: len(x) == len(x[x['filter_result'] == 'Rescue']))
print("Nr. of genes with only rescued transcripts: ", len(set(rescue_only_genes['associated_gene'])))


Nr. of final transcripts:  8078
Nr. of final genes:  2866
Nr. of rescued transcripts:  393
Nr. of genes with at least 1 rescued transcript:  259
Nr. of genes with only rescued transcripts:  48


### Investigate in Genome Viewer... 

In [32]:
igv_browser_rescue= igv_notebook.Browser(
    {
        "reference": {
            "id": "hg38",
            "name": "Human (GRCH38/hg38)",
            "fastaPath": "../../data/reference/GRCh38.primary_assembly.genome_chr8.fa",
            "indexPath": "../../data/reference/GRCh38.primary_assembly.genome_chr8.fa.fai"
        },
        "locus": "chr8:28701580-28753690",
        "tracks": [
            {
                "name": "IsoTools (after rescue)",
                "path": "../../data/sqanti_rescue/h1_endo_chr8_isotools/h1_endo_chr8_isotools_rescue_rescued.gtf",
                "format": "gtf",
                "type": "annotation",
                "displayMode": "SQUISHED"
            },
            {
            "name": "Reference",
            "path": "../../data/sqanti_qc/gencode.v45.annotation_chr8/gencode.v45.annotation_chr8_qc_corrected.gtf",
            "format": "gtf",
            "type": "annotation",
            "displayMode": "SQUISHED"
            }
        ]
    }
)

<IPython.core.display.Javascript object>

# Where do we go from here?

After using SQANTI3 to perform Quality Control on a custom transcriptome, Filter out spurious transcripts, and Rescue related reference transcripts, further downstream analyses can be conducted.

Some examples of further analyses:

* Differential Gene Expression; Differential Transcript Expression/Usage; Differential Exon Expression/Usage

* Functional Annotation, Enrichment Analysis, Pathway Analysis