# Exon GTF Generation

This guide explains how to generate an exon-level GTF reference file. This file is used to align scRNA-seq data to the exon level, allowing the extraction of exon read counts and junction read counts. The goal of the exon-level GTF is to ensure that exons within each gene are unique and do not overlap with one another.

![This is an example image](./exon_gtf_demonstration.png)


## **For Human GRCh38**
You can directly download the pre-generated exon-level GTF file from [here](https://mcgill-my.sharepoint.com/my?id=%2Fpersonal%2Fkailu%5Fsong%5Fmail%5Fmcgill%5Fca%2FDocuments%2FDeepExonas%5Fgithub%5Fexample%2Fgraph%5Fgeneration%5Frequired%5Ffile).  

## **For Other Species**
1. First, download the reference GTF file from [here](https://www.ensembl.org/index.html).  

2. Then, run this script to generate the exon-level GTF file.

In [5]:
from DOLPHIN.preprocess import generate_nonoverlapping_exons
import os

In [3]:
# === Step 1: Set paths ===
# Define the output directory
output_path = "./"

# Path to the input Ensembl GTF file
input_gtf_path = "/mnt/md0/kailu/Apps/ensembl_hg38/Homo_sapiens.GRCh38.107.gtf"

In [4]:
gtf_df, overlaps = generate_nonoverlapping_exons(input_gtf_path, output_path)

[Step] Reading GTF file from: /mnt/md0/kailu/Apps/ensembl_hg38/Homo_sapiens.GRCh38.107.gtf
[Status] GTF loaded and parsed with 3371244 total entries.
[Status] Removed duplicates: 674296 unique exon entries remain.
[Step] Start processing and saving exons by batch...


Processing all genes: 100%|██████████| 61860/61860 [1:16:37<00:00, 13.46it/s]


[Done] Finished saving all exon batches.
Successfully combined 7 files into a single DataFrame with 354386 rows.
Found 0 overlapping exon entries.
All 61860 expected genes are present in the merged DataFrame.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['attribute']=""
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['attribute']=df['attribute']+c+' "'+inGTF[c].astype(str)+'"; '
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['attribute']=df['attribute']+c+' "'+inGTF[c].astype(str)+'"; '
A value is trying to be set on a copy of a slice from 

GTF file saved to: ./dolphin_exon_gtf/dolphin.exon.gtf
Pickle file saved to: ./dolphin_exon_gtf/./dolphin.exon.pkl
[Success] Exon GTF processing pipeline completed.


## Generate Adjacency Index
This step computes a per-gene adjacency index based on the exon annotation (.pkl converted from GTF).
The result is used to locate each gene's exon adjacency matrix in the full graph structure.

In [3]:
from DOLPHIN.preprocess import generate_adj_index_table

In [None]:
exon_pkl_path= "./dolphin_exon_gtf/dolphin.exon.pkl"
df_adj_index = generate_adj_index_table(exon_pkl_path)

[Saved] Adjacency index table saved to: ./dolphin_exon_gtf/dolphin_adj_index.csv


In [6]:
df_adj_index

Unnamed: 0,geneid,ind_st,ind
0,ENSG00000223972,0.0,16.0
1,ENSG00000227232,16.0,121.0
2,ENSG00000278267,137.0,1.0
3,ENSG00000243485,138.0,9.0
4,ENSG00000284332,147.0,1.0
...,...,...,...
61855,ENSG00000224240,6529063.0,1.0
61856,ENSG00000227629,6529064.0,9.0
61857,ENSG00000237917,6529073.0,169.0
61858,ENSG00000231514,6529242.0,1.0


In [1]:
from DOLPHIN.preprocess import generate_adj_metadata_table

In [3]:
df_adj_index_meta = generate_adj_metadata_table(exon_pkl_path)

[Saved] Metadata table saved to: ./dolphin_exon_gtf/dolphin_adj_metadata_table.csv


In [4]:
df_adj_index_meta

Unnamed: 0,Geneid,GeneName,Gene_Junc_name
0,ENSG00000223972,DDX11L1,DDX11L1-1
1,ENSG00000223972,DDX11L1,DDX11L1-2
2,ENSG00000223972,DDX11L1,DDX11L1-3
3,ENSG00000223972,DDX11L1,DDX11L1-4
4,ENSG00000223972,DDX11L1,DDX11L1-5
...,...,...,...
6529239,ENSG00000237917,PARP4P1,PARP4P1-167
6529240,ENSG00000237917,PARP4P1,PARP4P1-168
6529241,ENSG00000237917,PARP4P1,PARP4P1-169
6529242,ENSG00000231514,CCNQP2,CCNQP2-1
