# Get Non-Canonical Transcripts Sequences : RNA intersected transcripts

Get the information of exons-CDS for all the non-canonical transcripts that were coupled with a candidate start codon.

The high-quality sample-specific SNPs identified (freeBayes quality > 20), were then inserted at their correct position into the intersected transcripts. When there was ambiguity for a given position, the integration was done through the corresponded IUPAC symbol.

Input Files
```bash
"""
logPath : path
    Path to save log output
saveOutputQsub : path
    Path to save qsub output
nameTask : string
    qsub name task 
strand : string
    Either '+' forward or '-' backward
mergedAssembledTranscripts : path
    Path to the assembled transcripts either RNA_AssembledTranscripts+.gtf RNA_AssembledTranscripts-.gtf
freeByesSNPs : path
    Path to the file from freebayes (output.var.5X.pga)
quality :int
    Quality of the SVN (default:20)
bedIntersected : path
    Paht to either Intersection_BedFileForward_Candidates_StartCodons_RNA or Intersection_BedFileBackward_Candidates_StartCodons_RNA bed files
output : path
    Path to save outputs
genome : path
    Path to genome fasta
genomeFai : path
    Path to genome index

sCAfterThreshold : path 
    Path to dic for the start codons after the threshold has been applied from the ROC courve - generated from the TIS calling step

starts_annotated :  path
    Paht to Bed file with the start codons intercepted in the Forward strand either Known_SC_Intercepted_Forward or Known_SC_Intercepted_Backward 
"""
```

Output Files
```bash
"""
    Total_Transcripts_Intersected_Canonical_+.dic or Total_Transcripts_Intersected_Canonical_-.dic : dic
        Dic that containt for each non canonical transcript its information (CDS, start codon position, scoreTis)
    
    InfoTranscriptsIntersected+.gtf or InfoTranscriptsIntersected+.gtf 
        gtfs files that contain the same information above mentioned in format gtf
    
    InfoTranscriptsNotIntersected-.gtf or InfoTranscriptsNotIntersected+.gtf : gtf files
        gtfs files that contain the transcripts that were not intercepted by the candidate start codons
    
    totalGenesIntersected+.dic or totalGenesIntersected-.dic : dic
        Dic that contains for each non canonical transcript the name of the genes according to stringTie  
"""
```

In [1]:
%%bash
module add torque 

logPath='.../logs'
saveOutputQsub='.../qsub_outputs'

echo 'Proteins Strand + '
nameTask='Get_Transcripts_Strand+'
strand='+'
mergedAssembledTranscripts='.../StringTieAssemblies/RNA/RNA_AssembledTranscripts+.gtf'
freeByesSNPs='.../FreeBayes/output.var.5X.pga'
quality=20
bedIntersected='.../Transcripts/Noncanonical/Intersection_BedFileForward_Candidates_StartCodons_RNA.bed'
output='.../Transcripts/Noncanonical/RNA/'
genome='../../../Data_Input_Scripts/GRCh38_Gencode26/GRCh38.primary_assembly.genome.fa'
genomeFai='../../../Data_Input_Scripts/GRCh38_Gencode26/GRCh38.primary_assembly.genome.fa.fai'
sCAfterThreshold='.../StartCodonsDetection/StartCodonsAfterThreshold.dic'
starts_annotated='.../StartCodonsDetection/Known_SC_Intercepted_Forward.bed'

echo "python ../../../Scripts/4_Get_Active_Transcripts/Noncanonical-Proteins/getInfoTranscripts.py -s $strand -t $mergedAssembledTranscripts -n $freeByesSNPs -q $quality -b $bedIntersected -o $output -f $genome -i $genomeFai -d $sCAfterThreshold -l $logPath -k $starts_annotated" | qsub -V -l nodes=1:ppn=10,mem=150gb,walltime=48:00:00 -j oe -N $nameTask -d $saveOutputQsub

echo 'Proteins Strand - '
nameTask='Get_Transcripts_Strand-'
strand='-'
mergedAssembledTranscripts='.../StringTieAssemblies/RNA/RNA_AssembledTranscripts-.gtf'
bedIntersected='.../Transcripts/Noncanonical/Intersection_BedFileBackward_Candidates_StartCodons_RNA.bed'
starts_annotated='.../StartCodonsDetection/Known_SC_Intercepted_Backward.bed'

echo "python ../../../Scripts/4_Get_Active_Transcripts/Noncanonical-Proteins/getInfoTranscripts.py -s $strand -t $mergedAssembledTranscripts -n $freeByesSNPs -q $quality -b $bedIntersected -o $output -f $genome -i $genomeFai -d $sCAfterThreshold -l $logPath -k $starts_annotated" | qsub -V -l nodes=1:ppn=10,mem=150gb,walltime=48:00:00 -j oe -N $nameTask -d $saveOutputQsub


Proteins Strand + 
Proteins Strand - 


# Get Non-Canonical Transcripts Sequences : Ribo-Elong intersected transcripts

Input Files
```bash
"""
logPath : path
    Path to save log output
saveOutputQsub : path
    Path to save qsub output
nameTask : string
    qsub name task 
strand : string
    Either '+' forward or '-' backward
mergedAssembledTranscripts : path
    Path to the assembled transcripts either RiboElong_AssembledTranscripts+.gtf RiboElong_AssembledTranscripts-.gtf
freeByesSNPs : path
    Path to the file from freebayes (output.var.5X.pga)
quality :int
    Quality of the SVN (default:20)
bedIntersected : path
    Paht to either Intersection_BedFileForward_Candidates_StartCodons_Elong or Intersection_BedFileBackward_Candidates_StartCodons_Elong bed files
output : path
    Path to save output
genome : path
    Path to genome fasta
genomeFai : path
    Path to genome index

sCAfterThreshold : path 
    Path to dic for the start codons after the threshold has been applied from the ROC courve - generated from the TIS calling step

starts_annotated :  path
    Paht to Bed file with the start codons intercepted in the Forward strand either Known_SC_Intercepted_Forward or Known_SC_Intercepted_Backward 
"""
```

Output Files
```bash
"""
    Total_Transcripts_Intersected_Canonical_+.dic or Total_Transcripts_Intersected_Canonical_-.dic : dic
        Dic that containt for each non canonical transcript its information (CDS, start codon position, scoreTis)
    
    InfoTranscriptsIntersected+.gtf or InfoTranscriptsIntersected+.gtf 
        gtfs files that contain the same information above mentioned in format gtf
    
    InfoTranscriptsNotIntersected-.gtf or InfoTranscriptsNotIntersected+.gtf : gtf files
        gtfs files that contain the transcripts that were not intercepted by the candidate start codons
    
    totalGenesIntersected+.dic or totalGenesIntersected-.dic : dic
        Dic that contains for each non canonical transcript the name of the genes according to stringTie  
"""
```

In [2]:
%%bash

module add torque 

logPath='.../logs'
saveOutputQsub='.../qsub_outputs'

echo 'Proteins Strand + '
nameTask='Get_Transcripts_Strand+'
strand='+'
mergedAssembledTranscripts='.../StringTieAssemblies/RiboElong/RiboElong_AssembledTranscripts+.gtf'
freeByesSNPs='.../FreeBayes/output.var.5X.pga'
quality=20
bedIntersected='.../Transcripts/Noncanonical/Intersection_BedFileForward_Candidates_StartCodons_Elong.bed'
output='.../Transcripts/Noncanonical/RiboElong/'
genome='../../../Data_Input_Scripts/GRCh38_Gencode26/GRCh38.primary_assembly.genome.fa'
genomeFai='../../../Data_Input_Scripts/GRCh38_Gencode26/GRCh38.primary_assembly.genome.fa.fai'
sCAfterThreshold='.../StartCodonsDetection/StartCodonsAfterThreshold.dic'
starts_annotated='.../StartCodonsDetection/Known_SC_Intercepted_Forward.bed'

echo "python ../../../Scripts/4_Get_Active_Transcripts/Noncanonical-Proteins/getInfoTranscripts.py -s $strand -t $mergedAssembledTranscripts -n $freeByesSNPs -q $quality -b $bedIntersected -o $output -f $genome -i $genomeFai -d $sCAfterThreshold -l $logPath -k $starts_annotated" | qsub -V -l nodes=1:ppn=10,mem=150gb,walltime=48:00:00 -j oe -N $nameTask -d $saveOutputQsub

echo 'Proteins Strand - '
nameTask='Get_Transcripts_Strand-'
strand='-'
mergedAssembledTranscripts='.../StringTieAssemblies/RiboElong/RiboElong_AssembledTranscripts-.gtf'
bedIntersected='.../Transcripts/Noncanonical/Intersection_BedFileBackward_Candidates_StartCodons_Elong.bed'
starts_annotated='.../StartCodonsDetection/Known_SC_Intercepted_Backward.bed'

echo "python ../../../Scripts/4_Get_Active_Transcripts/Noncanonical-Proteins/getInfoTranscripts.py -s $strand -t $mergedAssembledTranscripts -n $freeByesSNPs -q $quality -b $bedIntersected -o $output -f $genome -i $genomeFai -d $sCAfterThreshold -l $logPath -k $starts_annotated" | qsub -V -l nodes=1:ppn=10,mem=150gb,walltime=48:00:00 -j oe -N $nameTask -d $saveOutputQsub

Proteins Strand + 
Proteins Strand - 
