# Intersection annotated SC (canonical) with GTF Genome Annotation

<b>Intersect:</b> to detect the set of actively translated ORFs, we use the intersection function of the BEDTools suite in the BED file with the genomic positions of the positives start codons as well as each of the gtf files reported by StringTie either transcriptome assemblies based on Ribosome profiling Elongation and RNA-seq. Therefore, start codons intersecting assembled transcripts (i.e., pairs (starts, transcripts)) were collected as they represent the active ORFs that will be translated in silico. From this, a canonical protein is defined as the protein translated from a known start codon coupled with its corresponding transcript, while a noncanonical protein is defined as the protein translated from unknown couplings.

Input Files
```bash
"""
input : path
    Path to the Known_SC_Intercepted_Backward/Known_SC_Intercepted_Forward bed files that were generated at the TIS-calling step
input2 : path
    Path to the gencode.v26.primary_assembly.annotation-.gtf or
    gencode.v26.primary_assembly.annotation+.gtf
output : path
    Path to save the bed file result from the intersection function of BedTools 
nameTask : string
    qsub name task 
saveOutputQsub : path
    Path to save qsub output
logPath : path
    Path to sabe log output
"""
```
Output Files
```bash
"""
    Intersection_BedFileForward_KnownStartCodons or Intersection_BedFileBackward_KnownStartCodons : bed-like files
        bed files with the transcripts intersected by the known start codons identified in the sample
"""
```

In [1]:
%%bash

echo 'Intersection Backward'
input='.../StartCodonsDetection/Known_SC_Intercepted_Backward.bed'
input2='../../../Data_Input_Scripts/gencode.v26.primary_assembly.annotation-.gtf'
output='.../Transcripts/Canonical/Intersection_BedFileBackward_KnownStartCodons.bed'
nameTask='IntersectBedtools_Backward'
saveOutputQsub='.../qsub_outputs'
logPath='.../logs'

sh ../../../Scripts/3_Trascriptome_Assembly/intersectSC_Assemblies.sh $input $input2 $output $nameTask $saveOutputQsub $logPath

echo 'Intersection Forward'
input='.../StartCodonsDetection/Known_SC_Intercepted_Forward.bed'
input2='../../../Data_Input_Scripts/gencode.v26.primary_assembly.annotation+.gtf'
output='.../Transcripts/Canonical/Intersection_BedFileForward_KnownStartCodons.bed'
nameTask='IntersectBedtools_Forward'
saveOutputQsub='.../qsub_outputs'
logPath='.../logs'

sh ../../../Scripts/3_Trascriptome_Assembly/intersectSC_Assemblies.sh $input $input2 $output $nameTask $saveOutputQsub $logPath


Intersection Backward
Intersection Forward


Filtering intersected Bed Files to keep only those that are corresponding the start codon with the transcript associated !

In [None]:
def filterStartCodon_by_Transcript(intersectedFile, toSave):
    toWrite = ''
    transcripts_intersected = []
    with open(intersectedFile) as f:
        for index, startCodon in enumerate(f):  
            splitLine = startCodon.strip().split('\t')
            transcript_sc = splitLine[4]
            transcript_intersected = splitLine[14].split(";")[1].split("\"")[1]
            sc = splitLine[0]+':'+splitLine[1]+'-'+splitLine[2]+'-'+splitLine[3]
            if transcript_sc == transcript_intersected:
                transcripts_intersected.append(transcript_sc)
                toWrite += startCodon
    
    print 'Total Transcripts intersected ', len(transcripts_intersected)
    fileFiltered = open(toSave,"w") 
    fileFiltered.write(toWrite)
    fileFiltered.close() 

In [None]:
print 'Filtering Forward'

intersectedFile = '.../Transcripts/Canonical/Intersection_BedFileBackward_KnowStartCodons.bed'
toSave = '.../Transcripts/Canonical/Filtered_Intersection_BedFileBackward_KnowStartCodons.bed'

filterStartCodon_by_Transcript(intersectedFile, toSave)

print 'Filtering Backward'

intersectedFile = '.../Transcripts/Canonical/Intersection_BedFileForward_KnowStartCodons.bed'
toSave = '.../Transcripts/Canonical/Filtered_Intersection_BedFileForward_KnowStartCodons.bed'

filterStartCodon_by_Transcript(intersectedFile, toSave)

# Get Canonical TPM Information from RNA-assembled Transcripts

TPM values are obtained from the  assemblies transcripts of RNA-seq data generated by stringTie

In [None]:
import math
import pickle

tpm_dictionnary = {}
gtf_only_transcripts='..../StringTieAssemblies/RNA/RNA_info_Transcripts.gtf'

with open(gtf_only_transcripts) as f:
    for index, line in enumerate(f):
        tpm_sum += float(tpm)
        if 'reference_id' in line:
            transcript = line.strip().split(' reference_id ')[1].split('\"')[1]
            tpm = line.strip().split(' TPM ')[1].split('\"')[1]
            tpm_dictionnary[transcript] = math.log(float(tpm)+1,2)

with open('.../StringTieAssemblies/RNA/tpm_canonical_transcripts.dic', 'wb') as fp:
    pickle.dump(tpm_dictionnary, fp, protocol=pickle.HIGHEST_PROTOCOL)
    

# Get Canonical Transcripts Sequences

Get the information of exons-CDS for all the canonical transcripts that were coupled with their corresponding annotated start codon.

 The high-quality sample-specific SNPs identified (freeBayes quality > 20), were then inserted at their correct position into the intersected transcripts. When there was ambiguity for a given position, the integration was done through the corresponded IUPAC symbol.

Input Files
```python
"""
logPath : path
    Path to save log output='.../logs/'
strand : string
    Either '+' for forward or '-' for backward
mergedAssembledTranscripts : path
    Path to gtf file where the transcripts information will be gather. (Data_Input_Scripts/gencode.v26.primary_assembly.annotation.gtf)
freeByesSNPs : path
    Path to the file from freebayes (output.var.5X.pga)
quality :int
    Quality of the SVN (default:20)
bedIntersected : path
    Path to either Filtered_Intersection_BedFileForward_KnowStartCodons or Filtered_Intersection_BedFileBackward_KnowStartCodons bed files
output : path
    Path to save outputs

genome : path
    Path to genome fasta
genomeFai : path
    Path to genome index
dicStartCodons : path
    Path to dic Canonical_SC.dic which resumes the canonical start codons with their score generated at the TIS-calling step
tpm_dictionary : path
    Path to the tpm of the canonical transcripts (see above step)
"""
```
Output Files

```python
"""
Total_Transcripts_Intersected_Canonical_+.dic or Total_Transcripts_Intersected_Canonical_-.dic : dic
    Dic that containts for each canonical transcript its information (CDS, start codon position, scoreTis)

InfoTranscriptsIntersected+.gtf or InfoTranscriptsIntersected+.gtf 
    gtfs files that contains the same information above mentioned in format gtf
"""
```

In [3]:
%%bash

echo 'Transcripts Strand + '
logPath='.../logs/'

strand='+'
mergedAssembledTranscripts='../../../Data_Input_Scripts/gencode.v26.primary_assembly.annotation.gtf'
freeByesSNPs='.../FreeBayes/output.var.5X.pga'
quality=20
bedIntersected='.../Transcripts/Canonical/Filtered_Intersection_BedFileForward_KnowStartCodons.bed'
output='.../Transcripts/Canonical/DB/'
genome='../../../Data_Input_Scripts/GRCh38_Gencode26/GRCh38.primary_assembly.genome.fa'
genomeFai='../../../Data_Input_Scripts/GRCh38_Gencode26/GRCh38.primary_assembly.genome.fa.fai'
dicStartCodons='.../StartCodonsDetection/Canonical_SC.dic'
tpm_dictionary='.../StringTieAssemblies/RNA/tpm_canonical_transcripts.dic'

python ../../../Scripts/4_Get_Active_Transcripts/Canonical_Proteins/getInfoTranscripts.py -s $strand -d $dicStartCodons -t $mergedAssembledTranscripts -n $freeByesSNPs -q $quality -b $bedIntersected -o $output -f $genome -i $genomeFai -l $logPath -k $tpm_dictionnary

echo 'Transcripts Strand - '
strand='-'
bedIntersected='.../Proteins/Transcripts/Filtered_Intersection_BedFileBackward_KnowStartCodons.bed'

python ../../../Scripts/4_Get_Active_Transcripts/Canonical_Proteins/getInfoTranscripts.py -s $strand -d $dicStartCodons -t $mergedAssembledTranscripts -n $freeByesSNPs -q $quality -b $bedIntersected -o $output -f $genome -i $genomeFai -l $logPath -k $tpm_dictionnary



Proteins Strand + 
Proteins Strand - 
