# Python app: Find proteins containing c-terminal degrons

This app finds and selects all proteins that contain c-terminal degrons.

1. Upload human gene data &rarr; anotate gene_name, transcript_name, protein_name

2. Translate &rarr; anotate proteins

3. Find c-degron sequences: use consensus sequences 

4. Results visualization  

|Number|C-degrons|
|--:|---------:|
|1|-GG|  
|2|-RG|  
|3|-PG|  
|4|-XR|  
|5|-RXXG|  
|6|-EE| 
|7|-RXX|  
|8|-VX|  
|9|-AX|  
|10|-A|    

Varshavsky *et al* 2019 (**Fig S3** Supl material)  
Lin *et al* 2018  

### 1. Upload genomic data

In [3]:
#Pending: download data from ensembl, use pyensembl (pypi.org/project/pyensembl/)

from pyensembl import EnsemblRelease

#help(EnsemblRelease)


In [2]:
#Import cDNA sequences (from local file)

from Bio import SeqIO

cdna_seqs = []
cdna_ids = []
for record in SeqIO.parse("Data/Homo_sapiens.GRCh38.cdna.all.fa", "fasta"):
    cdna_seqs.append(str(record.seq))
    cdna_ids.append(record.id)

print(cdna_ids[0:10])
print(len(cdna_ids))
print(cdna_seqs[0:10])
print(len(cdna_seqs))

['ENST00000434970.2', 'ENST00000415118.1', 'ENST00000448914.1', 'ENST00000631435.1', 'ENST00000632684.1', 'ENST00000390583.1', 'ENST00000431440.2', 'ENST00000632524.1', 'ENST00000633009.1', 'ENST00000634070.1']
190432
['CCTTCCTAC', 'GAAATAGT', 'ACTGGGGGATACG', 'GGGACAGGGGGC', 'GGGACAGGGGGC', 'GTATTACTATGGTTCGGGGAGTTATTATAAC', 'TGACTACAGTAACTAC', 'CTAACTGGGGA', 'GGTATAGTGGGAGCTACTAC', 'GGGTATAGCAGCGGCTAC']
190432


- Create tables (IDs + sequences):

In [3]:
#Create cDNA sequences table

import pandas as pd

frame = {'ID': cdna_ids, 'Sequences': cdna_seqs}
cdna_df = pd.DataFrame(frame)
print(cdna_df)

                       ID                                          Sequences
0       ENST00000434970.2                                          CCTTCCTAC
1       ENST00000415118.1                                           GAAATAGT
2       ENST00000448914.1                                      ACTGGGGGATACG
3       ENST00000631435.1                                       GGGACAGGGGGC
4       ENST00000632684.1                                       GGGACAGGGGGC
...                   ...                                                ...
190427  ENST00000639660.1  GGCGTCTACAAGAGACCTTCCTTCTCAGCTCAACTGTGCCCTGCAG...
190428  ENST00000673346.1  CTAACAGATGTCTCTATATTCCTCCTCCTCGAACTCTCAGAGGATC...
190429  ENST00000673247.1  GTTACCTAACCAAACTCCTGCAAAACCACACCACCTATGACTGTGA...
190430  ENST00000672305.1  GGCTGTGACCGTCTATGACAAGCCGGCATCTTTCTTTAAAGAGACA...
190431  ENST00000671911.1  ACTCAGAGCTACTGCTGATCTCCTTCCAGGGCTTCCACTGGGACTA...

[190432 rows x 2 columns]


In [5]:
#Import protein sequences

from Bio import SeqIO

prot_seqs = []
prot_ids = []
for record in SeqIO.parse("Data/Homo_sapiens.GRCh38.pep.all.fa", "fasta"):
    prot_seqs.append(str(record.seq))
    prot_ids.append(record.id)

print(prot_ids[0:10])
print(len(prot_ids))
print(prot_seqs[0:10])
print(len(prot_seqs))

['ENSP00000451515.1', 'ENSP00000451042.1', 'ENSP00000452494.1', 'ENSP00000488240.1', 'ENSP00000487941.1', 'ENSP00000419773.1', 'ENSP00000430034.1', 'ENSP00000488695.1', 'ENSP00000488000.1', 'ENSP00000488392.1']
111047
['PSY', 'EI', 'TGGY', 'GTGG', 'GTGG', 'VLLWFGELL', '*LQ*L', 'LTG', 'GIVGAT', 'GYSSGY']
111047


In [6]:
#Create protein list table

import pandas as pd

frame = {'ID': prot_ids, 'Sequences': prot_seqs}
prot_df = pd.DataFrame(frame)
print(prot_df)

                       ID                                          Sequences
0       ENSP00000451515.1                                                PSY
1       ENSP00000451042.1                                                 EI
2       ENSP00000452494.1                                               TGGY
3       ENSP00000488240.1                                               GTGG
4       ENSP00000487941.1                                               GTGG
...                   ...                                                ...
111042  ENSP00000494625.1  RQGRCDTYATEFDLEAEEYVPLPKGDVHKKKEIIQDVTLHDLDVAN...
111043  ENSP00000494933.1  MAGRRVNVNVGVLGHIDSGKTALARALSTTASTAAFDKQPQSRERG...
111044  ENSP00000495578.1  MAGRRVNVNVGVLGHIDSGKTALARALSTTASTAAFDKQPQSRERG...
111045  ENSP00000496548.1  MPSMLERISKNLVKEIGSKDLTPVKYLLSATKLRQFVILRKKKDSR...
111046  ENSP00000494855.1  MPSMLERISKNLVKEIGSKDLTPVKYLLSATKLRQFVILRKKKDSR...

[111047 rows x 2 columns]


### 2. cDNA to RNA and translation

- From cDNA to RNA:

In [23]:
def cdna_to_rna(dna_seq_list):
    """Creates a list of rna sequences from a list of cdna sequences"""
    rna_seq_list = []
    for dna_seq in dna_seq_list:
        base_dna_rna = {'A':'A', 'T':'U', 'C':'C', 'G':'G'}
        rna_seq = ''
        for base in dna_seq:
             if base != 'N':
                rna_seq += base_dna_rna[base]
        rna_seq_list.append(rna_seq)
    return rna_seq_list

rna_seqs = cdna_to_rna(cdna_seqs)
print(rna_seqs[0:10])

['CCUUCCUAC', 'GAAAUAGU', 'ACUGGGGGAUACG', 'GGGACAGGGGGC', 'GGGACAGGGGGC', 'GUAUUACUAUGGUUCGGGGAGUUAUUAUAAC', 'UGACUACAGUAACUAC', 'CUAACUGGGGA', 'GGUAUAGUGGGAGCUACUAC', 'GGGUAUAGCAGCGGCUAC']


- From RNA to protein (translation):

In [24]:
#Method 1 to generate the translated sequence (using this dictioniary)

#Dictionary triplet RNA to aminoacid:
triplet_rna_aa = {'GAA': 'E', 'CGA': 'R', 'GUG': 'V', 'UAA': '*', 'CGU': 'R', 'AUA': 'I', 'GAC': 'D', 'UCG': 'S', 
                  'GAU': 'D', 'AUG': 'M', 'CUG': 'L', 'CUA': 'L', 'UAC': 'Y', 'GGA': 'G', 'CGG': 'R', 'AGC': 'S', 
                  'UCU': 'S', 'UGA': '*', 'AAA': 'K', 'ACC': 'T', 'ACA': 'T', 'UGC': 'C', 'AAG': 'K', 'GUC': 'V', 
                  'UCC': 'S', 'ACU': 'T', 'AGA': 'R', 'CUU': 'L', 'GCC': 'A', 'GUA': 'V', 'UAG': '*', 'CAA': 'Q', 
                  'CAC': 'H', 'GCU': 'A', 'UUA': 'L', 'CAU': 'H', 'CGC': 'R', 'UUC': 'F', 'AUU': 'I', 'GGC': 'G', 
                  'CAG': 'Q', 'AAC': 'N', 'CCC': 'P', 'GUU': 'V', 'AGG': 'R', 'UGU': 'C', 'CCG': 'P', 'GGG': 'G', 
                  'AUC': 'I', 'UUU': 'F', 'AAU': 'N', 'UCA': 'S', 'GAG': 'E', 'CCA': 'P', 'GCA': 'A', 'UAU': 'Y', 
                  'GGU': 'G', 'UGG': 'W', 'GCG': 'A', 'CUC': 'L', 'UUG': 'L', 'CCU': 'P', 'ACG': 'T', 'AGU': 'S'}


In [25]:
#Method 2 to generate the translated sequence (using a function for the genetic code and gencode.txt file)

#Generate a function for the genetic code

def genetic_code(file):
    gencode = open(file)
    lines = gencode.read().splitlines()
    genetic_code = {}
    for line in lines:
        codon, aa = line.split()
        genetic_code[codon] = aa
    return genetic_code

In [26]:
rna_seq_triplets = [rna1[i:i+3] for i in range(0, len(rna1), 3)]
prot_seq = ''
for triplets in rna_seq_triplets:
    triplet_to_aa = genetic_code('gencode.txt')
    prot_seq += triplet_to_aa[triplets]

print(rna1)
print(rna_seq_triplets)
print(prot_seq)


CCUUCCUAC
['CCU', 'UCC', 'UAC']
PSY


In [27]:
def translation(rna_seq_list):
    """Creates a list of protein sequences from a list of rna sequences"""
    
    prot_seq_list = []
    for rna_seq in rna_seq_list:
        rna_seq_triplets = [rna_seq[i:i+3] for i in range(0, len(rna_seq), 3)]
        prot_seq = ''
        
        for triplet in rna_seq_triplets:
            if len(triplet) == 3:
                prot_seq += triplet_rna_aa[triplet]
            else:
                break
                
        prot_seq_list.append(prot_seq)
    return prot_seq_list

prot_seqs2 = translation(rna_seqs)
print(prot_seqs2[0:10])

#Pending: eliminate sequence after STOP codons... AUG check

['PSY', 'EI', 'TGGY', 'GTGG', 'GTGG', 'VLLWFGELL*', '*LQ*L', 'LTG', 'GIVGAT', 'GYSSGY']


In [28]:
print(len(prot_seqs))
print(prot_seqs[0:10])
print(len(prot_seqs2))
print(prot_seqs2[0:10])

111047
['PSY', 'EI', 'TGGY', 'GTGG', 'GTGG', 'VLLWFGELL', '*LQ*L', 'LTG', 'GIVGAT', 'GYSSGY']
190432
['PSY', 'EI', 'TGGY', 'GTGG', 'GTGG', 'VLLWFGELL*', '*LQ*L', 'LTG', 'GIVGAT', 'GYSSGY']


### 3. List all proteins containing c-terminal degrons
- Find c-degrons (main function):

In [29]:
import re

def find_cdegron(prot_seq_list, cdegron_motif):
    """Finds all proteins containing c-terminal degrons (cdegrons)
    input: a list of protein sequences and the c-degron motif
    return: a list of protein sequences containing the c-degron motif"""
    cdegron_seq_list = []
    for prot_seq in prot_seq_list:
        find_cdegron_motif = re.findall(cdegron_motif, prot_seq)
        if find_cdegron_motif != []:
            cdegron_seq_list.append(prot_seq)
    return cdegron_seq_list


In [None]:
#Pending: function to find degrons from cdna sequences
#make if statements: if cdna... if protein... else print("The sequence cannot be recognized. Please upload a list of cdnas or proteins")...

- Functions for finding more than one c-degron motif in a given prot_seqs list:

In [30]:
cdegron_motifs = ['GG', 'RG', 'PG', 'XR', 'RXXG', 'EE', 'RXX', 'VX', 'AX', 'A']

#Prepare cdegron list to re terms:
def cdegron_to_re (cdegron_motifs):
    """This function converts a list of c-degron motifs to regular expressions"""
    cdegron_motifs_re = []
    for motif in cdegron_motifs:
        motif_re = motif +'$'
        cdegron_motifs_re.append(motif_re)
    cdegron_motifs_re = [c.replace('X', '.') for c in cdegron_motifs_re]
    return cdegron_motifs_re

In [32]:
#Search for each cdegron motif in a list of protein sequences:

def find_cdegron_list(cdegron_motifs, prot_seqs):
    """This function finds all cdegron motifs provided in a list"""
    cdegron_motifs_re = cdegron_to_re(cdegron_motifs)
    n_cdegron_motifs = []
    for i in range(len(cdegron_motifs_re)):
        motif = cdegron_motifs_re[i]
        cdegron_prot_list = find_cdegron(prot_seqs, motif)
        n_cdegron_motifs.append(len(cdegron_prot_list))
    return n_cdegron_motifs

In [33]:
#Calculate the percentage of each c-degron motif:

def percentages_cdegron (n_cdegron_motifs, prot_seqs):
    """This function calculates the percentage of proteins containing each c-degron provided"""
    percent_degron_list = []
    for i in range(len(n_cdegron_motifs)):
        percent_degron_i = round(n_cdegron_motifs[i]/len(prot_seqs)*100, 3)
        percent_degron_list.append(percent_degron_i)
    total_n_cdegron = sum(n_cdegron_motifs)
    percent_degrons_total = round(total_n_cdegron/len(prot_seqs)*100, 3)
    return [total_n_cdegron, percent_degron_list, percent_degrons_total]


In [40]:
#Results summary message:

def results_summary (cdegron_motifs, n_cdegron_motifs, percentages_degrons, prot_seqs):
    """This function displays a summary of the results"""
    total_n_cdegron = percentages_degrons[0]
    percent_degron_list = percentages_degrons[1]
    percent_degrons_total = percentages_degrons[2]
    
    sentence1 = "The protein list you provided harbors:\n" 
    sentence2 = ""
    for i in range(len(cdegron_motifs)): 
        sentencei = "- {str(n_cdegron_motifs[i])} proteins with the {str(cdegron_motifs[i])} c-degron motif ({str(percent_degron_list[i])} %)\n"
        sentence2 += sentencei
    sentence3 = "- " + str(total_n_cdegron) + " proteins with all c-degron motifs (" + str(percent_degrons_total) +"%)\n"

    sentence4 = "from a total " + str(len(prot_seqs)) + " proteins.\n"
    sentence5 = sentence1 + sentence2 + sentence3 + sentence4
    return sentence5

#Pending: List of lists... create a list of sequences for each cdegron motif

- Test the functions:

In [41]:
#Data:
cdegron_motifs = ['GG', 'RG', 'PG', 'XR', 'RXXG', 'EE', 'RXX', 'VX', 'AX', 'A']
prot_seqs1000 = prot_seqs[0:1000]

#cdegron_to_re
cdegron_re = cdegron_to_re(cdegron_motifs)
print(cdegron_re)

#find_cdegron_list
cdegron_list2 = find_cdegron_list(cdegron_motifs, prot_seqs2)
print(cdegron_list2)

#percentages_cdegron
percents2 = percentages_cdegron(cdegron_list2, prot_seqs2)
print(percents2)

#results_summary
output2 = results_summary(cdegron_motifs, cdegron_list2, percents2, prot_seqs2)
print(output2)

#Pending: Implement functions inside other functions!!


['GG$', 'RG$', 'PG$', '.R$', 'R..G$', 'EE$', 'R..$', 'V.$', 'A.$', 'A$']
[799, 686, 793, 10074, 552, 520, 9804, 8693, 9266, 8910]
[50097, [0.42, 0.36, 0.416, 5.29, 0.29, 0.273, 5.148, 4.565, 4.866, 4.679], 26.307]
The protein list you provided harbors:
- {str(n_cdegron_motifs[i])} proteins with the {str(cdegron_motifs[i])} c-degron motif ({str(percent_degron_list[i])} %)
- {str(n_cdegron_motifs[i])} proteins with the {str(cdegron_motifs[i])} c-degron motif ({str(percent_degron_list[i])} %)
- {str(n_cdegron_motifs[i])} proteins with the {str(cdegron_motifs[i])} c-degron motif ({str(percent_degron_list[i])} %)
- {str(n_cdegron_motifs[i])} proteins with the {str(cdegron_motifs[i])} c-degron motif ({str(percent_degron_list[i])} %)
- {str(n_cdegron_motifs[i])} proteins with the {str(cdegron_motifs[i])} c-degron motif ({str(percent_degron_list[i])} %)
- {str(n_cdegron_motifs[i])} proteins with the {str(cdegron_motifs[i])} c-degron motif ({str(percent_degron_list[i])} %)
- {str(n_cdegron_moti

In [None]:
#Pending: List and enumerate all proteins containing c-degrons: ANNOTATION, pandas


#Pending: Save outputs (protein ids+seqs, summary, etc) in files

## Data visualisation

In [None]:
# Percentages, etc...

