### Find motifs and residues sites
This notebook provides some examples for searching for specific motifs in the human proteom. Specifically, this makes use of the `sequence_tools` module which contains a string-search function, as described below.

In [1]:
from shephard.apis import uniprot
from shephard.interfaces import si_domains
from shephard.tools import sequence_tools
from shephard import Proteome

#### Using `sequence_tools`
The cell below makes use of `sequence_tools`, a module that contains stand-alone, stateless functions for exploring sequences.

Specifically, we're going to be using a search function for finding degenerate motifs. For those familiar, this acts like a regular expression (regex). The cell below shows a demonstration of how this works; the first parameter is the search query and the second should be a string of interest. The function returns the positions in the string using 1-indexing to make this compatible with protein indexing.

In [2]:
hits = sequence_tools.find_string_positions('L.VP', 'AAALAVPAAAA')
print(f'We searched for L.VP in AAALAVPAAAA, and this motif starts at position {hits[0]:d} in the sequence, using protein indexing')

We searched for L.VP in AAALAVPAAAA, and this motif starts at position 4 in the sequence, using protein indexing


In [3]:
# read in human proteome and annotated with disordered regions
human_proteome = uniprot.uniprot_fasta_to_proteome('../shprd_data/human_proteome_validated.fasta')
si_domains.add_domains_from_file(human_proteome, '../shprd_data/shprd_domains_metapredictv2_rerun.tsv')

In [4]:
LxVxE_count = 0   

# The loop here cycles over each protein in the human proteom
for protein in human_proteome:
    
    # for each domain 
    for domain in protein.domains:
        
        # find hits in the IDR
        hits = sequence_tools.find_string_positions('L.V.E', domain.sequence)
        
        # for each position in the sequence, create a new site and increment
        for hit in hits:        
            protein.add_site(hit, 'LxVxE site')            
            LxVxE_count = LxVxE_count + 1


# The eas
print(f'Found {LxVxE_count:d} LxVxE sites in IDRs in the human proteome')    

Found 863 LxVxE sites in IDRs in the human proteome


### Asking about local sequence context
Indeed, you can simply look for every occurrence of a specific residue with `find_string_positions` as well. In the demo below we're going to examine context of sequence around arginine in a fragment of the protein [TDP-43](https://www.uniprot.org/uniprotkb/Q13148/entry).

In [5]:
# define empty proteome
new_prot = Proteome()

# add a protein
seq = 'KGISVHISNAEPKHNSNRQLERSGRFGGNPGGFGNQGGFGNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWGMMGMLASQQNQSGPSGNNQNQGNMQREPNQAFGSGNNSYSGSNSGAAIGWGSASNAGSGSGFNGGFGSSMDSKSSGWGM'
new_prot.add_protein(seq, 'fragment_TDP43', 'seq_001')

# save protein object as variable
p = new_prot.protein('seq_001')

# add a site to that protein for every Arginine residue 
for i in sequence_tools.find_string_positions('R', p.sequence):
    p.add_site(i, 'interest_site_R' )

In [6]:
# get sequence around all sites
print('Local sequence context of R sites:')

# note the offset 3 means we look +/- 3 residues around each arginine
[s.get_local_sequence_context(offset=3) for s in new_prot.protein('seq_001').sites]

Local sequence context of R sites:


['NSNRQLE', 'QLERSGR', 'RSGRFGG', 'GNSRGGG', 'NMQREPN']