# Finding a Protein Motif

## Problem

To allow for the presence of its varying forms, a protein motif is represented by a shorthand as follows: [XY] means "either X or Y" and {X} means "any amino acid except X." For example, the N-glycosylation motif is written as N{P}[ST]{P}.

You can see the complete description and features of a particular protein by its access ID "uniprot_id" in the UniProt database, by inserting the ID number into

    http://www.uniprot.org/uniprot/uniprot_id
    
Alternatively, you can obtain a protein sequence in FASTA format by following  

    http://www.uniprot.org/uniprot/uniprot_id.fasta
    
### Given
At most 15 UniProt Protein Database access IDs.
### Return
 For each protein possessing the N-glycosylation motif, output its given access ID followed by a list of locations in the protein string where the motif can be found.
 
### Sample Dataset
>A2Z669 <br>
B5ZC00 <br>
P07204_TRBM_HUMAN <br>
P20840_SAG1_YEAST <br>
### Sampe Output
>B5ZC00<br>
85 118 142 306 395<br>
P07204_TRBM_HUMAN<br>
47 115 116 382 409<br>
P20840_SAG1_YEAST<br>
79 109 135 248 306 348 364 402 485 501 614


In [1]:
def MPRT(file):
    AccessID = []

    with open(file) as ds:
        for i in ds:
            AccessID.append(i.strip())

    from urllib.request import urlopen                            
    from Bio import SeqIO                                         
    import re
    for i in range(len(AccessID)):
        url = 'http://www.uniprot.org/uniprot/' + AccessID[i] + '.fasta'
        dataset = urlopen(url)
        fasta = dataset.read().decode('utf-8', 'ignore')

        with open('seq_file.fasta', 'a') as tf:
            tf.write(fasta)

    motif = re.compile(r'(?=(N[^P][ST][^P]))')
    handle = open('seq_file.fasta','r')
    c = 0
    try:
        for r in SeqIO.parse(handle, 'fasta'):
            seq = r.seq
            pos = []

            for m in re.finditer(motif, str(seq)):
                pos.append(m.start() + 1)
            if len(pos) > 0:
                print(AccessID[c])
                print(' '.join(map(str, pos)))
            c += 1
    except IndexError:
        pass

    tf.close()
    handle.close() 

In [2]:
MPRT('/home/kip/Downloads/sampledataset.txt')

B5ZC00
85 118 142 306 395
P07204_TRBM_HUMAN
47 115 116 382 409
P20840_SAG1_YEAST
79 109 135 248 306 348 364 402 485 501 614
