# Genome Mining Script
##### Original Author: Chun Yin Larry So | Python Interpretation: Nathan Alam
[Github repository](https://github.com/nathanalam/genomemining)

The purpose of this notebook is to provide scripts for reading bacterial genomes in search of regular expressions which seem to match a String corresponding to a candidate coding for a Lasso peptide.

## Links to relevant publications:
- [Genome mining for lasso peptides: past, present, and future](https://link.springer.com/article/10.1007/s10295-019-02197-z)
- [Prospecting Genomes for Lasso Peptides](https://www.ncbi.nlm.nih.gov/pubmed/24142336)

### Comments from Perl Script
- In this version each time MAST needs to be run on a putative maturation enzyme in a particular genome, a lookup in the database is performed to check if this protein has been analyzed by MAST before and if it has then use the values from before
- In this version traseq will be used instead of getorf for finding precursors
- getorf is still used for finding neighbors
- Like v4 except that multiple proteins of the same sequence don't cause wrong locations of maturation enzymes to be reported
- Another change is that the pattern needs to be adjusted to [MVL] instead of ^ in the beginning
- This version fixes the problem of having a useless %AME hash and also of erasing sequences from %AME_scores
- Precursor pattern is output into the log file
- ORF searching behavior is changed from stop-to-stop to [MVL]-to-stop
- Clusters of precursors are saved in clusters.txt
- **Warning**: transeq doesn't label the -1, -2, -3 frames sequentially. Sometimes it is -1, -3, -2, sometimes some other combination. This means the precursor locations on the reverse strand are off by one sometimes
- Take note that rank_hits expects 4 motifs for the B enzyme and 3 motifs for the C enzyme. Adjust accordingly.

In [1]:
import re
import sys
import os
import json
import requests

### THE Pattern

This is the pattern that we're using to identify lasso proteins. TODO - Use Machine Learning to adjust the pattern to maximize the number of valid lasso proteins.

In [2]:
PATTERN = 'M[A-Z]{15,45}T[A-Z][A-Z]{6,8}[DE][A-Z]{5,30}\*'
# PATTERN = 'CC.CGCCC...TGGC.'
# PATTERN = '.*'

### FASTA Function

Define a function that takes as input the relative path of a FASTA formatted text file, return an object that contains a list of sequence objects. Each sequence object has a description field ["description"] and a sequence field ["sequence"].

From http://www.csbio.sjtu.edu.cn/bioinf/virus-multi/example.htm, specification of a FASTA formatted file:
- The first line of each query protein input format must begin with a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier and description of the sequence, but both are optional.
- The sequence (in single-character code) begins in a different line and ends if another line starting with a ">" appears, which indicates the start of another query protein.

In [3]:
def readFASTA(name, cleanspace = 0):
    descriptions = []
    sequences = []
    sequenceList = []
    tempSequences = []     
        
    with open(name) as file:
        count = -1
        for line in file:
            
            if(line[0] == '>'):
                # if begins with a >, then a description
                descriptions.append(line[1:].replace('\n', ''))
                count += 1
                # skip the first time
                if count > 0 :
                    # combine the tempSequences into a single string and
                    # add it to sequences
                    newSequence = ' '.join(tempSequences)
                    # now remove all of the whitespaces
                    newSequence = newSequence.replace(' ', '')
                    newSequence = newSequence.replace('\n', '')
                    
                    sequences.append(newSequence)
                    # refresh the tempSequence list
                    tempSequences = []
                    
                    sequenceList.append({
                        "description": descriptions[count - 1],
                        "sequence": sequences[count - 1]
                    })
            else:
                tempSequences.append(line)
                
        # combine the tempSequences into a single string and
        # add it to sequences
        newSequence = ' '.join(tempSequences)
        # now remove all of the whitespaces
        newSequence = newSequence.replace(' ', '')
        newSequence = newSequence.replace('\n', '')

        sequences.append(newSequence)
        # refresh the tempSequence list
        tempSequences = []
        
        sequenceList.append({
            "description": descriptions[count],
            "sequence": sequences[count]
        })
                
                
    if len(descriptions) != len(sequences):
        print("ERROR: Number of descriptions does not match number of sequences")
        print("Number of descriptions: " + str(len(descriptions)))
        print("Number of sequences: " + str(len(sequences)))
        sys.exit(1);
        
    print("Read " + str(count + 1) + " objects from FASTA file " + name)
        
    return sequenceList
        


### Obtain Amino acid sequence directories

Begin by getting a list of all of the genomes available in the genomes folder alongside this script.

In [11]:
ALLDIRNAMES = []
for dirname in os.listdir("genomes"):
    ## if a regular file, just add to directory
    if (dirname.find(".") != -1):
        ALLDIRNAMES.append("genomes/" + dirname)
    else:
        for filename in os.listdir("genomes/" + dirname):
            ALLDIRNAMES.append("genomes/" + dirname + "/" + filename)

In [12]:
print(ALLDIRNAMES)

['genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/.DS_Store', 'genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000001.fna', 'genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000002.fna', 'genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000003.fna', 'genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000004.fna', 'genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000005.fna', 'genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000006.fna', 'genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000007.fna', 'genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000008.fna', 'genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000009.fna', 'genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000010.fna', 'genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000011.fna', 'genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000012.f

### Clear Directory
Delete all translated directories for further testing. **ONLY DO THIS IF YOU KNOW WHAT YOU'RE DOING**

In [6]:
for dirname in ALLDIRNAMES:
    if (dirname[len(dirname) - 3:] == "faa"):
        os.remove(dirname)

For the fna files, we need to convert them to amino acid sequences (or faa files). We do this using [Emboss Transeq](https://www.ebi.ac.uk/seqdb/confluence/display/JDSAT/EMBOSS+transeq+Help+and+Documentation#EMBOSStranseqHelpandDocumentation-Reference), and save the output amino acid sequences into this genome file

In [13]:
from Bio.Seq import Seq, reverse_complement, translate
from Bio.Alphabet import IUPAC

# takes a three letter codon, and returns the corresponding amino acid or 'X' if unknown
def translate_codon(codonString):
    if(not len(codonString) == 3):
        raise InputError()
        
    try:
        return translate(codonString, to_stop=False, table = 11)
    except:
        return 'X'

## An adapter function for the biopython's translate, takes in a DNA sequence and returns a list of protein sequences
def get_orfs(DNAseq):
    AAList = []
    
    codonArr = []
    seqLen = len(DNAseq) - (len(DNAseq) % 3)
    seq = ''
    try:
        seq = translate(DNAseq[0:seqLen])
    except:
        for i in range(0, seqLen, 3):
            codonArr.append(translate_codon(DNAseq[i:i + 3]))
        seq = ''.join(codonArr)
    AAList.append({
        "ORF": 1,
        "sequence": seq
    })
    
    codonArr = []
    seqLen = len(DNAseq) - ((len(DNAseq) - 1) % 3)
    seq = ''
    try:
        seq = translate(DNAseq[1:seqLen])
    except:
        for i in range(1, seqLen, 3):
            codonArr.append(translate_codon(DNAseq[i:i + 3]))

        seq = ''.join(codonArr)
    AAList.append({
        "ORF": 2,
        "sequence": seq
    })
    
    codonArr = []
    seqLen = len(DNAseq) - ((len(DNAseq) - 2) % 3)
    seq = ''
    try:
        seq = translate(DNAseq[2:seqLen])
    except:
        for i in range(2, seqLen, 3):
            codonArr.append(translate_codon(DNAseq[i:i + 3]))

        seq = ''.join(codonArr)
    AAList.append({
        "ORF": 3,
        "sequence": seq
    })
    
    backwards_dna = reverse_complement(DNAseq)
    codonArr = []
    seqLen = len(backwards_dna) - (len(backwards_dna) % 3)
    seq = ''
    try:
        seq = translate(backwards_dna[0:seqLen])
    except:
        for i in range(0, seqLen, 3):
            codonArr.append(translate_codon(backwards_dna[i:i + 3]))
        seq = ''.join(codonArr)
    AAList.append({
        "ORF": -1,
        "sequence": seq
    })
    
    codonArr = []
    seqLen = len(backwards_dna) - ((len(backwards_dna) - 1) % 3)
    seq = ''
    try:
        seq = translate(backwards_dna[1:seqLen])
    except:
        for i in range(1, seqLen, 3):
            codonArr.append(translate_codon(backwards_dna[i:i + 3]))

        seq = ''.join(codonArr)
    AAList.append({
        "ORF": -2,
        "sequence": seq
    })
    
    codonArr = []
    seqLen = len(backwards_dna) - ((len(backwards_dna) - 2) % 3)
    seq = ''
    try:
        seq = translate(backwards_dna[2:seqLen])
    except:
        for i in range(2, seqLen, 3):
            codonArr.append(translate_codon(backwards_dna[i:i + 3]))

        seq = ''.join(codonArr)
    AAList.append({
        "ORF": -3,
        "sequence": seq
    })
    
    return AAList   

In [14]:
get_orfs("ATGCATGGCAGCA")

[{'ORF': 1, 'sequence': 'MHGS'},
 {'ORF': 2, 'sequence': 'CMAA'},
 {'ORF': 3, 'sequence': 'AWQ'},
 {'ORF': -1, 'sequence': 'CCHA'},
 {'ORF': -2, 'sequence': 'AAMH'},
 {'ORF': -3, 'sequence': 'LPC'}]

In [15]:
for dirname in ALLDIRNAMES:
    if((dirname[len(dirname) - 3:] == "fna") and not (dirname[:len(dirname) - 3] + "faa") in ALLDIRNAMES):
        print("Opening up " + dirname + " and converting into peptide sequences...")
        DNAseqs = []
        seqDescriptions = []
        for fastaobj in readFASTA(dirname):
            DNAseqs.append(fastaobj["sequence"])
            seqDescriptions.append(fastaobj["description"])
            
        entries = []
        for i in range(0, len(DNAseqs)):
            print("converting " + str(len(DNAseqs[i])) + " base pairs from " + seqDescriptions[i])
            aalist = get_orfs(DNAseqs[i])
            print("created " + str(len(aalist)) + " peptide sequences from " + seqDescriptions[i])
            for e in range(0, len(aalist)):
                entries.append({
                    "sequence": aalist[e]["sequence"],
                    "description": str(seqDescriptions[i] + " - ORF " + str(aalist[e]["ORF"])) 
                })
        
        print("writing read peptides into '" + dirname[len('genomes/'):len(dirname) - 3] + "faa'")
        with open(dirname[:len(dirname) - 3] + "faa", 'w') as outfile:
            for ent in entries:
                outfile.write("> " + ent["description"] + "\n")
                outfile.write(ent["sequence"] + "\n\n")

Opening up genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000001.fna and converting into peptide sequences...
Read 1 objects from FASTA file genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000001.fna
converting 676657 base pairs from gi|484226714|ref|NZ_AQWM01000001.1| Asticcacaulis benevestitus DSM 16100 = ATCC BAA-896 strain DSM 16100 B060DRAFT_scaffold_0.1_C, whole genome shotgun sequence
created 6 peptide sequences from gi|484226714|ref|NZ_AQWM01000001.1| Asticcacaulis benevestitus DSM 16100 = ATCC BAA-896 strain DSM 16100 B060DRAFT_scaffold_0.1_C, whole genome shotgun sequence
writing read peptides into 'Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000001.faa'
Opening up genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000002.fna and converting into peptide sequences...
Read 1 objects from FASTA file genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000002.fna
converting 451660 base pairs from gi|484226718|re

created 6 peptide sequences from gi|484226750|ref|NZ_AQWM01000012.1| Asticcacaulis benevestitus DSM 16100 = ATCC BAA-896 strain DSM 16100 B060DRAFT_scaffold_11.12_C, whole genome shotgun sequence
writing read peptides into 'Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000012.faa'
Opening up genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000013.fna and converting into peptide sequences...
Read 1 objects from FASTA file genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000013.fna
converting 110774 base pairs from gi|484226753|ref|NZ_AQWM01000013.1| Asticcacaulis benevestitus DSM 16100 = ATCC BAA-896 strain DSM 16100 B060DRAFT_scaffold_12.13_C, whole genome shotgun sequence
created 6 peptide sequences from gi|484226753|ref|NZ_AQWM01000013.1| Asticcacaulis benevestitus DSM 16100 = ATCC BAA-896 strain DSM 16100 B060DRAFT_scaffold_12.13_C, whole genome shotgun sequence
writing read peptides into 'Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM

Opening up genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000024.fna and converting into peptide sequences...
Read 1 objects from FASTA file genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000024.fna
converting 57033 base pairs from gi|484226785|ref|NZ_AQWM01000024.1| Asticcacaulis benevestitus DSM 16100 = ATCC BAA-896 strain DSM 16100 B060DRAFT_scaffold_23.24_C, whole genome shotgun sequence
created 6 peptide sequences from gi|484226785|ref|NZ_AQWM01000024.1| Asticcacaulis benevestitus DSM 16100 = ATCC BAA-896 strain DSM 16100 B060DRAFT_scaffold_23.24_C, whole genome shotgun sequence
writing read peptides into 'Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000024.faa'
Opening up genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000025.fna and converting into peptide sequences...
Read 1 objects from FASTA file genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000025.fna
converting 52813 base pairs from gi|484226788|

created 6 peptide sequences from gi|484226824|ref|NZ_AQWM01000037.1| Asticcacaulis benevestitus DSM 16100 = ATCC BAA-896 strain DSM 16100 B060DRAFT_scaffold_36.37_C, whole genome shotgun sequence
writing read peptides into 'Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000037.faa'
Opening up genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000038.fna and converting into peptide sequences...
Read 1 objects from FASTA file genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000038.fna
converting 33452 base pairs from gi|484226827|ref|NZ_AQWM01000038.1| Asticcacaulis benevestitus DSM 16100 = ATCC BAA-896 strain DSM 16100 B060DRAFT_scaffold_37.38_C, whole genome shotgun sequence
created 6 peptide sequences from gi|484226827|ref|NZ_AQWM01000038.1| Asticcacaulis benevestitus DSM 16100 = ATCC BAA-896 strain DSM 16100 B060DRAFT_scaffold_37.38_C, whole genome shotgun sequence
writing read peptides into 'Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM0

Opening up genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000052.fna and converting into peptide sequences...
Read 1 objects from FASTA file genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000052.fna
converting 20428 base pairs from gi|484226872|ref|NZ_AQWM01000052.1| Asticcacaulis benevestitus DSM 16100 = ATCC BAA-896 strain DSM 16100 B060DRAFT_scaffold_51.52_C, whole genome shotgun sequence
created 6 peptide sequences from gi|484226872|ref|NZ_AQWM01000052.1| Asticcacaulis benevestitus DSM 16100 = ATCC BAA-896 strain DSM 16100 B060DRAFT_scaffold_51.52_C, whole genome shotgun sequence
writing read peptides into 'Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000052.faa'
Opening up genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000053.fna and converting into peptide sequences...
Read 1 objects from FASTA file genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000053.fna
converting 19396 base pairs from gi|484226874|

created 6 peptide sequences from gi|484226905|ref|NZ_AQWM01000063.1| Asticcacaulis benevestitus DSM 16100 = ATCC BAA-896 strain DSM 16100 B060DRAFT_scaffold_62.63_C, whole genome shotgun sequence
writing read peptides into 'Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000063.faa'
Opening up genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000064.fna and converting into peptide sequences...
Read 1 objects from FASTA file genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000064.fna
converting 11411 base pairs from gi|484226908|ref|NZ_AQWM01000064.1| Asticcacaulis benevestitus DSM 16100 = ATCC BAA-896 strain DSM 16100 B060DRAFT_scaffold_63.64_C, whole genome shotgun sequence
created 6 peptide sequences from gi|484226908|ref|NZ_AQWM01000064.1| Asticcacaulis benevestitus DSM 16100 = ATCC BAA-896 strain DSM 16100 B060DRAFT_scaffold_63.64_C, whole genome shotgun sequence
writing read peptides into 'Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM0

Opening up genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000089.fna and converting into peptide sequences...
Read 1 objects from FASTA file genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000089.fna
converting 1753 base pairs from gi|484226983|ref|NZ_AQWM01000089.1| Asticcacaulis benevestitus DSM 16100 = ATCC BAA-896 strain DSM 16100 B060DRAFT_scaffold_88.89_C, whole genome shotgun sequence
created 6 peptide sequences from gi|484226983|ref|NZ_AQWM01000089.1| Asticcacaulis benevestitus DSM 16100 = ATCC BAA-896 strain DSM 16100 B060DRAFT_scaffold_88.89_C, whole genome shotgun sequence
writing read peptides into 'Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000089.faa'
Opening up genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000090.fna and converting into peptide sequences...
Read 1 objects from FASTA file genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000090.fna
converting 1753 base pairs from gi|484226986|re

created 6 peptide sequences from gi|315497051|ref|NC_014816.1| Asticcacaulis excentricus CB 48 chromosome 1, complete sequence
writing read peptides into 'Asticcacaulis_excentricus_CB_48_GCA_000175215_2/NC_014816.1.faa'
Opening up genomes/Asticcacaulis_excentricus_CB_48_GCA_000175215_2/NC_014817.1.fna and converting into peptide sequences...
Read 1 objects from FASTA file genomes/Asticcacaulis_excentricus_CB_48_GCA_000175215_2/NC_014817.1.fna
converting 1315949 base pairs from gi|315499382|ref|NC_014817.1| Asticcacaulis excentricus CB 48 chromosome 2, complete sequence
created 6 peptide sequences from gi|315499382|ref|NC_014817.1| Asticcacaulis excentricus CB 48 chromosome 2, complete sequence
writing read peptides into 'Asticcacaulis_excentricus_CB_48_GCA_000175215_2/NC_014817.1.faa'
Opening up genomes/Asticcacaulis_excentricus_CB_48_GCA_000175215_2/NC_014818.1.fna and converting into peptide sequences...
Read 1 objects from FASTA file genomes/Asticcacaulis_excentricus_CB_48_GCA_

created 6 peptide sequences from gi|471265493|ref|NC_020817.1| Xanthomonas citri subsp. citri Aw12879 plasmid pXcaw58, complete genome
writing read peptides into 'Xanthomonas_citri_citri_Aw12879_GCA_000349225_1/NC_020817.1.faa'


Now, we append all of the files ending in a ".faa" translation to an array of files called DIRNAMES

In [10]:
DIRNAMES = []
for dirname in os.listdir("genomes"):
    if (dirname.find(".") != -1):
        if(dirname[len(dirname) - 3:] == "faa"):
            DIRNAMES.append("genomes/" + dirname)
    else:
        for filename in os.listdir("genomes/" + dirname):
            if(filename[len(filename) - 3:] == "faa"):
                DIRNAMES.append("genomes/" + dirname + "/" + filename)
    
        
print(DIRNAMES)

[]


### Pattern Matching
Uses the python regular expression library to determine whether proteins match the pattern sequence. This function takes in an overall sequence of amino acids and determines whether the sequence passes the pattern regular expression. The function returns a list of matched proteins, which have a specific sequence, and stores the overall sequence and the associated description.

In [13]:
def patternMatch(overallSequence, pattern, description):
    matchedProteinList = []
    
    # find all matches in protein that match
    matchIter = re.finditer(pattern, overallSequence)
    done_looping = False
    while not done_looping:
        try:
            match = next(matchIter)
        except StopIteration:
            done_looping = True
        else:
            # get the correct range based on span
            indices = list(match.span())
            ORF = int(description[len(description) - 2:])
            if ORF == 2:
                indices[0] += 1
                indices[1] += 1
            elif ORF == 3:
                indices[0] += 1
                indices[1] += 1
            elif ORF == -1:
                indices[0] = len(overallSequence) - indices[0]
                indices[1] = len(overallSequence) - indices[1]
            elif ORF == -2:
                indices[0] = len(overallSequence) - indices[0] - 1
                indices[1] = len(overallSequence) - indices[1] - 1
            elif ORF == -3:
                indices[0] = len(overallSequence) - indices[0] - 2
                indices[1] = len(overallSequence) - indices[1] - 2
            matchedProteinList.append({
                "description": description,
                "sequence": match.group(0),
                "searchPattern": match.re.pattern,
                "searchRange": indices,
                "overallLength": len(overallSequence),
                "genome": description[:description.index("/")]
                ## "overallString": match.string
            })
    return matchedProteinList

In [14]:
matchedProteins = []
for filename in DIRNAMES:
    readSequences = readFASTA(filename)
    for seq in readSequences:
        matchedProteins.extend(patternMatch(seq["sequence"], PATTERN, filename[8:len(filename) - 4] + " - " + seq["description"]))
    

print("Found " + str(len(matchedProteins)) + " that satisfy the pattern: " + PATTERN)
# for match in matchedProteins:
#     print(match["sequence"] + ", found in range " + str(match["searchRange"]))
#     print("description: " + match["description"])



Read 6 objects from FASTA file genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000001.faa
Read 6 objects from FASTA file genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000002.faa
Read 6 objects from FASTA file genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000003.faa
Read 6 objects from FASTA file genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000004.faa
Read 6 objects from FASTA file genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000005.faa
Read 6 objects from FASTA file genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000006.faa
Read 6 objects from FASTA file genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000007.faa
Read 6 objects from FASTA file genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000008.faa
Read 6 objects from FASTA file genomes/Asticcacaulis_benevestitus_DSM_16100_uid199072/NZ_AQWM01000009.faa
Read 6 objects from FASTA file genomes/Asticca

Read 6 objects from FASTA file genomes/Asticcacaulis_excentricus_CB_48_GCA_000175215_2/NC_014817.1.faa
Read 6 objects from FASTA file genomes/Asticcacaulis_excentricus_CB_48_GCA_000175215_2/NC_014818.1.faa
Read 6 objects from FASTA file genomes/Asticcacaulis_excentricus_CB_48_GCA_000175215_2/NC_014819.1.faa
Read 6 objects from FASTA file genomes/Frankia_CcI3_GCA_000013345_1/NC_007777.1.faa
Read 6 objects from FASTA file genomes/Streptococcus_suis_ST3_GCA_000204625_1/NC_015433.1.faa
Read 6 objects from FASTA file genomes/Streptomyces_albus_GCA_000827005_1/NZ_CP010519.1.faa
Read 6 objects from FASTA file genomes/Streptomyces_albus_GCA_001577385_1/NZ_CP014485.1.faa
Read 6 objects from FASTA file genomes/Streptomyces_albus_J1074_GCA_000359525_1/NC_020990.1.faa
Read 6 objects from FASTA file genomes/Streptomyces_leeuwenhoekii_GCA_001013905_1/NZ_LN831788.1.faa
Read 6 objects from FASTA file genomes/Streptomyces_leeuwenhoekii_GCA_001013905_1/NZ_LN831789.1.faa
Read 6 objects from FAS

In [18]:
matchedProteins[20000]

{'description': 'Streptomyces_lividans_TK24_GCA_000739105_1\uf028/NZ_CP009124.1 -  gi|749298793|ref|NZ_CP009124.1| Streptomyces lividans TK24, complete genome - ORF 2',
 'sequence': 'MVNSTRVPDRPVWKVIFSLSENRFTSAQATTSPSEPSGISSLTV*',
 'searchPattern': 'M[A-Z]{15,45}T[A-Z][A-Z]{6,8}[DE][A-Z]{5,30}\\*',
 'searchRange': [2440332, 2440377],
 'overallLength': 2781760,
 'genome': 'Streptomyces_lividans_TK24_GCA_000739105_1\uf028'}

In [16]:
with open('matches.json', 'w') as outfile:
    json.dump(matchedProteins, outfile)

In [17]:
lassopeptides = []
with open('matches.json', 'r') as storedfile:
    lassopeptides = json.loads(storedfile.read())
