# Genome Mining Script
##### Original Author: Chun Yin Larry So | Python Interpretation: Nathan Alam
[Github repository](https://github.com/nathanalam/genomemining)

The purpose of this notebook is to provide scripts for reading bacterial genomes in search of regular expressions which seem to match a String corresponding to a candidate coding for a Lasso peptide.

## Links to relevant publications:
- [Genome mining for lasso peptides: past, present, and future](https://link.springer.com/article/10.1007/s10295-019-02197-z)
- [Prospecting Genomes for Lasso Peptides](https://www.ncbi.nlm.nih.gov/pubmed/24142336)

### Comments from Perl Script
- In this version each time MAST needs to be run on a putative maturation enzyme in a particular genome, a lookup in the database is performed to check if this protein has been analyzed by MAST before and if it has then use the values from before
- In this version traseq will be used instead of getorf for finding precursors
- getorf is still used for finding neighbors
- Like v4 except that multiple proteins of the same sequence don't cause wrong locations of maturation enzymes to be reported
- Another change is that the pattern needs to be adjusted to [MVL] instead of ^ in the beginning
- This version fixes the problem of having a useless %AME hash and also of erasing sequences from %AME_scores
- Precursor pattern is output into the log file
- ORF searching behavior is changed from stop-to-stop to [MVL]-to-stop
- Clusters of precursors are saved in clusters.txt
- **Warning**: transeq doesn't label the -1, -2, -3 frames sequentially. Sometimes it is -1, -3, -2, sometimes some other combination. This means the precursor locations on the reverse strand are off by one sometimes
- Take note that rank_hits expects 4 motifs for the B enzyme and 3 motifs for the C enzyme. Adjust accordingly.

In [1]:
import re
import sys

### THE Pattern

This is the pattern that we're using to identify lasso proteins. TODO - Use Machine Learning to adjust the pattern to maximize the number of valid lasso proteins.

In [2]:
# PATTERN = '^M.{15,45}T..{6,8}[DE].{5,30}$'
PATTERN = '.*'

### Other defined parameters

In [3]:
DIRNAME = "test.txt"

### FASTA Function

Define a function that takes as input the relative path of a FASTA formatted text file, return an object that contains a list of protein object. Each protein object has a description field ["description"] and a sequence field ["sequence"].

From http://www.csbio.sjtu.edu.cn/bioinf/virus-multi/example.htm, specification of a FASTA formatted file:
- The first line of each query protein input format must begin with a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier and description of the sequence, but both are optional.
- The sequence (in single-character code) begins in a different line and ends if another line starting with a ">" appears, which indicates the start of another query protein.

In [4]:
def readFASTA(name, cleanspace = 0):
    descriptions = []
    sequences = []
    proteinList = []
    tempSequences = []     
        
    with open(name) as file:
        count = -1
        for line in file:
            
            if(line[0] == '>'):
                # if begins with a >, then a description
                descriptions.append(line[1:].replace('\n', ''))
                count += 1
                # skip the first time
                if count > 0 :
                    # combine the tempSequences into a single string and
                    # add it to sequences
                    newSequence = ' '.join(tempSequences)
                    # now remove all of the whitespaces
                    newSequence = newSequence.replace(' ', '')
                    newSequence = newSequence.replace('\n', '')
                    
                    sequences.append(newSequence)
                    # refresh the tempSequence list
                    tempSequences = []
                    
                    proteinList.append({
                        "description": descriptions[count - 1],
                        "sequence": sequences[count - 1]
                    })
            else:
                tempSequences.append(line)
                
        # combine the tempSequences into a single string and
        # add it to sequences
        newSequence = ' '.join(tempSequences)
        # now remove all of the whitespaces
        newSequence = newSequence.replace(' ', '')
        newSequence = newSequence.replace('\n', '')

        sequences.append(newSequence)
        # refresh the tempSequence list
        tempSequences = []
        
        proteinList.append({
            "description": descriptions[count],
            "sequence": sequences[count]
        })
                
                
    if len(descriptions) != len(sequences):
        print("ERROR: Number of descriptions does not match number of sequences")
        print("Number of descriptions: " + str(len(descriptions)))
        print("Number of sequences: " + str(len(sequences)))
        sys.exit(1);
        
    print("Read " + str(count + 1) + " objects from FASTA file " + name)
        
    return proteinList
        


### Pattern Matching
Uses the python regular expression library to determine whether proteins match the pattern sequence. This function takes in a list of proteins (defined as objects that have a ["description"] and a ["sequence"]) and determines whether ["sequence"] passes the pattern regular expression. The matched proteins have an additional parameter ["matched_sequences"], which is a list of substrings satisfying the pattern.

In [5]:
def patternMatch(proteinList):
    matchedProteinList = []
    for protein in proteinList:
        # find all matches in protein that match
        matches = re.findall(PATTERN, protein['sequence'])
        if(len(matches) > 0):
            matchedProteinList.append({
                "description": protein["description"],
                "sequence": protein["sequence"],
                "matched_sequences": matches
            })
    return matchedProteinList

In [6]:
matchedProteins = patternMatch(readFASTA(DIRNAME))

print("Found " + str(len(matchedProteins)) + " that satisfy the pattern:")
for match in matchedProteins:
    print(match['description'])

Read 6 objects from FASTA file test.txt
Found 6 that satisfy the pattern:
brotein: gives you protein
reedtein: makes u smarticle
query protein 1; example of multiple subcellular locations
query protein 2; example of single subcellular location
HSBGPG Human gene for bone gla protein (BGP)
HSGLTH1 Human theta 1-globin gene
