Add one more class called FileSequenceReader. The purpose of this class is to control how FASTA (and potentially other formats) files are read and converted into our SequenceRecord objects. A very large portion of this class will be the function you originally wrote back in OOP Lab 1, which read in the FASTA file and made your collection of SequenceRecord objects. 
 

Before you go copying and pasting that code into a method in your new class though, consider the following:

Should that method be an instance, class, or static method?
Does it make sense to create FileSequenceReader objects? If so, will it need instance variables?
Does this class need class variables? What would those be?
Can you think of any class or static method that would be helpful? Does it make sense for any of your other methods to be a part of this class instead of where you originally wrote them?
 

Remember, there are no correct answers to these questions. You do, however, need to think them through and include your justification for how you've chosen to structure your classes in a Markdown cell. At minimum, this justification should include answers to each of these questions, but I highly recommend going into more detail.
 

Finally, write a Markdown cell that explains how to use your classes. Put yourself in the shoes of someone who is coming in with a FASTA file, and wants to use your code to read and eventually analyze it. How does this person use what you've written to go from that starting point to usable objects? What is produced at each step? What attributes do your objects have? What are the things each object type can do, how do you use those methods, and what do they return if anything? Be comprehensive and carefully consider your formatting and presentation. Make good use of those options in Markdown cells.

In [56]:
#Sequence class
from functools import total_ordering

@total_ordering
class Sequence:

    def __init__(self,seq):
        valid_char = "MFLCYWPHQRITNKSVADEG*"
        for i in range(len(seq)):
            if seq[i] in valid_char:
                continue
            else:
                raise ValueError(f'Error at position {i+1}')
        self.seq = seq
       
    def __str__(self):
        return self.seq

    def __repr__(self):
        return f'Sequence: {self.seq}'
    
    def __eq__(self,other): #this will only allow objects of the same class to be equal
        if isinstance(other,DNAseq):
            if isinstance(self,DNAseq) and self.seq == other.seq:
                return True
            else:
                return False            
        elif isinstance(other,ProteinSeq):
            if isinstance(self,ProteinSeq) and self.seq == other.seq:
                return True
            else:
                return False            
        elif isinstance(other,Sequence):
            if isinstance(self,DNAseq):
                return False
            elif isinstance(self,ProteinSeq):
                return False
            elif isinstance(self,Sequence) and self.seq == other.seq:
                return True
            else:
                return False            
        else:
            return False
        
    def __add__(self,other): #this will only allow objects of the same class to be added to each other
        if isinstance(other,DNAseq):
            if isinstance(self,DNAseq):
                return DNAseq(self.seq + other.seq)
            else:
                print("cannot add sequences of different classes")
                return False            
        elif isinstance(other,ProteinSeq):
            if isinstance(self,ProteinSeq):
                return ProteinSeq(self.seq + other.seq)
            else:
                print("cannot add sequences of different classes")
                return False           
        elif isinstance(other,Sequence):
            if isinstance(self,Sequence):
                return Sequence(self.seq + other.seq)
            else:
                print("cannot add sequences of different classes")
                return False           
        else:
            return False
        
    def __lt__(self,other):
        if type(self.seq) == type(other.seq):
            return len(self.seq) < len(other.seq)
    
    def __len__(self):
        return len(self.seq)
    

In [57]:
#DNA sequence class

@total_ordering
class DNAseq(Sequence):
    
    def __init__(self,seq):
        valid_char = "ATCG"
        for i in range(len(seq)):
            if seq[i] in valid_char:
                continue
            else:
                raise ValueError(f'Invalid base at position {i+1}')
        super().__init__(seq)
        
    def __repr__(self):
        return f'DNA Sequence: {self.seq}'
    
    def translate_to_protein(self): #translates DNA to Amino Acids. Utilized last semester notes as an aid
        aa_dict = {'M':['ATG'], 'F':['TTT', 'TTC'], 'L':['TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'], 'C':['TGT', 'TGC'], 'Y':['TAC', 'TAT'], 'W':['TGG'], 'P':['CCT', 'CCC', 'CCA', 'CCG'], 'H':['CAT', 'CAC'], 'Q':['CAA', 'CAG'], 'R':['CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'], 'I':['ATT', 'ATC', 'ATA'], 'T':['ACT', 'ACC', 'ACA', 'ACG'], 'N':['AAT', 'AAC'], 'K':['AAA', 'AAG'], 'S':['AGT', 'AGC', 'TCT', 'TCC', 'TCA', 'TCG'], 'V':['GTT', 'GTC', 'GTA', 'GTG'], 'A':['GCT', 'GCC', 'GCA', 'GCG'], 'D':['GAT', 'GAC'], 'E':['GAA', 'GAG'], 'G':['GGT', 'GGC', 'GGA', 'GGG'], '*':['TAA','TAG','TGA']}
        protein_seq = ""
        for i in range(0,len(self.seq),3):
            codon = self.seq[i:i+3]
            for aa,codons in aa_dict.items():
                if codon in codons:
                    protein_seq += aa
                    
        return ProteinSeq(protein_seq)
    
    def __GC_content__(self): #produces the proportion of the sequence that is G and C base pairs
        GC_count = 0
        for base in self.seq:
            if base == "G" or base == "C":
                GC_count += 1
            else:
                continue
        proportionGC = GC_count/len(self.seq)
        return proportionGC

In [58]:
#Protein Sequence Class

@total_ordering
class ProteinSeq(Sequence):
    
    def __init__(self,seq):        
        valid_char = "MFLCYWPHQRITNKSVADEG*"
        for i in range(len(seq)):
            if seq[i] in valid_char:
                continue
            else:
                raise ValueError(f'Invalid Amino Acid at position {i+1}')
        super().__init__(seq)
        
    def __repr__(self):
        return f'Protein Sequence: {self.seq}'
    
    def __highest_AA_content__(self): #returns the amino acid found most in the sequence
        aa_counter = {}
        aa_highest = []
        for aa in self.seq:
            if aa in aa_counter.keys():
                aa_counter[aa] += 1
            else:
                aa_counter[aa] = 1
        highest_value = max(aa_counter.values())
        for aa,value in aa_counter.items():
            if value == highest_value:
                aa_highest.append(aa)
            
        return aa_highest

In [59]:
# SequenceRecord class 
from functools import total_ordering

@total_ordering
class SequenceRecord:

    def __init__(self,title,*args):
        self.title = title
        for seq in args:
            if isinstance(seq,Sequence):
                self.sequence = seq
            elif isinstance(seq,str):
                new_seq = Sequence(seq)
                self.sequence = new_seq
  
    def __str__(self):
        return str(self.title)+ "\n" + str(self.sequence)

    def __repr__(self):
        return f'{self.title}: {self.sequence}'
    
    def __eq__(self,other):
        if self.sequence == other.sequence:
            return True
        else:
            return False
    
    def __lt__(self,other):
        return len(self.sequence) < len(other.sequence)
    
    def __gt__(self,other):
        return len(self.sequence) > len(other.sequence)    

In [116]:
#FileSequenceReader Class

class FileSequenceReader:
    
    def __init__(self,file):
        self.file = file
              
    def __Fasta__(self):
        name = None
        seq = ""
        with open(str(self.file)) as fh:
            for line in fh:
                line = line.rstrip()
                if line.startswith(">"):
                    if name:
                        record = SequenceRecord(name,seq)
                        yield record
                    name = line
                    seq = ""
                else:
                    seq += line
            if name:
                record = SequenceRecord(name,seq)
                yield record
   

In [120]:
try1 = FileSequenceReader("OOPlab1.fasta")
print(try1)
print(try1.__Fasta__())
for record in try1.__Fasta__():
    print(record)
for record in try1.__Fasta__():
    print(type(record))
for record in try1.__Fasta__():
    print(type(record.sequence))

<__main__.FileSequenceReader object at 0x7fe6508db190>
<generator object FileSequenceReader.__Fasta__ at 0x7fe6100c1040>
>NC_000913.3:1368079-1368747 Escherichia coli str. K-12 substr. MG1655, complete genome
ATGGGTATTTTTTCTCGCTTTGCCGACATCGTGAATGCCAACATCAACGCTCTGTTAGAGAAAGCGGAAGATCCACAGAAACTGGTTCGTCTGATGATCCAGGAGATGGAAGATACACTGGTTGAAGTACGTTCTACTTCGGCGCGCGCGTTGGCAGAAAAGAAACAGCTGACTCGCCGTATTGAACAAGCGTCGGCGCGTGAGGTTGAATGGCAGGAAAAAGCCGAACTGGCGCTGCTGAAAGAGAGAGAGGATCTGGCACGTGCAGCGTTAATTGAAAAACAGAAACTGACCGATCTGATTAAGTCCCTGGAACATGAAGTGACGCTGGTGGACGATACGCTGGCACGCATGAAGAAAGAGATTGGCGAGCTGGAAAACAAATTGAGCGAAACACGCGCTCGCCAGCAGGCATTGATGTTACGTCATCAGGCGGCAAACTCGTCGCGCGATGTGCGTCGTCAGCTGGACAGTGGCAAACTGGATGAAGCAATGGCTCGTTTCGAATCTTTCGAACGTCGTATTGACCAGATGGAAGCGGAAGCAGAAAGCCACAGCTTCGGTAAACAAAAATCGCTGGACGATCAGTTTGCCGAACTGAAAGCCGATGATGCAATCAGCGAACAACTGGCACAATTAAAAGCCAAAATGAAGCAAGACAATCAATAA
>NC_000913.3:167484-169727 Escherichia coli str. K-12 substr. MG1655, complete genome
ATGGCGCGTTCCAAAACTGCTCAGCCAAAACACTC

Questions:

- Should that method be an instance, class, or static method?
The method should be a class method because a static method cannot access the instance variables and an instance method would be defined in __init__ 

- Does it make sense to create FileSequenceReader objects? If so, will it need instance variables?
It could make sense to make reader objects and yes it would need instance variables in that case, but I chose not to because it might make more sense in the future to have child classes for different types of files, and a file reader generator as the parent class (would have to move the current generator to new class and put a generic line by line reader as the parent class). That's what I would do.

- Does this class need class variables? What would those be?
This class didn't need class variables. One could be added to describe the parser generator though. I did not have time to add this.

- Can you think of any class or static method that would be helpful? Does it make sense for any of your other methods to be a part of this class instead of where you originally wrote them?
Another helpful method would be a way to read the file without having to write a generator. I couldn't figure this out though, because I could make another method with print() instead of yield and that kind of worked, so I think something there would be beneficial. I think it would make sense to check the proper characters, but when it gets sent through the other classes that gets checked anyways, so I don't think it is necessary. 

How To Use My Code:

1. Sequence Class
To use the Sequence Class you need a string (initialized by "") with only the following characters: MFLCYWPHQRITNKSVADEG*. To put your string through the sequence class you can call it using Sequence(yourString). 
There are some commands you can use if you set yourVariable = Sequence(yourString):
type(yourVariable) #should return Sequence type object
yourVariable.__str__() #returns sequence as a string
yourVariable.__repr__() #returns sequence with a header of "Sequence: "
yourVariable.__len__() #returns length of the sequence
yourVariable.__add__(yourVariable2) #adds two sequences to each other

2. DNA Sequence Class
To use the DNA Sequence Class you need a string (initialized by "") with only the following characters: ATCG. To put your string through the DNA sequence class you can call it using DNAseq(yourString).There are some commands you can use if you set yourVariable = DNAseq(yourString): 
type(yourVariable) #should return DNASeq type object
yourVariable.__str__() #returns sequence as a string
yourVariable.__repr__() #returns sequence with a header of "DNA Sequence: "
yourVariable.__len__() #returns length of the sequence
yourVariable.__translate_to_protein__() #returns the corresponding protein sequence
yourVariable.__GC_content__() #returns GC content of your sequence

3. Protein Seq Class
To use the Protein Sequence Class you need a string (initialized by "") with only the following characters: MFLCYWPHQRITNKSVADEG*. To put your string through the sequence class you can call it using ProteinSeq(yourString). There are some commands you can use if you set yourVariable = ProteinSeq(yourString):
type(yourVariable) #should return ProteinSeq type object
yourVariable.__str__() #returns sequence as a string
yourVariable.__repr__() #returns sequence with a header of "Protein Sequence: "
yourVariable.__len__() #returns length of the sequence
yourVariable.__highest_AA_content__() #returns which amino acids are in the sequence the most

4. Sequence Record Class
To use the Sequence Record Class you need two strings (initialized by ""), one with the header and the other with the corresponding sequence. The sequence string can only contain the following characters: MFLCYWPHQRITNKSVADEG*. To put your strings in the sequence record class you can call it using SequenceRecord(header,sequence).There are some commands you can use if you set yourVariable = SequenceRecord(header,sequence):
type(yourVariable) #should return SequenceRecord type object
yourVariable.__str__() #returns header then the corresponding sequence on a new line
yourVariable.__repr__() #returns header: sequence

5. File Sequence Reader Class
To use the File Sequence Reader Class you need a fasta file (a header line beginning with '>' and the next line containing sequence, in this pattern for the whole file). To put your file in the file sequence reader class you can call it using FileSequenceReader("yourfilename.fasta").There are some commands you can use if you set yourVariable = FileSequenceReader("yourfilename.fasta"):
type(yourVariable) #should return FileSequenceReader type object
for record in yourVariable.__Fasta__():
    print(record) #returns the fasta file, with any new line characters removed
for record in try1.__Fasta__():
    print(type(record)) #should return SequenceRecord object for each iteration
for record in try1.__Fasta__():
    print(type(record.sequence)) #should return a Sequence object for each iteration