Name: Jonathan Kim

Email: jkim185@uncc.edu


## Part 1 - Sequence Class

Write a Sequence class. In the __init__ method, you should initialized one attribute, a String that represents a DNA Sequence.

This class should also have the following magic methods we discussed in class yesterday:

- __repr__ and __str__
- __eq__ and __lt__ (then use the decorator I demonstrated)

It is up to you to decide how these should be implemented. For instance, what criteria do you think makes the most sense for saying two sequences are equal to one another? What criteria for one sequence to be considered "less than" another?

In [1]:
#sequence class goes here
from functools import total_ordering

@total_ordering
class Sequence:
    def __init__(self, seq):
        self.seq = seq
    # Informal Report
    def __str__(self):
        return self.seq
    # Formal Representation
    def __repr__(self):
        return f'The DNA Sequence is {self.seq}.'
    # If both strings are EXACTLY the same
    def __eq__(self,other):
        return self.seq == other.seq
    # Comparing lengths of two sequences
    def __lt__(self,other):
        return len(self.seq) < len(other.seq)

### Testing of Sequences w/ Sanity Checks

In [2]:
#Use this cell for testing your Sequence class. Show us what tests you ran to confirm your methods worked correctly
s1 = Sequence("TCGTCAGCTGACTGATATAGC")
s2 = Sequence("CTGACCTAGTCGATCGATCG")
s3 = Sequence("TCGTCAGCTGACTGATATAGC")
print("Test for __str__: ", s1.__str__())
print("Test for __repr__: ", s1.__repr__())
print("Sanity Check: s1 == s2 is ", s1 == s2, "and __eq__ gives ", s1.__eq__(s2))
print("Sanity Check: s1 == s3 is ", s1 == s3, "and __eq__ gives ", s1.__eq__(s3))
print("Sanity Check: s1 < s3 is ", s1 < s3, "and __lt__ gives ", s1.__lt__(s3))
print("Sanity Check: s2 < s3 is ", s2 < s3, "and __lt__ gives ", s2.__lt__(s3))

Test for __str__:  TCGTCAGCTGACTGATATAGC
Test for __repr__:  The DNA Sequence is TCGTCAGCTGACTGATATAGC.
Sanity Check: s1 == s2 is  False and __eq__ gives  False
Sanity Check: s1 == s3 is  True and __eq__ gives  True
Sanity Check: s1 < s3 is  False and __lt__ gives  False
Sanity Check: s2 < s3 is  True and __lt__ gives  True


## Part 2 - SequenceRecord Class

Write a class called Sequence Record. This class should have two attributes:

- A label/title (something that describe the source of the sequence, like the contents of a header line in a FASTA file)
- and a Sequence object 

Your initializer should attempt to confirm that the second attribute is, in fact, a Sequence object. Consider the following code and how it could be applied here

```
>>> s = "hello"
>>> type(s) == str
True 
```

You should also, at minimum, add a __str__ and __repr__ method.

In [3]:
s = 'hello'
print("test1",isinstance(s1,Sequence))
print("test2",isinstance(s1,str))

test1 True
test2 False


In [4]:
# SequenceRecord class goes here
class SequenceRecord:
    def __init__(self,label,seq):
        self.label = label
        self.seq = self.seqCheck(seq)
    def seqCheck(self, var):
        temp = "Not a Sequence, input a valid Sequence"
        if (isinstance(var,Sequence)):
            temp = var
        return temp
    def __str__(self):
        return self.label
    def __repr__(self):
        return f"The header is {self.label} with sequence: {self.seq}"

### Sanity Check Testing

In [5]:
# Use this cell to test your SequenceRecord class
header1 = "MD10G1276500"
header2 = "MD10G1110200"
header3 = "MD10G1036500"
rec = SequenceRecord(header1,s1)
print("Test for __str__: ", rec.__str__())
print("Test for __repr__: ", rec.__repr__())

Test for __str__:  MD10G1276500
Test for __repr__:  The header is MD10G1276500 with sequence: TCGTCAGCTGACTGATATAGC


In [6]:
fakes = "ATGCTAGCTGATGTCAG"
# fakeSeq # works
fakerec = SequenceRecord(header2,fakes)
fakerec # Does not work

The header is MD10G1110200 with sequence: Not a Sequence, input a valid Sequence

## Part 3 - Parsing using your new classes

Build yourself a test FASTA file with approx 3 simple records. Read in this file, and use it contents to create a SequenceRecords for each record in the file. 

- Please note this process is identical to what we did previously with FASTA parsing, only before we used a dictionary where the key stored the header info, and the value stored the sequence info. Now, our SequenceRecord object holds BOTH pieces.

Be sure to confirm your SequenceRecord objects hold the correct information.

For extra credit, write your parser as a generator.

In [7]:
# Write your parser and testing code here
class Parser:
    def __init__(self, file):
        self.file = self.toRecord(self.loadFa(file)[0],self.loadFa(file)[1])
    def loadFa(self, filename):
        fp = open(filename, 'r')
        data = fp.read().split('>')
        fp.close()
        data.pop(0)     
        headers = []
        sequences = []
        for sequence in data:
            lines = sequence.split('\n')
            headers.append(lines.pop(0))
            sequences.append(''.join(lines))
        return (headers, sequences)
    
    # Uses the created Sequence and SequenceRecord classes for a file
    def toRecord(self,heads,seqs):
        templist = []
        for i in range(len(heads)):
            templist.append(SequenceRecord(heads[i],Sequence(seqs[i])))
        return templist
    
    def __str__(self):
        return self.file
    def __repr__(self):
        final = []
        return f'The genes in the file are {self.file}'

### Personal Logic Testing

In [8]:
# Sanity Checking
def loadFasta(filename):
    fp = open(filename, 'r')
    # split at headers
    data = fp.read().split('>')
    fp.close()
    # ignore whatever appears before the 1st header
    data.pop(0)     
    headers = []
    sequences = []
    for sequence in data:
        lines = sequence.split('\n')
        headers.append(lines.pop(0))
        sequences.append(''.join(lines))
    return (headers, sequences)
head, seq = loadFasta("./three.fa")
len(seq)

3

### Final Parser Output

In [9]:
p = Parser("./three.fa")
print("p.__str__()[0] gives the first header: ",p.__str__()[0])
print("p.__repr__() gives the entire file with associated sequence file: ",p.__repr__())

p.__str__()[0] gives the first header:  MD10G1276500 pacid=40089867 polypeptide=MD10G1276500 locus=MD10G1276500 ID=MD10G1276500.v1.1.491 annot-version=v1.1
p.__repr__() gives the entire file with associated sequence file:  The genes in the file are [The header is MD10G1276500 pacid=40089867 polypeptide=MD10G1276500 locus=MD10G1276500 ID=MD10G1276500.v1.1.491 annot-version=v1.1 with sequence: CAGTCCGTGGCTCCTGTGTGCAATGTCTGCGGCGAGCAGGTGGGGCTTGGTGCCAATGGGGAGGTTTTCGTGGCATGCCACGAGTGTAATTTCCCCATTTGCAAGGCTTGTTTCGATGAAGATGTCAAGGCTGGGCGTAAAGTTTGCTTGCAGTGTGGTATTCCCTATGACGATAACCCGTTGGCGGAGTATGAAACAAAGGTGTCAGGCACTCGATCCACAATGGAAGCTCACCTGAATAATACACAGGATACAGGAATTCATGCTAGGCATATCAGCAGTGTGTCTACGTTGGATAGTGAATTAAACGATGAATCTGGCAATCCGATTTGGAAGAATAGAGTGGAAAGTTGGAAGGATAAGAAGGATAAGAAGGATAAAAAGATCAAGAAGAAAAAGGATACACCTAATGGGGAAAAAGAGGCTCAAATTCCACCTGAGAAGCAGATGACAGAGGAATATTCATCAGAGGCTGCGGAACCACTTTCAACTCTCGTCCCACTTCCATCTAACAGAATCACACCATACAGAACTGTTATAATTATGCGATTGATCATTCTCGCCCTTTTCTTCCATTATCGAGTAACAAATCCTGTTGATAGTGC