# Processing FASTA files

Exercise for processing DNA sequences.

* **Contact:** mate.balajti@unibas.ch

In [13]:
import numpy as np

filename1 = "sequences.fasta"
filename2 = "genome.fasta"

## Exercise 1.1 (3 points)

Write a function `parse_fasta()` that takes a path to a FASTA file as input
and returns a tuple of two lists, the first containing sequence headers
stripped of the leading `>`, and the second containing the actual sequences.

> **Notes:**
>  
> * Please write the parser from scratch and do _not_ use existing FASTA
>   parsers, such as the one provided by Biopython.
> * Ensure that wrapped sequences are handled such that fragments of a given
>   sequence are concatenated, without white space, in the order they appear
>   in the file. Make use of the leading `>` character to separate records
>   from each other.
> * Ensure that the number of items in the returned lists correspond to the
>   original number of records in the input file.

In [14]:
def parse_fasta(path: str) -> tuple[list[str], list[str]]:

    file = open(path)
    #print(file.read())

    headerList = []
    sequenceList = []
    listCounter = -1

    for line in file: 
        if line[0] == ">":
            #headerList.append(line[1:])
            # In order to not print the \n characters ill just print the following 4 chars
            headerList.append(line[1:5])
            listCounter += 1 # first run through will be 0 for chr1
            sequenceList.append("")
        
        else:
            sequenceList[listCounter] += line[0:-1] # \n is 1 character not 2!
            
    file.close()

    header = headerList
    sequence = sequenceList
    #print("Header: ", header)
    #print("Sequences: ", sequence)
    return header, sequence

parse_fasta("genome.fasta")

(['chr1', 'chr2', 'chr3', 'chr4'],
 ['TATTTACGCTCCCGCACATATACCGTCGTAAGCGACTGGCTGCTCGCGACCAACCGCCTCGCTACGATTGAGTCGAGAGCGGGTAGACCGGAGGCCGTTCGGCGCATAACTTGTCTTGTTAATTAGGCTGCCTGTGAACGATATTCAATTATGGCACGTGGCGCATTAAGCTCCACGCCACATTCCAACTTCTAGTAGTATGCTGAAACACCGTGCGGGGGACAGAGAGCCATGGCAAATTCAGTTCGGTGTAGACCATACTTTATAACTGGACAACAAGTCCGAGTCCTTCACCTGATGCGCCCAGAACCCCGCATTTCTGTACCTATAGCACCGGCGCACACCCCTCTCACATGCTGGGCAGCAGAGTACTTCTCACTCGATCAGAGAAACCAAACCAGTTGATTGAAGTTCGCATAAGCGTGTTTGCTGTTGCTTGCGGCTCTCATGCACATAGTCGTTCAAGCTTACCCTCCACCGTCGGGCGACAATGCTCGGGGCGCGCATCAGACCGTATTGACTTTTTAGTACAGATTGGACGCAGAGGTTGAGTCATCGTATTCATGTGATAATCTCAACAGGTCCGTCTTGTTATCAAACTTAGTAGTGGGGAGGTGAGTGGTGCAATGTGGAAGTGCCGTAGCCAGCCCGCGATCCACGCACTATTTCAGGGACCGAGGTAACTTAAACCACCCGTTGATGGACATACTGGTTATAGATTTTAGTGCGGGAGTAGGCCTCGTCGCAGCGCAAAGGATGAAGTTTACTAACTGGAGTGGTCAGAAGTTGCCGCCTGTGAAGGCGGCACACGCTTAGTCCGCAATATTATGGGACTGGGGAGGCGACGGTATCCGAAGGTGCTCTGATCAGGATAGCCTCTAGACAGTAATCATGGGTTCTCCATTACGGAGTATTATGGCGGACCGATCGGTCACGTTAATTTGGCTAACTATACGTTCATA

## Exercise 1.2 (2 points)

Write a function `discard_ambiguous_seqs()` that takes two arguments, a list of sequence headers and list of strings as input 
and returns a tuple of two lists with only those headers and strings that exclusively consist of letters of the
"DNA alphabet" (`A`, `C`, `G`, `T`).

> **Notes:**
>  
> * Make sure your implementation is case-insensitive, i.e., sequences
>   containing lowercase DNA characters, even if mixed with uppercase
>   characters, are valid as well.

In [15]:
def discard_ambiguous_seqs(header: list[str], sequence: list[str]) -> \
  tuple[: list[str], : list[str]]:
    
    r_header = []
    r_sequence = []
    
    if len(header) != len(sequence): 
        print("Error: List sizes of the two inputs mismatch. ")
        return
    
    counter = 0

    for element in header: 
        #print(element)
        error = False
        for char in sequence[counter]:
           #print(char)
           if char == "A" or char == "a" or char == "G" or char == "g" or char == "T" or char == "t" or char == "C" or char == "c":
               continue
           else: 
               print("Error: Sequence " + str(counter + 1) + " contains illegal characters.")
               error = True
               break
        
        if error == False: 
            #print("No error")
            r_header.append(header[counter])
            r_sequence.append(sequence[counter])
            
        counter += 1
    
    return r_header, r_sequence


list1 = ["chr1", "chr2", "chr3"]
list2 = ["ABCD", "GTAC", "AAAA"]

discard_ambiguous_seqs(list1, list2)



Error: Sequence 1 contains illegal characters.


(['chr2', 'chr3'], ['GTAC', 'AAAA'])

## Exercise 1.3 (2 points)

Write a function `nucleotide_frequencies()` that takes a list of strings as
input, and which prints out the total frequency of each nucleotide across
all input sequences. Use the following example as a template to format your
output:

```console
A: 0.31
C: 0.21
G: 0.17
T: 0.31
```

> **Notes:**
>
> * Note how numbers are rounded in the example and format decimals printed
>   by your solution in the same manner, i.e., rounded to two significant
>   digits.
> * The function does not require any specific return value. In case you are
>   not aware of how Python deals with functions without an explicit `return`
>   statement, look up the behavior in relevant documentation.

In [16]:
def nucleotide_frequencies(seqs: list[str]) -> None:

    # alt shift click to add carets
    # ctrl shift arrows to add carets up or down
    # alt shift up down to copy up down
    # alt control up down to move line up down

    number_g = 0
    number_t = 0
    number_a = 0
    number_c = 0
    number_char = 0

    for char in seqs: 
        #print(char)

        if char == "G" or char == "g": 
            number_g += 1
        if char == "T" or char == "t": 
            number_t += 1
        if char == "A" or char == "a": 
            number_a += 1
        if char == "C" or char == "c": 
            number_c += 1

        number_char += 1

    percentage_a = round((number_a / number_char),2) #* 100 round(0.123, 2) = 0.12
    percentage_c = round((number_c / number_char),2) #* 100
    percentage_g = round((number_g / number_char),2) #* 100
    percentage_t = round((number_t / number_char),2) #* 100

    print("A: " + str(percentage_a))
    print("C: " + str(percentage_c))
    print("G: " + str(percentage_g))
    print("T: " + str(percentage_t))
    print("Total: " + str(number_char))

nucleotide_frequencies("AAACacgtcggctagcta")

        


    

A: 0.33
C: 0.28
G: 0.22
T: 0.17
Total: 18


## Exercise 1.4 (3 points)

Write a function `map_reads()` that takes as input two FASTA files, the first
containing short read sequences ("query"), and the second containing reference
sequences. The function should 
* read the files, 
* discard _query_ sequences that contain non-DNA characters, 
* print the nucleotide fractions for both files to
the console 
* and returns a dictionary of dictionaries, where the outer
dictionary uses the names of query sequences as its keys, and the inner
dictionary uses reference sequence names as keys and a list of 1-based indices
indicating at which position (counting from left to right) in the reference
sequence the query sequence occurs as an exact substring.

Execute the function, passing `sequences.fasta` and `genome.fasta` as input.
Inspect the returned "hits" object (the dictionary of dicionaries). Interpret
the results in at least 2-3 bullet points. What's special about query sequence
`sequence4`?

In [41]:
def map_reads(filename1: str, filename2: str) -> dict[dict[str: list[int]]]:

    file1 = open(filename1)
    file2 = open(filename2)

    # Discard Queries

    list1 = []
    list2 = []

    for line in file1: 
        if line[0] == ">": 
            list1.append(line[1:-1])
        else: 
            list2.append(line[:-1])

    #print(list1)
    #print(list2)

    cleanListFile1 = discard_ambiguous_seqs(list1, list2)

    print("cleanListFile1: ")
    print(cleanListFile1)

    # Fractions

    print("Percentages of File " + filename1)
    print("\n")

    for i in range(len(cleanListFile1[1])):
        nucleotide_frequencies(cleanListFile1[1][i])
        print("\n")

  
    print("Percentages of File " + filename2)
    print("\n")

    listFile2 = parse_fasta(filename2)
    cleanListFile2 = discard_ambiguous_seqs(listFile2[0], listFile2[1]) # i guess it can be assumed that this file is clean
    #print(cleanListFile2)
    for i in range(len(cleanListFile2[1])):
        nucleotide_frequencies(cleanListFile2[1][i])
        print("\n")
    
    # Sequence Dictionaries
    # set up double dict

    findings = {}
    
    countOutisde = 0
    countInside = 0

    for i in cleanListFile1[0]:
        findings[i] = {}
        countInside = 0
        for k in cleanListFile2[0]:
            position = cleanListFile2[1][countInside].find(cleanListFile1[1][countOutisde])
            if position != -1: 
                findings[i][k] = position + 1 # positioning from find starts at 0
            else: 
                findings[i][k] = []
            
            countInside += 1

        countOutisde += 1


    # finder

    # for i in range(len(cleanListFile2[1])): # check how many genomes there are
    #     counter = 0

    #     for char in cleanListFile2[1][i]: #look through genome list chr[i]

    #         if char == cleanListFile1[1][0][counter]:
    #             counter += 1
    #             continue
    #         else:
    #             counter = 0 # reset counter
    #             break

    # ich benutze einfach sequence.find(string) in der dict builder loop

    

    
    file1.close()
    file2.close()

    return findings

map_reads("sequences.fasta", "genome.fasta")

Error: Sequence 3 contains illegal characters.
cleanListFile1: 
(['sequence1', 'sequence2', 'sequence4'], ['GAAGTTTACTAaCTGGAGTGGTCAGaAGTTGCCGCCTGTG', 'GCCCGGGCGTATGTATGAGAGATGTGGCCAGAAGTCGAAA', 'TTTGAG'])
Percentages of File sequences.fasta


A: 0.23
C: 0.17
G: 0.33
T: 0.28
Total: 40


A: 0.28
C: 0.17
G: 0.38
T: 0.17
Total: 40


A: 0.17
C: 0.0
G: 0.33
T: 0.5
Total: 6


Percentages of File genome.fasta


A: 0.25
C: 0.26
G: 0.25
T: 0.24
Total: 2126


A: 0.24
C: 0.26
G: 0.26
T: 0.25
Total: 2127


A: 0.24
C: 0.26
G: 0.25
T: 0.25
Total: 2124


A: 0.26
C: 0.25
G: 0.24
T: 0.25
Total: 2135




{'sequence1': {'chr1': [], 'chr2': [], 'chr3': [], 'chr4': []},
 'sequence2': {'chr1': [], 'chr2': 1422, 'chr3': [], 'chr4': []},
 'sequence4': {'chr1': [], 'chr2': 1039, 'chr3': 1422, 'chr4': 1455}}

Inspect the returned "hits" object (the dictionary of dicionaries). Interpret
the results in at least 2-3 bullet points. What's special about query sequence
`sequence4`?

Sequence 1 is not found in any of the genomes. 
Sequence 2 is found only in genome2 at pos 1422. 
Sequence 3 has been removed from the findinds as it included an illegal character "D"
Sequence 4 has been found in genome 2, 3 and 4. 

Sequence 4 has been found the most as it is also the shortest of the 3 sequences. Thus the probability of it occurring is higher. 

Its worthy to note that all these findings show the first found occurrence of the sequence, so there is no way of telling if they occur multiple times in the genome with this version of the code. 