## Mini Project for Bioinformatics Track: 

The genetic code of all living organisms are represented by a long sequence of simple molecules called nucleotides, or bases, which makes up the Deoxyribonucleic acid, better known as DNA. There are only four such nucleotides, and the entire genetic code of a human can be seen as a simple, though 3 billion long, string of the letters A, C, G, and T. Analyzing DNA data to gain increased biological understanding is much about searching in (long) strings for certain string patterns involving the letters A, C, G, and T. This is an integral part of bioinformatics, a scientific discipline addressing the use of computers to search for, explore, and use information about genes, nucleic acids, and proteins.

** FASTQ ** format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.

A FASTQ file normally uses four lines per sequence.
* Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
* Line 2 is the raw sequence letters.
* Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
* Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

In [61]:
!head -8 DNA.fastq

@HWI-M20149:202:000000000-AF422:1:1101:16309:1827 2:N:0:CAAATTCGGGAT
CCTGTTTGCTCCCCTCGCTTTCGTACCTCAGCGTCCATTCTTGTCCAGTCAGTCGCCTTCGCCACTGGTGTTCTTCCGTATATCTACGACTTTCACCTCTACACTCGGAATTCCACTCTCCTCTCCTATCTTCTAGCTATCTCGTTTCAATGGCTGTTCTGGCGTTGAGCTCCTGGCTTTCCCCTCTGACTTGATTATCCTCCTACGTACTCTTTACGCCCACTCCTTCCTATTCTCGCTTGCTTCCTCCT
+
AAA1>FD1BFFFGG1A1EFGGGEB00AGF111AAA0/D222A1DB121D111B1AA/AEH/EE//AB>F0BEH@F2@/10B1BFG21///?EGF2F1FGH1B10>0////?FE121>01BBBGGF011211BF>221>22<120?<?F22221<0/?<1<111/?<-.1<<1<110<<CCG00C<.<0=GGD<00:000;0:/::::.0:::0BF####################################
@HWI-M20149:202:000000000-AF422:1:1101:14316:1930 2:N:0:CAAATTCGGGAT
CCTCTTCGCTCCCCTCGCTTTCGTCCCTCAGCGTCAGTTTTGGCCCAGTAGCCTGCCTTCGCCATCGGTGTTCTTTCTCATCTCTGTGCATTTCACCGCTCCACTACGTATTCCCTTTACCTCTACTGTCTTCTAGACCTCTAGTTTCCTTCTCCCTTTTCCCATTCGACCCCTGGCTTTCGCCAGTTGCTTTTCTGCCCCCCTTACCACCCTTTCAACCCCATTCATCCCGATCACGCCTGCCACCTCCT
+
AA>1>F1AAA1AF111AFG0AGEBA0EHHBG1BAAAEH2220/1A/BFFB1111BF0BFHHGEE/1BEE/EEHG112@22B@BGH12BGBGHGFB

## Steps to complete the project: 
<font color=blue>
*  ### Read sequence from a file (DNA.fastq)
* ### Create a list or an array for sequence data 
* ### Calcuate counts for each necleotide (A, C, G, T)
* ### Plot necleotide base frequency 

In [1]:
import numpy as np

### Extract sequence from file

<font color=red>
#### Blank 1: Read DNA.fastq information into a list

<font color=red>
#### Blank 2: Reshape seq_list into an numpy array seq_array

In [None]:
print(seq_array)
print(seq_array.ndim, seq_array.size)

### Count nucleotides 

In [42]:
# A function to count occurances by a single base
def count(dna, base):
    m = []   # matches for base in dna: m[i]=True if dna[i]==base
    for c in dna:
        m.append(c == base)
    return sum(m)


In [43]:
count(seq_array[0,0], 'G')
count(seq_list[0], 'G')


36

In [None]:
# A function to count occurances of all 4 bases with a list
def freq_lists(dna_list):
    A=T=G=C=0
    for dna in dna_list:
        for index, base in enumerate(dna):
            if base == 'A':
                A +=1
            elif base == 'C':
                C += 1
            elif base == 'G':
                G += 1
            elif base == 'T':
                T += 1
            elif base == '\n':
                break
    return A, C, G, T

In [56]:
print(freq_lists(seq_array[0,0]))
print(freq_lists(seq_array[1,0]))
print(freq_lists(seq_list)

([30], [92], [36], [93])
([28], [105], [32], [86])
([30], [92], [36], [93])
([28], [105], [32], [86])


In [47]:
# A function to count occurances of all 4 bases 
# input: list
# output: dictionary

def freq_dict_of_list(dna_list):
    #frequency_matrix = {'A':0, 'C':0, 'G':0, 'T':0 }
    frequency_matrix = {base: 0 for base in 'ACGT'}
    for dna in dna_list:
         for index, base in enumerate(dna):  
            if not base in '\n':
                frequency_matrix[base] = frequency_matrix[base]+1
    return frequency_matrix

<font color=red>
#### Blank 3: call freq_dict_of_list(dna_list)

In [48]:
# call freq_dict_of_list(dna_list)
matrix_seq = 
print(matrix_seq)

{'A': 405, 'T': 962, 'C': 830, 'G': 313}


In [49]:
# A function to count occurances of all 4 bases 
# input: array
# output: dictionary

def freq_dict_of_array(dna_array):
    #frequency_matrix = {'A':0, 'C':0, 'G':0, 'T':0 }
    frequency_matrix = {base: 0 for base in 'ACGT'}
    for i in range(dna_array.size):
        dna=dna_array[i,0]
   #     print("dna",dna)
        for index, base in enumerate(dna):  
            if not base in '\n':
                frequency_matrix[base] = frequency_matrix[base]+1
    return frequency_matrix

<font color=red>
#### Blank 4: call freq_dict_of_array(dna_list)

In [None]:
seq_count = ??
print(seq_count)

### Calculate Nuclotides A, C, G, T frequence

<font color=red>
#### Blank 5: calcuate frequency of each base

In [None]:
# caluculate total counts of all bases
values = ??
value_sum = ?? 

# create a list of keys and a list of frequency of the bases
keys = []
seq_feqs = []

# create a list of keys and a list of base frequency
for key in sorted(seq_count):
    ??
    ??

print(seq_feq)


### Plotting Nuclotides Frequency 

In [449]:
import matplotlib.pyplot as plt

In [450]:
x = keys
y = seq_feqs


<font color=red>
#### Blank 6: plot a bar chart where x:base y:frquency

In [None]:

plt.bar( ?? )
plt.title(  )
plt.xlabel('Neclotides')
plt.ylabel('Frequence')
# shoe plot
??