<a href="https://colab.research.google.com/github/mariyagolchin/Python-and-Bioinformatics/blob/main/Bioinformatics_Examples1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


**source:**  http://hplgit.github.io/bioinf-py/doc/pub/bioinf-py.html


# Counting Letters in DNA Strings
Given some string dna containing the letters *A, C, G, or T*, representing the bases that make up DNA, we ask the question: 

**how many times does a certain base occur in the DNA string?** 

For example, if dna is *ATGGCATTA* and we ask how many times the base A occur in this string, the answer is 3.

A general Python implementation answering this problem can be done in many ways. Several possible solutions are presented below.



# List Iteration
The most straightforward solution is to loop over the letters in the string, test if the current letter equals the desired one, and if so, increase a counter. Looping over the letters is obvious if the letters are stored in a list. This is easily done by converting a string to a list:

In [1]:
list('ACGT')

['A', 'C', 'G', 'T']

In [8]:
# Function for find occurence of a base in dna list

def find_occurrence(dna,base):
   count=0
   dna=list(dna)  # convert string to list of letters
   for item in dna:
    #  print("item=>", item)
     if item == base:
       count +=1
   return count

dna='ACGTTACCTCGTA'
base='G'
find_occurrence(dna,base)

2

# String Iteration
Python allows us to iterate directly over a string without converting it to a list:

In [9]:
for c in 'ATGC':
  print(c) 

A
T
G
C


In [23]:
# Function for find occurrence of base in string of dna

def find_occurrence_str(dna,base):
   count=0
   for item in dna:
    #  print("item=>", item)
     if item == base:
       count +=1
   return count

dna='ACCGCTA'
base='C'
find = find_occurrence_str(dna,base)

# printf-style formatting
print('%s appears %d times in %s' % (base, find, dna))

# or (new) format string syntax
print ('{base} appears {find} times in {dna}'.format(base=base, find=find, dna=dna))

C appears 3 times in ACCGCTA
C appears 3 times in ACCGCTA


# Index Iteration
loops with an integer counter running over all indices in a string or array

In [22]:
def find_occurrence_str(dna,base):
   i=0
   for j in range(len(dna)):
     if dna[j] == base:
       print(dna[j])
       i +=1
   return i

dna='ACCGCTA'
base='C'
find = find_occurrence_str(dna,base)
print(find)

C
C
C
3


# While Loops
The while loop equivalent to the last function reads

In [30]:
def find_occurrence_w(dna,base):
   i=0
   j=0
   while j < len(dna):
     if dna[j] == base:
        print(dna[j])
        i += 1
     j = j+1
   return i

dna='ACCGCAATA'
base='A'
find = find_occurrence_w(dna,base)
print(find)

A
A
A
A
4


# Summing a Boolean List

The idea now is to create a list m where m[i] is True if dna[i] equals the letter we search for (base). The number of True values in m is then the number of base letters in dna. We can use the sum function to find this number because doing arithmetics with boolean lists automatically interprets True as 1 and False as 0. That is, sum(m) returns the number of True elements in m. A possible function doing this is

In [34]:
def find_occurrence_bl(dna,base):
   m=[]
   for char in dna:
       if char == base:
            m.append(True)
       else:
            m.append(False)
   return m


dna='ACCGCAATA'
base='A'
m_list = find_occurrence_bl(dna,base)
print(m_list)
print(sum(m_list))  # number of occurrences of base in dna


[True, False, False, False, False, True, True, False, True]
4


# Inline If Test
Shorter, more compact code is often a goal if the compactness enhances readability. The four-line if test in the previous function can be condensed to one line using the inline if construction: if condition value1 else value2.

In [38]:
def find_occurrence_bl(dna,base):
   m=[]
   for char in dna:
       m.append( True if char== base else False) # inline code
   return m

dna='ACCGCAGATA'
base='G'
m_list = find_occurrence_bl(dna,base)
print(m_list)
print(sum(m_list))

[False, False, False, True, False, False, True, False, False, False]
2
