# Lecture 2: Data Types and Flow Control


> ### __At the core of any language:__

- Control the flow of the program

- Construct and access data elements

- operate on data elements

- Construct functions

- Construct classes

- Libraries and built-in classes

> "The Practice of Computing Using Python"Python, , 3 rd Edition", Punch & Enbody, 2017


#### Today we will present a very brief overview of the basics of __data types__, __flow control__ and __functions__.

------
## Part 1: Data Types

### What is a Type?

A type in Python essentially defines two things:

 </br>

- the __internal structure__ of the type (what it contains)

    - For example, Python allocates a fixed number of bytes of space in memory for each variable of a normal integer type. Tpically, an integer occupies four bytes, or 32 bits. (~ billion)

 </br>
 
- the kinds of __operations__ you can perform:

    - `'abc'.capitalize()` is a method you can call on strings, but not integers

    -  Each data type has its own `operators` to compute the operations: Operators are special symbols in Python that carry out arithmetic or logical computation. An operator takes one or more operands, computes a result, and makes that result available to Python for further use. 

        - e.g.,`+`, `<`, `-`, `*`, `**`, `//`, `!=`, `==`

 </br>
 
Some types have multiple elements (__collections__); we'll see those later

 </br>
 

------

### Fundamental Types

#### __Numbers__

In [1]:
# intergers

total_seq = 100

In [2]:
# Floating point numbers

ph = 8.0 
avocado_number = 6.02e23

In [3]:
# You can also do math on numbers

stock_conc = 50
final_vol = 100
final_conc = 10

required_vol = 100*10/50 

print(f'The requred volume of stock solution is {required_vol:.2f}.') # break down it to {}, :, 2f

# required_vol = final_conc*final_vol/stock_conc

The requred volume of stock solution is 20.00.


#### __Bonus coverage: The function type__

A function is itself a data type in Python. You can think of the name of the function as a variable that contains the address of the function’s lines of code. The several lines are packaged together to be reused later.

> __Python format of defining a function:__

``` python
#---------------------------------

def my_function(argument1, arguement2): 

    '''what this function does''' #docstring
    
    function body

    return values # return statement

#---------------------------------
```


In [4]:
# create a dilution calculator function:

def desired_volume(stock_conc, final_vol, final_conc): 
    required_vol = final_conc*final_vol/stock_conc
    
    # print(f'The volume of stock solution we need to use is {required_vol:.2f}.')

    return required_vol



There are functions that are very useful and wildly applicable that has been written into scripts by others, and can be inported to be used. 

In [5]:
import math 

print("180 / pi Degrees is equal to Radians : ", end ="")
print (math.radians(180 / math.pi)) 

180 / pi Degrees is equal to Radians : 1.0


#### __Booleans__

In [6]:
# Boolean variables represent quantities that are True or False

read_gene_1 = 56
threshold = 5


bool_1 = (read_gene_1 == threshold)
# bool = (read_gene_1 == threshold) Don't do it!
print("'read_gene_1 == threshold' is", bool_1)
bool_2 = (read_gene_1 > threshold)
print("'read_gene_2 > threshold' is", bool_2)
bool_3 = (read_gene_1 <= threshold)
print("'read_gene_2 <= threshold' is", bool_3)

'read_gene_1 == threshold' is False
'read_gene_2 > threshold' is True
'read_gene_2 <= threshold' is False


#### __Strings__

In [7]:
# Strings can be defined using either single or double quotes
first_name = 'Barbara'
last_name = "McClintock"
print(first_name, last_name) # show them unicode; facilitating code in other languages; give an example

Barbara McClintock


In [8]:
# Multiline strings can be defined using triple quotes (single or double)

address = """
Cold Spring Harbor Laboratory
1 Bungtown Rd.
Cold Spring Harbor, NY 11724
"""
print(address)


Cold Spring Harbor Laboratory
1 Bungtown Rd.
Cold Spring Harbor, NY 11724



In [9]:
# The '+' sign concatenates strings

# EcoRI recognition site:

top_strand_5 = 'G'
coding_3 = 'AATTC' # change the variable name
noncoding_3 = 'CTTAA'  
noncoding_5 = 'G'

print(top_strand_5 + ' ' + coding_3 + '/n' + noncoding_3 + ' ' + noncoding_5) # multiline also works within ()

G AATTC
CTTAA G


Here is the DNA sequence of the multiple cloning site (MCS) on the plasmid [pcDNA5](https://www.addgene.org/vector-database/2132/), a popular vector for mammalian gene expression.

In [10]:
# It is simple to test if one string is contained within another

seq = 'GAGACCCAAGCTGGCTAGCGTTTAAACTTAAGCTTGGTACCGAGCTCGGATCCACTA' \
      'GTCCAGTGTGGTGGAATTCTGCAGATATCCAGCACAGTGGCGGCCGCTCGAGTCTAG' \
      'AGGGCCCGTTTAAACCCGCTGATCAGCCT'

# Does this MCS contain a restriction site for NheI (GCTAGC)? 
print('GCTAGC' in seq)

# How about for MscI (TGGCCA)?
print('TGGCCA' in seq)

True
False


In [11]:
# The len() function tells you the length of a string

len(seq)

143

In [12]:
full_name = last_name + ', ' + first_name

# The contents in a string can be indexed using brackets

print('First character:', full_name[0])         # strings are index starting at 0
print('Last character:', full_name[-1])         # str[-n] returns the n'th character from the end.
print('Characters 5-7:', full_name[5:7])        # str[start:stop]. This is called a 'slice'.
print('Every other character:', full_name[::2]) # str[start:stop:stride]
print('Reverse the string:', full_name[::-1])    # strings can be reversed using a stride of -1

First character: M
Last character: a
Characters 5-7: nt
Every other character: MCitc,Braa
Reverse the string: arabraB ,kcotnilCcM


There are many functions and methods for strings. You'll encounter the use of some of them for sequencing data in __Exercise_2.1__.

### Converting Types

a character `'1'` is not an integer `1`!

In [13]:
# int(some_var)returns an integer, can fail if the original variable cannot be an interger!

my_int = int(5.0) 
print('my_int =', my_int, '; type is', type(my_int))

# my_int = int(5.2)
# my_int = int('five')

my_int = 5 ; type is <class 'int'>


In [14]:
# float(some_var)returns a float

my_float = float('3.1415926')
print('my_num =', my_float, '; type is', type(my_float))

my_num = 3.1415926 ; type is <class 'float'>


In [15]:
# str(some_var) returns a string

my_str = str(5)
print('my_str =', my_str, '; type is', type(my_str))

my_str = 5 ; type is <class 'str'>


In [16]:
# input() fuction always returns a string
min_copy = input('threshold transcript copy number:')
3 < min_copy 

TypeError: '<' not supported between instances of 'int' and 'str'

-----

### Collections types

#### __lists and tuples__




In [17]:
# Define a list using brackets and commas.

lb_ingredients = ['Tryptone', 'NaCl', 'Yeast extract', 'Distilled water']

mixed_dtype_list = [1, 'two', 3.0, 'four', 5]


In [18]:
# Lists can be indexed using brackets just like strings can.

print('First element:', lb_ingredients[0])
print('Last element:', lb_ingredients[-1])
print('The list reversed:', lb_ingredients[::-1])

First element: Tryptone
Last element: Distilled water
The list reversed: ['Distilled water', 'Yeast extract', 'NaCl', 'Tryptone']


In [19]:
# Use 'in' to test whether an element is contained in a list.

'Ethanol' in lb_ingredients

False

In [20]:
# Change an element in a list.
print('Before: ', lb_ingredients)
lb_ingredients[-1] = 'Micropore water'
print('After:', lb_ingredients)

Before:  ['Tryptone', 'NaCl', 'Yeast extract', 'Distilled water']
After: ['Tryptone', 'NaCl', 'Yeast extract', 'Micropore water']


In [21]:
# Append an element to the end of a list.
print('Before:', lb_ingredients)
lb_ingredients.append('NaOH')
print('After:', lb_ingredients)

Before: ['Tryptone', 'NaCl', 'Yeast extract', 'Micropore water']
After: ['Tryptone', 'NaCl', 'Yeast extract', 'Micropore water', 'NaOH']


In [22]:
# You get an error if you try to access an index that doesn't exist.
lb_ingredients[10]

IndexError: list index out of range

In [23]:
# You also get an error if you pass a non-integer as an index.
lb_ingredients[4.0]

TypeError: list indices must be integers or slices, not float

In [24]:
# To create a list of numbers from 0 to n, use list(range(n))
enum_list = list(range(10))

In [25]:
# Sort a list of numbers
vals = [0,2,4,6,8,1,3,5,7,9]
print('Before sorting: ', *vals) #printing the list using * operator separated 
vals.sort()
print('After sorting: ', *vals)

Before sorting:  0 2 4 6 8 1 3 5 7 9
After sorting:  0 1 2 3 4 5 6 7 8 9


In [26]:
# Tuples are like lists, though they are defined using parentheses instead of brackets.
# Functions often pass tuples (not lists) back to the user.

t = (0, 1, 2, 3, 4)
print(t)

(0, 1, 2, 3, 4)


#### __Dictionaries__
Dictionaries are one of Python's most useful datatypes. 

They can be thought of as a list of key-value pairs that allow values to be rapidly looked up via keys. 
- Keys can be any (immutable) variable. Values can be anything.

In [27]:
# Dictionaries are defined using braces, colons, and commas
lb_ingredients_1L = {'Tryptone':10, 'NaCl':10, 'Yeast extract':5, 'Distilled water':975}
print(lb_ingredients_1L)

{'Tryptone': 10, 'NaCl': 10, 'Yeast extract': 5, 'Distilled water': 975}


In [28]:
# Access dictionary elements using a "key" enclosed in brackets
print(lb_ingredients_1L['NaCl'])

10


In [29]:
# You can replace and add elements to a dictionary after it is created.
lb_ingredients_1L['Yeast extract'] = 8
lb_ingredients_1L['NaOH'] = .1

In [30]:
# From a dictionary, you can get a list of both the keys and the values.
keys = list(lb_ingredients_1L.keys())
print('keys:', keys)

values = list(lb_ingredients_1L.values())
print('values:', values)

keys: ['Tryptone', 'NaCl', 'Yeast extract', 'Distilled water', 'NaOH']
values: [10, 10, 8, 975, 0.1]


#### __Exercise 1__



In [31]:
# The sequence we've defined before; note how to define a long string over multiple lines
seq = 'GAGACCCAAGCTGGCTAGCGTTTAAACTTAAGCTTGGTACCGAGCTCGGATCCACTA' \
      'GTCCAGTGTGGTGGAATTCTGCAGATATCCAGCACAGTGGCGGCCGCTCGAGTCTAG' \
      'AGGGCCCGTTTAAACCCGCTGATCAGCCT'

__E1.1__: Using the string method `.find()`, find the location(s) of the above restriction sites within the MCS.

In [32]:
# Answer here

__E2.2__: Using the string method `.replace()`, compute the RNA sequence of the DNA sequence above. 

In [33]:
# Answer here

**E2.3**: We have not yet discussed sets. Using Google, figure out what `set` objects are and explain what they represent. In particular, explain why Python evaluates {2,3,3} < {1,2,3} as True.

--------

## Part 2: flow control

__Selection__:

- Selection is how programs make choices, and it is the process of making choices that provides a lot of the power of computing. In python we use conditional statement to evalulate and select statements to execute.


__Repeatition__:

- Besides selecting which statements to execute, a fundamental need in a program is repetition: repeat a set of statements under some conditions

<br/>


### Selection

#### __Python Selection: `if` statements:__


> __Python format:__

``` python
#---------------------------------

if boolean expression:
    code body # watch out for indentations! 
    
#---------------------------------
```


We evaluate the boolean (`True` or `False`); if `True`, execute all statements in the suite

- example boolean operators: `<`, `>`, `<=`, `==`, `!=`


In [34]:
# review on boolean logic:
read_gene_1 = 56
threshold = 5

read_gene_1 > threshold 

True

In [35]:
# use if statement to output a statement if read is above a threshold.

read_gene_1 = 56
threshold = 5

if read_gene_1 > threshold:
    print('Gene_1 will be included in the analysis.')

Gene_1 will be included in the analysis.


#### __Python Selection with multiple rounds: `if/else`__

> __Python format:__

``` python
#---------------------------------

if boolean expression: # evaluate the boolean, if True, run suite1
    code body 1
else:
    code body 2 # if False, run suite2

#---------------------------------

```


In [36]:
# use if/else statement to output statement for inclusion of a gene.

read_gene_1 = 56
read_gene_2 = 3

threshold = 5

if read_gene_2 > threshold:
    print('Gene_2 will be included in the analysis.')
else:
    print('We disregard gene_2 in the analysis.')

We disregard gene_2 in the analysis.


#### __Python Selection with multiple rounds: `if/elif/else`__

> __Python format:__

``` python
#---------------------------------

if boolean expression A: # evaluate the boolean, if True, run suite1
    code body 1
elif boolean expression B:
    code body 2 # if true, run suite2
else:
    code body 3 # if both A and B are false, run suite2

#---------------------------------

```


In [37]:
# use if/elif/else statement:

read_gene_1 = 56
read_gene_2 = 3
read_gene_3 = 300

threshold = 5
high_exp = 100

if read_gene_3 > high_exp:
    print('Gene_3 is considered as highly expressed.')

elif read_gene_3 > threshold:
    print('Gene_3 will be included in the analysis.')
    
else:
    print('We disregard gene_3 in the analysis.')

Gene_3 is considered as highly expressed.


-----

### Repetition (loop)

#### __`while` and `for` Statements__

__while statement__:

– repeats a set of statements while some condition is `True`

– more general repetition construct

__for statement__:

– useful for iteration, moving through all the elements of data structure, one at a time

<br/>


> __Python format:__

``` python
#---------------------------------

flag = True # define a variable tracker

while flag:

    code body

# exit when flag is no longer true

#---------------------------------

```

- while loop will repeat the statements in the suite while the boolean is `True` (or its Python equivalent)

- if the Boolean expression never changes during the course of the loop, the loop will continue __forever__


In [38]:
# find cells with intensity that passed the threshold:
fluorescent_intesnity = [10.1, 15,3, 78.5, 0.3, 46.9, 0.2, 0.7, 2.2, 12.5, 33.2]

passed_index = []

index = 0 # initialize variable tracker

while index <= 10:

    if fluorescent_intesnity[index] > 20:
        passed_index.append(index)

    index += 1 # modify variable evaluated at the condition statement  

print(passed_index)

[3, 5, 10]


__General Approach to a `while` loop__:

__i)__ outside the loop, initialize the boolean

__ii)__ somewhere inside the loop, perform some operation which changes the state of the program, eventually leading to a False boolean and exiting the loop

Must have both!

<br/>


#### __`for` loop and iteration__

The `for` statement iterates through each element of a collection (list, etc.)

> __Python format:__

``` python
#---------------------------------

for element in collection:

    code body

#---------------------------------

```

We actually have encountered collection and their elements: 
- a string and the characters;
- a list and objects in the list;
- a dictionary and a key, value pair in the dictionary.

Those objects are  __"iterable"__: capable of returning its members one at a time, permitting it to be iterated over in a for-loop. 

In [None]:
# iterate over genes to decide whether to include them based on reads:
read_gene_1 = 56
read_gene_2 = 3
read_gene_3 = 300
read_gene_4 = 129

read_list = [56, 3, 300, 129]

threshold = 5
high_exp = 100

for i in range(len(read_list)): # iterate over the indexes:

    if read_list[i] > high_exp:
        print(f'Gene_{i+1} is considered as highly expressed.')

    elif read_list[i] > threshold:
        print(f'Gene_{i+1} will be included in the analysis.')
        
    else:
        print(f'Gene_{i+1} is discarded in the analysis.')

Gene_1 will be included in the analysis.
Gene_2 is discarded in the analysis.
Gene_3 is considered as highly expressed.
Gene_4 is considered as highly expressed.


There are other objects that are iterable and can be unpacked into lists:

In [None]:
# example iterable object classes:
v_iter = range(10)
print(type(v_iter))
print('iterable:', range(10))

v_list = list(range(10))
print(type(v_list))
print('list:    ', v_list, '\n')

e_iter = enumerate(full_name)
print(type(e_iter))
print('iterable:', e_iter)

e_list = list(enumerate(full_name))
print(type(e_list))
print('list:    ', e_list, '\n')

d = {'A':'T', 'C':'G', 'G':'C', 'T':'A'}
print(type(d.keys()))
print('iterable:', d.keys())
print(type(list(d.keys())))
print('list:    ', list(d.keys()))

<class 'range'>
iterable: range(0, 10)
<class 'list'>
list:     [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] 

<class 'enumerate'>
iterable: <enumerate object at 0x7feeec2aa080>
<class 'list'>
list:     [(0, 'M'), (1, 'c'), (2, 'C'), (3, 'l'), (4, 'i'), (5, 'n'), (6, 't'), (7, 'o'), (8, 'c'), (9, 'k'), (10, ','), (11, ' '), (12, 'B'), (13, 'a'), (14, 'r'), (15, 'b'), (16, 'a'), (17, 'r'), (18, 'a')] 

<class 'dict_keys'>
iterable: dict_keys(['A', 'C', 'G', 'T'])
<class 'list'>
list:     ['A', 'C', 'G', 'T']


In [None]:
# find cells with intensity that passed the threshold, 2nd way:

fluorescent_intesnity = [10.1, 15,3, 78.5, 0.3, 46.9, 0.2, 0.7, 2.2, 12.5, 33.2]

passed_index = []

index = 0 # initialize variable tracker

for i, f in enumerate(fluorescent_intesnity):

    if f > 20:
        passed_index.append(i)

print(passed_index)

[3, 5, 10]



#### __Exercise 2__
__Exercise 2.1__ Fill in the code to complete the `while` loop to know the date when the intensity of P32 is below threshold.

In [None]:
# Assume you have a vial of P32
half_life = 14.3 # days

# Initially, the vial is at 100% activity
current_activity = 100

# As long as it has ~10% activity, it's still good to use for radioactive gels
min_activity = 10

# Compute how many days the vial is good for before it needs to be thrown out
num_days = 0

while #### (1) fill in the code here ####

    # exponential decay 
    current_activity /= 2**(1/half_life) 

    # keep track of the days:
     #### (2) fill in the code here ####
    
print('P32 activity will be reduced to %.1f%% by day %d.'%(current_activity, num_days))

__Exercise 2.2__: use `for` loop and python dictionary to translate the sequence in Exercise 1 (there are many ways to achieve the same thing!)

In [None]:
# Answer here

--------

You can do this in one line with [list conprehension](https://docs.python.org/3/tutorial/datastructures.html)! Try it on your own!

But can also be an easy task using __vectorized computations__ with `numpy`.

In [None]:
import numpy as np

# these two are also stock modules to import for data analysis
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
np.argwhere(np.array(fluorescent_intesnity) > 20)

array([[ 3],
       [ 5],
       [10]])

In [None]:
### add some numpy math: plot sin(x)
# plt.plot()