In [1]:
import jupman
jupman.init()

# Practical 2

In this practical we will start interacting more with Python, practicing on how to handle data, functions and methods. We will see several built-in data types and then dive deeper into strings.

## Slides

The slides of the introduction can be found here: [Intro](docs/Practical2.pdf)

## Modules

Python modules are simply text files having the extension **.py** (e.g. ```exercise.py```). When you were writing the code in the IDE in the previous practical, you were in fact implementing the corresponding module. 

As said in the previous practical, once you implemented and saved the code of the module, you can execute it by typing 

```python3 exercise1.py```

or, in Visual Studio Code, by right clicking on the code panel and selecting **Run Python File in Terminal**.

A Module A can be loaded from another module B so that B can use the functions defined in A. Remember when we used the ```sqrt``` function? It is defined in the module math. To import it and use it we indeed wrote something like:

In [2]:
import math

A = math.sqrt(4)
print(A)

2.0


When importing modules we do not need to specify the extension ".py" of the file. 


## Objects

Python understands very well objects, and in fact everything is an object in Python. Objects have **properties** (characteristic features) and **methods** (things they can do). For example, an object *car* has the *properties* model, make, color, number of doors etc., and the *methods* steer right, steer left, accelerate, break, stop, change gear,... 
According to Python's official documentation: 

"Objects are Python's abstraction for data. All data in a Python program is represented by objects or by relations between objects."

All you need to know for now is that in Python objects have an **identifier (ID)** (i.e. their name), a **type** (numbers, text, collections,...) and a **value** (the actual data represented by the objects). Once an object has been created the *identifier* and the *type* never change, while its *value* can either change (**mutable objects**) or stay constant (**immutable objects**).

Python provides these built-in data types:

![](img/pract2/types.png)


We will stick with the simplest ones for now, but later on we will dive deeper into the all of them.

## Variables

Variables are just references to objects, in other words they are the **name** given to an object. Variables can be **assigned** to objects by using the assignment operator ```=```.

The instruction 


In [3]:
sides = 4

might represent the number of sides of a square. What happens when we execute it in Python? An object is created, it is given an identifier, its type is set to "int" (an integer number), it value to 4 and a name *sides* is placed in the current namespace to point to that object, so that after that instruction we can access that object through its name. The type of an object can be accessed with the function **type()** and the identifier with the function **id()**: 

In [4]:
sides = 4
print( type(sides) )
print( id(sides) )

<class 'int'>
10915392


Consider now the following code:

In [5]:
sides = 4 #a square 
print ("value:", sides, " type:", type(sides), " id:", id(sides))
sides = 5 #a pentagon
print ("value:", sides, " type:", type(sides), " id:", id(sides))


value: 4  type: <class 'int'>  id: 10915392
value: 5  type: <class 'int'>  id: 10915424


The value of the variable sides has been changed from 4 to 5, but as stated in the table above, the type ```int``` is **immutable**. Luckily, this did not prevent us to change the value of sides from 4 to 5. What happened behind the scenes when we executed the instruction sides = 5 is that a new object has been created of type int (5 is still an integer) and it has been made accessible with the same name *sides*, but since it is a different object (i.e. the integer 5) you can see that the identifier is actually different. Note: you do not have to really worry about what happens behind the scenes, as the Python interpreter will take care of these aspects for you, but it is nice to know what it does.

You can even change the type of a variable during execution but that is normally a **bad idea** as it makes understanding the code more complicated.

You can do (**but, please, refrain!**):

In [6]:
sides = 4 #a square
print ("value:", sides, " type:", type(sides), " id:", id(sides))
sides = "four" #the sides in text format
print ("value:", sides, " type:", type(sides), " id:", id(sides))

value: 4  type: <class 'int'>  id: 10915392
value: four  type: <class 'str'>  id: 140606094871104


<div class="alert alert-warning">

**IMPORTANT NOTE:** 
You can chose the name that you like for your variables (I advise to pick something reminding their meaning), but you need to adhere to some simple rules.


1. Names can only contain upper/lower case digits (```A-Z```, ```a-z```), numbers (```0-9```) or underscores ```_```;


2. Names cannot start with a number;


3. Names cannot be equal to reserved keywords:

![](img/pract2/keywords.png)


</div>


## Numeric types

We already mentioned that numbers are **immutable objects**. Python provides different numeric types: integers, booleans, reals (floats) and even complex numbers and fractions (but we will not get into those).

### Integers 

Their range of values is limited only by the memory available. As we have already seen, python provides also a set of standard operators to work with numbers:



In [7]:
a = 7
b = 4

a + b # 11
a - b # 3
a // b # integer division: 1
a * b # 28
a ** b # power: 2401
a / b # division 0.8333333333333334
type(a / b)

float

Note that in the latter case the result is no more an integer, but a float (we will get to that later).

### Booleans 

These objects are used for the boolean algebra. Truth values are represented with the keywords ```True``` and ```False``` in Python. A boolean object can only have value ```True``` or ```False```. We can convert booleans into integers with the builtin function ```int```. Any integer can be converted into a boolean (and vice-versa) with: 

In [8]:
a = bool(1)
b = bool(0)
c = bool(72)
d = bool(-5)
t = int(True)
f = int(False)

print("a: ", a, " b: ", b, " c: ", c, " d: ", d , " t: ", t, " f: ", f)

a:  True  b:  False  c:  True  d:  True  t:  1  f:  0


any integer is evaluated to true, except 0. Note that, the truth values ```True``` and ```False``` respectively behave like the integers 1 and 0.

We can operate on boolean values with the boolean operators ```and```, ```or```, ```not```. Recall boolean algebra for their use: 

In [9]:
T = True
F = False

print ("T: ", T, " F:", F)

print ("T and F: ", T and F) #False
print ("T and T: ", T and T) #True
print ("F and F: ", F and F) #False
print ("not T: ", not T) # False
print ("not F: ", not F) # True
print ("T or F: ", T or F) # True
print ("T or T: ", T or T) # True
print ("F or F: ", F or F) # False


T:  True  F: False
T and F:  False
T and T:  True
F and F:  False
not T:  False
not F:  True
T or F:  True
T or T:  True
F or F:  False


Numeric comparators are operators that return a boolean value. Here are some examples (from the lecture):

![](img/pract2/comparators.png)




**Example**: Given a variable a = 10 and a variable b = 77, let's swap their values (i.e. at the end a will be equal to 77 and b to 10). Let's also check the values at the beginning and at the end.

In [10]:
a = 10
b =  77
print("a: ", a, " b:", b)
print("is a equal to 10?", a == 10)
print("is b equal to 77?", b == 77)

TMP = b  #we need to store the value of b safely
b = a    #ok, the old value of b is gone... is it?
a = TMP  #a gets the old value of b... :-)

print("a: ", a, " b:", b)
print("is a equal to 10?", a == 10)
print("is a equal to 77?", a == 77)
print("is b equal to 10?", b == 10)
print("is b equal to 77?", b == 77)



a:  10  b: 77
is a equal to 10? True
is b equal to 77? True
a:  77  b: 10
is a equal to 10? False
is a equal to 77? True
is b equal to 10? True
is b equal to 77? False


### Real numbers

Python stores real numbers (floating point numbers) in 64 bits of information divided in sign, exponent and mantissa. 

**Example:**
Let's calculate the area of the center circle of a football pitch (radius = 9.15m) recalling that $area= Pi*R^2$:

In [11]:
R = 9.15
Pi = 3.1415926536
Area = Pi*(R**2)
print (Area)

263.02199094102605


Note that the parenthesis around the ```R**2``` are not necessary as operator ```**``` has the precedence, but I personally think it helps readability. 

Here is a reminder of the precedence of operators: 

![](img/pract2/precedence.png)

**Example:** Let's compute the GC content of a DNA sequence 33 base pairs long, having 12 As, 9 Ts, 5 Cs and 7Gs. The GC content can be expressed by the formula: $gc = \frac{G+C}{A+T+C+G}$ where A,T,C,G represent the number of nucleotides of each kind. What is the AT content? Is the GC content higher than the AT content?

In [12]:
A = 12
T = 9
C = 5
G = 7

gc = (G+C)/(A+T+C+G)

print("The GC content is: ", gc)

at = 1 - gc

print (gc > at)

The GC content is:  0.36363636363636365
False


## Strings 

Strings are **immutable objects** (note the actual type is **str**) used by python to handle text data. Strings are sequences of *unicode code points* that can represent characters, but also formatting information (e.g. '\\n' for new line). Unlike other programming languages, python does not have the data type character, which is represented as a string of length 1.

There are several ways to define a string:


In [13]:
S = "my first string, in double quotes"

S1 = 'my second string, in single quotes'

S2 = '''my third string is 
in triple quotes
therefore it can span several lines'''

S3 = """my fourth string, in triple double-quotes
can also span
several lines"""

print(S, '\n') #let's add a new line at the end of the string with \n
print(S1,'\n')
print(S2, '\n')
print(S3, '\n')

my first string, in double quotes 

my second string, in single quotes 

my third string is 
in triple quotes
therefore it can span several lines 

my fourth string, in triple double-quotes
can also span
several lines 



To put special characters like '," and so on you need to "escape them" (i.e. write them following a back-slash).

![](img/pract2/escapes.png)

**Example**:
Let's print a string containing a quote and double quote (i.e. ' and ").

In [14]:
myString = "This is how I \'quote\' and \"double quote\" things in strings"
print(myString)

This is how I 'quote' and "double quote" things in strings


Strings can be converted to and from numbers with the functions ```str()```, ```int()``` or ```float()```.

**Example**:
Let's define a string *myString* with the value "47001" and convert it into an int. Try adding one and print the result.

In [15]:
myString = "47001"
print(myString, " has type ", type(myString))

myInt = int(myString)

print(myInt, " has type ", type(myInt))

myInt = myInt + 1   #adds one

myString = myString + "1" #cannot add 1 (we need to use a string). 
                          #This will append 1 at the end of the string

print(myInt)
print(myString)

47001  has type  <class 'str'>
47001  has type  <class 'int'>
47002
470011


Python defines some operators to work with strings. Recall the slides shown during the lecture:

![](img/pract2/stringoperators.png)


**Example** 
A tandem repeat is a short sequence of DNA that is repeated several times in a row. Let's create a string representing the tandem repeat of the motif "ATTCG" repeated 5 times. What is the length of the whole repetitive region? Is the motif "TCGAT" (m1) present in the region? The motif "TCCT" (m2)? Let's give an orientation to the tandem repeat by adding the string "5'-" (5' end) on the left and "-3'" (3' end) to the right.

In [57]:
motif = "ATTCG"

tandem_repeat = motif * 5

print(motif)
print(tandem_repeat, " has length", len(tandem_repeat))
m1 = "TCGAT"
m2 = "TCCT"

print("Is ", m1, " in ", tandem_repeat, " ? ", m1 in tandem_repeat )
print("Is ", m2, " in ", tandem_repeat, " ? ", m2 in tandem_repeat )
oriented_tr = "5\'-" + tandem_repeat + "-3\'"
print(oriented_tr)

ATTCG
ATTCGATTCGATTCGATTCGATTCG  has length 25
Is  TCGAT  in  ATTCGATTCGATTCGATTCGATTCG  ?  True
Is  TCCT  in  ATTCGATTCGATTCGATTCGATTCG  ?  False
5'-ATTCGATTCGATTCGATTCGATTCG-3'


We can access strings at specific positions (indexing) or get a substring starting from a position S to a position E. The only thing to remember is that numbering starts from 0. The```i```-th character of a string can be accessed as ```str[i-1]```. Substrings can be accessed as ```str[S:E]```, optionally a third parameter can be specified to set the step (i.e. ```str[S:E:STEP]```). **Remember that when you do str[S:E], S is inclusive, while E is exclusive** (see S[0:6] below).

![](img/pract2/slicingstring.png)

In [62]:
S = "Luther College"

print(S) #print the whole string
print(S == S[:]) #a fancy way of making a copy of the original string
print(S[0]) #first character
print(S[3]) #fourth character
print(S[-1]) #last character
print(S[0:6]) #first six characters
print(S[-7:]) #final seven characters
print(S[0:len(S):2]) #every other character starting from the first
print(S[1:len(S):2]) #every other character starting from the second

Luther College
True
L
h
e
Luther
College
Lte olg
uhrClee


### Methods for the str object

The object ```str``` has some methods that can be applied to it (remember methods are things you can do on objects). Recall from the lecture that the main methods are:

![](img/pract2/strmethods.png)

<div class="alert alert-warning">
**IMPORTANT NOTE**:
Since Strings are immutable, every operation that changes the string actually produces a new *str* object  having the modified string as value. 
</div>

**Example**:
Given the DNA sequence S = "   aTATGCCCATatcgctAAATTGCTGCCATTACA    ". Print its length (removing any blank spaces at either sides), the number of adenines, cytosines, guanines and thymines present. Is the sequence "ATCG" present in S? Print how many times the substring "TGCC" appears in S and all the corresponding indexes.

In [61]:
S = "   aTATGCCCATatcgctAAATTGCTGCCATTACA    "

print(S)
S = S.strip(" ")
print(S)

print(len(S))
tmpS = S.upper() #for simplicity to count only 4 different nucleotides
print("A count: ", tmpS.count("A"))
print("C count: ", tmpS.count("C"))
print("T count: ", tmpS.count("T"))
print("G count: ", tmpS.count("G"))
print("Is ATCG in ", tmpS, "? ", tmpS.find("ATCG") != -1) #or tmpS.count("ATCG") > 0
print("TGCC is present ", tmpS.count("TGCC"), " times in ", tmpS)
print("TGCC is present at pos ", tmpS.find("TGCC"))
print("TGCC is present at pos ", tmpS.rfind("TGCC"))


   aTATGCCCATatcgctAAATTGCTGCCATTACA    
aTATGCCCATatcgctAAATTGCTGCCATTACA
33
A count:  10
C count:  9
T count:  10
G count:  4
Is ATCG in  ATATGCCCATATCGCTAAATTGCTGCCATTACA ?  True
TGCC is present  2  times in  ATATGCCCATATCGCTAAATTGCTGCCATTACA
TGCC is present at pos  3
TGCC is present at pos  23


## Exercises

1. An exon of a gene starts from position 12030 on a genome and ends at position 12174. Does an A/T SNP present at position 12111 affect this exon? And what about a SNP present at position 12188?

<div class="tggle" onclick="toggleVisibility('ex1');">Show/Hide Solution</div>
<div id="ex1" style="display:none;">

In [19]:
E_start = 12030
E_end = 12174
SNP1_pos = 12111
SNP2_pos = 12188

Test1 = (SNP1_pos >= E_start and SNP1_pos <=E_end)
Test2 = (SNP2_pos >= E_start and SNP2_pos <=E_end)
print ("SNP1 (", SNP1_pos,") in [", E_start, ",", E_end, "]?", Test1)
print ("SNP2 (", SNP2_pos,") in [", E_start, ",", E_end, "]?", Test2)

SNP1 ( 12111 ) in [ 12030 , 12174 ]? True
SNP2 ( 12188 ) in [ 12030 , 12174 ]? False


</div>

2. SNP FB_AFFY_0000024 of the Apple 480K SNP chip has 5' flanking region (i.e. the forward probe) CATTATTTTCACTTGGGTCGAGGCCAGATTCCATC and 3' flanking region (i.e. reverse probe) GGATTGCCCGAAATCAGAGAAAAGTCG. The SNP is a G/A transversion. Ansewr the following questions:

    1. What is the length of the 5' flanking region? And that of the 3' flanking region? 
    2. The IUPAC code of the G/A transversion is R. What is the sequence of the whole region using the [G/A] notation for the SNP (hint: concatenate in a new string called *region*) and the iupac notation R (region_iupac)?
    3. Retrive and print only the SNP from *region* and *iupac_region*

<div class="tggle" onclick="toggleVisibility('ex2');">Show/Hide Solution</div>
<div id="ex2" style="display:none;">

In [20]:
SNP_5prime =  "CATTATTTTCACTTGGGTCGAGGCCAGATTCCATC"
SNP_3prime = "GGATTGCCCGAAATCAGAGAAAAGTCG"

SNPseq = "G/A"
SNPiupac = "R"

print("Length of 5\' end: ", len(SNP_5prime))
print("Length of 3\' end: ", len(SNP_3prime))
region = SNP_5prime + "[" + SNPseq + "]" + SNP_3prime
region_iupac = SNP_5prime + SNPiupac + SNP_3prime
print(region)
print(region_iupac)

#string slicing and indexing!

snp_from_region = region[ len(SNP_5prime) + 1 : len(SNP_5prime) + 4 ] 
snp_from_iupac = region_iupac[ len(SNP_5prime) ]

print("SNP from region: ", snp_from_region)
print("SNP from iupac region: ", snp_from_iupac)


Length of 5' end:  35
Length of 3' end:  27
CATTATTTTCACTTGGGTCGAGGCCAGATTCCATC[G/A]GGATTGCCCGAAATCAGAGAAAAGTCG
CATTATTTTCACTTGGGTCGAGGCCAGATTCCATCRGGATTGCCCGAAATCAGAGAAAAGTCG
SNP from region:  G/A
SNP from iupac region:  R


</div>

3. Compute the melting temperature $T_m$ of the primer with sequence "TTAGCACACGTGAGCCAATGGAGCAAACGGGTAATT". The melting temperature $T_m$ (in degrees Celtius) can be computed as: $T_m = 64.9 + 0.41(GC - 16.4)/N$, where $GC$ is the total number of G and C in the primer and $N$ is its length.  

<div class="tggle" onclick="toggleVisibility('ex3');">Show/Hide Solution</div>
<div id="ex3" style="display:none;">

In [26]:
primer = "TTAGCACACGTGAGCCAATGGAGCAAACGGGTAATT"
N = len(primer)

gc = (primer.count("G") + primer.count("C"))

Tm = 64.9 + 41 * (gc - 16.4)/N

print("Melting T for primer ", primer, " is: ", Tm, "°C")

Melting T for primer  TTAGCACACGTGAGCCAATGGAGCAAACGGGTAATT  is:  65.58333333333334 °C


</div>

4. Convert the following extract of the [PalB2](http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000083093;r=16:23603160-23641310) gene into mRNA (i.e. replace thymine with uracile):
CTGTCTCCCTCACTGTATGTAAATTGCATCTAGAATAGCA
TCTGGAGCACTAATTGACACATAGTGGGTATCAATTATTA
TTCCAGGTACTAGAGATACCTGGACCATTAACGGATAAAT
AGAAGATTCATTTGTTGAGTGACTGAGGATGGCAGTTCCT
GCTACCTTCAAGGATCTGGATGATGGGGAGAAACAGAGAA
CATAGTGTGAGAATACTGTGGTAAGGAAAGTACAGAGGAC
TGGTAGAGTGTCTAACCTAGATTTGGAGAAGGACCTAGAA
GTCTATCCCAGGGAAATAAAAATCTAAGCTAAGGTTTGAG
GAATCAGTAGGAATTGGCAAAGGAAGGACATGTTCCAGAT
GATAGGAACAGGTTATGCAAAGATCCTGAAATGGTCAGAG
CTTGGTGCTTTTTGAGAACCAAAAGTAGATTGTTATGGAC
CAGTGCTACTCCCTGCCTCTTGCCAAGGGACCCCGCCAAG
CACTGCATCCCTTCCCTCTGACTCCACCTTTCCACTTGCC
CAGTATTGTTGGTGT

and print the number of uracils present and the total length of the sequence (**remember to remove newlines**).

Considering the genetic code and all the possible open reading frames, answer the following questions:

![](img/pract2/genetic_code.png)
    
1. How many stop codons are present in the sequence?
2. How many Glycines (Gly)?
3. Is Tryptophane (Trp) present?
4. What is the position of the leftmost Trp? Print the codon to double check correctness (hint: slicing).
5. What is the position of the rightmost Trp? Print the codon to double check correctness (hint: slicing).
    
<div class="tggle" onclick="toggleVisibility('ex4');">Show/Hide Solution</div>
<div id="ex4" style="display:none;">

In [63]:
seq ="""CTGTCTCCCTCACTGTATGTAAATTGCATCTAGAATAGCA
TCTGGAGCACTAATTGACACATAGTGGGTATCAATTATTA
TTCCAGGTACTAGAGATACCTGGACCATTAACGGATAAAT
AGAAGATTCATTTGTTGAGTGACTGAGGATGGCAGTTCCT
GCTACCTTCAAGGATCTGGATGATGGGGAGAAACAGAGAA
CATAGTGTGAGAATACTGTGGTAAGGAAAGTACAGAGGAC
TGGTAGAGTGTCTAACCTAGATTTGGAGAAGGACCTAGAA
GTCTATCCCAGGGAAATAAAAATCTAAGCTAAGGTTTGAG
GAATCAGTAGGAATTGGCAAAGGAAGGACATGTTCCAGAT
GATAGGAACAGGTTATGCAAAGATCCTGAAATGGTCAGAG
CTTGGTGCTTTTTGAGAACCAAAAGTAGATTGTTATGGAC
CAGTGCTACTCCCTGCCTCTTGCCAAGGGACCCCGCCAAG
CACTGCATCCCTTCCCTCTGACTCCACCTTTCCACTTGCC
CAGTATTGTTGGTGT"""

seq = seq.replace("\n","")
mRNA = seq.replace("T","U")

print("Number of uracils: ", mRNA.count("U"))
print("Total length of the sequence: ", len(seq))
stopc = mRNA.count("UAA") + mRNA.count("UGA") + mRNA.count("UAG")
print("Number of stop codons: ", stopc)
gly = mRNA.count("GGU") + mRNA.count("GGC") + mRNA.count("GGA") + mRNA.count("GGG")
print("Number of glycines: ", gly)
print("Is Trp present? ", mRNA.find("UGG")> 0)
rmTrp = mRNA.find("UGG")
print("Leftmost Trp at pos:", rmTrp, " Codon: ", mRNA[rmTrp : rmTrp + 3])
lmTrp = mRNA.rfind("UGG")
print("Rightmost Trp at pos:", mRNA.rfind("UGG"), " Codon: ", mRNA[lmTrp:lmTrp+3])


Number of uracils:  140
Total length of the sequence:  535
Number of stop codons:  32
Number of stop glycines:  34
Is Trp present?  True
Leftmost Trp at pos: 42  Codon:  UGG
Rightmost Trp at pos: 529  Codon:  UGG


</div>

5. Consider the following Illumina HiSeq 4000 read: 
"""AATGATACGGCGACCACCGAGATCTACACGCCTCCCTCGCGC
CATCAGAGAGTCTGGGTCTCAGGTACCGCAGTTGTATCTTGCGCGACTATA
ATCCACGGCTCTTATTCTAGCGTGCGCGTACGGCGGTGGGCGTCGTTACGCTATATT""". 

Answer the following questions:

    1. How long is the read (beware of newlines)?
    2. What is the GC content of the read (remember $gc = \frac{G+C}{A+T+C+G}$)?
    3. A Nextera adapter is "AATGATACGGCGACCACCGAGATCTACACGCCTCCCTCGCGCCATCAG". 
    Is it present in the read? How long is it?
    4. Remove the Nextera adapter from the read and recompute the GC content.
    Has GC content increased after adapter trimmming?


<div class="tggle" onclick="toggleVisibility('ex5');">Show/Hide Solution</div>
<div id="ex5" style="display:none;">

In [64]:
read = """AATGATACGGCGACCACCGAGATCTACACGCCTCCCTCGCGC
CATCAGAGAGTCTGGGTCTCAGGTACCGCAGTTGTATCTTGCGCGACTATA
ATCCACGGCTCTTATTCTAGCGTGCGCGTACGGCGGTGGGCGTCGTTACGCTATATT"""

read = read.replace("\n","")
print("Read length is: ", len(read), " base pairs")
g = read.count("G")
c = read.count("C")
t = read.count("T")
a = read.count("A")

gc = (g + c) / (a + t + c + g)

print("GC content of read: ", gc)

adapter = "AATGATACGGCGACCACCGAGATCTACACGCCTCCCTCGCGCCATCAG"

print("Is the adapter present? ", adapter in read)
print("Adapter length: ", len(adapter))
print("The adapter starts at: ", read.find(adapter))

trimmed_read = read.replace(adapter,"")

tr_g = trimmed_read.count("G")
tr_c = trimmed_read.count("C")
tr_t = trimmed_read.count("T")
tr_a = trimmed_read.count("A")

tr_gc = (tr_g + tr_c) / (tr_a + tr_t + tr_c + tr_g)
print("GC content of trimmed read: ", tr_gc)
print("GC content has increased after trimming: ", tr_gc > gc)

Read length is:  150  base pairs
GC content of read:  0.56
Is the adapter present?  True
Adapter length:  48
The adapter starts at:  0
GC content of trimmed read:  0.5392156862745098
GC content has increased after trimming:  False


</div>

6. Given *geneA* starting at position 1000 and ending at position 3400, and *geneB* starting at position 3700 and ending at position 6000. Randomly select a position (*pos*) from 1 to 5202 and check the following:
        a. is pos in geneA?
        b. is pos in geneB?
        c. is pos in between the two genes?
        d. is pos within one of the two genes?
        e. is pos outside both genes?
        f. is pos within 100 bases before the start of geneA?
    To pick a random number you can import the random module and use the random.randint(start,end) function:
    
    ```import random```
    
    ```pos = random.randint(1,6000)```
    
<div class="tggle" onclick="toggleVisibility('ex6');">Show/Hide Solution</div>
<div id="ex6" style="display:none;">

In [44]:
import random

geneA_start = 1000
geneA_end = 3400
geneB_start = 3700
geneB_end = 5201

pos = random.randint(1,6000)
print("Random position is: ", pos)

answerA = (pos >= geneA_start and pos <= geneA_end)
answerB = (pos >= geneB_start and pos <= geneB_end)
answerC = (pos >geneA_end and pos <geneB_start)
answerD = (answerA or answerB)
answerE = (pos < geneA_start or (pos > geneA_end and pos < geneB_start) or (pos > geneB_end))
answerF = (pos >= geneA_start - 100 ) and (pos < geneA_start)
print("Is ", pos, " in geneA [", geneA_start, ",", geneA_end, "]? ", answerA)
print("Is ", pos, " in geneB [", geneB_start, ",", geneB_end, "]? ", answerB)
print("Is ", pos, " between the two genes? ", answerC)
print("Is ", pos, " in one of the two genes? ", answerD)
print("Is ", pos, " outside of both genes? ", answerE)
print("Is ", pos, " within 100 bases from the start of geneA? ", answerF)           

Random position is:  4835
Is  4835  in geneA [ 1000 , 3400 ]?  False
Is  4835  in geneB [ 3700 , 5201 ]?  True
Is  4835  between the two genes?  False
Is  4835  in one of the two genes?  True
Is  4835  outside of both genes?  False
Is  4835  within 100 bases from the start of geneA?  False


</div>

7. The DNA-binding domain of the Tumor Suppressor Protein TP53 can be represented by the string:

chain_a = """SSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKM
FCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVV
RRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFR
HSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILT
IITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKG
EPHHELPPGSTKRALPNNT"""

Answer the following questions:

    1. How many lines is the sequence written on?
    2. How long is the sequence (remove newlines)?
    3. Create a new sequence with all new lines removed
    4. How many cysteines "C" and histidines "H" are there in the sequence?
    5. Does the chain contain the sub-sequence "NLRVEYLDDRN"? Where?
    6. Extract the first line of the sequence (Hint: use find and string slicing).

<div class="tggle" onclick="toggleVisibility('ex7');">Show/Hide Solution</div>
<div id="ex7" style="display:none;">

In [53]:
chain_a = """SSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKM
FCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVV
RRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFR
HSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILT
IITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKG
EPHHELPPGSTKRALPNNT"""

print(chain_a)
lines = chain_a.count('\n') + 1
print("The sequence is in ", lines, " lines")

sequence = chain_a.replace("\n","")
print("The sequence has ", len(sequence), " aminoacids")
print("The sequence counts ", sequence.count('C'), " cysteins")
print("The sequence counts ", sequence.count('H'), " histidines")
subseq = "NLRVEYLDDRN"
print("Does the sequence contain ", subseq, "?", subseq in sequence )
pos = sequence.find(subseq)
getS = sequence[pos:pos+len(subseq)]
print(subseq, " is present at pos: ", pos , "[check:", getS , "]")

end_first_line = chain_a.find('\n')
print("The first line is: ", chain_a[0:end_first_line])

SSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKM
FCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVV
RRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFR
HSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILT
IITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKG
EPHHELPPGSTKRALPNNT
The sequence is in  6  lines
The sequence has  219  aminoacids
The sequence counts  10  cysteins
The sequence counts  9  histidines
Does the sequence contain  NLRVEYLDDRN ? True
NLRVEYLDDRN  is present at pos:  106 [check: NLRVEYLDDRN ]
The first line is:  SSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKM


</div>

8. Calculate the zeros of the equation $ax^2-b = 0$ where a = 10 and b = 1. Hint: use math.sqrt or ** 0.5. Finally check that substituting the obtained value of x in the equation gives zero. 

<div class="tggle" onclick="toggleVisibility('ex8');">Show/Hide Solution</div>
<div id="ex8" style="display:none;">

In [45]:
import math

A = 10
B = 1

X = math.sqrt(B/A)

print("10X**2 - 1 = 0 for X:", X)
print(10*X**2 -1 == 0)

10X**2 - 1 = 0 for X: 0.31622776601683794
True


</div>