# <center> Python Tutorial Session 2 </center>

## What we covered in our previous session:
 - Error Types and Error Handling
 - List Comprehension
 - Dictionary Comprehension
 - Lambda Functions
 - Importing Modules
 - The Math module

## <font color=green>Table of Contents IIb </font>
 - [The Collections module](#collections)
 - [The OS module](#os)
 - [Working with random numbers](#random)
 - [Working with Time](#time)
 - [Regular Expressions](#re)
 - [Bioinformatics Examples](#Bioinformatics)
 - [What we covered](#Conclusion)

## <a id="collections"> The Collections Module </a>

In [1]:
import collections

In [3]:
print(collections.Counter("AAAAATTTGGGG"))
print(collections.Counter(['A','A','T','T']))
print(collections.Counter(["apple","dog","dog","apple"]))
print(collections.Counter({"apple":10,"banana":11}))

Counter({'A': 5, 'G': 4, 'T': 3})
Counter({'A': 2, 'T': 2})
Counter({'apple': 2, 'dog': 2})
Counter({'banana': 11, 'apple': 10})


You can also find the most common items in a collection using the mostcommon method.  Here you can see that I printed the 2 most common items in the collection.

In [4]:
collections.Counter("AAAAAAAATTTTTTTTGGGGGGGGGGGGGGGGGGGGGGGGCCCCCTTTAAA").most_common(2)

[('G', 24), ('A', 11)]

The deque() function can be used for storing a fixed number of items. 

In [5]:
from collections import deque

In [6]:
j=deque([1,2,3,4,5],maxlen=10)

In [7]:
j.append(6)
j.append(7)
j.append(8)
j.append(9)
j.append(10)
j.append(11)

In [8]:
j

deque([2, 3, 4, 5, 6, 7, 8, 9, 10, 11], maxlen=10)

Notice that the first value in the collection got removed. This can be done using list as well, but you need to code a bit.  The deque method is a more elegant solution. You can pop from either the left or the right.

In [9]:
print(j.pop())
print(j.popleft())
print(j)

11
2
deque([3, 4, 5, 6, 7, 8, 9, 10], maxlen=10)


If you want your tuple to be named, you can use the namedtuple function. The syntax is as follows, first you need to assign a "type" to your tuple.  In the example below, the "type" is food.  Next, within a square bracket, you add names for your tuple.  To assign values, you need to use the "types_of_food" variable as if it is a function.  Within it, you declare your items.  Next, you can use the "." to access specific names.

In [10]:
types_of_food=collections.namedtuple('Food',["fruits","vegetables","juice"])
food_tuple=types_of_food("apple","spinach","orange juice")
print(food_tuple)

Food(fruits='apple', vegetables='spinach', juice='orange juice')


In [11]:
print(food_tuple.fruits)
print(food_tuple.vegetables)

apple
spinach


In [12]:
from collections import namedtuple
mods=namedtuple("modification",["nucleotide","modification"])
new_mods=mods(["A","U"],['m6A','pseudouridine'])

In [13]:
new_mods.nucleotide

['A', 'U']

In [14]:
new_mods.modification

['m6A', 'pseudouridine']

In [15]:
new_mods.nucleotide[1]

'U'

In [16]:
new_mods

modification(nucleotide=['A', 'U'], modification=['m6A', 'pseudouridine'])

The example above shows that lists could be nested with namedtuples as well.  
Below is a different type of container called ChainMap.  It allows us to combine multiple dictionaries.

In [17]:
nucleotide={"A":'Adenosine',
           'T':"Thymine",
           "C":"Cytosine",
           "G":"Guanine"}
modification={"Adenosine":"m6A","Cytosine":'m5C'}
combined=collections.ChainMap(nucleotide,modification)

In [18]:
print(combined["Adenosine"])
print(combined['A'])

m6A
Adenosine


While I called the chainmap dictionary "combined", in reality the way chainmap works is by keeping track of different dictionaries in a list.

The collections library also has other functions such as <b>OrderedDict,defaultDict,etc </b>.  I am not going to cover these different functions and leave the reader to look them up if necessary.

## <a id="os"> The OS Module </a>

The next module we are going to talk about is the OS module, which can be used to print current working directory, change directory, etc.

In [19]:
import os

In [20]:
os.getcwd()

'C:\\Users\\banskotan2\\Documents\\Python Class'

In [21]:
os.cpu_count()

20

In [22]:
os.listdir()

['.ipynb_checkpoints',
 'new_file.txt',
 'Session 1a.ipynb',
 'Session 1b.ipynb',
 'Session 2a.ipynb',
 'Session 2b.ipynb']

There is also os.walk() for finding files in a directory tree, os.mkdir(), os.rmdir(), os.chdir(),etc. <br>


## <a id="random"> Working with random numbers </a> <br>
The <b> random </b> function can be used to generate random numbers.

Note also that the numpy package has its own random functions as well!

In [24]:
import random

The random.random() function can be used to generate values betweeen 0 to 1.

In [25]:
random.random()

0.9056917606447467

In [26]:
random.random()

0.9380524590177166

The random.randint() function can be used to generate integer between two different values

In [27]:
random.randint(0,1)

0

In [28]:
random.randint(0,100)

58

For reproducibility, set the seed so that the same value gets produced.  This can be important for algorithms that require random numbers--for example the k-means algorithm

In [29]:
random.seed(5)
random.randint(5,10)

9

In [30]:
random.randint(5,10)

7

In [31]:
random.seed(5)
random.randint(5,10)

9

Some other random number functions:

In [32]:
random.choice([1,2,3,4,5])

3

In [33]:
random.choice(["A","T","G","C"])

'G'

In [34]:
random.choice("ATGC")

'A'

In [35]:
random.seed(10)
numbers=[1,2,3,4]
random.shuffle(numbers)
print(numbers)

[4, 3, 2, 1]


## <a id="time"> Working with Time </a> <br>
The time module can be used to print time.

In [36]:
import time

In [37]:
print(time.time())

1699990235.907581


The time function can be used to check how long it takes to run a program.

In [38]:
start=time.time()
for i in range(100000000):
    pass
end=time.time()
print(f"time elapsed:{end-start}")

time elapsed:1.9190759658813477


In [39]:
start=time.time()
for i in range(500000000):
    pass
end=time.time()
print(f"time elapsed:{end-start}")

time elapsed:9.887067556381226


Jupyter Notebook also comes with the timeit magic command that can be run as follows:

In [40]:
%%timeit
for i in range(100000):
    pass


962 µs ± 25.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


## <a id="re"> Regular Expressions</a>

The re module is used for dealing with regular expressions, which is a method for parsing strings based on patterns. The name of the package is "re".  Let's load it first.
First of all, what is a regular expression?  These are expressions that match a particular pattern that you are interested in.  Many different programming languages allow usage of regular expressions: Perl, R, as well as Python.  A few regular expressions to know. <br>



<font size="+5"><b>*</b></font>: matches zero or more repititions of a character. Greedy.

<font size="+5"><b>*?</b></font>: matches zero or more repititions but is not greedy

<font size="+5"><b>+</b></font> :matches one or more repitition of a character

<font size="+5"><b>+?</b></font>: One or more repititions and is not greedy.

<font size="+5"><b>.</b></font>: Dot matches one character except newline.

<font size="+5"><b>^</b>:  Starting string</font>

<font size="+5"><b>$</b>: ending string</font>

<font size="+5"><b>{m} </b> </font> :Matches m repititions of the previous regular expression

<font size="+5"><b>{m,n}</b> </font>: Matches m to n repititions. Greedy.

<font size="+5"><b>{m,n}?</b></font>:  Matches m to n repititions, but is not greedy.

<font size="+5"><b>\</b></font>: Escape spetial characters. For example, you want to match a ".", which is a special character. 
 
<font size="+5"><b>[]<b></font>: Matches characters in the square bracket.

<font size="+5"><b>|</b></font>:  For example A|B matches either A or B 

<font size="+5"><b>(...)</b></font>:  Where "..." is a regular expression. For group capture.

<font size="+5"><b>\n</b></font>:  For example \1.  This matches group 1.

<font size="+5"><b>\b</b></font>:  Matches empty string at the beginning or end of a group

<font size="+5"><b>\B</b></font>:  Matches empty strong NOT at the beginning or end.

<font size="+5"><b>\d</b></font>:  Matches a number

<font size="+5"><b>\D</b></font>:  Matches something that is not a number

<font size="+5"><b>\s</b></font>:  Matches whitespace

<font size="+5"><b>\S</b></font>:  Matches something that is not whitespace

<font size="+5"><b>\w</b></font>: Matches a word

<font size="+5"><b>\W</b></font>: Matches something that is not a word


<b> And there are many other operators out there</b>

In [42]:
import re

In [43]:
txt="I want to know. Have you ever seen the rain?"

The re.search() function can be used to find a match.  if it matches, then it return an object.

In [44]:
re.search("edljdkd",txt)

In [45]:
re.search(".*know\\.",txt)

<re.Match object; span=(0, 15), match='I want to know.'>

In [46]:
if re.search("know",txt): print("yes")

yes


In [47]:
if re.search("gibberish",txt):print("Yes")

In [48]:
if not re.search("gibberish",txt):print("No")

No


In [51]:
txt="ATTTTT"

In [52]:
re.search("ATG*",txt)

<re.Match object; span=(0, 2), match='AT'>

The reason that the above statement matches is because the * implies that G can be 0 or more characters.
<br>
Printing the original string can be achieved by using the string method.

In [54]:
txt="AAAATTTTAAAGGGGCCTTT"
re.search("ATG*",txt).string

'AAAATTTTAAAGGGGCCTTT'

The .string attribute does not appear to be that informative.  However, the group() method will return the match, which can be useful.

In [55]:
txt="AAAATTTTAAAGGGGCCTTT"
re.search("ATG*",txt).group()

'AT'

Here are 3 example for comparison.  First we use the group() method without specifying anything inside.  This returns the full match.<br>
Next, we specify 1 within group().  This returns the first match.
After this, we spcify 2.  This returns the second group.

In [56]:
txt="AAAATTTTAAAGGGGCCTTT"
print(re.search("(ATG*).*(G.*CT)",txt).group())
print(re.search("(ATG*).*(G.*CT)",txt).group(1))
print(re.search("(ATG*).*(G.*CT)",txt).group(2))


ATTTTAAAGGGGCCT
AT
GCCT


The span() method can be used to return the indices(start and end) of the match in the form of a tuple.  As was the case with the group() command, we can indeed specify the group #.

In [57]:
txt="AAAATTTTAAAGGGGCCTTT"
print(re.search("(ATG*).*(G.*CT)",txt).span())
print(re.search("(ATG*).*(G.*CT)",txt).span(1))
print(re.search("(ATG*).*(G.*CT)",txt).span(2))


(3, 18)
(3, 5)
(14, 18)


The next is the findall() function, which returns all matches.

In [58]:
re.findall("A",txt)

['A', 'A', 'A', 'A', 'A', 'A', 'A']

The findall() function returns an empty list if there is no match as can be seen below.

In [59]:
txt="AAAATTTTAAAGGGGCCTTT"
re.findall("ATGC",txt)

[]

The re package also has a split() function, which is more powerful than the builtin split() function.

In [60]:
txt="I am going to the park"
re.split("am|the",txt)

['I ', ' going to ', ' park']

In the example above, we split using the | (or) function to split where the match is either "am" or "the".  Note that the match is not included in the returned list.

In [61]:
re.split("am|the",txt,maxsplit=1)

['I ', ' going to the park']

The sub() function can be used for subtitution. <br>
The syntax is re.sub(\<old substring\>, \<new substring\>, \<full text\>)
An example below.  

In [62]:
ensembl_ID="ENSG000000003"
re.sub("ENSG","ENSMUSG",ensembl_ID)

'ENSMUSG000000003'

## <a id="Bioinformatics">Bioinformatics Examples </a>

#### Example 1: 
Count the number of A,T,G,C in a sequence.

In [63]:
from collections import Counter
def count_sequence(sequence):
    return Counter(sequence)

In [64]:
print(count_sequence("AAAAAAAAAATTTTTTTTGCCCC"))

Counter({'A': 10, 'T': 8, 'C': 4, 'G': 1})


#### Example 2:
Report the percentage of A T G C

In [65]:
def percentage_sequence(sequence,rounding=1):
    seq_dict=dict(Counter(sequence))
    #print(sum(seq_dict.values()))
    return {key:round(100*(value/sum(seq_dict.values())),rounding) for key,value in seq_dict.items()}

In [66]:
print(percentage_sequence("AAAAAAAAAATTTTTTTTGCCCC"))

{'A': 43.5, 'T': 34.8, 'G': 4.3, 'C': 17.4}


In [67]:
print(percentage_sequence("AAAAAAAAAATTTTTTTTGCCCC",2))

{'A': 43.48, 'T': 34.78, 'G': 4.35, 'C': 17.39}


In [68]:
print(percentage_sequence("AAAATTTTGGGGCCCC",1))

{'A': 25.0, 'T': 25.0, 'G': 25.0, 'C': 25.0}


#### Example 3 <br>
Return kmers of size n.  If a kmer is not of size k, exclude it.

In [69]:
def kmer_size_n(sequence,k=4):
    return [sequence[n:(n+k)] for n in range(0,len(sequence),k) if len(sequence[n:(n+k)])%k==0]

In [70]:
kmer_size_n("ATGCAT")

['ATGC']

In [71]:
kmer_size_n("ATGCAATT")

['ATGC', 'AATT']

kmer_size_n("AAA")

In [72]:
kmer_size_n("AAAATTGCA")

['AAAA', 'TTGC']

In [73]:
kmer_size_n("ATGATC",3)

['ATG', 'ATC']

### Example 4 <br>
Return a kmer of size n only if it is a palindrome and it is of size k.  A palindrome is the same sequence reversed.  For example, the reverse of ATA is also ATA

In [74]:
def kmer_palindrome(sequence,k=4):
    return [sequence[n:(n+k)] for n in range(0,len(sequence),k) if len(sequence[n:(n+k)])%k==0 and sequence[n:(n+k)]=="".join(reversed(sequence[n:(n+k)]))]

In [75]:
kmer_palindrome("ATAAAATTATATATTA")

['ATTA']

In [76]:
kmer_palindrome("ATATATTTAATT",k=3)

['ATA', 'TAT']

### Example 5

Return the reverse complement of a sequence.

In [77]:
def reverse_complement(seq):
    return "".join([k for k in seq.translate(str.maketrans("ATGC","TACG"))])[::-1]

In [78]:
reverse_complement("ATGCATAT")

'ATATGCAT'

In [79]:
reverse_complement("AAAATTTTTGGGCCC")

'GGGCCCAAAAATTTT'

In [80]:
reverse_complement("GCATATCTTAAGGCC")

'GGCCTTAAGATATGC'

Translate an RNA sequence to Protein, given just the coding region.

In [81]:
def translate_seq(RNA):
    translation_dict={"TTT":"F",
                      "TTC":"F",
                      "TTA":"L",
                      "TTG":"L",
                      "CTT":"L",
                      "CTC":"L",
                      "CTA":"L",
                      "CTG":"L",
                      "ATT":"I",
                      "ATC":"I",
                      "ATA":"I",
                      "ATG":"M",
                      "GTT":"V",
                      "GTC":"V",
                      "GTA":"V",
                      "GTG":"V",
                      "TCT":"S",
                      "TCC":"S",
                      "TCA":"S",
                      "TCG":"S",
                      "CCT":"P",
                      "CCC":"P",
                      "CCA":"P",
                      "CCG":"P",
                      "ACT":"T",
                      "ACC":"T",
                      "ACG":"T",
                      "ACA":"T",
                      "GCT":"A",
                      "GCC":"A",
                      "GCA":"A",
                      "GCG":"A",
                      "TAT":"Y",
                      "TAC":"Y",
                      "TAA":"*",
                      "TAG":"*",
                      "CAT":"H",
                      "CAC":"H",
                      "CAA":"Q",
                      "CAG":"Q",
                      "AAT":"N",
                      "AAC":"N",
                      "AAA":"K",
                      "AAG":"K",
                      "GAT":"D",
                      "GAC":"D",
                      "GAA":"E",
                      "GAG":"E",
                      "TGT":"C",
                      "TGC":"C",
                      "TGA":"*",
                      "TGG":"W",
                      "CGT":"R",
                      "CGC":"R",
                      "CGA":"R",
                      "CGG":"R",
                      "AGT":"S",
                      "AGC":"S",
                      "AGA":"R",
                      "AGG":"R",
                      "GGT":"G",
                      "GGC":"G",
                      "GGA":"G",
                      "GGG":"G"}
    return "".join([translation_dict[RNA[k:k+3]] for k in range(0,len(RNA),3)])
    

In [82]:
translate_seq("ATGCAG")

'MQ'

In [83]:
sequence_from_function=translate_seq("ATGGCGGCTAACGCTACTACCAACCCGTCGCAGCTGCTGCCCTTAGAGCTTGTGGACAAATGTATAGGATCAAGAATTCACATCGTGATGAAGAGTGATAAGGAAATTGTTGGTACTCTTCTAGGATTTGATGACTTTGTCAATATGGTACTGGAAGATGTCACTGAGTTTGAAATCACACCAGAAGGAAGAAGGATTACTAAATTAGATCAGATTTTGCTAAATGGAAATAATATAACAATGCTGGTTCCTGGAGGAGAAGGACCTGAAGTGTGA")

In [84]:
print(sequence_from_function)

MAANATTNPSQLLPLELVDKCIGSRIHIVMKSDKEIVGTLLGFDDFVNMVLEDVTEFEITPEGRRITKLDQILLNGNNITMLVPGGEGPEV*


In [85]:
actual_sequence="MAANATTNPSQLLPLELVDKCIGSRIHIVMKSDKEIVGTLLGFDDFVNMVLEDVTEFEITPEGRRITKLDQILLNGNNITMLVPGGEGPEV*"

In [86]:
sequence_from_function==actual_sequence

True

## <a id="Conclusion"> What we covered </a>

Print, Input statements

Conditionals

Functions

Iteration/Loops

String manipulation

Lists, Tuples, Sets, Dictionaries

List and Dictionary Comprehension

Collections package

Error Types and error handling

OS package

Math package

Time package

Random package


<b>What we have not covered (and we can cover in future classes)</b>:<br>
- Generators
- Decorators
- Python for Bioinformatics
    - Biopython
- Scientific Computing, Data Analysis and Visualization
    - Pandas
    - Matplotlib, Seaborn
    - Numpy
    - Scipy
- Deep Learning
    - Packages:
        - Tensorflow
        - Pytorch
    - Algorithms that can be covered:
        - Fully-connected neural network
        - Convolutional Neural Network
    - Natural Language Processing:
        Large Language Models(LLMs) using huggingface.