In [28]:
from IPython.core.display import display, HTML
# Larger display 
display(HTML("<style>.container { width:90% !important; }</style>"))

# High-performance container datatypes from the standard library

**Python standard library contains a very interesting module to simplify the parsing a file: "collections" (See [documentation](https://docs.python.org/3.5/library/collections.html) for detailed information)**

**This module implements specialized container datatypes providing alternatives to Python’s native data structures (dict, list, set...)**

**Two new containers are particularly useful:**
* Counter
* OrderedDict

---
## Counter

**A [Counter](https://docs.python.org/3.5/library/collections.html#collections.Counter) container is provided to support convenient and rapid counting of specific occurences. See also [defaultdict](https://docs.python.org/3.5/library/collections.html#collections.defaultdict) for a generalization to other types than integer.**

***Example: counting characters in a string*** 

In [None]:
from collections import Counter

In [None]:
random_text = """Ukip is likely to be asked to repay tens of thousands of euros by European parliament finance chiefs
who have accused the party of misspending EU funds on party workers and Nigel Farage’s failed bid to win a seat in
Westminster.The Alliance for Direct Democracy in Europe (ADDE), a Ukip-dominated political vehicle, will be asked to
repay €173,000 (£148,000) in misspent funds and denied a further €501,000 in EU grants for breaking European rules
that ban spending EU money on national election campaigns and referendums. According to a European parliament audit
report seen by the Guardian, Ukip spent EU funds on polling and analysis in constituencies where they hoped to win a 
seat in the 2015 general election, including the South Thanet seat that party leader Farage contested. The party also
funded polls to gauge the public mood on leaving the EU, months before the official campaign kicked off in April 2016"""

In [None]:
# Example with a Counter 
c = Counter()

# Iterate over each characters of the string
for character in random_text:
    # Increment the counter for the current element
    c[character.lower()] += 1

# Order by most frequent element
c.most_common()

In [None]:
# Same thing but with native collections
d = {}

# Iterate over each characters of the string
for character in random_text:
    # If the element is not in the dict we have to create an entry first
    if character not in d:
        d[character.lower()] = 0
    # Increment the counter for the current element
    d[character.lower()]+=1
    
# Order by most frequent element
sorted(d.items(), key=lambda t: t[1], reverse=True)

***Example: Ramdomly selected global event with catastrophic consequences***

In [None]:
c =Counter()

# Open the file
with open ("../data/US_election vote_sample.txt", "r") as fp:
    # Separate words by tabulations
    for candidate in fp.read().split("\t"):
        # Increment the counter for the current element
        c[candidate]+=1

# Order by most frequent element
c.most_common()

***Example: Counting feature types in a gene annotation file (gff3)*** 

In [None]:
# print the 2 first lines of a file to analyse the file structure
with open ("../data/gencode_random.gff3", "r") as fp:
    for i in range (2):
        print (next(fp))

The "feature type" is the 3rd element of the list => index 2 in python  

In [None]:
c = Counter()

# Open the file
with open ("../data/gencode_random.gff3", "r") as fp:
    # Iterate over lines
    for line in fp:
        # Split the line and get the element 3
        feature_type = line.split("\t")[2]
        # Increment the counter
        c[feature_type]+=1
        
# Order by most frequent element
c.most_common()

---
## OrderedDict

**In a standard python dictionary, the order of the elements is not guaranteed and can change between 2 successive calls. In many situation, it can be annoying particularly if the order of elements in a parsed file matters (fastq, fasta...)**

**[Ordered dictionaries](https://docs.python.org/3.5/library/collections.html#collections.OrderedDict) are just like regular dictionaries but they remember the order that items were inserted, like lists.**

**When iterating over an ordered dictionary, the items are returned in the order their keys were first added.**

In [None]:
from collections import OrderedDict, Counter

In [None]:
fruit_str = "banana ripe:banana unripe:banana ripe:banana rotten:apple ripe:apple ripe:apple ripe:apple unripe:orange unripe:orange unripe:orange unripe:pear rotten:pear rotten:pear ripe"

***Parsing with a normal dictionary*** 

In [None]:
d={}

for element in fruit_str.split(":"):
    fruit, status = element.split(" ")
    if fruit not in d:
        d[fruit] = Counter()
    d[fruit][status]+=1

d

***Parsing with a Ordered dictionary*** 

In [None]:
d=OrderedDict()

for element in fruit_str.split(":"):
    fruit, status = element.split(" ")
    if fruit not in d:
        d[fruit] = Counter()
    d[fruit][status]+=1

d

**Since an ordered dictionary remembers its insertion order, it can be used in conjunction with sorting to make a sorted dictionary (by key or value) from a standard dictionary**

In [None]:
print ("\nStandard unsorted dictionary")
d = {'banana':3, 'apple':4, 'pear':1, 'orange':2, "peach":10, "apricot":2}
print (d)

print("\nDictionary sorted by key")
d_per_key = OrderedDict(sorted(d.items(), key=lambda t: t[0]))
print (d_per_key)

print("\nDictionary sorted by value")
d_per_val = OrderedDict(sorted(d.items(), key=lambda t: t[1]))
print (d_per_val)

---
# Pandas: a Powerful data structures for data analysis, time series, and statistics

* Flexible, and expressive data structures **Series** (1-dimensional) and **DataFrame** (2-dimensional)
* High-level building block for doing **practical, real world data analysis**
* **Nearly as fast as C language** = Build on top of Numpy and extensive use of Cython
* **Robust IO tools** for loading and parsing data from text files, excel files and databases.

## Introduction to Series

* **1D labeled array capable of holding any data type** (integers, strings, float...)

* Similar to a python standard dictionary but **faster** (because based on C datatypes) and more **user-friendly** in Jupyter

In [95]:
import pandas as pd

### Create series

***From 2 lists or sets ***

In [37]:
Base = ('A','T','C','G','N')
Freq = (0.21, 0.24, 0.27, 0.25, 0.03)
pd.Series(data=Freq, index=Base)

A    0.21
T    0.24
C    0.27
G    0.25
N    0.03
dtype: float64

***From a python dictionary***

In [38]:
d = {'A':0.21, 'T':0.24, 'C':0.27, 'G':0.25, 'N':0.03}
pd.Series(d)

A    0.21
C    0.27
G    0.25
N    0.03
T    0.24
dtype: float64

***The data type and series names can be specified*** 

In [51]:
d = {'A':21.0, 'T':24.0, 'C':27.0, 'G':25.0, 'N':3.0}
pd.Series(d, name="Percent", dtype=int)

A    21
C    27
G    25
N     3
T    24
Name: Percent, dtype: int64

**From a file containing 2 columns with the squeeze option**

In [43]:
pd.read_table("../data/DNA_distrib.tsv", index_col=0, squeeze=True, sep="\t")

base
A    0.21
T    0.24
C    0.27
G    0.25
N    0.03
Name: freq, dtype: float64

### Manipulate series

***Support list methods***

In [96]:
s = pd.Series({'A':0.21, 'T':0.24, 'C':0.27, 'G':0.25})

# Concat 2 series
s2 = pd.Series({'Y':0.01, 'N':0.03})
s3 = s.append(s2)
print(s3)

# Slicing
print(s[2:4])

# Extraction
print(s[2])

# the "for" loop works as for a list 
for i in s:
    print (i)


A    0.21
C    0.27
G    0.25
T    0.24
N    0.03
Y    0.01
dtype: float64
G    0.25
T    0.24
dtype: float64
0.25
0.21
0.27
0.25
0.24


***Support dictionary methods***

In [85]:
s = pd.Series({'A':21.0, 'T':24.0, 'C':27.0, 'G':25.0, 'N':3.0}, name="Percent", dtype=int)

# Update value
s["A"] = 22
print(s)

# Named indexing
print(s["A"])

# Test for existence
print ("A" in s)
print ("V" in s)

A    22
C    27
G    25
N     3
T    24
Name: Percent, dtype: int64
22
True
False


***Support a wide range of mathematic operations (thanks to numpy)***

In [93]:
s = pd.Series({'A':21, 'T':24, 'C':27, 'G':25, 'N':3}, name="Percent")

print(s.max())
print(s.mean())
print(s.all()>20)
print(s.sem())

# Addition of 2 series will return a results for all values in the 2 series
s2 = pd.Series({'A':0.2, 'T':0.7, 'C':0.4, 'G':1.5, 'N':-3}, name="Percent")

print (s + s2)


27
20.0
False
4.35889894354
A    21.2
C    27.4
G    26.5
N     0.0
T    24.7
Name: Percent, dtype: float64


## Introduction to Dataframes

* **2-dimensional labeled data structure with columns of potentially different types**
* **HTML rendering in jupyter**

### Create dataframe

***You can optionally pass index (row labels) and columns (column labels) arguments***

***From a pandas Series***

In [114]:
s = pd.Series({'A':21.0, 'T':24.0, 'C':27.0, 'G':25.0, 'N':3.0})
pd.DataFrame(s, columns=["Percent"])

Unnamed: 0,Percent
A,21.0
C,27.0
G,25.0
N,3.0
T,24.0


***From a list of pandas Series***

In [121]:
series_list = [
    pd.Series({'A':21.0, 'T':24.0, 'C':27.0, 'G':25.0, 'N':3.0}, name="Percent"),
    pd.Series({'A':331.2, 'T':322.2, 'C':307.2, 'G':347.2, 'N':None}, name="MolecularWeight"),
    pd.Series({'A':259, 'T':267, 'C':271, 'G':253, 'N':None}, name="AbsorbanceMax")]

pd.DataFrame(series_list)

Unnamed: 0,A,C,G,N,T
Percent,21.0,27.0,25.0,3.0,24.0
MolecularWeight,331.2,307.2,347.2,,322.2
AbsorbanceMax,259.0,271.0,253.0,,267.0


**From a simple list of list** 

In [120]:
list_list = [[21.0, 24.0, 27.0, 25.0, 3.0], [331.2, 322.2, 307.2, 347.2, None], [259, 267, 271, 253, None]]
column_list = ['A', 'T', 'C', 'G', 'N']
index_list = ["Percent", "MolecularWeight", "AbsorbanceMax"]

pd.DataFrame(list_list, index=index_list, columns=column_list)

Unnamed: 0,A,T,C,G,N
Percent,21.0,24.0,27.0,25.0,3.0
MolecularWeight,331.2,322.2,307.2,347.2,
AbsorbanceMax,259.0,267.0,271.0,253.0,


***The Dataframe creation is very versatile and can also be done from Dictionaries of lists, dicts, or Series and from numpy.ndarray...***

***One of the major strength of Pandas is its ability to perform complex files parsing into a comprehensive a dataframe format***

In [125]:
pd.read_table("../data/gencode_random.gff3", sep="\t", names =["chrom", "source", "type", "start", "end", "score", "strand", "frame", "info"])

Unnamed: 0,chrom,source,type,start,end,score,strand,frame,info
0,chr18,HAVANA,CDS,12452255,12452391,.,-,0,ID=CDS:ENST00000410092.7;Parent=ENST0000041009...
1,chr2,HAVANA,five_prime_UTR,9995879,9995970,.,+,.,ID=UTR5:ENST00000480736.1;Parent=ENST000004807...
2,chr2,HAVANA,CDS,159886420,159886530,.,-,2,ID=CDS:ENST00000504764.5;Parent=ENST0000050476...
3,chr1,HAVANA,stop_codon,87097937,87097939,.,+,0,ID=stop_codon:ENST00000370551.8;Parent=ENST000...
4,chr1,HAVANA,start_codon,208217920,208217922,.,-,0,ID=start_codon:ENST00000367033.3;Parent=ENST00...
5,chr22,ENSEMBL,CDS,24876577,24876715,.,+,2,ID=CDS:ENST00000610372.4;Parent=ENST0000061037...
6,chr14,HAVANA,exon,52646192,52646287,.,-,.,ID=exon:ENST00000555069.1:2;Parent=ENST0000055...
7,chr1,ENSEMBL,five_prime_UTR,145728244,145728327,.,+,.,ID=UTR5:ENST00000235933.10;Parent=ENST00000235...
8,chr1,HAVANA,exon,202736213,202736392,.,-,.,ID=exon:ENST00000367265.7:21;Parent=ENST000003...
9,chr1,HAVANA,five_prime_UTR,179082086,179082128,.,+,.,ID=UTR5:ENST00000352445.10;Parent=ENST00000352...
