In [2]:
from IPython.core.display import display, HTML
# Larger display 
display(HTML("<style>.container { width:90% !important; }</style>"))

# High-performance container datatypes from the standard library

**Python standard library contains a very interesting module to simplify the parsing a file: "collections" (See [documentation](https://docs.python.org/3.5/library/collections.html) for detailed information)**

**This module implements specialized container datatypes providing alternatives to Python’s native data structures (dict, list, set...)**

**Two new containers are particularly useful:**
* Counter
* OrderedDict

---
## Counter

**A [Counter](https://docs.python.org/3.5/library/collections.html#collections.Counter) container is provided to support convenient and rapid counting of specific occurences. See also [defaultdict](https://docs.python.org/3.5/library/collections.html#collections.defaultdict) for a generalization to other types than integer.**

***Example: counting characters in a string*** 

In [20]:
from collections import Counter

In [None]:
random_text = """Ukip is likely to be asked to repay tens of thousands of euros by European parliament finance chiefs
who have accused the party of misspending EU funds on party workers and Nigel Farage’s failed bid to win a seat in
Westminster.The Alliance for Direct Democracy in Europe (ADDE), a Ukip-dominated political vehicle, will be asked to
repay €173,000 (£148,000) in misspent funds and denied a further €501,000 in EU grants for breaking European rules
that ban spending EU money on national election campaigns and referendums. According to a European parliament audit
report seen by the Guardian, Ukip spent EU funds on polling and analysis in constituencies where they hoped to win a 
seat in the 2015 general election, including the South Thanet seat that party leader Farage contested. The party also
funded polls to gauge the public mood on leaving the EU, months before the official campaign kicked off in April 2016"""

In [None]:
# Example with a Counter 
c = Counter()

# Iterate over each characters of the string
for character in random_text:
    # Increment the counter for the current element
    c[character] += 1

# Order by most frequent element
c.most_common()

In [None]:
# Same thing but with native collections
d = {}

# Iterate over each characters of the string
for character in random_text:
    # If the element is not in the dict we have to create an entry first
    if character not in d:
        d[character] = 0
    # Increment the counter for the current element
    d[character]+=1
    
# Order by most frequent element
sorted(d.items(), key=lambda t: t[1], reverse=True)

***Example: Ramdomly selected global event with catastrophic consequences***

In [28]:
c =Counter()

# Open the file
with open ("../data/US_election vote_sample.txt", "r") as fp:
    # Separate words by tabulations
    for candidate in fp.read().split("\t"):
        # Increment the counter for the current element
        c[candidate]+=1

# Order by most frequent element
c.most_common()

[('Clinton', 96191),
 ('Trump', 94075),
 ('Johnson', 6532),
 ('Stein', 2084),
 ('McMullin', 819),
 ('Castle', 299)]

***Example: Counting feature types in a gene annotation file (gff3)*** 

In [None]:
# print the 2 first lines of a file to analyse the file structure
with open ("../data/gencode_random.gff3", "r") as fp:
    for i in range (2):
        print (next(fp))

The "feature type" is the 3rd element of the list => index 2 in python  

In [None]:
c = Counter()

# Open the file
with open ("../data/gencode_random.gff3", "r") as fp:
    # Iterate over lines
    for line in fp:
        # Split the line and get the element 3
        feature_type = line.split("\t")[2]
        # Increment the counter
        c[feature_type]+=1
        
# Order by most frequent element
c.most_common()

---
## OrderedDict

**Ordered dictionaries are just like regular dictionaries but they remember the order that items were inserted, like lists.**

**When iterating over an ordered dictionary, the items are returned in the order their keys were first added.**

In [None]:
from collections import OrderedDict

***Example: extract *** 

[('Clinton', 96191),
 ('Trump', 94075),
 ('Johnson', 6532),
 ('Stein', 2084),
 ('McMullin', 819),
 ('Castle', 299)]

***Example: listing  a gene annotation file (gff3)*** 

In [29]:
Extract coordinates per reference sequence.

SyntaxError: invalid syntax (<ipython-input-29-c04425c6df47>, line 1)