Note: solutions of the exercices of this lesson are at the end of this notebook
---------------------

# Imports

**Importing the package you will need on the top of your notebook is a good programming practice** 

In [None]:
# Import the packages that will be usefull for this part of the lesson
from collections import OrderedDict, Counter
import pandas as pd
from pprint import pprint

# Small trick to get a larger display
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))

# Reminder on file parsing strategy

* **Read the first line of the file and try to understand the structure and determine the length of the file**

* **If the file is a standard genomic/proteomic format, read the documentation**

* **Think about the most efficient way to parse the file to get the information you want**

    > How are you going to access the field(s) of interest ? (you can test that with 1 line before starting with the whole file)
    
    > A real life file will contain millions of lines and file reading is usually slow. Try to read the file only 1 time, even if you need to parse multiple element per line. 
    
    > How are you going to collect the information (dictionary, list, dataframe...) ?
    
**... Now you can parse the file**

 ## Counters

**Exercise 1: Randomly selected global event with catastrophic consequences**

*The file indicated bellow contain a representative sample of popular votes for the last US presidential elections. Parse the file and return a count of the number of occurrence of each name*

In [None]:
file = "../data/US_election_vote_sample.txt"

**Exercise 2: Parsing a gene annotation file (gff3)**

*Parse The following gff3 gene annotation file. Extract the **type** of each line and print the count ordered by descending occurrence*

*gff3 is a standard genomic format. Read the [documentation here](http://gmod.org/wiki/GFF3).*

In [None]:
file = "../data/gencode_sample.gff3"

In [None]:
! head -2 ../data/gencode_sample.gff3 # check format of GFF file

In [None]:
# rearrange the lines below to create a working solution
        type_field = line.split('\t')[2]
type_counter = Counter()
    print('%s:\t%d' % count_info)
with open(file, 'r') as fh:
    for line in fh.readlines():
for count_info in type_counter.most_common():
from collections import Counter
        type_counter[type_field] += 1

# OrderedDict

**Exercise 3: **

*Parse The following gff3 gene annotation file. For each **seqid** (chromosome) list the sequences **ID** associated. Use an ordered dict to preserve the original chromosome ordering from the file*

Example:

    d={"chr1": ["stop_codon:ENST00000370551.8", "UTR5:ENST00000235933.10", ...], "chr2": ["ID=CDS:ENST00000504764.5", "ID=UTR5:ENST00000480736.1", ...], ... }

In [None]:
file = "../data/gencode_sample.gff3"
! head -2 ../data/gencode_sample.gff3

In [None]:
# fill in the blanks (---) in the code below to create a working solution
from collections import ---

sequences = OrderedDict()

with open(file, ---) as fh:
    --- line in fh.readlines():
        fields = line.split()
        seqid = fields[---]
        attributes = ---
        ID = attributes.split(';')[0]
        if --- in sequences:
            sequences[seqid].append(ID)
        else:
            sequences[seqid] = [ID]

for seq, ids in sequences.items():
    print('%s:\t%s' % (seq, ', '.join(ids)))

# Statistics and viewing

**Exercise 4: **

*Parse The following sam file. *sam is a standard genomic format. Read the [documentation here](http://genome.sph.umich.edu/wiki/SAM). It can be read as a table with `pandas`.

In [None]:
file = "../data/sample_alignment.sam"

In [None]:
import pandas as pd
df = pd.read_table(file, comment='@', header=None)

Which of the following code blocks will:

1) *Print the 10 last rows*?

```Python
# a)
df.head()

# b)
df.tail()

# c)
df.last10()

#d)
df[6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
```

2) *Sample randomly 10 rows and compute the mean and median fragment length (TLEN)*?

```Python
# a)
sample = df.sample(10)
meanTLEN = sample[1].mean()
medianTLEN = sample[1].median()

# b)
sample = df.sample(10)
meanTLEN = sample[8].mean
medianTLEN = sample[8].median

# c)
sample = df.sample(10)
meanTLEN = sample[8].mean()
medianTLEN = sample[8].median()
```

3) *Generate a summary for all the columns*?

```Python
# a)
pd.describe(df)

# b)
df.describe(include='all')

# c)
summarise(pd.df)

# d)
df.summarise('columns')
```

# Selection and indexing

**Exercise 5: **

*Parse the following count file obtained by Kallisto from an RNAseq Dataset. The file is not a standard genomics format, but it is a tabulated file and some information can be found in [Kallisto documentation](https://pachterlab.github.io/kallisto/manual.html).*

1. *Extract the following **target_id**: 'ENST00000487368.4', 'ENST00000623229.1', 'ENST00000444276.1', 'ENST00000612487.4', 'ENST00000556673.2', 'ENST00000623191.1'*
2. *Select only the **est_counts and tpm** columns and print the **first 10 lines** of the table*
3. *Extract of the rows with a **tpm** value higher that 10000 *
4. *Extract of the rows with a **tpm** and an **est_counts** higher than 1000, order by descending **eff_len** and print the **10 first lines**.*

In [None]:
file = "../data/abundance.tsv"

---

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

# POSSIBLE ANSWERS

**Exercise 1**

In [None]:
c = Counter()

# Open the file
with open ("../data/US_election_vote_sample.txt", "r") as fp:
    for candidate in fp:
        # Increment the counter for the current element
        c[candidate]+=1

# Order by most frequent element
c.most_common()

**Exercise 2**

In [None]:
file = "../data/gencode_sample.gff3"
c = Counter()

# Open the file
with open (file, "r") as fp:
    # Iterate over lines
    for line in fp:
        # Split the line and get the element 3
        feature_type = line.split("\t")[2]
        # Increment the counter
        c[feature_type]+=1
        
# Order by most frequent element
c.most_common()

**Exercise 3**

In [None]:
!head -n 1 "../data/gencode_sample.gff3"

In [None]:
file = "../data/gencode_sample.gff3"
d =  OrderedDict()

# Open the file
with open (file, "r") as fp:
    # Iterate over lines
    for line in fp:
        # Split the line and get the element 3
        seqid = line.split("\t")[0]
        # Parse the line to get the ID
        ID = line.split("\t")[8].split(";")[0][3:]
        #
        if not seqid in d:
            d[seqid] = []
        d[seqid].append(ID)
        
d

**Exercise 4**

In [None]:
file = "../data/sample_alignment.sam"
columns_names = ['QNAME', 'FLAG', 'RNAME', 'POS', 'MAPQ', 'CIGAR', 'RNEXT', 'PNEXT', 'TLEN', 'SEQ', 'QUAL']
df = pd.read_table(file, sep="\t", names = columns_names, skiprows=[0,1], index_col=0)

In [None]:
df.tail(10)

In [None]:
tlen_sample = df.sample(10).TLEN
print (tlen_sample)
print ("\nMean:", tlen_sample.mean())
print ("\nMedian:", tlen_sample.median())

In [None]:
df.describe(include="all")

**Exercise 5**

In [None]:
file = "../data/abundance.tsv"
df = pd.read_table(file, index_col=0)

In [None]:
df.loc[['ENST00000487368.4', 'ENST00000623229.1', 'ENST00000444276.1', 'ENST00000612487.4', 'ENST00000556673.2', 'ENST00000623191.1']]

In [None]:
df[["est_counts", "tpm"]].head(10)

In [None]:
df[(df.tpm > 10000)]

In [None]:
df = df[(df.est_counts > 1000) & (df.tpm > 1000)]
df = df.sort_values("eff_length")
df.head(10)