In [9]:
import pandas as pd
from itertools import repeat
import re
from orangecontrib.bio.ontology import OBOOntology

### Read column headers from fantom5 data. 
Read the column headers and extract sample information from it. 

In [1]:
!ls data

fantom5_ds.txt				   samples1829
fantom5_head2000.txt			   samples1829_LIBRARY_IDs
hg19.cage_peak_phase1and2combined_ann.txt  samples1829_simplified
human_samples_nature13182-s2


In [44]:
!grep "^##ColumnVariables" data/fantom5_ds.txt | cut -d"=" -f2 > data/column_vars.txt
col_info = !cat data/column_vars.txt

In [42]:
col_info[:10]

['CAGE peak id',
 'short form of the description below. Common descriptions in the long descriptions has been omited',
 'description of the CAGE peak',
 'transcript which 5end is the nearest to the the CAGE peak',
 'entrezgene (genes) id associated with the transcript',
 'hgnc (gene symbol) id associated with the transcript',
 'uniprot (protein) id associated with the transcript',
 'tpm of 293SLAM rinderpest infection, 00hr, biol_rep1.CNhs14406.13541-145H4',
 'tpm of 293SLAM rinderpest infection, 00hr, biol_rep2.CNhs14407.13542-145H5',
 'tpm of 293SLAM rinderpest infection, 00hr, biol_rep3.CNhs14408.13543-145H6']

In [8]:
sample_info = col_info[7:]

### Retreiving Information from the ontoloty. 

The column headers are difficult to parse (inconsistent commata, etc.). 
We found an ontology on the fantom5 web page. [1]

First, we check, if all the ids from the column headers appear in the ontology. 

[1] http://fantom.gsc.riken.jp/5/datafiles/latest/extra/Ontology/ff-phase2-140729.obo.txt

In [98]:
LIB_ID_REGEX = re.compile(r'CNhs\d+.(\w+)-(\w+)')

In [96]:
for info_line in sample_info:
    ff_id = "-".join(LIB_ID_REGEX.search(info_line).groups())
    res = !grep {ff_id} data/ff-phase2-140729.obo.txt 
    assert len(res) > 0

that seems to be the case...

#### Try out the Ontology Parser

In [11]:
obo = OBOOntology()
obo.load(open("data/ff-phase2-140729.obo.txt"))

In [19]:
print(obo.term("FF:1394-42H2").tags())

[('id', 'FF:1394-42H2', None, None), ('name', 'lung, neonate N30, rep1', None, None), ('namespace', 'FANTOM5', None, None), ('subset', 'phase1', None, None), ('subset', 'phase2', None, None), ('subset', 'update022', None, None), ('is_a', 'EFO:0002091', None, 'biological replicate'), ('is_a', 'FF:0011489', None, 'mouse lung- neonate N30 sample')]


#### Parse the ontology. 

at least, we don't run into massive comma-parsing trouble again. There remain issues, though:
* sometimes, the time/donor/replicate is not annotated using the ontology but only appears in the 'name'. In that case parse using regex and check if it is consistent with the ongology annotation, if available. 
* I need to figure out a way how to determine the cell type from the name. Is it enough to rely on the 'derives_from' annotation? 