In [16]:
%load_ext autoreload
%autoreload 1
from orangecontrib.bio.ontology import OBOOntology
import networkx as nx
%matplotlib inline
from pylab import * 
from itertools import repeat
%aimport network_tools
from network_tools import *
import pandas as pd
import re

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Improve ontology
Starting off with the 'human_samples' network from `explore_ontology.py` I will improve the annotation:
* group singletons that are the same sample but replicates/from different donors
* add 'cell_line' / 'primary cell' / .. annotation from supplementary information
* ...

Moreover, I will start to cluster the samples to get an overview which cell signatures we want to generate and are biologically meaningful. 

### What do we have in the F5 dataset
From the F5 paper (Nature 2014): 
* 573 human primary cell samples (3 donors for most cell types) 
* 128 mouse primary cell samples
* 250 cancer cell lines
* 152 human post-mortem-tissues
* 271 mouse developmental tissue samples


In [4]:
573+128+250+152+271

1374

### Check human_samples network
* Does it contain all (human) samples? 

In [6]:
obo = OBOOntology()
obo.load(open("data/ff-phase2-140729.obo"))

In [20]:
hsa_samples = relabel_nodes(obo, build_tree(obo, "FF:0000210"))

The network also contains a lot of 'annotation' nodes that do not represent a sample. Samples have an ID FF:?????-?????.

In [46]:
contained_samples = [s for s in hsa_samples.nodes() if re.match(r'FF:(.{5})-(.{5})', s) is not None]

In [42]:
col_vars = pd.read_csv("data/column_vars.processed.csv")

In [50]:
col_obo_ids = set(col_vars["obo_id"])

In [51]:
hsa_obo_ids = set([x.split()[0][:-1] for x in hsa_samples.nodes()])

In [65]:
print("number of nodes in network: ", len(hsa_obo_ids))
print("number of samples in network: ", len(contained_samples))
print("number of samples in data table: ", len(col_obo_ids))
print("intersection: ", len(col_obo_ids & hsa_obo_ids))

number of nodes in network:  2298
number of samples in network:  1800
number of samples in data table:  1816
intersection:  1800


That means that all 1800 human samples in the network are contained in the datatable. 
However, 16 samples are in the data table that are not annotated as human sample in the ontology: 

In [66]:
sorted([tag2name(obo, n) for n in col_obo_ids.difference(hsa_obo_ids)])

['FF:10150-102I6: medial frontal gyrus, adult, donor10252',
 'FF:11914-125G6: CD4+CD25-CD45RA- memory conventional T cells expanded, donor2',
 'FF:11915-125G7: CD4+CD25+CD45RA+ naive regulatory T cells expanded, donor2',
 'FF:11918-125H1: CD4+CD25-CD45RA- memory conventional T cells expanded, donor3',
 'FF:11919-125H2: CD4+CD25+CD45RA+ naive regulatory T cells expanded, donor3',
 'FF:11937-126A2: gamma delta positive T cells, donor1',
 'FF:11938-126A3: gamma delta positive T cells, donor2',
 'FF:11939-126A4: Mast cell, expanded, donor5',
 'FF:11940-126A5: Mast cell, expanded and stimulated, donor5',
 'FF:11941-126A6: Mast cell, expanded, donor8',
 'FF:11942-126A7: Mast cell, expanded and stimulated, donor8',
 'FF:13364-143F7: HES3-GFP Embryonic Stem cells, cardiomyocytic induction, day00, biol_rep1 (UH-1)',
 'FF:13365-143F8: HES3-GFP Embryonic Stem cells, cardiomyocytic induction, day00, biol_rep2 (UH-2)',
 'FF:13424-144D4: iPS differentiation to neuron, control donor C11-CRL2429, day1

By looking at the network, one can find that most of the samples have simply been forgotten:
`FF:13364-143F7: HES3-GFP Embryonic Stem cells, cardiomyocytic induction, day00, biol_rep1 (UH-1)` fits perfectly to `FF:13366-143F9: HES3-GFP Embryonic Stem cells, cardiomyocytic induction, day00, biol_rep3 (UH-3)` which exists as a singleton in the network. 

### Integrating the missing values in the network
We will now manually correct the ontology:
* Add the missing entries
* group the singletons by name (if they are the same sample but replicates/from different donors) 

In [67]:
## todo

### Find inconsitencies between name and Ontology

In [68]:
## i'm particularily interested in donor/replicate information that is not part of the ontology. 
## This could be helpful for grouping singletons. 

### add sample type information from supplementary Information

In [None]:
si_table = pd.read_excel("data/fantom5-S1.xls", sheetname=1)

## Conclusion
We tried our very best to fix the bugs in the annotation, however it is unlikely that my analasis reveals and corrected all mistakes. So in a next step we have to do some computational outlier detection. 