# Workflow 8 - module 2 - initial testing

This notebook tests and gives examples for Module 2 of Workflow 8. See <a href="https://docs.google.com/presentation/d/1IkAzjSrOMzOLx5z8GqRVKmVd1GKrpIEb_xF4g4RlI1U/edit?usp=sharing">here</a> for an overview of Workflow 8. Module 2 takes **genes and tissues** as input and **interacting genes** and/or a **gene interaction/similarrity matrix** that can be used as input for module 3 (DDOT). Module 2 uses Google BigQuery. Boilerplate code (API calls for instance) is wrapped in `wf8_module2.py`, which contains API calls written by John Earls and Theo Knijnenburg.  

Notebook written by: Samson Fong, John Earls, Theo Knijnenburg, Chris Churas and Aaron Gary. 

## Libraries and such

In [3]:
%load_ext autoreload 
%autoreload 2

import json
from pprint import pprint
from wf8_module1 import doid_to_genes_and_tissues
from wf8_module2 import call_biggim
import numpy as np
import pandas as pd
import time

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


The main function **call_biggim** has seven input arguments of which only the first two are required
1. genes [required]  - List of input genes as a set of NCBI genes, e.g. genes = ['2188', '79728', '7124']
2. tissues [required] - LIst of tissues, e.g. tissues = ['pancreas', 'liver']
3. limit=1000000 - Maximum number of rows returned. Make smaller for quicker testing and smaller tables
4. average_columns=False - If tissues are associated with multiple columns average the scores into one columns called 'mean'
5. query_id2=False - List of input genes. Only pairs that are constituted of genes in 'genes' and 'query_id2' are returned
6. return_genes=False - Return new genes (interacting with the original set of genes) as a list
7. N=250 - Number of new genes returned

# Testing for Fanconi Anemia

Tissue and genes are from module 1. Tissues were filtered with the help of Maureen.

## Getting Interacting Genes

Tissue and genes are from module 1. Tissues were filtered with the help of Maureen.

In [12]:
genes = ['2188', '79728', '7124', '675', '83990', '29089', '2178', '57697', '2187', '84464', '2175', '5889', '2177', '55120', '2072', '2189', '5888', '7516', '4599', '2176', '55215', '10459']
genes = list(set(genes))
tissues =  ['esophagus','hematopoietic_system', 'neck', 'trachea']
genes2 = call_biggim(genes, tissues, average_columns=True, return_genes=True, N=100)


Sent: GET http://biggim.ncats.io/api/metadata/tissue/esophagus?None
Sent: GET http://biggim.ncats.io/api/metadata/tissue/hematopoietic_system?None
Sent: GET http://biggim.ncats.io/api/metadata/tissue/neck?None
Sent: GET http://biggim.ncats.io/api/metadata/tissue/trachea?None
Returned 56 Big GIM columns
['GIANT_dendritic_cell_KnownFunctionalInteraction', 'GIANT_megakaryocyte_KnownFunctionalInteraction', 'GIANT_megakaryocyte_ProbabilityOfFunctionalInteraction', 'GTEx_Esophagus_Correlation', 'GIANT_thyroid_gland_KnownFunctionalInteraction', 'GIANT_blood_platelet_KnownFunctionalInteraction', 'GIANT_lymphocyte_ProbabilityOfFunctionalInteraction', 'GIANT_t_lymphocyte_ProbabilityOfFunctionalInteraction', 'GIANT_b_lymphocyte_ProbabilityOfFunctionalInteraction', 'GIANT_hematopoietic_stem_cell_KnownFunctionalInteraction', 'GIANT_blood_KnownFunctionalInteraction', 'GIANT_lymph_node_ProbabilityOfFunctionalInteraction', 'GIANT_neutrophil_ProbabilityOfFunctionalInteraction', 'GIANT_spleen_Probabilit

In [13]:
print('All genes')
print(genes2)

All genes
['10036', '10051', '10346', '10459', '10561', '10592', '1062', '1070', '10964', '11065', '11073', '1111', '11130', '1434', '1479', '1854', '2072', '2146', '2175', '2175', '2176', '2176', '2177', '2177', '2178', '2178', '2187', '2188', '2189', '2189', '2237', '22974', '23310', '24137', '259266', '26271', '29028', '29089', '3014', '3148', '3383', '3430', '3431', '3433', '3434', '3437', '3553', '3627', '3838', '4001', '4061', '4085', '4171', '4173', '4174', '4175', '4176', '4436', '4599', '4599', '4600', '4678', '4938', '4939', '51203', '51514', '5427', '54739', '55120', '55120', '55215', '55215', '55355', '55706', '55872', '57697', '5888', '5888', '5889', '5889', '5984', '5985', '6240', '6241', '641', '64151', '64761', '6672', '672', '6737', '675', '675', '6772', '6790', '701', '7083', '7124', '7124', '7128', '7153', '7157', '7298', '7411', '7468', '7516', '79682', '79728', '81620', '8317', '83990', '83990', '84464', '86', '8638', '890', '9055', '9232', '9246', '9319', '9636', 

In [14]:
print('Here are the new genes')
print(list(set(genes2)-set(genes)))

Here are the new genes
['26271', '5427', '4938', '9833', '1854', '7128', '51514', '1062', '3838', '9918', '5984', '3437', '51203', '4061', '9768', '10036', '10964', '4085', '4174', '3627', '8638', '10592', '3383', '6672', '3433', '7083', '2146', '4173', '9232', '3434', '64151', '10051', '7298', '6737', '7157', '9246', '1070', '4176', '81620', '24137', '4939', '1434', '55872', '9055', '3014', '4171', '9928', '3431', '3148', '5985', '6240', '7468', '3553', '3430', '701', '11065', '2237', '641', '7153', '4436', '259266', '4175', '6772', '79682', '8317', '6790', '11073', '54739', '1111', '4001', '55706', '64761', '6241', '29028', '10346', '9319', '11130', '86', '672', '1479', '23310', '4678', '4600', '7411', '9636', '55355', '10561', '22974', '890']


## Running Big GIM for the second round, now producing the interaction matrix (df) that can serve as an input for DDOT

In [15]:
df = call_biggim(genes2, tissues, average_columns=True, query_id2=genes2)

Sent: GET http://biggim.ncats.io/api/metadata/tissue/esophagus?None
Sent: GET http://biggim.ncats.io/api/metadata/tissue/hematopoietic_system?None
Sent: GET http://biggim.ncats.io/api/metadata/tissue/neck?None
Sent: GET http://biggim.ncats.io/api/metadata/tissue/trachea?None
Returned 56 Big GIM columns
['GIANT_dendritic_cell_KnownFunctionalInteraction', 'GIANT_megakaryocyte_KnownFunctionalInteraction', 'GIANT_megakaryocyte_ProbabilityOfFunctionalInteraction', 'GTEx_Esophagus_Correlation', 'GIANT_thyroid_gland_KnownFunctionalInteraction', 'GIANT_blood_platelet_KnownFunctionalInteraction', 'GIANT_lymphocyte_ProbabilityOfFunctionalInteraction', 'GIANT_t_lymphocyte_ProbabilityOfFunctionalInteraction', 'GIANT_b_lymphocyte_ProbabilityOfFunctionalInteraction', 'GIANT_hematopoietic_stem_cell_KnownFunctionalInteraction', 'GIANT_blood_KnownFunctionalInteraction', 'GIANT_lymph_node_ProbabilityOfFunctionalInteraction', 'GIANT_neutrophil_ProbabilityOfFunctionalInteraction', 'GIANT_spleen_Probabilit

In [11]:
print(df)

     Gene1  Gene2      mean
0     2177   2176  0.126488
1    11065    675  0.214483
2    29089   2188  0.089574
3     5889   3161  0.143898
4    79728   7516  0.057075
5    11065   2178  0.095352
6     5889   2176  0.049106
7     5889   2178  0.085579
8    55120    580  0.216469
9    55215   2175  0.186829
10    2175   1479  0.079013
11    2176    675  0.078430
12   81620   4939  0.066903
13   83990  79728  0.071995
14    4939    675  0.071992
15    7516   3161  0.081837
16   84464  57697  0.080099
17   29089    580  0.187491
18    2178    580  0.067098
19   57697  55215  0.112632
20    7124   4599  0.073832
21   84464   2178  0.058176
22   57697   7516  0.084148
23    5888   2176  0.069216
24    2176   2175  0.256351
25    2188    675  0.071333
26   55215   7516  0.094710
27   84464   2175  0.095109
28   79728  57697  0.081342
29   29089   3161  0.265825
..     ...    ...       ...
299  83990    580  0.092860
300   8317   2189  0.230804
301   9246   7124  0.096379
302   2189   2176  0