# Demonstration of feature extraction methods

In [1]:
import sys
sys.path.append('../src')

from extractor import MatrixFormattedGraph

## Conversion of Neo4j Formatted .csv files to Matrices

The format for neo4j node and edge csvs is shown below. Notice the colons before column titles.

Nodes:

|:ID|name|:LABEL|
|-|-|-|
|UBERON:0000002|uterine cervix|Anatomy|
|UBERON:0000004|nose|Anatomy|
|UBERON:0000006|islet of Langerhans|Anatomy|

Edges:

|:START_ID|:END_ID|:TYPE|
| - | - | - |
|9021|GO:0071357|participates_GpBP|
|51676|GO:0098780|participates_GpBP|
|19|GO:0055088|participates_GpBP|

The conversion process is fairly fast.  A metapaths.json file can be provided instead of the parameters `start_kind` `end_kind` and `max_length`.  However, when this file is not passed, a metagraph will not be built, so there is less flexibility in terms of what metapaths can be extracted down the line.

One important caveat for successful initalization of the metagraph is the format of the `:TYPE` column.  It is important that this be in the format `{EdgeName}_{EdgeAbbrevation}` with the separating underscore.

In [2]:
%%time
mg = MatrixFormattedGraph('nodes.csv', 'edges.csv', start_kind='Compound', end_kind='Disease', max_length=4)

Reading file information...
Initializing metagraph...
Generating adjacency matrices...


100%|██████████| 24/24 [00:59<00:00,  5.30s/it]



Weighting matrices by degree with dampening factor 0.4...


100%|██████████| 25/25 [00:30<00:00,  1.69it/s]

CPU times: user 1min 33s, sys: 1 s, total: 1min 34s
Wall time: 1min 36s





## Extraction of Degree Features

Degree Features can also be extracted using matrices. In a machine learning context, we are primarily interested in the degrees of the source and target of the edge we are trying to predict. For this reason, this function has a `start_nodes` and `end_nodes` requried parameters.

In [3]:
%%time
degs = mg.extract_degrees(start_nodes='Compound', end_nodes='Disease')

100%|██████████| 24/24 [00:29<00:00,  1.03s/it]

CPU times: user 29.1 s, sys: 860 ms, total: 29.9 s
Wall time: 29.9 s





In [4]:
degs.head(2)

Unnamed: 0,compound_id,disease_id,CbG,CcSE,CdG,CiPC,CpD,CrC,CtD,CuG,DaG,DdG,DlA,DpC,DpS,DrD,DtC,DuG
0,DB00014,DOID:0050156,2,249,0,1,0,7,2,1,18,250,4,1,8,2,0,250
1,DB00014,DOID:0050425,2,249,0,1,0,7,2,1,12,0,16,10,21,6,0,0


In [5]:
degs.sort_values('disease_id').head(2)

Unnamed: 0,compound_id,disease_id,CbG,CcSE,CdG,CiPC,CpD,CrC,CtD,CuG,DaG,DdG,DlA,DpC,DpS,DrD,DtC,DuG
0,DB00014,DOID:0050156,2,249,0,1,0,7,2,1,18,250,4,1,8,2,0,250
166181,DB01440,DOID:0050156,5,189,0,0,1,0,0,0,18,250,4,1,8,2,0,250


In [6]:
import bz2

In [7]:
with bz2.open('degree-features.tsv.bz2', 'wt') as write_file:
    degs.to_csv(write_file, sep='\t', index=False)

## Extraction of DWPC features

DWPC features are similarly extracted using matrix-matrix multiplication.  DWPCs are fast to extract in this way.  This method also benefits greatly from speed increases provided by multiprocessing.

Currently, remformatting the results to a DataFrame takes longer than producing the matrix.  This step does not show much of an increase in speed with parallel processing.

In [8]:
%%time
dwpcs = mg.extract_dwpc(start_nodes='Compound', end_nodes='Disease', n_jobs=4)

Calculating DWPCs...


100%|██████████| 1206/1206 [06:02<00:00,  1.38s/it]



Sub-setting resultant matrices...


100%|██████████| 1206/1206 [00:03<00:00, 369.42it/s]



Formatting results to series...


100%|██████████| 1206/1206 [00:05<00:00, 240.98it/s]



Concatenating series to DataFrame...
CPU times: user 14.4 s, sys: 3.19 s, total: 17.6 s
Wall time: 6min 14s


In [9]:
print(dwpcs.shape)
dwpcs.head()

(212624, 1208)


Unnamed: 0,compound_id,disease_id,CuGuAdGdD,CbGiGaD,CtDdGbCpD,CtDaGdAlD,CtDrDuGdD,CdG<rGaDrD,CuGdCbGuD,CdGdCuGdD,...,CdGpMFpGaD,CdGuDdGaD,CdGdCdGuD,CpDuGcGdD,CbGr>Gr>GuD,CdGbCbGaD,CcSEcCbGdD,CtDpSpD,CbGr>GbCpD,CuGiGcGaD
0,DB00014,DOID:0050156,0.001575,0.0,0.004113,0.00586,0.004513,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.015462,0.003645,0.0,0.0
1,DB00014,DOID:0050425,0.0,0.0,0.0,0.012053,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012345,0.0,0.0
2,DB00014,DOID:0050741,0.000856,0.0,0.0,0.031147,0.002543,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.006986,0.010633,0.0,0.0
3,DB00014,DOID:0050742,0.001518,0.0,0.000933,0.022739,0.000912,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.008941,0.017212,0.0,0.0
4,DB00014,DOID:0060073,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
with bz2.open('dwpc-features.tsv.bz2', 'wt') as write_file:
    dwpcs.to_csv(write_file, sep='\t', index=False, float_format='%.4g')

#### Subsetting

A subset of metapths, or Compound-Disease pairs can also be extracted by passing lists as arugments of the function.  Subsetting the metapaths will result in faster computation, but subsetting start and end nodes does not.

In [11]:
subset = mg.extract_dwpc(metapaths=['CbGaD', 'CrCrCrCtD', 'CuGpPWpGaD'], 
                         start_nodes=['DB00132', 'DB00179', 'DB00238'], 
                         end_nodes=['DOID:0060119', 'DOID:7148', 'DOID:9970'])

Calculating DWPCs...


100%|██████████| 3/3 [00:00<00:00,  3.28it/s]



Sub-setting resultant matrices...


100%|██████████| 3/3 [00:00<00:00, 341.19it/s]



Formatting results to series...


100%|██████████| 3/3 [00:00<00:00, 462.62it/s]


Concatenating series to DataFrame...





In [12]:
subset

Unnamed: 0,compound_id,disease_id,CbGaD,CrCrCrCtD,CuGpPWpGaD
0,DB00132,DOID:0060119,0.0,0.0,0.001396
1,DB00132,DOID:7148,0.049131,0.0,0.013806
2,DB00132,DOID:9970,0.031676,0.014459,0.016529
3,DB00179,DOID:0060119,0.0,0.0,0.0
4,DB00179,DOID:7148,0.009839,0.0,0.0
5,DB00179,DOID:9970,0.004522,0.003562,0.0
6,DB00238,DOID:0060119,0.0,0.0,0.0
7,DB00238,DOID:7148,0.0,0.0,0.0
8,DB00238,DOID:9970,0.0,0.0,0.0


## Extraction of Prior probability

An estimate of the prior probability that two entities are related across a given edge can be extracted quickly.  The following formula is used to estimate this probability:

$$ 1 - \displaystyle\prod_{i=0}^{S-1} \cfrac{(T-E-i)}{(T-i)} $$

Where $S$ is the dergee of the *Start* node of the given edge, $E$ is the degree of the *End* node of the given edge $T$ is the *Total* number of edges of the given type.  The product portion of this equation gives you the probability that the two nodes are connected by a given edge, if the edges were randomized while degree remaining constant. 

In [13]:
prior = mg.extract_prior_estimate('CtD')
prior.head()

100%|██████████| 1/1 [00:00<00:00, 10.97it/s]


Unnamed: 0,compound_id,disease_id,prior
0,DB00014,DOID:0050156,0.0
1,DB00014,DOID:0050425,0.0
2,DB00014,DOID:0050741,0.011091
3,DB00014,DOID:0050742,0.002778
4,DB00014,DOID:0060073,0.024872


In [14]:
with bz2.open('prior.tsv.bz2', 'wt') as write_file:
    prior.to_csv(write_file, sep='\t', index=False, float_format='%.4g')