# Demonstration of feature extraction methods

In [1]:
import sys
sys.path.append('../src')

from extractor import MatrixFormattedGraph

## Conversion of Neo4j Formatted .csv files to Matrices

The format for neo4j node and edge csvs is shown below. Notice the colons before column titles.

Nodes:

|:ID|name|:LABEL|
|-|-|-|
|UBERON:0000002|uterine cervix|Anatomy|
|UBERON:0000004|nose|Anatomy|
|UBERON:0000006|islet of Langerhans|Anatomy|

Edges:

|:START_ID|:END_ID|:TYPE|
| - | - | - |
|9021|GO:0071357|participates_GpBP|
|51676|GO:0098780|participates_GpBP|
|19|GO:0055088|participates_GpBP|

The conversion process is fairly fast.  A metapaths.json file can be provided instead of the parameters `start_kind` `end_kind` and `max_length`.  However, when this file is not passed, a metagraph will not be built, so there is less flexibility in terms of what metapaths can be extracted down the line.

One important caveat for successful initalization of the metagraph is the format of the `:TYPE` column.  It is important that this be in the format `{EdgeName}_{EdgeAbbrevation}` with the separating underscore.

In [2]:
%%time
mg = MatrixFormattedGraph('nodes.csv', 'edges.csv', start_kind='Compound', end_kind='Disease', max_length=4)

Reading file information...
Initializing metagraph...
Generating adjacency matrices...


100%|██████████| 24/24 [01:07<00:00,  6.27s/it]



Weighting matrices by degree with dampening factor 0.4...


100%|██████████| 25/25 [00:25<00:00,  1.02s/it]

CPU times: user 1min 48s, sys: 2.77 s, total: 1min 50s
Wall time: 1min 51s





## Extraction of Degree Features

Degree Features can also be extracted using matrices. In a machine learning context, we are primarily interested in the degrees of the source and target of the edge we are trying to predict. For this reason, this function has a `start_nodes` and `end_nodes` requried parameters.

In [3]:
%%time
degs = mg.extract_degrees(start_nodes='Compound', end_nodes='Disease')

100%|██████████| 24/24 [00:24<00:00,  1.03s/it]


CPU times: user 22.7 s, sys: 1.99 s, total: 24.7 s
Wall time: 24.7 s


In [4]:
degs.head(2)

Unnamed: 0,compound_id,disease_id,CbG,CcSE,CdG,CiPC,CpD,CrC,CtD,CuG,DaG,DdG,DlA,DpC,DpS,DrD,DtC,DuG
0,DB00014,DOID:0050156,2,249,0,1,0,7,2,1,18,250,4,1,8,2,0,250
1,DB00014,DOID:0050425,2,249,0,1,0,7,2,1,12,0,16,10,21,6,0,0


In [5]:
degs.sort_values('disease_id').head(2)

Unnamed: 0,compound_id,disease_id,CbG,CcSE,CdG,CiPC,CpD,CrC,CtD,CuG,DaG,DdG,DlA,DpC,DpS,DrD,DtC,DuG
0,DB00014,DOID:0050156,2,249,0,1,0,7,2,1,18,250,4,1,8,2,0,250
166181,DB01440,DOID:0050156,5,189,0,0,1,0,0,0,18,250,4,1,8,2,0,250


In [6]:
import bz2

In [7]:
with bz2.open('degree-features.tsv.bz2', 'wt') as write_file:
    degs.to_csv(write_file, sep='\t', index=False)

## Extraction of DWPC features

DWPC features are similarly extracted using matrix-matrix multiplication.  DWPCs are fast to extract in this way.  This method also benefits greatly from speed increases provided by multiprocessing.

Currently, remformatting the results to a DataFrame takes longer than producing the matrix.  This step does not show much of an increase in speed with parallel processing.

In [8]:
%%time
dwpcs = mg.extract_dwpc(start_nodes='Compound', end_nodes='Disease', n_jobs=24)

Calculating DWPCs...


100%|██████████| 1206/1206 [04:39<00:00,  1.78s/it]



Reformating results...


100%|██████████| 1206/1206 [05:58<00:00,  2.69it/s]


CPU times: user 2min 33s, sys: 6min 53s, total: 9min 27s
Wall time: 10min 43s


In [9]:
print(dwpcs.shape)
dwpcs.head()

(212624, 1208)


Unnamed: 0,compound_id,disease_id,CuGuAuGdD,CpDpSpDrD,CbGdDaGaD,CuGuDdGuD,CiPCiCbGdD,CrCtDtCpD,CdGr>GcGuD,CiPCiCbGaD,...,CdGcGaD,CbGdD,CbG<rG<rGuD,CuG<rGdCtD,CuG<rGuCtD,CuGr>GbCpD,CdGaDrD,CbGiGuD,CbGuD,CtDtCdGuD
0,DB00014,DOID:0050156,0.001434,0.0,0.0,0.000169,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.007722
1,DB00014,DOID:0050425,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,DB00014,DOID:0050741,0.000839,0.0,0.0,0.000349,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000582,0.000511,0.0,0.0,0.0,0.0,0.007271
3,DB00014,DOID:0050742,0.001529,0.0,0.0,0.001517,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.001304,0.0,0.0,0.0,0.0,0.0,0.008063
4,DB00014,DOID:0060073,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.001569,0.000964,0.0,0.0,0.0,0.0,0.0


In [10]:
with bz2.open('dwpc-features.tsv.bz2', 'wt') as write_file:
    dwpcs.to_csv(write_file, sep='\t', index=False, float_format='%.4g')

#### Subsetting

A subset of metapths, or Compound-Disease pairs can also be extracted by passing lists as arugments of the function.  Subsetting the metapaths will result in faster computation, but subsetting start and end nodes does not.

In [11]:
subset = mg.extract_dwpc(metapaths=['CbGaD', 'CrCrCrCtD', 'CuGpPWpGaD'], 
                         start_nodes=['DB00132', 'DB00179', 'DB00238'], 
                         end_nodes=['DOID:0060119', 'DOID:7148', 'DOID:9970'])

Calculating DWPCs...


100%|██████████| 3/3 [00:01<00:00,  2.92it/s]



Reformating results...


100%|██████████| 3/3 [00:00<00:00,  6.75it/s]


In [12]:
subset

Unnamed: 0,compound_id,disease_id,CbGaD,CrCrCrCtD,CuGpPWpGaD
0,DB00132,DOID:0060119,0.0,0.0,0.001396
1,DB00132,DOID:7148,0.049131,0.0,0.013806
2,DB00132,DOID:9970,0.031676,0.014459,0.016529
3,DB00179,DOID:0060119,0.0,0.0,0.0
4,DB00179,DOID:7148,0.009839,0.0,0.0
5,DB00179,DOID:9970,0.004522,0.003562,0.0
6,DB00238,DOID:0060119,0.0,0.0,0.0
7,DB00238,DOID:7148,0.0,0.0,0.0
8,DB00238,DOID:9970,0.0,0.0,0.0
