# HNU1 Weighted K-Hop Feature Matrix

Vivek Gopalakrishnan | October 23, 2018


## Overview
Generate a feature matrix from the HNU1 edgelists using scan statistics. Incorporate phenotypic data so that we have labels for age and sex.

## Methods
- Use the k-hop locality scan statistic to make feature vector for each graph
- Concatenate these vectors to create a feature matrix
- Match patient IDs between the feature matrix and the phenotypic data to get labels for sex and age

## Central Definitions
- k-hop locality: the number of edges in the subgraph induced by the vertices $k$-hops away from the node $u$

## Important notes
- "Weighted" means that the passed-to-ranks edge weights were used while calculating k-hop locality
- To run this, you need NetworkX v1.9. To get this working, construct a virtual environment, install NetworkX v1.9, and setup a custom kernal for Jupyter to run on.

In [1]:
import pandas as pd
import networkx as nx
from ptr import pass_to_ranks
from edge_fetch import edge_terrier

In [2]:
def weighted_khop_locality(G, filename):

    # Process the filename to create a patient_id and session number 
    patient_id = int(filename.split('_')[0].split('-')[1])
    session = int(filename.split('_')[1].split('-')[1])
    embed = [patient_id, session]

    # PTR graph
    ptr_G = pass_to_ranks(G)
    G = nx.from_numpy_matrix(ptr_G)

    # Loop through all of the nodes in G and calculate 1-hop and 2-hop locality
    for node in G.nodes():

        for k in [1, 2]:

            k_hop = list(nx.single_source_shortest_path_length(G, node, cutoff=k).keys())
            induced = nx.get_edge_attributes(G.subgraph(k_hop), 'weight')
            embed += [sum(induced.values())]

    if len(embed) == 96 + 2:
        return embed

### Step 2:
- Make feature vector for each edgelist
- Construct single feature matrix using pandas

In [3]:
# Custom class that pulls files from a given s3 bucket
f = edge_terrier(filepath='data/HNU1/ndmg_0-0-48/graphs/JHU/')

In [4]:
# Compute all embeddings
all_embeddings = []

for file in f.filelist:
    
    G, filename = f.convert_gpickle(file)
    
    if G is not None:
        embed = weighted_khop_locality(G, filename)
        
    if embed is not None:
        all_embeddings.append(embed)

In [5]:
# Convert list-of-lists to a pandas dataframe
df = pd.DataFrame.from_records(all_embeddings)
df.columns = ['SUBID', 'SESSION'] + list(range(96))

# View the dataframe
print(df.shape)
df.head()

(300, 98)


Unnamed: 0,SUBID,SESSION,0,1,2,3,4,5,6,7,...,86,87,88,89,90,91,92,93,94,95
0,25427,10,16.925975,58.819149,0.097961,16.945479,15.150709,58.588652,38.620567,58.714096,...,5.622784,55.593528,1.969415,49.899823,3.321365,50.572695,22.477394,58.588652,16.933511,58.588652
1,25427,1,9.010638,46.202128,0.722074,23.070922,10.055408,49.246011,34.325798,49.808067,...,3.865691,44.485372,7.484043,45.052305,1.317819,28.198138,21.362589,49.246011,15.629433,46.773936
2,25427,2,9.738475,52.966755,6.678191,56.258865,11.175975,55.811613,36.503989,56.078457,...,5.019504,46.264628,2.594415,50.036791,2.754433,47.044326,20.695035,54.775266,17.614362,53.495567
3,25427,3,9.041667,55.149823,1.246897,26.95656,14.569592,56.579344,38.019947,57.212323,...,8.310284,54.762411,1.042553,42.951684,7.538564,52.139184,22.556294,56.579344,17.912677,55.403812
4,25427,4,15.931738,72.471631,1.218528,35.547429,14.373227,71.721188,54.713652,72.471631,...,7.495567,68.568262,21.39539,70.06383,2.14406,47.284574,31.000887,71.721188,26.534131,71.721188


### Step 3:
- Read in phenotypic data
- Map sex and age labels to data matrix
- Make a single dataframe and export it as a csv

In [6]:
# Read in important columns from phenotypic data
phenotypic = pd.read_csv('HNU1.csv').loc[:,['SUBID', 'SEX', 'AGE_AT_SCAN_1']]

# Remove all dupilicates (each patient is listed 10 times because their brains were scanned 10 times)
phenotypic = phenotypic.drop_duplicates(subset='SUBID')
phenotypic.head()

Unnamed: 0,SUBID,SEX,AGE_AT_SCAN_1
0,25427,2,23
10,25428,1,28
20,25441,1,24
30,25442,1,27
40,25443,2,25


In [7]:
# Make single df
df = pd.merge(phenotypic, df, how='inner', on=['SUBID'])

# I am not sure which sex (male or female) each of these numbers encodes
df['SEX'] = df['SEX'].map({2: 1, 1: 0})

# Rename columns
df.columns = ['id', 'sex', 'age_at_scan_1', 'session'] + list(range(96))

# Ensure it is the same size and looks good
print(df.shape)
df.head()

(300, 100)


Unnamed: 0,id,sex,age_at_scan_1,session,0,1,2,3,4,5,...,86,87,88,89,90,91,92,93,94,95
0,25427,1,23,10,16.925975,58.819149,0.097961,16.945479,15.150709,58.588652,...,5.622784,55.593528,1.969415,49.899823,3.321365,50.572695,22.477394,58.588652,16.933511,58.588652
1,25427,1,23,1,9.010638,46.202128,0.722074,23.070922,10.055408,49.246011,...,3.865691,44.485372,7.484043,45.052305,1.317819,28.198138,21.362589,49.246011,15.629433,46.773936
2,25427,1,23,2,9.738475,52.966755,6.678191,56.258865,11.175975,55.811613,...,5.019504,46.264628,2.594415,50.036791,2.754433,47.044326,20.695035,54.775266,17.614362,53.495567
3,25427,1,23,3,9.041667,55.149823,1.246897,26.95656,14.569592,56.579344,...,8.310284,54.762411,1.042553,42.951684,7.538564,52.139184,22.556294,56.579344,17.912677,55.403812
4,25427,1,23,4,15.931738,72.471631,1.218528,35.547429,14.373227,71.721188,...,7.495567,68.568262,21.39539,70.06383,2.14406,47.284574,31.000887,71.721188,26.534131,71.721188


In [8]:
# Save to csv
df.to_csv('../data/hnu1-weighted_klocality_with_age-sex.csv', index=False)