# HNU1 Feature Matrix

Vivek Gopalakrishnan

October 23, 2018


## Overview
Generate a feature matrix from the HNU1 edgelists using scan statistics. Incorporate phenotypic data so that we have labels for age and sex.

## Methods
- Use the k-hop locality scan statistic to make feature vector for each graph
- Concatenate these vectors to create a feature matrix
- Match patient IDs between the feature matrix and the phenotypic data to get labels for sex and age

## Central Definitions
- k-hop locality: the number of edges in the subgraph induced by the vertices $k$-hops away from the node $u$

## Important notes
- To run this, you need NetworkX v1.9. To get this working, construct a virtual environment, install NetworkX v1.9, and setup a custom kernal for Jupyter to run on.

### Step 1:
- Import necessary packages
- Write function to calculate k-hop locality

In [1]:
import pandas as pd
import networkx as nx
from edge_fetch import edge_terrier

In [2]:
# Function to calculate 1-hop and 2-hop locality

def khop_locality(G, filename):
    
    # Process the filename to create a patient_id and session number 
    patient_id = int(filename.split('_')[0].split('-')[1])
    session = int(filename.split('_')[1].split('-')[1])
    embed = [patient_id, session]
    
    # Loop through all of the nodes in G and calculate 1-hop and 2-hop locality
    for node in G.nodes():

        one_hop = list(nx.single_source_shortest_path_length(G, node, cutoff=1).keys())
        two_hop = list(nx.single_source_shortest_path_length(G, node, cutoff=2).keys())

        embed += len(G.subgraph(one_hop).edges()), len(G.subgraph(two_hop).edges())
    
    # Ensure that there were indeed 48 nodes in the graph G
    if len(embed) == 98:
        return embed

### Step 2:
- Make feature vector for each edgelist
- Construct single feature matrix using pandas

In [3]:
# Custom class that pulls files from a given s3 bucket
f = edge_terrier(filepath='data/HNU1/ndmg_0-0-48/graphs/JHU/')

In [4]:
# Compute all embeddings
all_embeddings = []

for file in f.filelist:
    
    G, filename = f.convert_gpickle(file)

    if G is not None:
        embed = khop_locality(G, filename)
        
    if embed is not None:
        all_embeddings.append(embed)

In [5]:
# Convert list-of-lists to a pandas dataframe
df = pd.DataFrame.from_records(all_embeddings)
df.columns = ['SUBID', 'SESSION'] + list(range(96))

# View the dataframe
print(df.shape)
df.head()

(300, 98)


Unnamed: 0,SUBID,SESSION,0,1,2,3,4,5,6,7,...,86,87,88,89,90,91,92,93,94,95
0,25427,10,120,388,5,122,80,377,235,384,...,27,352,14,310,21,324,122,377,86,377
1,25427,1,77,333,11,190,57,344,225,359,...,21,302,49,299,10,208,127,344,78,324
2,25427,2,70,360,51,380,61,369,222,375,...,26,295,15,321,19,331,110,362,88,355
3,25427,3,64,371,15,193,79,367,233,383,...,44,353,6,256,54,347,126,367,89,357
4,25427,4,96,428,14,217,63,413,298,428,...,31,382,119,403,14,287,177,413,154,413


### Step 3:
- Read in phenotypic data
- Map sex and age labels to data matrix
- Make a single dataframe and export it as a csv

In [6]:
# Read in important columns from phenotypic data
phenotypic = pd.read_csv('HNU1.csv').loc[:,['SUBID', 'SEX', 'AGE_AT_SCAN_1']]

# Remove all dupilicates (each patient is listed 10 times because their brains were scanned 10 times)
phenotypic = phenotypic.drop_duplicates(subset='SUBID')
phenotypic.head()

Unnamed: 0,SUBID,SEX,AGE_AT_SCAN_1
0,25427,2,23
10,25428,1,28
20,25441,1,24
30,25442,1,27
40,25443,2,25


In [7]:
# Make single df
df = pd.merge(phenotypic, df, how='inner', on=['SUBID'])

# I am not sure which sex (male or female) each of these numbers encodes
df['SEX'] = df['SEX'].map({2: 1, 1: 0})

# Rename columns
df.columns = ['id', 'sex', 'age_at_scan_1', 'session'] + list(range(96))

# Ensure it is the same size and looks good
print(df.shape)
df.head()

(300, 100)


Unnamed: 0,id,sex,age_at_scan_1,session,0,1,2,3,4,5,...,86,87,88,89,90,91,92,93,94,95
0,25427,1,23,10,120,388,5,122,80,377,...,27,352,14,310,21,324,122,377,86,377
1,25427,1,23,1,77,333,11,190,57,344,...,21,302,49,299,10,208,127,344,78,324
2,25427,1,23,2,70,360,51,380,61,369,...,26,295,15,321,19,331,110,362,88,355
3,25427,1,23,3,64,371,15,193,79,367,...,44,353,6,256,54,347,126,367,89,357
4,25427,1,23,4,96,428,14,217,63,413,...,31,382,119,403,14,287,177,413,154,413


In [8]:
# Save to csv
df.to_csv('../data/hnu1-klocality_with_age-sex.csv', index=False)