# Creating a feature matrix from a networkx graph

In this notebook we will look at a few ways to quickly create a feature matrix from a networkx graph.

In [1]:
import networkx as nx
import pandas as pd

G = nx.read_gpickle('major_us_cities')

## Node based features

In [2]:
G.nodes(data=True)

[('San Diego, CA', {'location': (-117, 32), 'population': 1355896}),
 ('Las Vegas, NV', {'location': (-115, 36), 'population': 603488}),
 ('Fort Worth, TX', {'location': (-97, 32), 'population': 792727}),
 ('Baltimore, MD', {'location': (-76, 39), 'population': 622104}),
 ('Tucson, AZ', {'location': (-110, 32), 'population': 526116}),
 ('Phoenix, AZ', {'location': (-112, 33), 'population': 1513367}),
 ('New Orleans, LA', {'location': (-90, 29), 'population': 378715}),
 ('Detroit, MI', {'location': (-83, 42), 'population': 688701}),
 ('San Francisco, CA', {'location': (-122, 37), 'population': 837442}),
 ('Cleveland, OH', {'location': (-81, 41), 'population': 390113}),
 ('Mesa, AZ', {'location': (-111, 33), 'population': 457587}),
 ('Long Beach, CA', {'location': (-118, 33), 'population': 469428}),
 ('Albuquerque, NM', {'location': (-106, 35), 'population': 556495}),
 ('Boston, MA', {'location': (-71, 42), 'population': 645966}),
 ('Washington D.C.', {'location': (-77, 38), 'population'

In [3]:
# Initialize the dataframe, using the nodes as the index
df = pd.DataFrame(index=G.nodes())

### Extracting attributes

Using `nx.get_node_attributes` it's easy to extract the node attributes in the graph into DataFrame columns.

In [7]:
df['location'] = pd.Series(nx.get_node_attributes(G, 'location'))
df['population'] = pd.Series(nx.get_node_attributes(G, 'population'))

df.head()

Unnamed: 0,location,population
"San Diego, CA","(-117, 32)",1355896
"Las Vegas, NV","(-115, 36)",603488
"Fort Worth, TX","(-97, 32)",792727
"Baltimore, MD","(-76, 39)",622104
"Tucson, AZ","(-110, 32)",526116


### Creating node based features

Most of the networkx functions related to nodes return a dictionary, which can also easily be added to our dataframe.

In [8]:
df['clustering'] = pd.Series(nx.clustering(G))
df['degree'] = pd.Series(G.degree())

df

Unnamed: 0,location,population,clustering,degree
"San Diego, CA","(-117, 32)",1355896,0.745455,11
"Las Vegas, NV","(-115, 36)",603488,0.666667,12
"Fort Worth, TX","(-97, 32)",792727,0.763636,11
"Baltimore, MD","(-76, 39)",622104,0.8,10
"Tucson, AZ","(-110, 32)",526116,0.75,8
"Phoenix, AZ","(-112, 33)",1513367,0.694444,9
"New Orleans, LA","(-90, 29)",378715,0.607143,8
"Detroit, MI","(-83, 42)",688701,0.672727,11
"San Francisco, CA","(-122, 37)",837442,1.0,8
"Cleveland, OH","(-81, 41)",390113,0.659341,14


# Edge based features

In [9]:
G.edges(data=True)

[('San Diego, CA', 'Long Beach, CA', {'weight': 151.45008247402757}),
 ('San Diego, CA', 'Tucson, AZ', {'weight': 587.0077247254917}),
 ('San Diego, CA', 'Los Angeles, CA', {'weight': 179.29746717246732}),
 ('San Diego, CA', 'Las Vegas, NV', {'weight': 426.17584242277843}),
 ('San Diego, CA', 'Fresno, CA', {'weight': 507.4313769710358}),
 ('San Diego, CA', 'Phoenix, AZ', {'weight': 480.550984974869}),
 ('San Diego, CA', 'San Jose, CA', {'weight': 669.6772910512424}),
 ('San Diego, CA', 'San Francisco, CA', {'weight': 737.1473892523547}),
 ('San Diego, CA', 'Oakland, CA', {'weight': 730.9953136286227}),
 ('San Diego, CA', 'Mesa, AZ', {'weight': 502.32635614606744}),
 ('San Diego, CA', 'Sacramento, CA', {'weight': 760.0327691692984}),
 ('Las Vegas, NV', 'Long Beach, CA', {'weight': 385.2597725411484}),
 ('Las Vegas, NV', 'Albuquerque, NM', {'weight': 779.953416852203}),
 ('Las Vegas, NV', 'Fresno, CA', {'weight': 418.94992726715964}),
 ('Las Vegas, NV', 'San Jose, CA', {'weight': 614.393

In [12]:
# Initialize the dataframe, using the edges as the index
df = pd.DataFrame(index=G.edges())

### Extracting attributes

Using `nx.get_edge_attributes`, it's easy to extract the edge attributes in the graph into DataFrame columns.

In [13]:
df['weight'] = pd.Series(nx.get_edge_attributes(G, 'weight'))

df

Unnamed: 0,weight
"(San Diego, CA, Long Beach, CA)",151.450082
"(San Diego, CA, Tucson, AZ)",587.007725
"(San Diego, CA, Los Angeles, CA)",179.297467
"(San Diego, CA, Las Vegas, NV)",426.175842
"(San Diego, CA, Fresno, CA)",507.431377
"(San Diego, CA, Phoenix, AZ)",480.550985
"(San Diego, CA, San Jose, CA)",669.677291
"(San Diego, CA, San Francisco, CA)",737.147389
"(San Diego, CA, Oakland, CA)",730.995314
"(San Diego, CA, Mesa, AZ)",502.326356


### Creating edge based features

Many of the networkx functions related to edges return a nested data structures. We can extract the relevant data using list comprehension.

In [14]:
df['preferential attachment'] = [i[2] for i in nx.preferential_attachment(G, df.index)]

df

Unnamed: 0,weight,preferential attachment
"(San Diego, CA, Long Beach, CA)",151.450082,121
"(San Diego, CA, Tucson, AZ)",587.007725,88
"(San Diego, CA, Los Angeles, CA)",179.297467,121
"(San Diego, CA, Las Vegas, NV)",426.175842,132
"(San Diego, CA, Fresno, CA)",507.431377,99
"(San Diego, CA, Phoenix, AZ)",480.550985,99
"(San Diego, CA, San Jose, CA)",669.677291,88
"(San Diego, CA, San Francisco, CA)",737.147389,88
"(San Diego, CA, Oakland, CA)",730.995314,88
"(San Diego, CA, Mesa, AZ)",502.326356,88


In the case where the function expects two nodes to be passed in, we can map the index to a lamda function.

In [15]:
df['Common Neighbors'] = df.index.map(lambda city: len(list(nx.common_neighbors(G, city[0], city[1]))))

df

Unnamed: 0,weight,preferential attachment,Common Neighbors
"(San Diego, CA, Long Beach, CA)",151.450082,121,10
"(San Diego, CA, Tucson, AZ)",587.007725,88,5
"(San Diego, CA, Los Angeles, CA)",179.297467,121,10
"(San Diego, CA, Las Vegas, NV)",426.175842,132,10
"(San Diego, CA, Fresno, CA)",507.431377,99,8
"(San Diego, CA, Phoenix, AZ)",480.550985,99,6
"(San Diego, CA, San Jose, CA)",669.677291,88,7
"(San Diego, CA, San Francisco, CA)",737.147389,88,7
"(San Diego, CA, Oakland, CA)",730.995314,88,7
"(San Diego, CA, Mesa, AZ)",502.326356,88,5
