## Download data

This part is common to all packages, it downloads and unpacks the necessary data:

In [1]:
import os
import pandas as pd
data_dir = os.path.expanduser("~/cora")
if not os.path.exists(data_dir):
    os.makedirs(data_dir)
import requests

 
cora_tgz = os.path.join(data_dir, "cora.tgz")
response = requests.get("https://temprl.com/cora.tgz", stream = True)
with open(cora_tgz,'wb') as output:
  output.write(response.content)

import tarfile
with tarfile.open(cora_tgz) as z:
    for member in z:
      if member.isdir():
         continue
      fname = member.name.rsplit('/',1)[1]
      z.makefile(member,data_dir + '/' + fname)

## NetworkX

NetworkX is the most common graph package in Python. It does not perform any machine learning but it has a very complete graph analysis API and performs well on small and medium datasets.

In [2]:
import networkx as nx

edge_data = pd.read_csv(os.path.join(data_dir, "cora.cites"), sep='\t', header=None, names=["target", "source"])
edge_data["label"] = "cites"

The edge list is just a source-target couple and there is no payload:

In [3]:
edge_data.sample(frac=1).head(5)

Unnamed: 0,target,source,label
2438,24530,1109581,cites
2241,20179,91852,cites
3596,73162,714260,cites
3929,100701,1107041,cites
4953,560936,562123,cites


In [4]:
Gnx = nx.from_pandas_edgelist(edge_data, edge_attr="label")
nx.set_node_attributes(Gnx, "paper", "label")

In [5]:
 Gnx.nodes[1103985]

{'label': 'paper'}

In [6]:
feature_names = ["w_{}".format(ii) for ii in range(1433)]
column_names =  feature_names + ["subject"]
node_data = pd.read_csv(os.path.join(data_dir, "cora.content"), sep='\t', header=None, names=column_names)

The payload on the node consists of the weights with the subject label:

In [7]:
node_data.head(5)

Unnamed: 0,w_0,w_1,w_2,w_3,w_4,w_5,w_6,w_7,w_8,w_9,...,w_1424,w_1425,w_1426,w_1427,w_1428,w_1429,w_1430,w_1431,w_1432,subject
31336,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,Neural_Networks
1061127,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,Rule_Learning
1106406,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Reinforcement_Learning
13195,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Reinforcement_Learning
37879,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Probabilistic_Methods


There are seven subjects:

In [8]:
set(node_data["subject"])

{'Case_Based',
 'Genetic_Algorithms',
 'Neural_Networks',
 'Probabilistic_Methods',
 'Reinforcement_Learning',
 'Rule_Learning',
 'Theory'}

If you don't like the weights in multiple columns you can merge them:

In [9]:
weight_column_names = node_data.columns[0:-1]
node_data['content'] = node_data[weight_column_names].apply(
    lambda x: ','.join(x.dropna().astype(str)),
    axis=1
)
node_data.drop(weight_column_names, axis=1, inplace=True)

In [10]:
node_data.head()

Unnamed: 0,subject,content
31336,Neural_Networks,"0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,..."
1061127,Rule_Learning,"0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,..."
1106406,Reinforcement_Learning,"0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,..."
13195,Reinforcement_Learning,"0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,..."
37879,Probabilistic_Methods,"0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,..."


Note that the content is not an embedding but is the encoded article content.