# Intro

This notebook shows how to convert the data obtained from DbPedia via query to a heterogeneous graph. The data is stored in a csv file. It will be converted to a graph using PyTorch Geometric.

We will build a knowledge graph and treat "NATO membership" as target variable. The graph will contain the following nodes:
1. Country
2. Legislature
3. House (of legislature)
4. Government type
5. Political party

Such a structure will be a reflection of the following Neo4j conceptual model:

![img](graph_model.png)

# Lib imports

In [129]:
import pandas as pd
import torch_geometric as tg
import torch_geometric.nn as tgnn
import torch as th
import torch.nn as nn
import networkx as nx
import numpy as np

from sklearn.preprocessing import LabelEncoder

# Reading the data

We will be interested in mapping the following entities to nodes in the graph:
1. Country - composed of the name, and additional features like: NATO member, EU member, Three Seas Initiative member.
2. Legislature - composed of the name.
3. Housename - another graph entity, connected to legislature.
4. Government type - composed only of name.
5. Political Subject - a wider concept connected to government type.

In [2]:
raw_data = pd.read_csv("countries_data.csv")
raw_data.head()

Unnamed: 0,country_name,nato_member,eu_member,three_seas_member,countryid,legname,legislatureid,govtype,govtypeid,political_subject,subjectid,houseid,housename
0,Republic of Slovenia,1,1,1,27338,Slovenian Parliament,5663885,Unitary parliamentary republic,48467292,Unitary state,65734150,1025128.0,National Assembly of the Republic of Slovenia
1,Republic of Poland,1,1,1,22936,Parliament of Poland,2986705,Unitary parliamentary republic,48467292,Unitary state,65734150,462813.0,Senate of the Republic of Poland
2,Hungary,1,1,1,13275,National Assembly,585416,Unitary parliamentary republic,48467292,Unitary state,65734150,,
3,Slovak Republic,1,1,1,26830,National Council of the Slovak Republic,494968,Unitary parliamentary republic,48467292,Unitary state,65734150,,
4,Republic of Bulgaria,1,1,1,3415,National Assembly,2122384,Unitary parliamentary republic,48467292,Unitary state,65734150,,


In [3]:
cols_of_interest = {
    'country': ['country_name', 'nato_member', 'eu_member', 'three_seas_member'],
    'legislature': ['legname'],
    'house': ['housename'],
    'govtype': ['govtype'],
    'political_subject': ['political_subject'],
}

data = raw_data.loc[:, list([col for lst in cols_of_interest.values() for col in lst])]
data.head(3)

Unnamed: 0,country_name,nato_member,eu_member,three_seas_member,legname,housename,govtype,political_subject
0,Republic of Slovenia,1,1,1,Slovenian Parliament,National Assembly of the Republic of Slovenia,Unitary parliamentary republic,Unitary state
1,Republic of Poland,1,1,1,Parliament of Poland,Senate of the Republic of Poland,Unitary parliamentary republic,Unitary state
2,Hungary,1,1,1,National Assembly,,Unitary parliamentary republic,Unitary state


# Mapping the data to ids

Torch Geometric graphs cannot handle textual data. Therefore we will map each name to numerical label, starting from 0.

For this purpose we will utilize the LabelEncoder from sklearn.

In [4]:
mappers = {
    'country_name': LabelEncoder(),
    'legname': LabelEncoder(),
    'housename': LabelEncoder(),
    'govtype': LabelEncoder(),
    'political_subject': LabelEncoder(),
}

data_mapped = data.copy()

for colname, mapper in mappers.items():
    data_mapped[colname] = mapper.fit_transform(data[colname])

data_mapped[cols_of_interest['country'][1:]] = data[cols_of_interest['country'][1:]]

In [5]:
data_mapped.head(3)

Unnamed: 0,country_name,nato_member,eu_member,three_seas_member,legname,housename,govtype,political_subject
0,131,1,1,1,149,28,17,7
1,123,1,1,1,116,51,17,7
2,32,1,1,1,56,61,17,7


# Building a graph

Heterogeous graphs in PyTorch Geometric are represented by a **dictionary of edge types**. Each edge type is a tuple of two lists: the first one contains the source nodes, the second one contains the target nodes.

Additionally, we can provide a dictionary of **node features**. Each key in the dictionary is a node type, and the value is a tensor of node features. 

In our case - it will be just an id of node (country, legislature, etc.), and for country case - additional features like NATO member, EU member, Three Seas Initiative member.
We will use mapped id labels for each node type - e.g. USA=0, Germany=1, etc. When building a graph model, label ids can be used in the embedding lookup dict.

First we will fill the node features dictionary.

In [173]:
country_features = data_mapped[cols_of_interest['country']].drop_duplicates()

In [169]:
# Step 1: Create heterogenous graph
hetero_data = tg.data.HeteroData()

# Step 2: Add country features
hetero_data['country'].x = th.tensor(country_features.drop(columns='nato_member').values).to(th.float)

# Step 3: add y for country = NATO membership
hetero_data['country'].y = th.tensor(country_features['nato_member'].values)

# Step 4: add legislature id mappings
hetero_data['legislature'].x = th.tensor(np.arange(mappers['legname'].classes_.shape[0]))

# Step 5: add house id mappings
hetero_data['house'].x = th.tensor(np.arange(mappers['housename'].classes_.shape[0]))

# Step 6: add govtype id mappings
hetero_data['govtype'].x = th.tensor(np.arange(mappers['govtype'].classes_.shape[0]))

# Step 7: add political_subject id mappings
hetero_data['political_subject'].x = th.tensor(np.arange(mappers['political_subject'].classes_.shape[0]))

hetero_data

HeteroData(
  [1mcountry[0m={
    x=[173, 3],
    y=[173]
  },
  [1mlegislature[0m={ x=[165] },
  [1mhouse[0m={ x=[62] },
  [1mgovtype[0m={ x=[19] },
  [1mpolitical_subject[0m={ x=[8] }
)

Next we will build the edge types dictionary.

In [170]:
# Building edge indices for the heterogenous graph
hetero_data['country', 'is_a', 'govtype'].edge_index = th.tensor(data_mapped[['country_name', 'govtype']].drop_duplicates().values.T, dtype=th.long)
hetero_data['country', 'has_a', 'legislature'].edge_index = th.tensor(data_mapped[['country_name', 'legname']].drop_duplicates().values.T, dtype=th.long)
hetero_data['legislature', 'contains', 'house'].edge_index = th.tensor(data_mapped[['legname', 'housename']].drop_duplicates().values.T, dtype=th.long)
hetero_data['govtype', 'concerns', 'political_subject'].edge_index = th.tensor(data_mapped[['govtype', 'political_subject']].drop_duplicates().values.T, dtype=th.long)

In [171]:
hetero_data

HeteroData(
  [1mcountry[0m={
    x=[173, 3],
    y=[173]
  },
  [1mlegislature[0m={ x=[165] },
  [1mhouse[0m={ x=[62] },
  [1mgovtype[0m={ x=[19] },
  [1mpolitical_subject[0m={ x=[8] },
  [1m(country, is_a, govtype)[0m={ edge_index=[2, 261] },
  [1m(country, has_a, legislature)[0m={ edge_index=[2, 193] },
  [1m(legislature, contains, house)[0m={ edge_index=[2, 257] },
  [1m(govtype, concerns, political_subject)[0m={ edge_index=[2, 22] }
)

Many heterogeneous graph models require that each edge type is undirected. Therefore we need to add the reverse edges to the edge types dictionary.

In [174]:
to_undir = tg.transforms.ToUndirected()
hetero_data_undir = to_undir(hetero_data)

In [175]:
hetero_data_undir

HeteroData(
  [1mcountry[0m={
    x=[173, 3],
    y=[173]
  },
  [1mlegislature[0m={ x=[165] },
  [1mhouse[0m={ x=[62] },
  [1mgovtype[0m={ x=[19] },
  [1mpolitical_subject[0m={ x=[8] },
  [1m(country, is_a, govtype)[0m={ edge_index=[2, 261] },
  [1m(country, has_a, legislature)[0m={ edge_index=[2, 193] },
  [1m(legislature, contains, house)[0m={ edge_index=[2, 257] },
  [1m(govtype, concerns, political_subject)[0m={ edge_index=[2, 22] },
  [1m(govtype, rev_is_a, country)[0m={ edge_index=[2, 261] },
  [1m(legislature, rev_has_a, country)[0m={ edge_index=[2, 193] },
  [1m(house, rev_contains, legislature)[0m={ edge_index=[2, 257] },
  [1m(political_subject, rev_concerns, govtype)[0m={ edge_index=[2, 22] },
  [1m(country, rev_rev_is_a, govtype)[0m={ edge_index=[2, 261] },
  [1m(country, rev_rev_has_a, legislature)[0m={ edge_index=[2, 193] },
  [1m(legislature, rev_rev_contains, house)[0m={ edge_index=[2, 257] },
  [1m(govtype, rev_rev_concerns, political_

# Split data

Now we can autoatically split graph into train-test-validation sets without a data leakages. 
Of course, in this example we have very few observations for NATO countries, but the idea remains the same, regardless of the number of observations or target variable.

In [163]:
node_split = tg.transforms.RandomNodeSplit(
    split='train_rest',
    num_val=15,
    num_test=15,
)
hetero_data_undir_split = node_split(hetero_data_undir)

In [164]:
loader = tg.loader.HGTLoader(
    hetero_data_undir,
    num_samples={key: [10] * 2 for key in hetero_data_undir.node_types},
    batch_size=32,
    input_nodes=('country', hetero_data_undir['country'].train_mask),
)

In [165]:
batch = next(iter(loader))
batch

HeteroData(
  [1mcountry[0m={
    x=[64, 3],
    y=[64],
    train_mask=[64],
    val_mask=[64],
    test_mask=[64],
    n_id=[64],
    input_id=[32],
    batch_size=32
  },
  [1mlegislature[0m={
    x=[34, 1],
    n_id=[34]
  },
  [1mhouse[0m={
    x=[23, 1],
    n_id=[23]
  },
  [1mgovtype[0m={
    x=[11, 1],
    n_id=[11]
  },
  [1mpolitical_subject[0m={
    x=[6, 1],
    n_id=[6]
  },
  [1m(country, is_a, govtype)[0m={
    edge_index=[2, 75],
    e_id=[75]
  },
  [1m(country, has_a, legislature)[0m={
    edge_index=[2, 44],
    e_id=[44]
  },
  [1m(legislature, contains, house)[0m={
    edge_index=[2, 85],
    e_id=[85]
  },
  [1m(govtype, concerns, political_subject)[0m={
    edge_index=[2, 11],
    e_id=[11]
  },
  [1m(govtype, rev_is_a, country)[0m={
    edge_index=[2, 95],
    e_id=[95]
  },
  [1m(legislature, rev_has_a, country)[0m={
    edge_index=[2, 44],
    e_id=[44]
  },
  [1m(house, rev_contains, legislature)[0m={
    edge_index=[2, 98],
    e_id=