# Feature Engineering

In this notebook we're going to generate features for our link prediction classifier.

In [49]:
from neo4j import GraphDatabase

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')
pd.set_option('display.float_format', lambda x: '%.3f' % x)

from sklearn.ensemble import RandomForestClassifier

In [50]:
bolt_uri = "bolt://localhost:7687"
driver = GraphDatabase.driver(bolt_uri, auth=("neo4j", "neo4jneo4j"))

print(driver.address)

localhost:7687


We can create our classifier with the following code:

In [51]:
# Load the CSV files saved in the train/test notebook

df_train_under = pd.read_csv("data/df_train_under.csv")
df_test_under = pd.read_csv("data/df_test_under.csv")

In [52]:
df_train_under.sample(5)

Unnamed: 0,node1,node2,label
4769,3574,8402,0
7390,1610,6192,1
13234,13839,13843,1
7337,1527,1528,1
8742,3447,3492,1


In [53]:
df_test_under.sample(5)

Unnamed: 0,node1,node2,label
3352,9591,6021,0
7412,6786,2324,0
9585,2272,2273,1
1020,9050,9395,0
151,14179,14913,0


# Generating graphy features

We’ll start by creating a simple model that tries to predict whether two authors will have a future collaboration based on features extracted from common authors, preferential attachment, and the total union of neighbors.


### Common Neighbors
Common neighbors captures the idea that two strangers who have a friend in common are more likely to be introduced than those who don’t have any friends in common.

### Preferential Attachment
Preferential Attachment is a measure used to compute the closeness of nodes, based on their shared neighbors.

### Total Neighbors
Total Neighbors computes the closeness of nodes, based on the number of unique neighbors that they have. It is based on the idea that the more connected a node is, the more likely it is to receive new links.

The following function computes each of these measures for pairs of nodes:

In [54]:
def apply_graphy_features(data, rel_type):
    query = """
    UNWIND $pairs AS pair
    MATCH (p1) WHERE id(p1) = pair.node1
    MATCH (p2) WHERE id(p2) = pair.node2
    RETURN pair.node1 AS node1,
           pair.node2 AS node2,
           gds.alpha.linkprediction.commonNeighbors(p1, p2, {
             relationshipQuery: $relType}) AS cn,
           gds.alpha.linkprediction.preferentialAttachment(p1, p2, {
             relationshipQuery: $relType}) AS pa,
           gds.alpha.linkprediction.totalNeighbors(p1, p2, {
             relationshipQuery: $relType}) AS tn
    """
    pairs = [{"node1": node1, "node2": node2}  for node1,node2 in data[["node1", "node2"]].values.tolist()]
    
    with driver.session(database="demo") as session:
        result = session.run(query, {"pairs": pairs, "relType": rel_type})
        features = pd.DataFrame([dict(record) for record in result])    
    return pd.merge(data, features, on = ["node1", "node2"])

Let's apply the function to our training DataFrame:

In [55]:
df_train_under = apply_graphy_features(df_train_under, "CO_AUTHOR_EARLY")
df_test_under = apply_graphy_features(df_test_under, "CO_AUTHOR")

In [56]:
df_train_under.drop(columns=["node1", "node2"]).sample(5, random_state=42)

Unnamed: 0,label,cn,pa,tn
8715,1,3.0,16.0,5.0
12764,1,2.0,9.0,4.0
4881,0,0.0,64.0,8.0
102,0,0.0,49.0,7.0
6032,0,1.0,8.0,8.0


In [57]:
df_test_under.drop(columns=["node1", "node2"]).sample(5, random_state=42)

Unnamed: 0,label,cn,pa,tn
7924,0,0.0,36.0,13.0
2380,0,0.0,4.0,2.0
13577,1,22.0,529.0,24.0
2252,0,1.0,15.0,7.0
7247,0,1.0,24.0,10.0


In [58]:
# Re-order so that label is last
df_train_under = df_train_under.reindex(columns=sorted(df_train_under.columns))
df_train_under = df_train_under.reindex(columns=(list([a for a in df_train_under.columns if a != 'label']) + ['label'] ))

df_test_under = df_test_under.reindex(columns=sorted(df_test_under.columns))
df_test_under = df_test_under.reindex(columns=(list([a for a in df_test_under.columns if a != 'label']) + ['label'] ))


# Save our DataFrames to CSV files for use in the next notebook

df_train_under.to_csv("data/df_train_under_all.csv", index=False)
df_test_under.to_csv("data/df_test_under_all.csv", index=False)

# df_train_under = pd.read_csv("data/df_train_under_all.csv")
# df_test_under = pd.read_csv("data/df_test_under_all.csv")

# Save the samples as CSV files as well
df_train_under.drop(columns=["node1", "node2"]).sample(5, random_state=42).to_csv("data/df_train_under_sample.csv", index=False, float_format='%g')
df_test_under.drop(columns=["node1", "node2"]).sample(5, random_state=42).to_csv("data/df_test_under_sample.csv", index=False, float_format='%g')