<a href="https://colab.research.google.com/github/neohack22/IASD/blob/graphs/WS_22_03_IASD_Train_Test_Split.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

Install the necessary libraries in your Colab notebook environment and connect to your hosted Neo4J Sandbox.

In [None]:
!pip install neo4j pyspark matplotlib

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
ip = "54.174.38.179"
bolt_port = "7687"
username = "neo4j"
password = "spots-carrier-wires"

In [None]:
from neo4j import GraphDatabase

driver = GraphDatabase.driver("bolt://" + ip + ":" + bolt_port, auth=(username, password))

print(driver.address) # your-sandbox-ip:your-sandbox-bolt-port

54.174.38.179:7687


In [None]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from neo4j import unit_of_work

plt.style.use('fivethirtyeight')
pd.set_option('display.float_format', lambda x: '%.3f' % x)
spark = SparkSession.builder.master("local[*]").getOrCreate()

# Train / test split

We need to produce a train and test datasets on which we can build, and then evaluate a model. This binary classifier will predict wether the two nodes composing the pair shoulde be linked (1) or not (0).

This classifier will be trained on a dataset composed of nodes pairs and tested on an isolated dataset of nodes pairs. We need to be careful when splitting the data, because pairs of nodes in our training set should not be connected to those in the test set.

In order to build the features for each pair of nodes in the training set, we will compute measures that should not contain information from the test set (and vice versa). **The solution is to split our co-authorship graph into two sub graphs of similar structure**. 

We will use the year of first collaboration to split our initial graph in two.

## Year split

We can create train and test graphs by splitting the data on a particular year. Now we need to figure out what year that should be. Let's have a look at the distribution of the first year that co-authors collaborated.

- Count the number of CO_AUTHOR relationships by year of first collaboration

In [None]:
query = """
MATCH ()-[r:CO_AUTHOR]->()
WITH r.year AS year, COUNT(*) AS count
ORDER BY year
RETURN toString(year) AS year, count
"""

- Save the results of your counting in a Spark DataFrame by running the cell below

In [None]:
# Run to execute the query in Colab

with driver.session() as session:
    result = session.run(query)
    # Convert the result to a list of dictionaries
    result_dict = [dict(record) for record in result]
    # Create DataFrame from this list
    co_authorships_by_year = spark.createDataFrame(result_dict)

- Plot the distribution as a bar graph by running the cell below

It looks like 2006 would act as a good year on which to split the data. We'll take all the co-authorships from 2005 and earlier as our train graph, and everything from 2006 onwards as the test graph.

Let's create explicit *CO_AUTHOR_EARLY* and *CO_AUTHOR_LATE* relationships in our graph based on that year. We will then use them to distinguish train and test sets.

 - Create the *CO_AUTHOR_EARLY* and *CO_AUTHOR_LATE* relationships

In [None]:
# Test the queries in your Neo4J Browser first

query_early = """
  MATCH (...)-[...]->(...)
  WHERE ...
  MERGE (...)-[:CO_AUTHOR_EARLY {year: r.year}]-(...)
"""

query_late = """
  MATCH (...)-[...]->(...)
  WHERE ...
  MERGE (...)-[:CO_AUTHOR_LATE {year: r.year}]-(...)
"""

In [None]:
# Run to execute the queries in Colab

with driver.session() as session:
    session.run(query_early)

with driver.session() as session:
    session.run(query_late)

- Run the following cell to count how many *CO_AUTHOR* relationships we have in each of these sub graphs.

In [None]:
query = """
  MATCH ()-[r]->()
  WHERE TYPE(r) = "CO_AUTHOR_EARLY" OR TYPE(r) = "CO_AUTHOR_LATE"
  RETURN TYPE(r), COUNT(*) AS count
"""

with driver.session() as session:
    result = session.run(query)
    early_late_count = spark.createDataFrame([dict(record) for record in result])

early_late_count.show()

## Class imbalance considerations

Using 2006 as the split year, we have a split of 52% - 48% split of *CO_AUTHOR* relationships. We will use this split to separate our **positive examples** (i.e. pairs of nodes that we will label with 1) in train and test set. 

## Create the training set pairs


- Complete the following cell to extract the **positive examples** for the training set in a Spark DataFrame. That is the pairs of nodes that are linked by a *CO_AUTHOR_EARLY* relationship.

In [None]:
# Test the query in your Neo4J Browser

query = """
  MATCH (...)-[...]->(...)
  RETURN ID(author) AS node1, ID(other) AS node2, 1 AS label
"""

In [None]:
# Run to execute the query in Colab

with driver.session() as session:
    result = session.run(query)
    train_existing_links = spark.createDataFrame([dict(record) for record in result])
    train_existing_links = train_existing_links.dropDuplicates()

We also need to split our **negative examples** (i.e. pairs of nodes that we will label with 0, because no *CO_AUTHOR* link exist between them) between train and test sets.

The simplest approach would be to use all pair of nodes within each sub graph that don’t have a relationship. The problem with this approach is that there are significantly more examples of pairs of nodes that don’t have a relationship than there are pairs of nodes that do. If we use all of these negative examples in our training set we will have a massive class imbalance.

We need to reduce the number of negative examples to produce a training dataset with balanced classes. 

To do so, we will:
- use pairs of nodes that are two to three of hops away from each other
- further down sample the negative examples if necessary

- Extract the **negative examples** for the training set in a Spark DataFrame. That is the pairs of *Author* nodes in the *EARLY* sub graph that are two to three hops away and that did not collaborate together yet.

In [None]:
# Test the query in your Neo4J Browser

query = """
  MATCH (author)-[...*...]->(...) // Two to three hops away
  WHERE NOT((author)-[...]-(...)) // Did not collaborate together
  RETURN ID(author) AS node1, ID(other) AS node2, 0 AS label
"""

In [None]:
# Run to execute the query in Colab

with driver.session() as session:
    result = session.run(query)
    train_missing_links = spark.createDataFrame([dict(record) for record in result])    
    train_missing_links = train_missing_links.drop_duplicates()

We add the positive pairs to the negative ones in order to create our training set.

In [None]:
training_df = train_missing_links.union(train_existing_links)
print('Observations in training set: ', training_df.count())

Observations in training set:  188505


- Complete the following cell to count the number of positive and negative pairs in the training set (use a groupBy on the Spark DataFrame). Do we have class imbalance?


In [None]:
print('Train set class imbalance:')
training_df. ... .show()

SyntaxError: ignored

- Randomly downsample the **negative examples** to get a balanced training dataset (50% label 0 and 50% label 1).

In [None]:
### Isolate and count the negative examples

train_df_class_0 = training_df.filter(...)
train_count_class_0 = train_df_class_0.count()

### Isolate and count the positive examples

train_df_class_1 = training_df.filter(...)
train_count_class_1 = train_df_class_1.count()

### Compute the proportion of positive examples
fraction = train_count_class_1/train_count_class_0

### Sample the negative examples to reflect the proportion of positive examples
### Hint use pyspark.sql.DataFrame.sample function 

train_df_class_0_under = train_df_class_0.sample(...)

### Add these sample to the positive examples to get a balanced training set
df_train_under = train_df_class_0_under.union(train_df_class_1)

# Show class imbalance now
print('Train set class imbalance after downsampling:')
df_train_under.groupBy(F.col('label')).count().show()

## Create the test set pairs



Let's now do the same thing for the test set.

Run the following cell to:

- Extract the **positive examples** for the test set in a Spark DataFrame. That is the pairs of nodes that are linked by a *CO_AUTHOR_LATE* relationship.
- Extract the **negative examples** for the test set in a Spark DataFrame. That is the pairs of *Author* nodes in the *LATE* sub graph that are two to three hops away and that did not collaborate together yet.

In [None]:
with driver.session() as session:
    result = session.run(
        """
          MATCH (author:Author)-[:CO_AUTHOR_LATE]->(other:Author)
          RETURN ID(author) AS node1, ID(other) AS node2, 1 AS label
        """)
    test_existing_links = spark.createDataFrame([dict(record) for record in result])
    test_existing_links = test_existing_links.dropDuplicates()

    result = session.run(
        """
          MATCH (author)-[:CO_AUTHOR_LATE*2..3]->(other)
          WHERE NOT((author)-[:CO_AUTHOR_LATE]-(other))
          RETURN ID(author) AS node1, ID(other) AS node2, 0 AS label
        """)
    test_missing_links = spark.createDataFrame([dict(record) for record in result])    
    test_missing_links = test_missing_links.drop_duplicates()

In [None]:
with driver.session() as session:

    result = session.run(
        """
          MATCH (author)-[:CO_AUTHOR_LATE*2..3]->(other)
          WHERE NOT((author)-[:CO_AUTHOR_LATE]-(other))
          RETURN ID(author) AS node1, ID(other) AS node2, 0 AS label
        """)
    test_missing_links = spark.createDataFrame([dict(record) for record in result])    
    test_missing_links = test_missing_links.drop_duplicates()

We add the positive pairs to the negative ones in order to create our testing set.

In [None]:
testing_df = test_missing_links.union(test_existing_links)
print('Observations in test set: ', testing_df.count())

- Run the following cell to count the number of positive and negative pairs in the test set to evaluate class imbalance.


In [None]:
print('Test set class imbalance:')
testing_df.groupBy(F.col('label')).count().show()

In the test set, we do not want a balanced dataset because we need to model the real structure of observations on which this link prediction model could be tested.

This is why we want a density close to: 

*(# CO_AUTHOR relationships) / (# pairs of Authors without CO_AUTHOR relationship)*.

We know that:

*# pairs of Authors without CO_AUTHOR relationship = (# Authors)² - (# CO_AUTHOR relationships) - (# Authors)*.

- Run the folowing cell to compute the target density needed in our test set (using the equation above).

In [None]:
with driver.session() as session:

    # Compute the number of Authors
    result = session.run(
        """
          MATCH (a1:Author)
          RETURN COUNT(DISTINCT a1) as nb_authors
       """)
    nb_authors = [dict(record) for record in result][0]['nb_authors']

    # Compute the number of CO_AUTHOR relationships
    result = session.run(
        """
          MATCH (a1:Author)-[r:CO_AUTHOR]-(a2:Author)
          RETURN COUNT(DISTINCT r) as nb_rels
       """)
    nb_co_author_relationships = [dict(record) for record in result][0]['nb_rels']

    # Compute the target density for the test set
    nb_negative_pairs = (nb_authors*nb_authors) - nb_co_author_relationships - nb_authors
    density = nb_co_author_relationships / nb_negative_pairs
    print("Target density in test set: ", density)

Target density in test set:  2.4074343933981664e-05


Considering the number of negative pairs that we have extracted, respecting this density would produce only a couple of positive examples in the test set. This is a bit extreme for our future classifier evaluation. 

This is why we will use **an arbitrary density of 0.01** for the sake of the illustration.

- Run the following cell to downsample our positive class for the test set to reflect this density of 0.01.

In [None]:
approx_density = 0.01

### Isolate and count the negative examples

test_df_class_0 = testing_df.filter(F.col('label') == 0)
test_count_class_0 = test_df_class_0.count()

### Isolate and count the positive examples

test_df_class_1 = testing_df.filter(F.col('label') == 1)
test_count_class_1 = test_df_class_1.count()

### Sample the positive examples
test_df_class_1_under = test_df_class_1.sample(withReplacement=False, fraction=approx_density, seed=42) # Note: Spark does not guarantee the fraction to be exactly respected

### Add these sample to the negative examples to get a realistic test set
df_test_under = test_df_class_1_under.union(test_df_class_0)

# Show class imblance now
print('Test set class imbalance after downsampling:')
df_test_under.groupBy(F.col('label')).count().show()

Test set class imbalance after downsampling:
+-----+------+
|label| count|
+-----+------+
|    1|   801|
|    0|128915|
+-----+------+



# Save our train and test pairs DataFrames to CSV

Let's have a look at the contents of our train and test DataFrames before saving them on Google Drive.

In [None]:
df_train_under.filter(F.col('label') == 0).show(5)
df_train_under.filter(F.col('label') == 1).show(5)

In [None]:
df_test_under.filter(F.col('label') == 0).show(5)
df_test_under.filter(F.col('label') == 1).show(5)

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
# Save our DataFrames to CSV files for use in the next notebook

save_folder = '/content/gdrive/My Drive/IASD04/IASD_link_prediction/link-prediction/notebooks/data/'

df_train_under.write.csv(save_folder + 'df_train_under.csv', mode='overwrite', header=True)
df_test_under.write.csv(save_folder + 'df_test_under.csv', mode='overwrite', header=True)

Please check that both datasets have been written to your Drive at the desired location because we are going to need them later for features engineering.