# Load [`ogbn-arxiv`](https://ogb.stanford.edu/docs/nodeprop/#ogbn-arxiv) Graph Into Neo4j Database (~5 Min)
![Neo4j version](https://img.shields.io/badge/Neo4j->=4.4.9-brightgreen)
![GDS version](https://img.shields.io/badge/GDS-2.3-brightgreen)
![GDS Python Client version](https://img.shields.io/badge/GDS_Python_Client-1.6-brightgreen)

This notebook is a prerequisite to [`pyg-gnn.ipynb` - "Sampling a Graph Database to Train a GNN Model"](https://github.com/neo4j-product-examples/graph-machine-learning-examples/blob/main/gnns-with-neo4j/db-sampling-for-gnn-training/pyg-gnn.ipynb)

## Setup

In [1]:
%pip install graphdatascience python-dotenv ogb

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np
from graphdatascience import GraphDataScience
from dotenv import load_dotenv
import os
from graph_data.data_import import get_ogbn_arxiv_data, chunks, load_nodes, load_rels
from numpy.typing import ArrayLike

In [3]:
load_dotenv('db-credentials.env', override=True)

# Use Neo4j URI and credentials according to our setup
gds = GraphDataScience(
    os.getenv('NEO4J_URI'),
    auth=(os.getenv('NEO4J_USERNAME'),
          os.getenv('NEO4J_PASSWORD')),
    aura_ds=eval(os.getenv('AURA_DS').title()))

# Necessary if you enabled Arrow on the db - this is true for AuraDS
gds.set_database("neo4j")

## Get Source Data
Including raw text for titles, abstracts, and subjects

In [4]:
paper_source_df, citation_source_df = get_ogbn_arxiv_data(map_txt_properties=True, node_id_col_name='ogbId')
paper_source_df

Downloading http://snap.stanford.edu/ogb/data/nodeproppred/arxiv.zip


Downloaded 0.08 GB: 100%|██████████| 81/81 [00:02<00:00, 34.92it/s]


Extracting dataset/arxiv.zip
Loading necessary files...
This might take a while.
Processing graphs...


100%|██████████| 1/1 [00:00<00:00, 7557.30it/s]

Saving...





Unnamed: 0,paperId,title,abstract,ogbId,textEmbedding,year,subjectId,subjectLabel
0,630234,spreadsheets on the move an evaluation of mobi...,The power of mobile devices has increased dram...,104447,"[0.010079000145196915, -0.028968000784516335, ...",2011,6,arxiv cs hc
1,16868154,factors influencing the quality of the user ex...,The use of mobile devices and the rapid growth...,126951,"[-0.14517700672149658, -0.04205799847841263, -...",2014,6,arxiv cs hc
2,30955769,a comprehensive model of usability,Usability is a key quality attribute of succes...,160133,"[-0.13881300389766693, 0.0757559984922409, -0....",2008,6,arxiv cs hc
3,54410545,planning in the wild modeling tools for pddl,Even though there are sophisticated AI plannin...,148334,"[-0.09789499640464783, -0.0002640000020619482,...",2014,6,arxiv cs hc
4,62732812,low cost eye trackers useful for information s...,Research investigating cognitive aspects of in...,6449,"[-0.26364800333976746, -0.016287999227643013, ...",2014,6,arxiv cs hc
...,...,...,...,...,...,...,...,...
169338,3010964174,latent space subdivision stable and controllab...,We propose an end-to-end trained neural networ...,146977,"[-0.11292299628257751, -0.029922999441623688, ...",2020,17,arxiv cs gr
169339,3011546779,enabling viewpoint learning through dynamic la...,Optimal viewpoint prediction is an essential t...,6443,"[-0.23396000266075134, 0.022811999544501305, -...",2020,17,arxiv cs gr
169340,3011610898,real time image smoothing via iterative least ...,Edge-preserving image smoothing is a fundament...,113754,"[-0.06525000184774399, 0.1853809952735901, -0....",2020,17,arxiv cs gr
169341,3011917630,geodesic distance field based curved layer vol...,This paper presents a new curved layer volume ...,132336,"[0.020333999767899513, 0.17771199345588684, -0...",2020,17,arxiv cs gr


## Load Nodes and Relationships

In [5]:
# Don't forget to create that index :)
gds.run_cypher('CREATE CONSTRAINT unique_ogb_id IF NOT EXISTS FOR (n:Paper) REQUIRE n.ogbId IS UNIQUE')

In [6]:
%%time
# using reduced chunk size for loading large raw text abstracts
load_nodes(gds, paper_source_df, 'ogbId', 'Paper', chunk_size=2_000)

staging 169,343 records
Loaded 2,000 of 169,343 nodes
Loaded 4,000 of 169,343 nodes
Loaded 6,000 of 169,343 nodes
Loaded 8,000 of 169,343 nodes
Loaded 10,000 of 169,343 nodes
Loaded 12,000 of 169,343 nodes
Loaded 14,000 of 169,343 nodes
Loaded 16,000 of 169,343 nodes
Loaded 18,000 of 169,343 nodes
Loaded 20,000 of 169,343 nodes
Loaded 22,000 of 169,343 nodes
Loaded 24,000 of 169,343 nodes
Loaded 26,000 of 169,343 nodes
Loaded 28,000 of 169,343 nodes
Loaded 30,000 of 169,343 nodes
Loaded 32,000 of 169,343 nodes
Loaded 34,000 of 169,343 nodes
Loaded 36,000 of 169,343 nodes
Loaded 38,000 of 169,343 nodes
Loaded 40,000 of 169,343 nodes
Loaded 42,000 of 169,343 nodes
Loaded 44,000 of 169,343 nodes
Loaded 46,000 of 169,343 nodes
Loaded 48,000 of 169,343 nodes
Loaded 50,000 of 169,343 nodes
Loaded 52,000 of 169,343 nodes
Loaded 54,000 of 169,343 nodes
Loaded 56,000 of 169,343 nodes
Loaded 58,000 of 169,343 nodes
Loaded 60,000 of 169,343 nodes
Loaded 62,000 of 169,343 nodes
Loaded 64,000 of 16

In [None]:
%%time
load_rels(gds, citation_source_df, 'Paper', 'Paper', ('ogbId', 'paper'), ('ogbId', 'citedPaper'), 'CITES')

staging 1,166,243 records
Loaded 50,000 of 1,166,243 relationships
Loaded 100,000 of 1,166,243 relationships
Loaded 150,000 of 1,166,243 relationships
Loaded 200,000 of 1,166,243 relationships
Loaded 250,000 of 1,166,243 relationships
Loaded 300,000 of 1,166,243 relationships
Loaded 350,000 of 1,166,243 relationships
Loaded 400,000 of 1,166,243 relationships
Loaded 450,000 of 1,166,243 relationships
Loaded 500,000 of 1,166,243 relationships
Loaded 550,000 of 1,166,243 relationships
Loaded 600,000 of 1,166,243 relationships
Loaded 650,000 of 1,166,243 relationships
Loaded 700,000 of 1,166,243 relationships
Loaded 750,000 of 1,166,243 relationships
Loaded 800,000 of 1,166,243 relationships
Loaded 850,000 of 1,166,243 relationships


# Set Data Splitting Labels

In [None]:
gds.run_cypher('CREATE INDEX year_ind IF NOT EXISTS FOR (n:Paper) ON n.year')

In [None]:
VALID_YEAR = 2018

In [None]:
gds.run_cypher('''
    MATCH(n:Paper) WHERE n.year < $validYear
    SET n:Train
    RETURN count(n) AS trainSetCount
    ''', params={'validYear': VALID_YEAR})

In [None]:
gds.run_cypher('''
    MATCH(n:Paper) WHERE n.year = $validYear
    SET n:Valid
    RETURN count(n) AS validSetCount
    ''', params={'validYear': VALID_YEAR})

In [None]:
gds.run_cypher('''
    MATCH(n:Paper) WHERE n.year > $validYear
    SET n:Test
    RETURN count(n) AS TestSetCount
    ''', params={'validYear': VALID_YEAR})

In [None]:
gds.close()