# IPO Prediction of Using Graph Convolutional Neural Networks
## By Parker Erickson

In this notebook, we will install and run queries on a TigerGraph database to collect data from their Crunchbase Knowledge Graph demo and then pipe this data into a Graph Convolutional Neural Network (GCN) to predict whether or not a company will IPO. The performance of the GCN is not astounding, but we will explore why this is due to the very nature of the dataset and some of the simplifications I make. Other models may fair better, although these avenues haven't been explored yet.

# TigerGraph Setup

We will be installing the queries found in ../db_scripts onto the TigerGraph database. This will create a REST endpoint that the package pyTigerGraph will request from in order to grab the data for the GCN. If you haven't already done so, create a free TigerGraph cloud instance of the CrunchBase knowledge graph demo. Then, configure your gradle-local.properties file and get a SSL certificate from the server following the directions found [here](https://medium.com/@jon.herke/getting-started-with-giraffle-on-tigergraph-cloud-970ead739943). Then, we will be all set to use gradle and Giraffle to install the necessary queries

## Installing Queries

In [None]:
%cd ..

!gradle tasks --console=plain

In [None]:
!gradle createCompanyLinks --console=plain
!gradle installCompanyLinks --console=plain

In [None]:
!gradle createGetAllIpo --console=plain
!gradle installGetAllIpo --console=plain

In [None]:
!gradle createGetAllCompanies --console=plain
!gradle installGetAllCompanies --console=plain

# Importing Packages

In [None]:
import pyTigerGraph as tg 
import cfg
import random

import torch
import torch.nn as nn
import torch.nn.functional as F
import pyTigerGraph as tg
import dgl
import networkx as nx
from heapq import nlargest, nsmallest

from gcn import GCN

# Creating Connection to Database and Getting Data

## Getting API token from Secret

In [None]:
cfg.token = tg.TigerGraphConnection(host="https://crunchml1.i.tgcloud.io", graphname="CrunchBasePre_2013").getToken(cfg.secret, "1000000")[0]

In [None]:
conn = tg.TigerGraphConnection(host="https://crunchml1.i.tgcloud.io", graphname="CrunchBasePre_2013", password=cfg.password, apiToken=cfg.token)
conn.debug=False

## Getting Edges Between Companies

The cell below runs the query companyLinks, and then formats each edge into a tuple (src, destination). Unfortunately, due to both memory constraints as well as the imbalanced nature of the dataset, we will not use all of the edges in the graph, and instead sample nodes and then search this list for edges between the nodes.

#### **Assumption Alert:** We oversimplify the graph here. The query returns pairs of companies that have something in common. This hurts accuracy (a lot). Where TigerGraph comes in is the ease of data extraction, as there are no JOIN operations to create these links between companies.
* Note: It is possible to create a GCN that has multiple types of verticies, (known as a Relational Graph Convolutional Notebook) but it is more complex. A good way to get started is to simplify until you only have relations between the same type of thing.

In [None]:
edges = [(thing["src"], thing["dest"]) for thing in conn.runInstalledQuery("companyLinks", {}, sizeLimit=512000000, timeout=320000)[0]["@@tupleRecords"]]
print(edges[:3])


## Getting the List of IPOed Companies

In [None]:
print("Getting IPOed List")
ipoed = list(set([thing["src"] for thing in conn.runInstalledQuery("getAllIpo", {}, sizeLimit=512000000, timeout=320000)[0]["@@tupleRecords"]]) - set(['']))
print("Getting Non IPOed List")
nonipo = list(set([thing["src"] for thing in conn.runInstalledQuery("getAllCompanies", {}, sizeLimit=512000000, timeout=320000)[0]["@@tupleRecords"]]) - set(ipoed) - set(['']))

# Creating the Over-Sampled IPO Graph

The code blocks below sample a number of nodes from each the IPOed list and the non-IPO list and determines what edges there are between the sampled nodes. Unfortunately, due to the large number of nodes in the complete graph, the number of edges in the sampled graph is quite small. This lack of edges contributes to the mediocre and highly variant performance of the following GCN. Other graph machine learning approaches such as node2vec might fair better.

In [None]:
numberofnodes = 1000

print("Number of IPOs: ", len(ipoed))
ipoedsample = random.choices(ipoed, k=numberofnodes)
noniposample = random.choices(nonipo, k=numberofnodes)
print("Total number of nodes: ", len(noniposample)+len(ipoedsample))

allNodes = noniposample+ipoedsample

print(len(allNodes))


finalEdges = []

print("Computing edges")
for edge in edges:
    if edge[0] in allNodes and edge[1] in allNodes:
            finalEdges.append(edge)

print(len(finalEdges))
print(finalEdges[:3])



In [None]:
compToNum = {} # translation dictionary for company name to number (for dgl)
numToComp = {} # translation dictionary for number to company name

numericalNodes = []

for i in range(0, len(allNodes)):
    compToNum[allNodes[i]] = i
    numericalNodes.append(i)
    numToComp[i] = allNodes[i]

def createEdgeList(result): # returns tuple of number version of edge
    fromKey = compToNum[result[0]]
    toKey = compToNum[result[1]]
    return (fromKey, toKey)

edges = [createEdgeList(thing) for thing in finalEdges]
print("Number of Edges: ", len(edges))
print(edges[:5])

In [None]:
g = nx.Graph()
g.add_nodes_from(numericalNodes)
g.add_edges_from(edges)


G = dgl.DGLGraph(g) # Convert networkx graph to a graph that DGL can work on

In [None]:
G.number_of_nodes()

## One-Hot Encoding of Node Features
We one-hot encode the features of the verticies in the graph. Feature assignment can be done a multitude of different ways, this is just the fastest and easiest.

If you had a graph of documents for example, you could run doc2vec on those documents to create a feature vector and create the feature matrix by concatenating those together.

Another possiblity is that you have a graph of songs, artists, albums, etc. and you could use tempo, max volume, minimum volume, length, and other numerical descriptions of the song to create the feature matrix.

In [None]:
G.ndata["feat"] = torch.eye(G.number_of_nodes())

print(G.nodes[2].data['feat'])


# Graph Convolutional Neural Network Setup

In [None]:
numEpochs = 100
learningRate = 0.01

## Creating Neural Network and Labelling Relevant Verticies

Here, we create the GCN. A two-layered GCN appears to work better than deeper networks, and this is further corroborated by the fact [this](https://arxiv.org/abs/1609.02907) paper only used a two-layered one. We also label the wanted and unwanted verticies and setup the optimizer. Since the GCN is a semi-supervised algorithm, we do not label all of the nodes to their correct classes before training - only two are needed!

In [None]:
compIPO = 0
compNonIPO = 0
i = 0
while((not(compIPO) or not(compNonIPO)) and (i<G.number_of_nodes())):
    if numToComp[i] in ipoed:
        compIPO = i
    else:
        compNonIPO = i
    i += 1

In [None]:
net = GCN(G.number_of_nodes(), 32, 2) #Two layer GCN
inputs = G.ndata["feat"]
labeled_nodes = torch.tensor([compNonIPO, compIPO])  # only the liked movies and the disliked movies are labelled
labels = torch.tensor([0, 1])  # their labels are different
optimizer = torch.optim.Adam(net.parameters(), lr=learningRate)

## Training GCN

Below is the training loop that actually trains the GCN. Unlike many traditional deep learning architectures, GCNs don't always need that much training or as large of data sets due to their exploitation of the *structure* of the data, as opposed to only the features of the data.
* Note: due to the randomized initial values of the weights in the neural network and our lack of a very well-connected graph, sometimes models don't work very well, or their loss gets stuck at a relatively large number. If that happens, just stop and restart the training process (also rerun the cell above to reset the weights) and hope for better luck! Alternatively, you can run more epochs in hopes of eventually getting out of the rut.

In [None]:
all_logits = []
for epoch in range(numEpochs):
    logits = net(G, inputs)
    # we save the logits for visualization later
    all_logits.append(logits.detach())
    logp = F.log_softmax(logits, 1)
    # we only compute loss for labeled nodes
    loss = F.nll_loss(logp[labeled_nodes], labels)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    print('Epoch %d | Loss: %6.3e' % (epoch, loss.item()))

# Get Results

In [None]:
predictions = list(all_logits[numEpochs-1])
predictIPO = []
predictNonIPO = []

a=0
for company in predictions:
    if company[1] >= company[0]:
        predictIPO.append(numToComp[a])
    else:
        predictNonIPO.append(numToComp[a])
    a += 1

trueIPO = 0
falseIPO = 0
trueNonIPO = 0
falseNonIPO = 0


for prediction in predictIPO:
    if prediction in ipoed:
        trueIPO += 1
    else:
        falseIPO += 1

print("True IPO: ", trueIPO)
print("False IPO: ", falseIPO)

for prediction in predictNonIPO:
    if prediction in ipoed:
        falseNonIPO += 1        
    else:
        trueNonIPO += 1
print("True Non-IPO: ", trueNonIPO)
print("False Non-IPO: ", falseNonIPO)

In [None]:
accuracy = (trueNonIPO+trueIPO)/(len(predictions))
print(accuracy)

## Credits
<p><img alt="Picture of Parker Erickson" height="150px" src="https://avatars1.githubusercontent.com/u/9616171?s=460&v=4" align="right" hspace="20px" vspace="20px"></p>

Demo/tutorial written by Parker Erickson, a student at the University of Minnesota pursuing a B.S. in Computer Science. His interests include graph databases, machine learning, traveling, playing the saxophone, and watching Minnesota Twins baseball. Feel free to reach out! Find him on:

* LinkedIn: [https://www.linkedin.com/in/parker-erickson/](https://www.linkedin.com/in/parker-erickson/)
* GitHub: [https://github.com/parkererickson](https://github.com/parkererickson)
* Medium: [https://medium.com/@parker.erickson](https://medium.com/@parker.erickson)
* Email: [parker.erickson30@gmail.com](parker.erickson30@gmail.com)
----
GCN Resources:
* DGL Documentation: [https://docs.dgl.ai/](https://docs.dgl.ai/)
* GCN paper by Kipf and Welling [https://arxiv.org/abs/1609.02907](https://arxiv.org/abs/1609.02907)
* R-GCN paper: [https://arxiv.org/abs/1703.06103](https://arxiv.org/abs/1703.06103)
---- 
Notebook adapted from: [https://docs.dgl.ai/en/latest/tutorials/basics/1_first.html](https://docs.dgl.ai/en/latest/tutorials/basics/1_first.html)