# Question Classification Using Graph Convolutional Networks and TigerGraph
------
## Introduction
This notebook connects to a TigerGraph database that holds Wikipedia article topics and their links between them, as well as various keywords mentioned in the articles. Check out the README for more information surrounding the graph database end of the project. The purpose of this notebook is to classify quiz-bowl styled questions using a Graph Convolutional Neural Network purely on the links between questions. For a deeper explaination of the GCN, check out [this](https://arxiv.org/abs/1609.02907) link.

## Importing Packages
The core packages that need to be installed are PyTorch, dgl, and pyTigerGraph. PyTorch and dgl are used for creating and training the GCN, while pyTigerGraph is used for connecting to the TigerGraph database. We also import networkx for converting the list of edges from TigerGraph into a graph dgl can work with. gcn and cfg are other files in the code directory. gcn is already created, but you must create cfg with your TigerGraph API token. The file should look like:
```py
token = "YOUR_API_TOKEN_HERE"
```
To install the other packages, simply use: ```pip3 install torch dgl networkx pyTigerGraph matplotlib```




In [1]:
import pyTigerGraph as tg
import dgl
import networkx as nx
import torch
from py_scripts.gcn import GCN
from py_scripts.gcn import threeLayerGCN
import torch.nn as nn
import torch.nn.functional as F
import py_scripts.cfg as cfg
import gensim
import pickle
import random

## Configuration
Here we define some variables, such as the number of epochs of training (usually only need 30 or less for a 2-layer GCN, 100+ for a 3+ layer GCN) and learning rate (0.01 seems to work well).

In [2]:
numEpochs = 100
learningRate = 0.01

## Creating Database Connection and Creating Edge List
This section instantiates a connection to the TigerGraph database and creates a list of tuples with consisting of directed edges in the form of (from, to). This is done through two dictionaries that corresponds an article name to a unique numerical id that is needed to process the graph in the GCN.

In [15]:
conn = tg.TigerGraphConnection(ipAddress="https://wikipediagraph.i.tgcloud.us", apiToken=cfg.token, graphname="MyGraph") # connection to TigerGraph database



edges = [createEdge(thing) for thing in conn.runInstalledQuery("trainQuestionTuples", {})["results"][0]["list_of_question_tuples"]] # creates list of edges


219699


## Initializing Graph
This section converts the list of edges into a graph that DGL can process in the GCN. It also converts our wanted and unwanted topics to their corresponding numerical ids that we will use later on.

In [37]:
g = nx.DiGraph()

g.add_edges_from(edges)

G = dgl.DGLGraph(g)

print('We have %d nodes.' % G.number_of_nodes())
print('We have %d edges.' % G.number_of_edges())

We have 4588 nodes.
We have 1198 edges.


## Adding Features to Verticies in Graph
We will use the doc2vec model to encode each of the article's into a numerical vector that can be used for the features of the vertex.

In [38]:
try:
    with open("features.pickle", "rb") as f:
        print("Loading pickled features")
        G.ndata["feat"] = pickle.load(f)
except:
    print("Creating feature matrix: This may take a while...")
    d2v_model = gensim.models.doc2vec.Doc2Vec.load("doc2vec.model")
    vectors = []
    for article in articleToNum:
        try: 
            words = [] 
            with open(("plaintext_articles/"+article+".txt"), "r") as f: 
                words = f.read().lower().split() 
            vectors.append(list(d2v_model.infer_vector(words)))
        except:
            vectors.append([0 for i in range(300)])

    G.ndata["feat"] = torch.tensor(vectors) # one hot encode nodes for features (replace with doc2vec in future)

    with open("features.pickle", "wb") as f:
        pickle.dump(G.ndata["feat"], f)

print(G.nodes[2].data['feat'])

Loading pickled features
tensor([[-15.7746,  -7.8044,  12.4863,  22.5554,   1.3301,  -1.9337,  19.6358,
          -2.2648, -15.7526, -19.0251, -21.6771, -31.0005,  -6.7516,  23.1438,
         -26.6049,   3.5942,   8.5233,  50.1070, -11.7487,   4.3738, -11.2436,
         -10.8022,   1.3211,  15.0990,  -1.3964,  19.6429,   9.0302, -18.7562,
           2.8804, -11.9642,  14.8702, -37.4905,  20.4850,  22.2344,  13.4185,
          23.4811,  -3.5016,  10.3540, -11.3863,  18.8748,  -0.7793,  28.5464,
         -15.9820,  -5.0830,  12.7408, -19.6781,   3.1694, -37.4113,   0.6387,
          14.9999,  -7.0661,  12.2533, -27.0358,  26.8235,   1.2522, -16.5426,
          32.3230, -30.5527,   0.7063,  20.0358,  24.4204, -20.4096,  -6.0147,
           8.5914,  14.3524,  12.8390,  25.4506,  -4.4919,   5.6486,   6.4814,
          10.0003,   4.2661,   1.9836,  -4.4280,  22.0171,   1.1926,  29.6659,
         -53.2439,  11.3238,   4.6576, -18.6014, -13.8002, -56.5608,   3.4562,
          -3.1308, -13.3575

## Creating Neural Network and Labelling Relevant Verticies
Here, we create the GCN, with options for using a two-layered GCN or a three-layered one. Thus far, the two-layered GCN appears to work better, and this is further corroborated by the fact [this](https://arxiv.org/abs/1609.02907) paper only used a two-layered one. We also label the wanted and unwanted verticies and setup the optimizer.

In [39]:

net = GCN(300, 25, 2) #Two layer GCN
# net = threeLayerGCN(G.number_of_nodes(), 75, 25, 2) # Three Layer GCN 

inputs = G.ndata["feat"]
labeled_nodes = torch.tensor([wantedNum, unwantedNum])  # only the instructor and the president nodes are labeled
labels = torch.tensor([0, 1])  # their labels are different


optimizer = torch.optim.Adam(net.parameters(), lr=learningRate)

## Training the Neural Network
This loop stores the results of each epoch of training for visualization and then continues the training until the number of desired epochs is hit.

In [40]:
all_logits = []
for epoch in range(numEpochs):
    logits = net(G, inputs)
    # we save the logits for visualization later
    all_logits.append(logits.detach())
    logp = F.log_softmax(logits, 1)
    # we only compute loss for labeled nodes
    loss = F.nll_loss(logp[labeled_nodes], labels)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    print('Epoch %d | Loss: %6.3e' % (epoch, loss.item()))


Epoch 0 | Loss: 3.617e-01
Epoch 1 | Loss: 3.566e-01
Epoch 2 | Loss: 3.515e-01
Epoch 3 | Loss: 3.465e-01
Epoch 4 | Loss: 3.415e-01
Epoch 5 | Loss: 3.366e-01
Epoch 6 | Loss: 3.317e-01
Epoch 7 | Loss: 3.269e-01
Epoch 8 | Loss: 3.222e-01
Epoch 9 | Loss: 3.175e-01
Epoch 10 | Loss: 3.129e-01
Epoch 11 | Loss: 3.083e-01
Epoch 12 | Loss: 3.038e-01
Epoch 13 | Loss: 2.993e-01
Epoch 14 | Loss: 2.949e-01
Epoch 15 | Loss: 2.906e-01
Epoch 16 | Loss: 2.863e-01
Epoch 17 | Loss: 2.821e-01
Epoch 18 | Loss: 2.779e-01
Epoch 19 | Loss: 2.738e-01
Epoch 20 | Loss: 2.697e-01
Epoch 21 | Loss: 2.658e-01
Epoch 22 | Loss: 2.618e-01
Epoch 23 | Loss: 2.580e-01
Epoch 24 | Loss: 2.542e-01
Epoch 25 | Loss: 2.504e-01
Epoch 26 | Loss: 2.467e-01
Epoch 27 | Loss: 2.431e-01
Epoch 28 | Loss: 2.395e-01
Epoch 29 | Loss: 2.360e-01
Epoch 30 | Loss: 2.326e-01
Epoch 31 | Loss: 2.292e-01
Epoch 32 | Loss: 2.258e-01
Epoch 33 | Loss: 2.225e-01
Epoch 34 | Loss: 2.193e-01
Epoch 35 | Loss: 2.161e-01
Epoch 36 | Loss: 2.130e-01
Epoch 37 | 

## Output Predictions
This section translates the output of the last result of training and outputs the top 10 results given the topic desired.

In [41]:
predictions = list(all_logits[numEpochs-1])

predictionsWithIndex = []
a = 0
for article in predictions:
    predictionsWithIndex.append([a, article[0]])
    a+=1

predictionsWithIndex.sort(key=lambda x: x[1], reverse=True)

topResults = predictionsWithIndex[:10]


for article in topResults:
    print("Article Id: "+str(article[0]))
    print("Article Name: "+str(numToArticle[article[0]]))
    print("Article Score: "+str(article[1]))
    print("")

Article Id: 4
Article Name: Ape
Article Score: tensor(0.8247)

Article Id: 8
Article Name: Organism
Article Score: tensor(0.8247)

Article Id: 9
Article Name: Empiricism
Article Score: tensor(0.8247)

Article Id: 10
Article Name: Poor_Law
Article Score: tensor(0.8247)

Article Id: 13
Article Name: Gal%C3%A1pagos_Islands
Article Score: tensor(0.8247)

Article Id: 17
Article Name: Natural_selection
Article Score: tensor(0.8247)

Article Id: 19
Article Name: Creation-evolution_controversy
Article Score: tensor(0.8247)

Article Id: 21
Article Name: S%C3%B8ren_Kierkegaard
Article Score: tensor(0.8247)

Article Id: 24
Article Name: Sociology
Article Score: tensor(0.8247)

Article Id: 25
Article Name: Emotion
Article Score: tensor(0.8247)



## Credits
- [GCN Tutorial Using DGL] (https://docs.dgl.ai/en/latest/tutorials/basics/1_first.html)
- [Using Python With TigerGraph's REST API] (https://github.com/tigergraph/ecosys/tree/master/etl/tg-python-wrapper) (Used to create pyTigerGraph)