# Wikipedia Article Suggestion Using Graph Convolutional Networks and TigerGraph
------
## Introduction
This notebook connects to a TigerGraph database that holds Wikipedia article topics and their links between them, as well as various keywords mentioned in the articles. Check out the README for more information surrounding the graph database end of the project. The purpose of this notebook is to suggest articles based upon a given topic using a Graph Convolutional Neural Network purely on the links between articles. For a deeper explaination of the GCN, check out [this](https://arxiv.org/abs/1609.02907) link.

## Importing Packages
The core packages that need to be installed are PyTorch, dgl, and pyTigerGraph. PyTorch and dgl are used for creating and training the GCN, while pyTigerGraph is used for connecting to the TigerGraph database. We also import networkx for converting the list of edges from TigerGraph into a graph dgl can work with. gcn and cfg are other files in the code directory. gcn is already created, but you must create cfg with your TigerGraph API token. The file should look like:
```py
token = "YOUR_API_TOKEN_HERE"
```
To install the other packages, simply use: ```pip3 install torch dgl networkx pyTigerGraph matplotlib```




In [1]:
import pyTigerGraph as tg
import dgl
import networkx as nx
import torch
from py_scripts.gcn import GCN
from py_scripts.gcn import threeLayerGCN
import torch.nn as nn
import torch.nn.functional as F
import py_scripts.cfg as cfg
import matplotlib.animation as animation
import matplotlib.pyplot as plt

## Configuration
Here we define some variables, such as the number of epochs of training (usually only need 30 or less for a 2-layer GCN, 100+ for a 3+ layer GCN), the learning rate (0.01 seems to work well) and the topic you want recommendations based upon, as well as a completely unrelated topic. Make sure both of these topics are in the data.

In [2]:
numEpochs = 30
learningRate = 0.01
wantedTopic = "Saxophone"
unwantedTopic = "Falkland_Islands" #In future, determine automatically through the inverse of PageRank/other centrality algo

## Creating Database Connection and Creating Edge List
This section instantiates a connection to the TigerGraph database and creates a list of tuples with consisting of directed edges in the form of (from, to). This is done through two dictionaries that corresponds an article name to a unique numerical id that is needed to process the graph in the GCN.

In [3]:
conn = tg.TigerGraphConnection(ipAddress="https://wikipediagraph.i.tgcloud.us", apiToken=cfg.token, graphname="WikipediaGraph") # connection to TigerGraph database

articleToNum = {} # translation dictionary for article name to number (for dgl)
numToArticle = {} # translation dictionary for number to article name
i = 0
def createEdgeList(result): # returns tuple of number version of edge
    global i
    if result["article1"] in articleToNum:
        fromKey = articleToNum[result["article1"]]
    else:
        articleToNum[result["article1"]] = i
        numToArticle[i] = result["article1"]
        fromKey = i
        i+=1
    if result["article2"] in articleToNum:
        toKey = articleToNum[result["article2"]]
    else:
        articleToNum[result["article2"]] = i
        numToArticle[i] = result["article2"]
        toKey = i
        i+=1
    return (fromKey, toKey)
    
edges = [createEdgeList(thing) for thing in conn.runInstalledQuery("tupleArticle", {})["results"][0]["list_of_article_tuples"]] # creates list of edges


## Initializing Graph
This section converts the list of edges into a graph that DGL can process in the GCN. It also converts our wanted and unwanted topics to their corresponding numerical ids that we will use later on.

In [4]:
wantedNum = articleToNum[wantedTopic]
unwantedNum = articleToNum[unwantedTopic]

g = nx.DiGraph()

g.add_edges_from(edges)

G = dgl.DGLGraph(g)

print('We have %d nodes.' % G.number_of_nodes())
print('We have %d edges.' % G.number_of_edges())

We have 4588 nodes.
We have 119881 edges.


## Adding Features to Verticies in Graph

In [5]:
G.ndata["feat"] = torch.eye(G.number_of_nodes()) # one hot encode nodes for features (replace with doc2vec in future)

print(G.nodes[2].data['feat'])

tensor([[0., 0., 1.,  ..., 0., 0., 0.]])


## Creating Neural Network and Labelling Relevant Verticies
Here, we create the GCN, with options for using a two-layered GCN or a three-layered one. Thus far, the two-layered GCN appears to work better, and this is further corroborated by the fact [this](https://arxiv.org/abs/1609.02907) paper only used a two-layered one. We also label the wanted and unwanted verticies and setup the optimizer.

In [9]:
net = GCN(G.number_of_nodes(), 75, 2) #Two layer GCN
# net = threeLayerGCN(G.number_of_nodes(), 75, 25, 2) # Three Layer GCN 

inputs = torch.eye(G.number_of_nodes())
labeled_nodes = torch.tensor([wantedNum, unwantedNum])  # only the instructor and the president nodes are labeled
labels = torch.tensor([0, 1])  # their labels are different


optimizer = torch.optim.Adam(net.parameters(), lr=learningRate)

## Training the Neural Network
This loop stores the results of each epoch of training for visualization and then continues the training until the number of desired epochs is hit.

In [10]:
all_logits = []
for epoch in range(numEpochs):
    logits = net(G, inputs)
    # we save the logits for visualization later
    all_logits.append(logits.detach())
    logp = F.log_softmax(logits, 1)
    # we only compute loss for labeled nodes
    loss = F.nll_loss(logp[labeled_nodes], labels)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    print('Epoch %d | Loss: %6.3e' % (epoch, loss.item()))


Epoch 0 | Loss: 6.016e-01
Epoch 1 | Loss: 3.918e-03
Epoch 2 | Loss: 1.276e-05
Epoch 3 | Loss: 2.384e-07
Epoch 4 | Loss: 0.000e+00
Epoch 5 | Loss: 0.000e+00
Epoch 6 | Loss: 0.000e+00
Epoch 7 | Loss: 0.000e+00
Epoch 8 | Loss: 0.000e+00
Epoch 9 | Loss: 0.000e+00
Epoch 10 | Loss: 0.000e+00
Epoch 11 | Loss: 0.000e+00
Epoch 12 | Loss: 0.000e+00
Epoch 13 | Loss: 0.000e+00
Epoch 14 | Loss: 0.000e+00
Epoch 15 | Loss: 0.000e+00
Epoch 16 | Loss: 0.000e+00
Epoch 17 | Loss: 0.000e+00
Epoch 18 | Loss: 0.000e+00
Epoch 19 | Loss: 0.000e+00
Epoch 20 | Loss: 0.000e+00
Epoch 21 | Loss: 0.000e+00
Epoch 22 | Loss: 0.000e+00
Epoch 23 | Loss: 0.000e+00
Epoch 24 | Loss: 0.000e+00
Epoch 25 | Loss: 0.000e+00
Epoch 26 | Loss: 0.000e+00
Epoch 27 | Loss: 0.000e+00
Epoch 28 | Loss: 0.000e+00
Epoch 29 | Loss: 0.000e+00


## Output Predictions
This section translates the output of the last result of training and outputs the top 10 results given the topic desired.

In [11]:
predictions = list(all_logits[numEpochs-1])

predictionsWithIndex = []
a = 0
for article in predictions:
    predictionsWithIndex.append([a, article[0]])
    a+=1

predictionsWithIndex.sort(key=lambda x: x[1], reverse=True)

topResults = predictionsWithIndex[:10]


for article in topResults:
    print("Article Id: "+str(article[0]))
    print("Article Name: "+str(numToArticle[article[0]]))
    print("Article Score: "+str(article[1]))
    print("")

Article Id: 1266
Article Name: Saxophone
Article Score: tensor(20.9135)

Article Id: 3591
Article Name: Flemish_people
Article Score: tensor(7.1663)

Article Id: 3071
Article Name: Clarinet
Article Score: tensor(6.4077)

Article Id: 2205
Article Name: Miles_Davis
Article Score: tensor(5.2597)

Article Id: 2899
Article Name: Ragtime
Article Score: tensor(4.5008)

Article Id: 2577
Article Name: Louis_Armstrong
Article Score: tensor(4.4726)

Article Id: 3069
Article Name: Louis_Jordan
Article Score: tensor(3.8125)

Article Id: 2202
Article Name: Synthesizer
Article Score: tensor(3.7143)

Article Id: 2203
Article Name: Billie_Holiday
Article Score: tensor(3.5022)

Article Id: 3262
Article Name: Christina_Aguilera
Article Score: tensor(3.4239)



## Credits
- [GCN Tutorial Using DGL] (https://docs.dgl.ai/en/latest/tutorials/basics/1_first.html)
- [Using Python With TigerGraph's REST API] (https://github.com/tigergraph/ecosys/tree/master/etl/tg-python-wrapper) (Used to create pyTigerGraph)