# KAGGLE COMPETITON AUEB 2021-2022
## Link Citation Prediction
##### Link prediction is the problem of predicting the existence of a link between two entities in a network.

The problem of link prediction has recently attracted a lot of attention in many domains. For instance, in social networks, one may be interested in predicting friendship links among users, while in biology, predicting interactions between genes and proteins is of paramount importance.

In this challenge, you will deal with the problem of predicting whether a research paper cites another research paper. You are given a citation network consisting of several thousands of research papers, along with their abstracts and their list of authors. The pipeline that is typically followed to deal with the problem is similar to the one applied in any classification problem; the goal is to use edge information to learn the parameters of a classifier and then to use the classifier to predict whether two nodes are linked by an edge or not.

Next, we also provide a simple approach, on which you can work. The key component is how we create our training and test data. For training, you may first create a numpy array of zeros with 2m rows and n columns (where m is the number of edges in the edgelist file, and n the number of desired features). We then need to populate this array to create the training features. A way to do that is to iterate over all edges in the edgelist file, and if we want to have 3 features: we get the degree of the first node of the edge as the 1st column, the degree of the 2nd node of the edge as the 2nd column and their sum as the 3rd column. For the remaining m rows, we select randomly nodes and get their degrees and sum as features as well. Now we have our matrix for training. We also need to create the y classes. We create a vector of zeros with size 2m. The first m elements represent the edges that are present in the edgelist. Thus we make them 1. The remaining need to be 0, as they represent non-existing edges. We are now ready to give the features and classes to a scikitlearn classifier via the 'fit' method. Afterwards, we only need to create a similar feature matrix for the test file. We then need the probabilities of our predictions. Remember to get the probabilities only for the positive class. Please, check the 'submissionsrandom' file to check if the format matches your output. The 1st column in the submissions file represents an id integer of the edges in test file. 0 points to the first edge of the test file and so on. You are now ready to submit!

	---

In [1]:
import math
import csv

import networkx as nx
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from random import randint

from tqdm import tqdm
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier

In [2]:
# Create a graph
G = nx.read_edgelist('data/edgelist.txt', delimiter=',', create_using=nx.Graph(), nodetype=int)
nodes = list(G.nodes())
n = G.number_of_nodes()
m = G.number_of_edges()
print('Number of nodes:', n)
print('Number of edges:', m)

Number of nodes: 138499
Number of edges: 1091955


In [3]:
# Read the abstract of each paper
abstracts = dict()

with open('data/abstracts.txt', 'r', encoding = "UTF-8") as f:
    for line in f:
        node, abstract = line.split('|--|')
        
        abstracts[int(node)] = abstract

for node in abstracts:
    abstracts[node] = set(abstracts[node].split())

In [9]:
authors = dict()
with open('data/authors.txt', 'r', encoding = "UTF-8") as f:
    for line in f:
        node, author = line.split('|--|')
        
        authors[int(node)] = author.split(",")

In [5]:
node_list_1 = []
node_list_2 = []

for i in G.edges:
  node_list_1.append(i[0])
  node_list_2.append(i[1])


df = pd.DataFrame({'node_1': node_list_1, 'node_2': node_list_2})

In [None]:
print(df)

In [None]:
# create graph
G = nx.from_pandas_edgelist(df, "node_1", "node_2", create_using=nx.Graph())
# plot graph
plt.figure(figsize=(10,10))
pos = nx.random_layout(G, seed=23)
nx.draw(G, with_labels=False,  pos = pos, node_size = 40, alpha = 0.6, width = 0.7)
plt.show()



In [None]:
# its class label is 1 if it corresponds to an edge and 0, otherwise.
# Use the following 3 features for each pair of nodes:
# (1) sum of number of unique terms of the two nodes' abstracts
# (2) absolute value of difference of number of unique terms of the two nodes' abstracts
# (3) number of common terms between the abstracts of the two nodes

x = np.zeros((2*m, 4))
y = np.zeros(2*m)
n = G.number_of_nodes()
for i,edge in enumerate(G.edges()):
    # an edge
    x[i,0] = len(abstracts[edge[0]]) + len(abstracts[edge[1]])
    x[i,1] = abs(len(abstracts[edge[0]]) - len(abstracts[edge[1]]))
    x[i,2] = len(abstracts[edge[0]].intersection(abstracts[edge[1]]))
    x[i,3] = len(set(authors[edge[0]]).intersection(authors[edge[1]]))
    y[i] = 1

    # a randomly generated pair of nodes
    n1 = randint(0, n-1)
    n2 = randint(0, n-1)
    x[m+i,0] = len(abstracts[n1]) + len(abstracts[n2])
    x[m+i,1] = abs(len(abstracts[n1]) - len(abstracts[n2]))
    x[m+i,2] = len(abstracts[n1].intersection(abstracts[n2]))
    x[m+i,3] = len(set(authors[n1]).intersection(authors[n2]))
    y[m+i] = 0

In [None]:
x, y = shuffle(x, y)
x_train, x_test, y_train, y_test = train_test_split(x, y)

##### Logistic Regression

In [None]:
# Use logistic regression to predict if two nodes are linked by an edge
clf = LogisticRegression(solver='liblinear',random_state=34)
clf.fit(x_train, y_train)
y_pred = clf.predict_proba(x_test)
y_pred = y_pred[:,1]

##### Linear Regression 

In [None]:
clf = LinearRegression()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)

##### Ada-Boost

In [None]:
abc = AdaBoostClassifier(n_estimators=50, learning_rate=1)

model = abc.fit(x_train, y_train)
y_pred = model.predict(x_test)

In [None]:
y_predictions = []
for i in y_pred: 
    if i >= 0.5: 
        y_predictions.append(1)
    else:
        y_predictions.append(0)


print("Accuracy: ", accuracy_score(y_predictions, y_test))

##### Graph