# KAGGLE COMPETITON AUEB 2021-2022
## Link Citation Prediction
##### Link prediction is the problem of predicting the existence of a link between two entities in a network.

The problem of link prediction has recently attracted a lot of attention in many domains. For instance, in social networks, one may be interested in predicting friendship links among users, while in biology, predicting interactions between genes and proteins is of paramount importance.

In this challenge, you will deal with the problem of predicting whether a research paper cites another research paper. You are given a citation network consisting of several thousands of research papers, along with their abstracts and their list of authors. The pipeline that is typically followed to deal with the problem is similar to the one applied in any classification problem; the goal is to use edge information to learn the parameters of a classifier and then to use the classifier to predict whether two nodes are linked by an edge or not.

Next, we also provide a simple approach, on which you can work. The key component is how we create our training and test data. For training, you may first create a numpy array of zeros with 2m rows and n columns (where m is the number of edges in the edgelist file, and n the number of desired features). We then need to populate this array to create the training features. A way to do that is to iterate over all edges in the edgelist file, and if we want to have 3 features: we get the degree of the first node of the edge as the 1st column, the degree of the 2nd node of the edge as the 2nd column and their sum as the 3rd column. For the remaining m rows, we select randomly nodes and get their degrees and sum as features as well. Now we have our matrix for training. We also need to create the y classes. We create a vector of zeros with size 2m. The first m elements represent the edges that are present in the edgelist. Thus we make them 1. The remaining need to be 0, as they represent non-existing edges. We are now ready to give the features and classes to a scikitlearn classifier via the 'fit' method. Afterwards, we only need to create a similar feature matrix for the test file. We then need the probabilities of our predictions. Remember to get the probabilities only for the positive class. Please, check the 'submissionsrandom' file to check if the format matches your output. The 1st column in the submissions file represents an id integer of the edges in test file. 0 points to the first edge of the test file and so on. You are now ready to submit!

	---
## Import Statements

In [2]:
import networkx as nx
import pandas as pd
import numpy as np
from random import randint
from collections import Counter
from itertools import combinations

from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import VotingRegressor
from sklearn.neighbors import KNeighborsRegressor

from sklearn.metrics import accuracy_score

	---
## Data Acquisition
#### Read Edgelist

In [None]:
# Create a graph
G = nx.read_edgelist('../input/linkCitatationData/edgelist.txt', delimiter=',', create_using=nx.Graph(), nodetype=int)
nodes = list(G.nodes())

num_of_nodes = G.number_of_nodes()
num_of_edges = G.number_of_edges()

print('Number of nodes:', num_of_nodes)
print('Number of edges:', num_of_edges)

##### Read Abstracts

In [None]:
# Read the abstract of each paper
abstracts = dict()

with open('../input/linkCitatationData/abstracts.txt', 'r', encoding = "UTF-8") as f:
    for line in f:
        node, abstract = line.split('|--|')
        
        abstracts[int(node)] = abstract

for node in abstracts:
    abstracts[node] = set(abstracts[node].split())

##### Read Authors

In [None]:
authors = dict()
with open('../input/linkCitatationData/authors.txt', 'r', encoding = "UTF-8") as f:
    for line in f:
        node, author = line.split('|--|')

        author = author.replace("\n", "")
        
        authors[int(node)] = author.split(",")

	---
## Data Preparation
#### Baseline Data

In [None]:
# Features:
# (1) sum of number of unique terms of the two nodes' abstracts
# (2) absolute value of difference of number of unique terms of the two nodes' abstracts
# (3) number of common terms between the abstracts of the two nodes

x = np.zeros((2*num_of_edges, 3))
y = np.zeros(2*num_of_edges)

for i, edge in enumerate(G.edges()):
    # an edge
    x[i,0] = len(abstracts[edge[0]]) + len(abstracts[edge[1]])
    x[i,1] = abs(len(abstracts[edge[0]]) - len(abstracts[edge[1]]))
    x[i,2] = len(abstracts[edge[0]].intersection(abstracts[edge[1]]))

    y[i] = 1

    # a randomly generated pair of nodes
    n1 = randint(0, num_of_nodes-1)
    n2 = randint(0, num_of_nodes-1)
    x[num_of_edges+i,0] = len(abstracts[n1]) + len(abstracts[n2])
    x[num_of_edges+i,1] = abs(len(abstracts[n1]) - len(abstracts[n2]))
    x[num_of_edges+i,2] = len(abstracts[n1].intersection(abstracts[n2]))
    
    y[num_of_edges+i] = 0

#### Baseline Data + Common Authors

In [None]:
# Features:
# (1) sum of number of unique terms of the two nodes' abstracts
# (2) absolute value of difference of number of unique terms of the two nodes' abstracts
# (3) number of common terms between the abstracts of the two nodes
# (4) number of common Authors between the Author List of the two nodes

x = np.zeros((2*num_of_edges, 4))
y = np.zeros(2*num_of_edges)

for i, edge in enumerate(G.edges()):
    # an edge
    x[i,0] = len(abstracts[edge[0]]) + len(abstracts[edge[1]])
    x[i,1] = abs(len(abstracts[edge[0]]) - len(abstracts[edge[1]]))
    x[i,2] = len(abstracts[edge[0]].intersection(abstracts[edge[1]]))
    x[i,3] = len(set(authors[edge[0]]).intersection(authors[edge[1]]))

    y[i] = 1

    # a randomly generated pair of nodes
    n1 = randint(0, num_of_nodes-1)
    n2 = randint(0, num_of_nodes-1)
    x[num_of_edges+i,0] = len(abstracts[n1]) + len(abstracts[n2])
    x[num_of_edges+i,1] = abs(len(abstracts[n1]) - len(abstracts[n2]))
    x[num_of_edges+i,2] = len(abstracts[n1].intersection(abstracts[n2]))
    x[num_of_edges+i,3] = len(set(authors[n1]).intersection(authors[n2]))
    
    y[num_of_edges+i] = 0

#### Baseline Data - Sum of Unique Terms + Common Authors


In [None]:
# Features:
# (1) absolute value of difference of number of unique terms of the two nodes' abstracts
# (2) number of common terms between the abstracts of the two nodes
# (3) number of common Authors between the Author List of the two nodes

x = np.zeros((2*num_of_edges, 3))
y = np.zeros(2*num_of_edges)

for i, edge in enumerate(G.edges()):
    # an edge
    x[i,0] = abs(len(abstracts[edge[0]]) - len(abstracts[edge[1]]))
    x[i,1] = len(abstracts[edge[0]].intersection(abstracts[edge[1]]))
    x[i,2] = len(set(authors[edge[0]]).intersection(authors[edge[1]]))

    y[i] = 1

    # a randomly generated pair of nodes
    n1 = randint(0, num_of_nodes-1)
    n2 = randint(0, num_of_nodes-1)
    x[num_of_edges+i,0] = abs(len(abstracts[n1]) - len(abstracts[n2]))
    x[num_of_edges+i,1] = len(abstracts[n1].intersection(abstracts[n2]))
    x[num_of_edges+i,2] = len(set(authors[n1]).intersection(authors[n2]))
    
    y[num_of_edges+i] = 0

## Papers

In [None]:
paper_times_cited = Counter([edge[0] for edge in G.edges])
paper_num_of_citations = Counter([edge[1] for edge in G.edges])

# Features:
# (1) probability of getting cited for the first paper
# (2) probability of citing a paper for the second paper

x = np.zeros((2*num_of_edges, 4))
y = np.zeros(2*num_of_edges)

for i, edge in enumerate(G.edges()):
    # an edge
    
    x[i,0] = paper_times_cited[edge[0]]/num_of_edges

    x[i,1] = paper_num_of_citations[edge[1]]/num_of_edges

    y[i] = 1

    # a randomly generated pair of nodes
    n1 = randint(0, num_of_nodes-1)
    n2 = randint(0, num_of_nodes-1)

    x[num_of_edges+i,0] = paper_times_cited[n1]/num_of_edges

    x[num_of_edges+i,1] = paper_num_of_citations[n2]/num_of_edges
    
    y[num_of_edges+i] = 0

## Authors

In [None]:
#distinct_authors = list(set([author for author_list in authors.values() for author in author_list]))

times_cited = Counter([author for edge in G.edges for author in authors[edge[0]]])
num_of_citations = Counter([author for edge in G.edges for author in authors[edge[1]]])

In [None]:
# Features:
# (1) probability of getting cited for the first paper
# (2) probability of citing a paper for the second paper
# (3) number of common terms between the abstracts of the two nodes
# (4) number of common Authors between the Author List of the two nodes

x = np.zeros((2*num_of_edges, 4))
y = np.zeros(2*num_of_edges)

for i, edge in enumerate(G.edges()):
    # an edge
    
    x[i,0] = 1
    for author in authors[edge[0]]:
        x[i,0] *= times_cited[author]/num_of_edges

    x[i,1] = 1
    for author in authors[edge[1]]:
        x[i,1] *= num_of_citations[author]/num_of_edges

    x[i,2] = len(abstracts[edge[0]].intersection(abstracts[edge[1]]))
    x[i,3] = len(set(authors[edge[0]]).intersection(authors[edge[1]]))

    y[i] = 1

    # a randomly generated pair of nodes
    n1 = randint(0, num_of_nodes-1)
    n2 = randint(0, num_of_nodes-1)

    x[num_of_edges+i,0] = 1
    for author in authors[n1]:
        x[num_of_edges+i,0] *= times_cited[author]/num_of_edges

    x[num_of_edges+i,1] = 1
    for author in authors[n2]:
        x[num_of_edges+i,1] *= num_of_citations[author]/num_of_edges

    x[num_of_edges+i,2] = len(abstracts[n1].intersection(abstracts[n2]))
    x[num_of_edges+i,3] = len(set(authors[n1]).intersection(authors[n2]))
    
    y[num_of_edges+i] = 0

	---
#### Shuffle Data and Split into Train/Test

In [None]:
x, y = shuffle(x, y)
x_train, x_test, y_train, y_test = train_test_split(x, y)

	---
## Models
##### Gaussian Naive Bayes

In [None]:
gnb = GaussianNB()
y_pred = gnb.fit(x_train, y_train).predict(x_test)

print("Accuracy: " + str(accuracy_score(y_pred, y_test)) + "%")

##### Logistic Regression

In [None]:
# Use logistic regression to predict if two nodes are linked by an edge
clf = LogisticRegression(solver='liblinear',random_state=34)
clf.fit(x_train, y_train)
y_pred = clf.predict_proba(x_test)
y_pred = y_pred[:,1]

getAccuracy()

##### Linear Regression 

In [None]:
clf = LinearRegression()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)

getAccuracy()

##### Ada-Boost

In [None]:
abc = AdaBoostClassifier(n_estimators=10, learning_rate=1)

model = abc.fit(x_train, y_train)
y_pred = model.predict(x_test)

getAccuracy()

##### Voting Regressor

In [None]:
r1 = LinearRegression()
r2 = RandomForestRegressor(n_estimators=10, random_state=1)
r3 = KNeighborsRegressor()
er = VotingRegressor([('lr', r1), ('rf', r2), ('r3', r3)])

model = er.fit(x_train, y_train)
y_pred = model.predict(x_test)

getAccuracy()


	---
## Metrics
#### Accuracy Score

In [None]:
def getAccuracy():
    y_predictions = []
    for i in y_pred: 
        if i >= 0.5: 
            y_predictions.append(1)
        else:
            y_predictions.append(0)


    print("Accuracy: " + str(accuracy_score(y_predictions, y_test)) + "%")

	---
## Accuracy Score

#### Baseline Data:

	Gaussian Naive Bayes: 0.660876445571067%

	Logistic Regression:
		RS 34: 0.7141752964405159% (Baseline)
		RS 54: 0.7129389828894205%

	Linear Regression: 0.7173146170724828%

	Ada-Boost:
		10 E, 1 lR: 0.7015484140386609%
		50 E, 1 lR: 0.7171332910849888%
		10 E, 5 lR: 0.3140895787009733%
		50 E, 5 lR: 0.3140895787009733%

#### Baseline Data + Common Authors:

	Gaussian Naive Bayes: 0.5558905303876713%

	Logistic Regression:
		RS 34: 0.7238331947441106% (0.7260475696822949% with \n removed)
		RS 54: 0.7226939547014715%

	Linear Regression: 0.7215931777470888% (0.7231866485462785% with \n removed)

	Ada-Boost:
		10 E, 1 lR: 0.7105487766906359%
		50 E, 1 lR: 0.7245456776646678%
		10 E, 5 lR: 0.3146756829029741%
		50 E, 5 lR: 0.3146756829029741%

		100 E, 1 lR: 0.7262582008798889%

## Ideas

Create Friendly  Connections - Relations between Authors, and produce its predictions



## Dataframe

#### Reading Files

In [None]:
# Read the abstract of each paper
abstracts = [] #Init List Abstracts

with open('../input/linkCitatationData/abstracts.txt', 'r', encoding = "UTF-8") as f:
    for line in f:
        abstracts.append(line.split('|--|')[1].replace("\n", ""))  
        
authors = [] #Init List Authors
with open('../input/linkCitatationData/authors.txt', 'r', encoding = "UTF-8") as f:
    for line in f:
        authors.append(line.split('|--|')[1].replace("\n", "").split(",")) 
        

##### Dataframe

In [None]:
df = pd.DataFrame(data = {'abstracts': abstracts, 'authors': authors})
print(df.iloc[0,1][0])

In [None]:
# Create a graph

G = nx.read_edgelist('../input/linkCitatationData/edgelist.txt', delimiter=',', create_using=nx.Graph(), nodetype=int)
nodes = list(G.nodes())

num_of_nodes = G.number_of_nodes()
num_of_edges = G.number_of_edges()
adj = nx.adjacency_matrix(G)
print('Number of nodes:', num_of_nodes)
print('Number of edges:', num_of_edges)
#print('Adjacency Matrix:', adj[: 5][: 5])
print(adj[0][1])

In [None]:
for i in range(num_of_nodes):
    for j in range(num_of_nodes):
        