# Competition — Link Prediction

### Challenge Overview

The dataset contains publications that is described by a binary vector indicating the presence of the corresponding word. The dateset can be represented as a graph where nodes are publications and edges are citations. For simplicity, let us say that this graph is undirected. The most similar pairs of nodes which are not connected was selected in the same amount as existent edges. The used proximity metric is cosine similarity, which is a normalized dot product of adjacency matrix. Let us denote existent edges by label 1 and additionally selected pairs by label 0.

The dataset is represented by 3 files:
* features.txt contains description of the papers in the format:
    * `<node id> <3703 unique one-hot encoded words>`
* labeled_edges.txt contains labeled pairs of nodes in the format:
    * `<node id> <node id> <label>`
* unlabeled_edges.txt contains unlabeled pairs of nodes in the format:
    * `<node id> <node id>`

Your task is to predict labels for unlabeled pairs of nodes: 0 — disconnected, 1 — connected.

Hints:
* Consider the features only. Transform the sparse feature matrix into low-dimensional dense embeddings. Fit a classificator to predict links.
* Consider the structure only. Create a graph that consists of edges with labels 1. Train any structural embedding model to obtain node embeddings. Fit a classificator to predict links.
* Consider the both structure and features. Create a graph that consists of edges with labels 0, labels 1 and no labels. Train any GNN model to obtain node embeddings. Minimize the link prediction error by gradient descent.
* Concatenate (multiply, sum up, average) pairs of node embeddings to obtain edge embeddings.
* You can combine embeddings from heterogeneous models.

### Evaluation Criteria

Here are balanced classes, so the usual accuracy metric is used:

* Accuracy = True predictions / All predictions

You can find baselines for grade 4, 6, 8 in the leaderboard.

### Submission Guidelines

Upload the txt file with your predictions sepateted by line break. For example:
```
1
1
0
1
0
```
... and so on.

In [29]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import warnings

warnings.filterwarnings('ignore')

In [6]:
features_df = pd.read_csv('features.txt', delimiter = "\s+", header=None).set_index(0)
labeled_edges_df = pd.read_csv('labeled_edges.txt', delimiter = "\s+", names=['from','to','label'])
unlabeled_edges_df = pd.read_csv('unlabeled_edges.txt', delimiter = "\s+", names=['from','to'])

In [7]:
features_df.head(5)

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,3694,3695,3696,3697,3698,3699,3700,3701,3702,3703
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
labeled_edges_df.head(5)

Unnamed: 0,from,to,label
0,2495,598,0
1,1473,1570,1
2,1000,748,1
3,2951,1693,0
4,1000,522,1


In [9]:
unlabeled_edges_df.head(5)

Unnamed: 0,from,to
0,1429,2808
1,1125,1250
2,2277,2810
3,1623,2808
4,1767,1768


In [12]:
features_from_df = features_df.loc[labeled_edges_df['from']]
features_to_df = features_df.loc[labeled_edges_df['to']]

In [14]:
features_from_to = features_from_df.to_numpy() + features_to_df.to_numpy()

In [24]:
X = np.concatenate([features_from_to,features_from_to.sum(axis=1)[:, None]],axis=1)
y = labeled_edges_df['label']

y.shape, X.shape

((6048,), (6048, 3704))

In [19]:
scaler = preprocessing.StandardScaler().fit(X)

In [20]:
X_scaler = scaler.transform(X)

In [25]:
seed = 7
test_size = 0.33
X_train, X_val, y_train, y_val = train_test_split(X_scaler, y, test_size=test_size, random_state=seed)

In [30]:
model_XGB = XGBClassifier()
model_XGB.fit(X_train, y_train)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=12, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [32]:
y_pred = model_XGB.predict(X_val)
predictions = [round(value) for value in y_pred]

In [34]:
accuracy = accuracy_score(y_val, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 69.89%


In [36]:
model_LR = LogisticRegression(max_iter=10000)
model_LR.fit(X_train, y_train)

LogisticRegression(max_iter=10000)

In [37]:
y_pred = model_LR.predict(X_val)
predictions = [round(value) for value in y_pred]

In [38]:
accuracy = accuracy_score(y_val, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 62.37%


## Submit solution

In [39]:
model_XGB = XGBClassifier()
model_XGB.fit(X_scaler, y)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=12, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [41]:
features_test_from_df = features_df.loc[unlabeled_edges_df['from']]
features_test_to_df = features_df.loc[unlabeled_edges_df['to']]
features_test_from_to = features_from_df.to_numpy() + features_to_df.to_numpy()

In [43]:
X_test = np.concatenate([features_test_from_to,features_test_from_to.sum(axis=1)[:, None]],axis=1)
X_test_scaler = scaler.transform(X_test)

In [45]:
predictions = model_XGB.predict(X_test_scaler)

In [46]:
np.savetxt(r'submit_link_prediction_XGB.txt', predictions, fmt='%d')