# Creating the Network

### Undirected Weighted Network

- read in the with text embeddings
- use the k-nearest neighbours algorithm to find the 15 nearest neighbours for each node, based on cosine distance
- create a matrix of the edges between nodes, where the weight of the edge is the cosine similarity \* (1 - alpha)
- create a matrix of the edges between nodes, where the a weight of 1 \* alpha is assigned if a citation exists between the two nodes
- to make the matrix symmetric, we
  - calculate the average if ij and ji are non-zero
  - use ij in cases where ji is zero (and vice versa)
  - if ij and ji are both zero, set the value to zero (no edge)
  - we then use the upper triangle of the matrix to create the final adjacency matrix
- we then create a network based on this adjacency matrix
- we add data from the original dataframe to the network (title, year, etc.)
- we save the network as a graphml file; we save the log to the output folder

### Directed Unweighted Network

- create an edgelist from the dataframe
- create a directed network based on this edgelist
- add data from the original dataframe to the network (title, year, etc.)
- we save the network as a graphml file; we save the log to the output folder


In [1]:
import pandas as pd
from src.network.creation.NetworkCreator import (
    WeightedNetworkCreator,
    DirectedNetworkCreator,
)
import json
import networkx as nx


p = "../data/04-embeddings/df_with_specter2_embeddings.pkl"
df = pd.read_pickle(p)

In [2]:
# usage

alphas = [0.3, 0.5]
k = [5, 10, 15, 20]
# get all unique combinations of alpha and k
alpha_k_combinations = [(a, b) for a in alphas for b in k]

for alpha, k in alpha_k_combinations:
    print(f"Creating weighted knn citation graph with alpha={alpha} and k={k}")
    wnc = WeightedNetworkCreator(df, alpha=alpha)
    similarities, indices = wnc.get_nearest_neighbours(k=k)
    knn_matrix = wnc.link_matrix_knn(similarities, indices)
    knn_citation_matrix = wnc.link_matrix_citations(knn_matrix)
    symmetric_matrix = wnc.make_matrix_symmetric(knn_citation_matrix)
    G = wnc.create_network(symmetric_matrix)
    G = wnc.prettify_network(
        G,
        col_list=["eid", "title", "year", "doi"],
    )
    weighted_info_log = wnc.get_info_log()

    with open(
        "../output/descriptive-stats-logs/weighted_network_info_log.json", "w"
    ) as f:
        json.dump(weighted_info_log, f)

    # write graphml
    nx.write_graphml(
        G,
        f"../data/05-graphs/weighted-knn-citation-graph/weighted_alpha{alpha}_k{k}_knn_citation.graphml",
    )
    print("#" * 30)

Creating weighted knn citation graph with alpha=0.3 and k=5
Initializing WeightedNetworkCreator
Getting nearest neighbours...
Creating KNN link matrix...
Creating citation link matrix...
Number of edges from citations: 336391
Making matrix symmetric...
Creating weighted undirected network...
Number of nodes: 36975
Number of edges: 425353
Percentage of edges from citations: 79.09%
Percentage of edges from KNN: 20.91%
##############################
Creating weighted knn citation graph with alpha=0.3 and k=10
Initializing WeightedNetworkCreator
Getting nearest neighbours...
Creating KNN link matrix...
Creating citation link matrix...
Number of edges from citations: 336391
Making matrix symmetric...
Creating weighted undirected network...
Number of nodes: 36975
Number of edges: 553370
Percentage of edges from citations: 60.79%
Percentage of edges from KNN: 39.21%
##############################
Creating weighted knn citation graph with alpha=0.3 and k=15
Initializing WeightedNetworkCreator


In [3]:
list(G.edges(data=True))[0]

('Hyttel_1982', 'Ofsti_1982', {'weight': 0.7130630612373352})

In [4]:
list(G.nodes(data=True))[0]

('Hyttel_1982',
 {'eid': '2-s2.0-0020416931',
  'title': 'Citalopram - Pharmacological profile of a specific serotonin uptake inhibitor with antidepressant activity',
  'year': 1982,
  'doi': '10.1016/S0278-5846(82)80179-6'})

# Directed Citation Graph


In [5]:
# Create a directed network from the dataframe
dnc = DirectedNetworkCreator(df, data_to_add=["eid", "doi"])
dnc.build_graph()
print(dnc.get_graph_info())
directed_info_log = dnc.get_info_log()
print(directed_info_log)

G = dnc.G

Number of nodes: 36975
Number of edges: 336391
{'num_edges': 336391, 'num_nodes': 36975, 'nr_isolated_nodes': 2044}


In [6]:
# get node attributes
list(G.nodes(data=True))[0]

('Hyttel_1982',
 {'eid': '2-s2.0-0020416931', 'doi': '10.1016/S0278-5846(82)80179-6'})

In [7]:
with open("../output/descriptive-stats-logs/directed_network_info_log.json", "w") as f:
    json.dump(directed_info_log, f)

# write graphml
nx.write_graphml(G, "../data/05-graphs/citation-graph/directed_citation_graph.graphml")