# Creating the Network

### Undirected Weighted Network

- read in the with text embeddings
- use the k-nearest neighbours algorithm to find the 15 nearest neighbours for each node, based on cosine distance
- create a matrix of the edges between nodes, where the weight of the edge is the cosine similarity \* (1 - alpha)
- create a matrix of the edges between nodes, where the a weight of 1 \* alpha is assigned if a citation exists between the two nodes
- to make the matrix symmetric, we
  - calculate the average if ij and ji are non-zero
  - use ij in cases where ji is zero (and vice versa)
  - if ij and ji are both zero, set the value to zero (no edge)
  - we then use the upper triangle of the matrix to create the final adjacency matrix
- we then create a network based on this adjacency matrix
- we add data from the original dataframe to the network (title, year, etc.)
- we save the network as a graphml file; we save the log to the output folder

### Directed Unweighted Network

- create an edgelist from the dataframe
- create a directed network based on this edgelist
- add data from the original dataframe to the network (title, year, etc.)
- we save the network as a graphml file; we save the log to the output folder


In [1]:
import pandas as pd
import sys

sys.path.append("/Users/jlq293/Projects/Study-1-Bibliometrics/")
from src.network.creation.NetworkCreator import (
    WeightedNetworkCreator,
    DirectedNetworkCreator,
)
import json
import networkx as nx
from datetime import datetime


p = "../data/04-embeddings/df_with_specter2_embeddings.pkl"
df = pd.read_pickle(p)

In [2]:
alphas = [0.3, 0.5]
k = [5, 10, 15, 20]

# get all unique combinations of alpha and k
alpha_k_combinations = [(a, b) for a in alphas for b in k]

print(
    f"Creating weighted knn citation graph with {len(alpha_k_combinations)} combinations"
)

for alpha, k in alpha_k_combinations:
    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    print(
        f"[{timestamp}] Creating weighted knn citation graph with alpha={alpha} and k={k}"
    )

    wnc = WeightedNetworkCreator(df, alpha=alpha)
    similarities, indices = wnc.get_nearest_neighbours(k=k)
    knn_matrix = wnc.link_matrix_knn(similarities, indices)
    knn_citation_matrix = wnc.link_matrix_citations(knn_matrix)
    symmetric_matrix = wnc.make_matrix_symmetric(knn_citation_matrix)
    Gtemp = wnc.create_network(symmetric_matrix)
    Gpretty = wnc.prettify_network(
        Gtemp,
        col_list=["year", "title", "eid"],
    )
    weighted_info_log = wnc.get_info_log()

    with open(
        "../output/descriptive-stats-logs/weighted_network_info_log.json", "w"
    ) as f:
        json.dump(weighted_info_log, f)

    # write graphml
    nx.write_graphml(
        Gpretty,
        f"../data/05-graphs/weighted-knn-citation-graph/weighted_alpha{alpha}_k{k}_knn_citation.graphml",
    )

    timestamp_end = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    print(f"[{timestamp_end}] Finished processing alpha={alpha}, k={k}")
    print("#" * 30)

Creating weighted knn citation graph with 8 combinations
[2024-10-24 13:50:03] Creating weighted knn citation graph with alpha=0.3 and k=5
Initializing WeightedNetworkCreator
Getting nearest neighbours...
Creating KNN link matrix...
Creating citation link matrix...
Number of edges from citations: 360645
Making matrix symmetric...
Creating weighted undirected network...
Number of nodes: 40643
Number of edges: 460489
Percentage of edges from citations: 78.32%
Percentage of edges from KNN: 21.68%
[2024-10-24 13:52:14] Finished processing alpha=0.3, k=5
##############################
[2024-10-24 13:52:14] Creating weighted knn citation graph with alpha=0.3 and k=10
Initializing WeightedNetworkCreator
Getting nearest neighbours...
Creating KNN link matrix...
Creating citation link matrix...
Number of edges from citations: 360645
Making matrix symmetric...
Creating weighted undirected network...
Number of nodes: 40643
Number of edges: 602780
Percentage of edges from citations: 59.83%
Percent

In [5]:
list(Gpretty.edges(data=True))[0]

(0, 20, {'weight': 0.45390310883522034})

In [4]:
list(Gpretty.nodes(data=True))[0]

(0,
 {'year': 1982,
  'title': 'Serotonergic mechanism in the control of β-endorphin and acth release in male rats',
  'eid': '2-s2.0-0020316326'})

# Directed Citation Graph


In [9]:
# Create a directed network from the dataframe
dnc = DirectedNetworkCreator(df, data_to_add=["eid", "year", "title"])
dnc.build_graph()
print(dnc.get_graph_info())
directed_info_log = dnc.get_info_log()
print(directed_info_log)

G = dnc.G

Number of nodes: 40643
Number of edges: 360645
{'num_edges': 360645, 'num_nodes': 40643, 'nr_isolated_nodes': 2776}


In [10]:
# get node attributes
list(G.nodes(data=True))[0]

('Bruni_1982',
 {'eid': '2-s2.0-0020316326',
  'year': 1982,
  'title': 'Serotonergic mechanism in the control of β-endorphin and acth release in male rats'})

In [11]:
with open("../output/descriptive-stats-logs/directed_network_info_log.json", "w") as f:
    json.dump(directed_info_log, f)

# write graphml
nx.write_graphml(G, "../data/05-graphs/citation-graph/directed_citation_graph.graphml")