# Creating the Network

## Overview

This notebook creates two types of networks from the bibliometric data:

1. A weighted undirected network that combines citation relationships with semantic similarity
2. A directed unweighted network representing pure citation relationships

## Weighted Undirected Network Creation Process

### Input Data

- DataFrame containing paper information including:
  - Text embeddings (SPECTER2)
  - Citation relationships
  - Metadata (year, title, etc.)

### Network Construction Steps

1. **Semantic Similarity Calculation**

   - Use k-nearest neighbors (kNN) algorithm to find similar papers
   - Calculate cosine similarity between paper embeddings
   - For each paper, find k most similar papers (k ∈ {5, 10, 15, 20})

2. **Edge Weight Calculation**

   - Combine two types of relationships:
     - Semantic similarity: weight = cosine_similarity \* (1 - α)
     - Citation relationship: weight = 1 \* α
   - α ∈ {0.3, 0.5} controls the balance between semantic and citation relationships
     - An alpha of 0.3 means that citation relationships are given 30% of the maximum weight

3. **Matrix Symmetrization**

   - Make the adjacency matrix symmetric by:
     - Averaging weights when both directions exist
     - Using the non-zero weight when one direction is zero
     - Setting zero when both directions are zero
   - Keep only the upper triangular part to avoid duplicate edges

4. **Network Creation**
   - Create an undirected weighted network from the symmetric matrix
   - Add paper metadata as node attributes
   - Save network statistics and visualization

### Output

- GraphML file containing the weighted network
- Statistics log with:
  - Number of nodes and edges
  - Percentage of edges from citations vs. semantic similarity
  - Network density and other metrics

## Directed Citation Network Creation Process

### Input Data

- Same DataFrame as above, focusing on citation relationships

### Network Construction Steps

1. **Node Creation**

   - Create nodes for each paper using unique identifiers
   - Add paper metadata as node attributes

2. **Edge Creation**

   - Create directed edges based on citation relationships
   - Edge direction: citing paper → cited paper
   - No weights assigned (unweighted network)

3. **Network Analysis**
   - Calculate basic network statistics
   - Identify isolated nodes
   - Generate network metrics

### Output

- GraphML file containing the directed citation network
- Statistics log with:
  - Number of nodes and edges
  - Number of isolated nodes
  - Network structure metrics

## File Structure

- Input: `../data/04-embeddings/df_with_specter2_embeddings.pkl`
- Output:
  - Weighted networks: `../data/05-graphs/weighted-knn-citation-graph/`
  - Directed network: `../data/05-graphs/citation-graph/`
  - Statistics logs: `../output/descriptive-stats-logs/`
  -


In [18]:
import os

import numpy as np
import pandas as pd
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Access environment variables
python_path = os.getenv('PYTHONPATH')
data_dir = os.getenv('DATA_DIR')
src_dir = os.getenv('SRC_DIR')
output_dir = os.getenv('OUTPUT_DIR')


import json
from datetime import datetime

import networkx as nx

from src.network.creation.NetworkCreator import (
    DirectedNetworkCreator,
    WeightedNetworkCreator,
)


In [19]:
p = data_dir + "/04-embeddings/2025/df_with_specter2_embeddings.pkl"
df = pd.read_pickle(p)
df.head(5)


Unnamed: 0,eid,title,date,first_author,abstract,doi,year,auth_year,unique_auth_year,pubmed_id,...,fund_sponsor,article_number,reference_eids,nr_references,filtered_reference_eids,nr_filtered_references,title_abstract,clean_title,clean_abstract,specter2_embeddings
0,2-s2.0-0020425640,Kinetics of citalopram in man; plasma levels i...,1982-01-01,Overø K.,Abstract1.Citalopram is rapidly absorbed and s...,10.1016/S0278-5846(82)80181-4,1982,Overo_1982,Overo_1982_2,6959195.0,...,,,"[2-s2.0-0019179180, 2-s2.0-0020047901, 2-s2.0-...",18,"[2-s2.0-0020431499, 2-s2.0-0020431887, 2-s2.0-...",5,Kinetics of citalopram in man; plasma levels i...,Kinetics of citalopram in man; plasma levels i...,1.Citalopram is rapidly absorbed and slowly el...,"[0.01807364, 0.45536643, -0.6515877, 0.2952828..."
1,2-s2.0-0019951467,Nonpurinergic nature and efficacy of nonadrene...,1982-01-01,Irvin C.G.,The nonadrenergic inhibition of airway smooth ...,,1982,Irvin_1982,Irvin_1982,6121785.0,...,,,,0,[],0,Nonpurinergic nature and efficacy of nonadrene...,Nonpurinergic nature and efficacy of nonadrene...,The nonadrenergic inhibition of airway smooth ...,"[0.2421042, 0.35010815, -0.4101752, 0.56234515..."
2,2-s2.0-0020059465,ANDROGEN‐INDUCED SEXUAL DIMORPHISM IN HIGH AFF...,1982-01-01,JALILIAN‐TEHRANI M.H.,High affinity binding of [3H]‐dopamine and [3H...,10.1111/j.1476-5381.1982.tb08755.x,1982,JalilianTehrani_1982,JalilianTehrani_1982,7074286.0,...,,,"[2-s2.0-0017191716, 2-s2.0-0018835903, 2-s2.0-...",54,[],0,ANDROGENINDUCED SEXUAL DIMORPHISM IN HIGH AFFI...,ANDROGENINDUCED SEXUAL DIMORPHISM IN HIGH AFFI...,High affinity binding of dopamine and 5hydroxy...,"[0.20964071, 1.1354221, -0.08355975, -0.338157..."
3,2-s2.0-0019961783,On the prolactin-inhibiting effect of neuroten...,1982-01-01,Koenig J.,Neurotensin (NT) when injected in a dose of 5 ...,10.1159/000123394,1982,Koenig_1982,Koenig_1982,6983042.0,...,Eunice Kennedy Shriver National Institute of C...,,"[2-s2.0-0001720457, 2-s2.0-0014768366, 2-s2.0-...",22,[],0,On the prolactin-inhibiting effect of neuroten...,On the prolactin-inhibiting effect of neuroten...,Neurotensin (NT) when injected in a dose of 5 ...,"[0.14462437, 0.82366514, -0.36023885, 0.486612..."
4,2-s2.0-0019992213,Effect of amezinium on the release and catabol...,1982-07-15,Steppeler A.,Occipitocortical slices of rats were preincuba...,10.1016/0006-2952(82)90535-4,1982,Steppeler_1982,Steppeler_1982,7126251.0,...,,,"[2-s2.0-0019501271, 2-s2.0-0019487332, 2-s2.0-...",20,[],0,Effect of amezinium on the release and catabol...,Effect of amezinium on the release and catabol...,Occipitocortical slices of rats were preincuba...,"[-0.2999696, 1.100803, -0.7959455, 0.4357793, ..."


In [20]:
# Create a dictionary to store all graphs
graphs_dict = {}

# Get current date for file naming
current_date = datetime.now().strftime("%Y%m%d")

alphas = [0.3, 0.5]
k = [5, 10, 15, 20]

# get all unique combinations of alpha and k
alpha_k_combinations = [(a, b) for a in alphas for b in k]

print(
    f"Creating weighted knn citation graph with {len(alpha_k_combinations)} combinations"
)

for alpha, k in alpha_k_combinations:
    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    print(
        f"[{timestamp}] Creating weighted knn citation graph with alpha={alpha} and k={k}"
    )

    wnc = WeightedNetworkCreator(df, alpha=alpha)
    similarities, indices = wnc.get_nearest_neighbours(k=k)
    knn_matrix = wnc.link_matrix_knn(similarities, indices)
    knn_citation_matrix = wnc.link_matrix_citations(knn_matrix)
    symmetric_matrix = wnc.make_matrix_symmetric(knn_citation_matrix)
    Gtemp = wnc.create_network(symmetric_matrix)
    Gpretty = wnc.prettify_network(
        Gtemp,
        col_list=["year", "title", "eid"],
    )

    # Calculate weight statistics
    weights = [d['weight'] for _, _, d in Gpretty.edges(data=True)]
    weight_stats = {
        "mean_weight": np.mean(weights),
        "std_weight": np.std(weights),
        "min_weight": np.min(weights),
        "max_weight": np.max(weights),
        "median_weight": np.median(weights),
        "q1_weight": np.percentile(weights, 25),
        "q3_weight": np.percentile(weights, 75)
    }

    # Store the graph in the dictionary with a descriptive key
    key = f"alpha{alpha}_k{k}"
    graphs_dict[key] = {
        "graph": Gpretty,
        "info_log": wnc.get_info_log(),
        "alpha": alpha,
        "k": k,
        "timestamp": timestamp,
        "weight_stats": weight_stats
    }

    weighted_info_log = wnc.get_info_log()

    # Add date to the info log
    weighted_info_log["creation_date"] = current_date
    weighted_info_log["creation_timestamp"] = timestamp

    # Write log with date in filename
    log_filename = f"weighted_network_info_log_{current_date}.json"
    with open(
        output_dir + "/descriptive-stats-logs/" + log_filename, "w"
    ) as f:
        json.dump(weighted_info_log, f)

    # Write graphml with date in filename
    graph_filename = f"weighted_alpha{alpha}_k{k}_knn_citation_{current_date}.graphml"
    nx.write_graphml(
        Gpretty,
        data_dir + "/05-graphs/2025/weighted-knn-citation-graph/" + graph_filename,
    )

    timestamp_end = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    print(f"[{timestamp_end}] Finished processing alpha={alpha}, k={k}")
    print("#" * 30)

# Print summary of created graphs
print("\nCreated graphs summary:")
for key, data in graphs_dict.items():
    print(f"\n{key}:")
    print(f"  Number of nodes: {data['info_log']['num_nodes']}")
    print(f"  Number of edges: {data['info_log']['num_edges']}")
    print(f"  Percentage of edges from citations: {data['info_log']['num_edges_from_citations'] / data['info_log']['num_edges'] * 100:.2f}%")
    print(f"  Percentage of edges from KNN: {data['info_log']['num_edges_from_knn'] / data['info_log']['num_edges'] * 100:.2f}%")
    print(f"  Created at: {data['timestamp']}")
    print("\n  Edge Weight Statistics:")
    print(f"    Mean: {data['weight_stats']['mean_weight']:.4f}")
    print(f"    Std: {data['weight_stats']['std_weight']:.4f}")
    print(f"    Min: {data['weight_stats']['min_weight']:.4f}")
    print(f"    Max: {data['weight_stats']['max_weight']:.4f}")
    print(f"    Median: {data['weight_stats']['median_weight']:.4f}")
    print(f"    Q1: {data['weight_stats']['q1_weight']:.4f}")
    print(f"    Q3: {data['weight_stats']['q3_weight']:.4f}")


Creating weighted knn citation graph with 8 combinations
[2025-04-01 20:45:50] Creating weighted knn citation graph with alpha=0.3 and k=5
Initializing WeightedNetworkCreator
Getting nearest neighbours...
Creating KNN link matrix...
Creating citation link matrix...
Number of edges from citations: 355263
Making matrix symmetric...
Creating weighted undirected network...
Number of nodes: 38961
Number of edges: 449283
Percentage of edges from citations: 79.07%
Percentage of edges from KNN: 20.93%
[2025-04-01 20:48:22] Finished processing alpha=0.3, k=5
##############################
[2025-04-01 20:48:22] Creating weighted knn citation graph with alpha=0.3 and k=10
Initializing WeightedNetworkCreator
Getting nearest neighbours...
Creating KNN link matrix...
Creating citation link matrix...
Number of edges from citations: 355263
Making matrix symmetric...
Creating weighted undirected network...
Number of nodes: 38961
Number of edges: 584576
Percentage of edges from citations: 60.77%
Percent

In [21]:
list(Gpretty.edges(data=True))[0]


(0, 33, {'weight': 0.9682894349098206})

In [22]:
list(Gpretty.nodes(data=True))[0]


(0,
 {'year': np.int32(1982),
  'title': 'Kinetics of citalopram in man; plasma levels in patients',
  'eid': '2-s2.0-0020425640'})

# Directed Citation Graph


In [23]:
# Create a directed network from the dataframe
dnc = DirectedNetworkCreator(df, data_to_add=["eid", "year", "title"])
dnc.build_graph()
print(dnc.get_graph_info())
directed_info_log = dnc.get_info_log()
print(directed_info_log)

G = dnc.G


Number of nodes: 38961
Number of edges: 355263
{'num_edges': 355263, 'num_nodes': 38961, 'nr_isolated_nodes': 2150}


In [24]:
# get node attributes
list(G.nodes(data=True))[0]


('Overo_1982_2',
 {'eid': '2-s2.0-0020425640',
  'year': 1982,
  'title': 'Kinetics of citalopram in man; plasma levels in patients'})

In [25]:
with open(output_dir + "/descriptive-stats-logs/directed_network_info_log.json", "w") as f:
    json.dump(directed_info_log, f)

# write graphml
nx.write_graphml(G, data_dir + "/05-graphs/2025/citation-graph/directed_citation_graph.graphml")
