# Creating the Network

## Overview
This notebook creates two types of networks from the bibliometric data:
1. A weighted undirected network that combines citation relationships with semantic similarity
2. A directed unweighted network representing pure citation relationships

## Weighted Undirected Network Creation Process

### Input Data
- DataFrame containing paper information including:
  - Text embeddings (SPECTER2)
  - Citation relationships
  - Metadata (year, title, etc.)

### Network Construction Steps
1. **Semantic Similarity Calculation**
   - Use k-nearest neighbors (kNN) algorithm to find similar papers
   - Calculate cosine similarity between paper embeddings
   - For each paper, find k most similar papers (k ∈ {5, 10, 15, 20})

2. **Edge Weight Calculation**
   - Combine two types of relationships:
     - Semantic similarity: weight = cosine_similarity * (1 - α)
     - Citation relationship: weight = 1 * α
   - α ∈ {0.3, 0.5} controls the balance between semantic and citation relationships
     - An alpha of 0.3 means that citation relationships are given 30% of the maximum weight
  

3. **Matrix Symmetrization**
   - Make the adjacency matrix symmetric by:
     - Averaging weights when both directions exist
     - Using the non-zero weight when one direction is zero
     - Setting zero when both directions are zero
   - Keep only the upper triangular part to avoid duplicate edges

4. **Network Creation**
   - Create an undirected weighted network from the symmetric matrix
   - Add paper metadata as node attributes
   - Save network statistics and visualization

### Output
- GraphML file containing the weighted network
- Statistics log with:
  - Number of nodes and edges
  - Percentage of edges from citations vs. semantic similarity
  - Network density and other metrics

## Directed Citation Network Creation Process

### Input Data
- Same DataFrame as above, focusing on citation relationships

### Network Construction Steps
1. **Node Creation**
   - Create nodes for each paper using unique identifiers
   - Add paper metadata as node attributes

2. **Edge Creation**
   - Create directed edges based on citation relationships
   - Edge direction: citing paper → cited paper
   - No weights assigned (unweighted network)

3. **Network Analysis**
   - Calculate basic network statistics
   - Identify isolated nodes
   - Generate network metrics

### Output
- GraphML file containing the directed citation network
- Statistics log with:
  - Number of nodes and edges
  - Number of isolated nodes
  - Network structure metrics

## File Structure
- Input: `../data/04-embeddings/df_with_specter2_embeddings.pkl`
- Output:
  - Weighted networks: `../data/05-graphs/weighted-knn-citation-graph/`
  - Directed network: `../data/05-graphs/citation-graph/`
  - Statistics logs: `../output/descriptive-stats-logs/`
  - 

In [4]:
import pandas as pd
import numpy as np
import sys

sys.path.append("/Users/jlq293/Projects/Study-1-Bibliometrics/")
from src.network.creation.NetworkCreator import (
    WeightedNetworkCreator,
    DirectedNetworkCreator,
)
import json
import networkx as nx
from datetime import datetime


p = "../data/04-embeddings/df_with_specter2_embeddings.pkl"
df = pd.read_pickle(p)

In [2]:
df

Unnamed: 0,eid,title,date,first_author,abstract,doi,year,auth_year,unique_auth_year,pubmed_id,...,fund_sponsor,article_number,reference_eids,nr_references,filtered_reference_eids,nr_filtered_references,title_abstract,clean_title,clean_abstract,specter2_embeddings
0,2-s2.0-0020316326,Serotonergic mechanism in the control of β-end...,1982-04-12,Bruni J.F.,The role of the serotonergic mechanism in the ...,10.1016/0024-3205(82)90686-5,1982,Bruni_1982,Bruni_1982,6283286.0,...,National Institutes of Health,,"[2-s2.0-0016795422, 2-s2.0-0000011578, 2-s2.0-...",46,[],0,Serotonergic mechanism in the control of β-end...,Serotonergic mechanism in the control of β-end...,The role of the serotonergic mechanism in the ...,"[-0.38758993, 0.8743463, -0.52714413, 0.029653..."
1,2-s2.0-0019936013,EFFECTS OF PAROXETINE ON SYNAPTOSOMAL NEUROTRA...,1982-01-01,Magnussen I.,,10.1111/j.1600-0404.1982.tb03382.x,1982,Magnussen_1982,Magnussen_1982_4,,...,,,[],0,[],0,EFFECTS OF PAROXETINE ON SYNAPTOSOMAL NEUROTRA...,EFFECTS OF PAROXETINE ON SYNAPTOSOMAL NEUROTRA...,,"[0.329068, 0.23448052, -0.6597941, 0.13635367,..."
2,2-s2.0-0020058010,Treatment of intention myoclonus with paroxeti...,1982-01-01,Magnussen I.,,,1982,Magnussen_1982,Magnussen_1982_3,,...,,,[],0,[],0,Treatment of intention myoclonus with paroxeti...,Treatment of intention myoclonus with paroxeti...,,"[0.406605, 1.0992043, -0.60125256, 0.73224956,..."
3,2-s2.0-0020446870,"Paroxetine, a potent selective long-acting inh...",1982-09-01,Magnussen I.,The high-affinity uptake of tritium labelled t...,10.1007/BF01276577,1982,Magnussen_1982,Magnussen_1982_2,,...,,,"[2-s2.0-0017144720, 2-s2.0-0018872854, 2-s2.0-...",14,[2-s2.0-0019996341],1,Paroxetine a potent selective long-acting inhi...,Paroxetine a potent selective long-acting inhi...,The high-affinity uptake of tritium labelled t...,"[0.14719126, 0.53084964, -0.752622, 0.29364386..."
4,2-s2.0-0019996341,Treatment of myoclonic syndromes with paroxeti...,1982-01-01,Magnussen I.,Paroxetine is a specific presynaptic 5‐hydroxy...,10.1111/j.1600-0404.1982.tb04525.x,1982,Magnussen_1982,Magnussen_1982,6215817.0,...,,,"[2-s2.0-0017883259, 2-s2.0-0017874037, 2-s2.0-...",13,[],0,Treatment of myoclonic syndromes with paroxeti...,Treatment of myoclonic syndromes with paroxeti...,Paroxetine is a specific presynaptic 5hydroxyt...,"[0.11351775, 1.1247323, -0.72639483, 0.6222344..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40638,2-s2.0-85174642163,Repeated seizures lead to progressive ventilat...,2023-10-01,Manis A.D.,Patients with uncontrolled epilepsy experience...,10.1152/JAPPLPHYSIOL.00158.2023,2023,Manis_2023,Manis_2023,37535709.0,...,National Institutes of Health,,"[2-s2.0-85064111657, 2-s2.0-0026360052, 2-s2.0...",66,"[2-s2.0-33645018161, 2-s2.0-84900866360, 2-s2....",4,Repeated seizures lead to progressive ventilat...,Repeated seizures lead to progressive ventilat...,Patients with uncontrolled epilepsy experience...,"[0.2146723, -0.2686365, -0.042400524, 0.058695..."
40639,2-s2.0-85152263916,Response to the comments on “A Prospective Obs...,2023-05-01,Mandal S.,,10.1177/02537176231164752,2023,Mandal_2023,Mandal_2023_2,,...,,,[2-s2.0-85132849740],1,[2-s2.0-85132849740],1,Response to the comments on A Prospective Obse...,Response to the comments on A Prospective Obse...,,"[0.17586017, 0.5648126, -0.934148, 0.14409506,..."
40640,2-s2.0-85132849740,A Prospective Observational Study on Changes i...,2023-01-01,Mandal S.,Background: Depression has emerged as one of t...,10.1177/02537176221101487,2023,Mandal_2023,Mandal_2023,,...,,,"[2-s2.0-85092481765, 2-s2.0-0014011252, 2-s2.0...",21,"[2-s2.0-85020198554, 2-s2.0-33744790628, 2-s2....",8,A Prospective Observational Study on Changes i...,A Prospective Observational Study on Changes i...,Depression has emerged as one of the prime mor...,"[0.22507422, 0.30938473, -0.5567989, 0.2439473..."
40641,2-s2.0-85174424908,Photoelectrocatalytic degradation of pharmaceu...,2023-11-15,Torres-Pinto A.,Graphitic carbon nitride (g-C3N4) recently eme...,10.1016/j.cej.2023.146761,2023,TorresPinto_2023,TorresPinto_2023,,...,"Ministério da Ciência, Tecnologia e Ensino Sup...",146761,"[2-s2.0-85135107332, 2-s2.0-84947475287, 2-s2....",72,"[2-s2.0-85047018684, 2-s2.0-85126959718]",2,Photoelectrocatalytic degradation of pharmaceu...,Photoelectrocatalytic degradation of pharmaceu...,Graphitic carbon nitride (g-C3N4) recently eme...,"[0.88104117, 0.16385362, -0.45470127, 1.125364..."


In [5]:
# Create a dictionary to store all graphs
graphs_dict = {}

# Get current date for file naming
current_date = datetime.now().strftime("%Y%m%d")

alphas = [0.3, 0.5]
k = [5, 10, 15, 20]

# get all unique combinations of alpha and k
alpha_k_combinations = [(a, b) for a in alphas for b in k]

print(
    f"Creating weighted knn citation graph with {len(alpha_k_combinations)} combinations"
)

for alpha, k in alpha_k_combinations:
    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    print(
        f"[{timestamp}] Creating weighted knn citation graph with alpha={alpha} and k={k}"
    )

    wnc = WeightedNetworkCreator(df, alpha=alpha)
    similarities, indices = wnc.get_nearest_neighbours(k=k)
    knn_matrix = wnc.link_matrix_knn(similarities, indices)
    knn_citation_matrix = wnc.link_matrix_citations(knn_matrix)
    symmetric_matrix = wnc.make_matrix_symmetric(knn_citation_matrix)
    Gtemp = wnc.create_network(symmetric_matrix)
    Gpretty = wnc.prettify_network(
        Gtemp,
        col_list=["year", "title", "eid"],
    )
    
    # Calculate weight statistics
    weights = [d['weight'] for _, _, d in Gpretty.edges(data=True)]
    weight_stats = {
        "mean_weight": np.mean(weights),
        "std_weight": np.std(weights),
        "min_weight": np.min(weights),
        "max_weight": np.max(weights),
        "median_weight": np.median(weights),
        "q1_weight": np.percentile(weights, 25),
        "q3_weight": np.percentile(weights, 75)
    }
    
    # Store the graph in the dictionary with a descriptive key
    key = f"alpha{alpha}_k{k}"
    graphs_dict[key] = {
        "graph": Gpretty,
        "info_log": wnc.get_info_log(),
        "alpha": alpha,
        "k": k,
        "timestamp": timestamp,
        "weight_stats": weight_stats
    }
    
    weighted_info_log = wnc.get_info_log()
    
    # Add date to the info log
    weighted_info_log["creation_date"] = current_date
    weighted_info_log["creation_timestamp"] = timestamp

    # Write log with date in filename
    log_filename = f"weighted_network_info_log_{current_date}.json"
    with open(
        f"../output/descriptive-stats-logs/{log_filename}", "w"
    ) as f:
        json.dump(weighted_info_log, f)

    # Write graphml with date in filename
    graph_filename = f"weighted_alpha{alpha}_k{k}_knn_citation_{current_date}.graphml"
    nx.write_graphml(
        Gpretty,
        f"../data/05-graphs/weighted-knn-citation-graph/{graph_filename}",
    )

    timestamp_end = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    print(f"[{timestamp_end}] Finished processing alpha={alpha}, k={k}")
    print("#" * 30)

# Print summary of created graphs
print("\nCreated graphs summary:")
for key, data in graphs_dict.items():
    print(f"\n{key}:")
    print(f"  Number of nodes: {data['info_log']['num_nodes']}")
    print(f"  Number of edges: {data['info_log']['num_edges']}")
    print(f"  Percentage of edges from citations: {data['info_log']['num_edges_from_citations'] / data['info_log']['num_edges'] * 100:.2f}%")
    print(f"  Percentage of edges from KNN: {data['info_log']['num_edges_from_knn'] / data['info_log']['num_edges'] * 100:.2f}%")
    print(f"  Created at: {data['timestamp']}")
    print("\n  Edge Weight Statistics:")
    print(f"    Mean: {data['weight_stats']['mean_weight']:.4f}")
    print(f"    Std: {data['weight_stats']['std_weight']:.4f}")
    print(f"    Min: {data['weight_stats']['min_weight']:.4f}")
    print(f"    Max: {data['weight_stats']['max_weight']:.4f}")
    print(f"    Median: {data['weight_stats']['median_weight']:.4f}")
    print(f"    Q1: {data['weight_stats']['q1_weight']:.4f}")
    print(f"    Q3: {data['weight_stats']['q3_weight']:.4f}")

Creating weighted knn citation graph with 8 combinations
[2025-03-26 11:28:31] Creating weighted knn citation graph with alpha=0.3 and k=5
Initializing WeightedNetworkCreator
Getting nearest neighbours...
Creating KNN link matrix...
Creating citation link matrix...
Number of edges from citations: 360645
Making matrix symmetric...
Creating weighted undirected network...
Number of nodes: 40643
Number of edges: 460488
Percentage of edges from citations: 78.32%
Percentage of edges from KNN: 21.68%
[2025-03-26 11:31:20] Finished processing alpha=0.3, k=5
##############################
[2025-03-26 11:31:20] Creating weighted knn citation graph with alpha=0.3 and k=10
Initializing WeightedNetworkCreator
Getting nearest neighbours...
Creating KNN link matrix...
Creating citation link matrix...
Number of edges from citations: 360645
Making matrix symmetric...
Creating weighted undirected network...
Number of nodes: 40643
Number of edges: 602779
Percentage of edges from citations: 59.83%
Percent

In [6]:
list(Gpretty.edges(data=True))[0]

(0, 20, {'weight': 0.4539029598236084})

In [7]:
list(Gpretty.nodes(data=True))[0]

(0,
 {'year': np.int64(1982),
  'title': 'Serotonergic mechanism in the control of β-endorphin and acth release in male rats',
  'eid': '2-s2.0-0020316326'})

# Directed Citation Graph


In [9]:
# Create a directed network from the dataframe
dnc = DirectedNetworkCreator(df, data_to_add=["eid", "year", "title"])
dnc.build_graph()
print(dnc.get_graph_info())
directed_info_log = dnc.get_info_log()
print(directed_info_log)

G = dnc.G

Number of nodes: 40643
Number of edges: 360645
{'num_edges': 360645, 'num_nodes': 40643, 'nr_isolated_nodes': 2776}


In [10]:
# get node attributes
list(G.nodes(data=True))[0]

('Bruni_1982',
 {'eid': '2-s2.0-0020316326',
  'year': 1982,
  'title': 'Serotonergic mechanism in the control of β-endorphin and acth release in male rats'})

In [11]:
with open("../output/descriptive-stats-logs/directed_network_info_log.json", "w") as f:
    json.dump(directed_info_log, f)

# write graphml
nx.write_graphml(G, "../data/05-graphs/citation-graph/directed_citation_graph.graphml")