# Creating the Network

### Undirected Weighted Network

- read in the with text embeddings
- use the k-nearest neighbours algorithm to find the 15 nearest neighbours for each node, based on cosine distance
- create a matrix of the edges between nodes, where the weight of the edge is the cosine similarity \* (1 - alpha)
- create a matrix of the edges between nodes, where the a weight of 1 \* alpha is assigned if a citation exists between the two nodes
- to make the matrix symmetric, we
  - calculate the average if ij and ji are non-zero
  - use ij in cases where ji is zero (and vice versa)
  - if ij and ji are both zero, set the value to zero (no edge)
  - we then use the upper triangle of the matrix to create the final adjacency matrix
- we then create a network based on this adjacency matrix
- we add data from the original dataframe to the network (title, year, etc.)
- we save the network as a graphml file; we save the log to the output folder

### Directed Unweighted Network

- create an edgelist from the dataframe
- create a directed network based on this edgelist
- add data from the original dataframe to the network (title, year, etc.)
- we save the network as a graphml file; we save the log to the output folder


In [1]:
import pandas as pd
from src.network.NetworkCreator import WeightedNetworkCreator, DirectedNetworkCreator
import json
import networkx as nx


p = "../data/04-embeddings/df_with_specter2_embeddings.pkl"
df = pd.read_pickle(p)

In [2]:
# usage

alphas = [0.3, 0.5]
k = [5, 10, 15, 20]
# get all unique combinations of alpha and k
alpha_k_combinations = [(a, b) for a in alphas for b in k]

for alpha, k in alpha_k_combinations:
    print(f"Creating weighted knn citation graph with alpha={alpha} and k={k}")
    wnc = WeightedNetworkCreator(df, alpha=alpha)
    similarities, indices = wnc.get_nearest_neighbours(k=k)
    knn_matrix = wnc.link_matrix_knn(similarities, indices)
    knn_citation_matrix = wnc.link_matrix_citations(knn_matrix)
    symmetric_matrix = wnc.make_matrix_symmetric(knn_citation_matrix)
    G = wnc.create_network(symmetric_matrix)
    G = wnc.add_data_to_nodes(
        G,
        col_list=["eid", "title", "year", "doi", "unique_auth_year"],  # "abstract",
    )
    weighted_info_log = wnc.get_info_log()

    with open("../output/removal_log/weighted_network_info_log.json", "w") as f:
        json.dump(weighted_info_log, f)

    # write graphml
    nx.write_graphml(
        G,
        f"../data/05-graphs/weighted-knn-citation-graph/weighted_alpha{alpha}_k{k}_knn_citation.graphml",
    )
    print("#" * 30)

Creating weighted knn citation graph with alpha=0.3 and k=5
Initializing WeightedNetworkCreator
Getting nearest neighbours...
Creating KNN link matrix...
Creating citation link matrix...
Number of edges from citations: 360645
Making matrix symmetric...
Creating weighted undirected network...
Number of nodes: 40643
Number of edges: 460488
Percentage of edges from citations: 78.32%
Percentage of edges from KNN: 21.68%
Adding data to nodes...
##############################
Creating weighted knn citation graph with alpha=0.3 and k=10
Initializing WeightedNetworkCreator
Getting nearest neighbours...
Creating KNN link matrix...
Creating citation link matrix...
Number of edges from citations: 360645
Making matrix symmetric...
Creating weighted undirected network...
Number of nodes: 40643
Number of edges: 602779
Percentage of edges from citations: 59.83%
Percentage of edges from KNN: 40.17%
Adding data to nodes...
##############################
Creating weighted knn citation graph with alpha=0

In [3]:
list(G.edges(data=True))[0]

(0, 32, {'weight': 0.463172972202301})

In [4]:
list(G.nodes(data=True))[0]

(0,
 {'eid': '2-s2.0-0019989147',
  'title': 'Forebrain serotonin and avoidance learning: Behavioural and biochemical studies on the acute effect of p-chloroamphetamine on one-way active avoidance learning in the male rat',
  'year': 1982,
  'doi': '10.1016/0091-3057(82)90040-5',
  'unique_auth_year': 'Aagaard_2009'})

# Directed Citation Graph


In [2]:
dnc = DirectedNetworkCreator(df)
G = dnc.create_network()
G = dnc.add_data_to_nodes(
    G,
    col_list=[
        "title",
        "year",
        # "abstract",
        "doi",
        # "unique_auth_year",
        "eid",
    ],
)
directed_info_log = dnc.get_info_log()

Initializing DirectedNetworkCreator
Creating directed network...
Number of nodes: 37867
Number of edges: 360645
Adding data to nodes...


In [3]:
with open("../output/removal_log/directed_network_info_log.json", "w") as f:
    json.dump(directed_info_log, f)

# write graphml
nx.write_graphml(G, "../data/05-graphs/citation-graph/directed_citation_graph.graphml")

# Make Pajek Ready

1. Only EID as node data
2. Remove loops, create family loops (as in liu2019)


In [4]:
import networkx as nx
import pandas as pd


class PajekPrepper:
    def __init__(self, G):
        self.OG = G
        self.NG = G.copy()
        self.log = {}

    def prepare_attributes(self):
        for node, data in self.NG.nodes(data=True):
            keys_to_remove = [k for k in data if k != "eid"]
            for k in keys_to_remove:
                data.pop(k, None)
        for node, data in self.NG.nodes(data=True):
            for k, v in data.items():
                data[k] = str(v)

    def remove_loops(self):
        sccs = list(nx.strongly_connected_components(self.NG))
        Gloopless = nx.DiGraph()
        self.original_to_family = {}
        removed_eids = []  # To store eids of nodes merged into families

        for scc in sccs:
            if len(scc) > 1:
                eids = ";".join(
                    [
                        str(self.NG.nodes[node]["eid"])
                        for node in scc
                        if "eid" in self.NG.nodes[node]
                    ]
                )
                family_node = "family_" + "_".join(sorted([str(node) for node in scc]))
                Gloopless.add_node(family_node, eid=eids)
                removed_eids.append(eids)  # Log merged eids
                for node in scc:
                    self.original_to_family[node] = family_node
            else:
                node = next(iter(scc))
                Gloopless.add_node(node, **self.NG.nodes[node])
                self.original_to_family[node] = node

        for u, v, data in self.NG.edges(data=True):
            new_u = self.original_to_family.get(u, u)
            new_v = self.original_to_family.get(v, v)
            if new_u != new_v:
                Gloopless.add_edge(new_u, new_v, **data)

        removed_eids_list = [
            eid for sublist in removed_eids for eid in sublist.split(";")
        ]
        self.NG = Gloopless
        self.log["loops_removed"] = {
            "count": len(removed_eids),
            "eids": removed_eids_list,
        }

    def remove_isolates(self):
        isolates = list(nx.isolates(self.NG))
        isolated_eids = [self.NG.nodes[iso]["eid"] for iso in isolates]
        self.NG.remove_nodes_from(isolates)
        isolated_eids_list = [
            eid for sublist in isolated_eids for eid in sublist.split(";")
        ]
        self.log["isolates_removed"] = {
            "count": len(isolates),
            "eids": isolated_eids_list,
        }

    def extract_largest_wcc(self):
        largest_wcc = max(nx.weakly_connected_components(self.NG), key=len)
        removed_nodes = set(self.NG.nodes()) - set(largest_wcc)
        removed_eids = [self.NG.nodes[node]["eid"] for node in removed_nodes]
        self.NG = self.NG.subgraph(largest_wcc).copy()
        removed_eids_list = [
            eid for sublist in removed_eids for eid in sublist.split(";")
        ]

        self.log["largest_wcc_removed"] = {
            "count": len(removed_nodes),
            "eids": removed_eids_list,
        }

    def prepare_pajek(self):
        self.prepare_attributes()
        self.remove_loops()
        self.remove_isolates()
        self.extract_largest_wcc()
        return self.NG, self.log


pp = PajekPrepper(G)
Gpjk, log = pp.prepare_pajek()
print(f"Original number of nodes: {G.number_of_nodes()}")
print(f"Original number of edges: {G.number_of_edges()}")
print(f"New number of nodes: {Gpjk.number_of_nodes()}")
print(f"New number of edges: {Gpjk.number_of_edges()}")

print(f" Removed bc loops: {log['loops_removed']['count']}")
print(f" Removed bc isolates: {log['isolates_removed']['count']}")
print(f" Removed bc largest wcc: {log['largest_wcc_removed']['count']}")

Original number of nodes: 37867
Original number of edges: 360645
New number of nodes: 37587
New number of edges: 358235
 Removed bc loops: 73
 Removed bc isolates: 2
 Removed bc largest wcc: 153


### save to pajek


In [5]:
nx.write_pajek(
    Gpjk,
    "../data/05-graphs/citation-graph/directed_citation_graph_loopless_pjk.net",
)

In [6]:
def removed_eid_printer(df, log, key):
    eids = log[key]["eids"]
    sub_df = df[df["eid"].isin(eids)]
    return sub_df[["title", "year", "eid"]]


sub_df = removed_eid_printer(df, log, "largest_wcc_removed")
for i, row in sub_df.iterrows():
    print(row["title"], row["year"], row["eid"])

Suppression of prolactin secretion by benzodiazepines in vivo 1982 2-s2.0-0020055314
Actions of benzodiazepines on the neuroendocrine system 1983 2-s2.0-0021051430
Effect of some serotoninergic agents on the rectal temperature of the domestic fowl (gallus domesticus) 1984 2-s2.0-0021735830
Depletions of central norepinephrine by intraventricular xylamine in rats 1984 2-s2.0-0021323330
The uptake and metabolism of 5-hydroxytryptamine by tissue slices of the cestode Hymenolepis diminuta 1985 2-s2.0-0021951134
Differential effects of neuroleptic and serotonergic drugs on amphetamine-induced hypothermia in mice 1985 2-s2.0-0021985820
Behavioral effects of xylamine-induced depletions of brain norepinephrine: Interaction with LSD 1985 2-s2.0-0022378149
Behavioral effects of xylamine-induced depletions of brain norepinephrine: Interaction with amphetamine 1986 2-s2.0-0022616497
The uptake and metabolism of l-glutamate by tissue slices of the cestode Hymenolepis diminuta 1986 2-s2.0-0022521312