In [2]:
from IPython.display import Markdown, display
display(Markdown(open("./SM_header.md", "r").read()))

Copyright © 2025 Université Paris Cité

Author: [Guillaume Rousseau](https://www.linkedin.com/in/grouss/), Department of Physics, Paris, France (email: guillaume.rousseau@u-paris.fr)

This archive contains the supplemental materials and replication package associated with the preprint, "*Temporal and topological partitioning in real-world growing networks for scale-free properties study*", available on [arXiv](https://arxiv.org/abs/2501.10145) and [ssrn](http://ssrn.com/abstract=5191689).

The current version of the Python scripts and associated resources is available on the [author's GitHub page](https://github.com/grouss/growing-network-study).

This work is currently licensed under the [Creative Commons CC BY-NC-SA 4.0 license](https://creativecommons.org/licenses/by-nc-sa/4.0).

To give appropriate credit and cite this work ([BibTeX entry](./rousseau2025temporal)):
Rousseau, G. (2025). *Temporal and topological partitioning in real-world growing networks for scale-free properties study* [Preprint]. arXiv:2501.10145. https://arxiv.org/abs/2501.10145; also available on SSRN: http://ssrn.com/abstract=5191689

 
# A) Replication Package

[Open the corresponding Replication Package notebook](./Replication_Package.ipynb)

# B) QuickStart Guide

[Open the corresponding QuickStart Guide notebook](./SM00_QuickStart.ipynb)

# C) Table of Contents

- 1. [Function Definitions](./SM01_Functions.ipynb)
- 2. [Dataset Import](./SM02_DatasetImport.ipynb)
- 3. [Building the Transposed Graph](./SM03_BuildingTransposedGraph.ipynb)
- 4. [Temporal Information Quality and Summary Statistics](./SM04_TemporalInformationMainStats.ipynb)
- 5. [Growth Relationship Between Nodes and Edges](./SM05_GrowingRules.ipynb)
- 6. [Topological Partitioning($RV$ Nodes)](./SM06_TopologicalPartitioning.ipynb)
- 7. [In-Degree and Out-Degree Distributions Over Time](./SM07_DegreeDistributionOverTime.ipynb)
- 8. [Distribution Tail Analysis](./SM08_DistributionTailAnalysis.ipynb)
- 9. [Temporal Partitioning](./SM09_TemporalPartitioning.ipynb)
- 10. [Derived $O-(RV/RL)-O$ Graph Construction](./SM10_DerivedGrowingNetwork.ipynb)
- 11. [Building the $TSL$ Partitioning](./SM11_TSLPartitioning.ipynb)
- 12. [Barabási–Albert Model Use Case](./SM12_BarabasiAlbertUseCase.ipynb)


**NB :** As of 2025/05/07, the QuickStart guide, the replication package, and SM01 to SM12 are available. The Python scripts are also provided under `local_utils` directory, but they are not in their final form and should be considered an alpha release. The graphs used in the study are available in a distinct Zenodo Deposit 10.5281/zenodo.15260640 ($\sim50$ Go), including the main dataset $O/RV/RL-O/RV/RL$ (2+ billions of nodes, $\sim4$ billions of edges), and two derived $O-(RV/RL)-O$ graphs ($\sim150$ millions nodes and edges). 

In [2]:
%load_ext autoreload
%autoreload 2

import importlib,sys,local_utils
from local_utils import *

print("___ Import data from graphpath=",config.graphpath)
print("___ Export data to exportpath=",config.exportpath)   

DisplayCopyrightInfo()


___ Import data from graphpath= ./ImportData/
___ Export data to exportpath= ./ExportData/
--------------------------------------------------------------------------------
Copyright 2025 Université Paris Cité, France 
Author: Guillaume Rousseau, Physics Department, Paris, France 

(https://www.linkedin.com/in/grouss/)

This archive contains the supplemental materials and replication package associated with the preprint available on :
- arXiv (https://arxiv.org/abs/2501.10145)
- SSRN  (http://ssrn.com/abstract=5191689

Current version of python scripts and associated ressources are available on author's github page
(https://github.com/grouss/growing-network-study)

This work is currently licensed under CC BY-NC-SA 4.0
(https://creativecommons.org/licenses/by-nc-sa/4.0)
--------------------------------------------------------------------------------



---

# 2. Dataset Import

We describe here how to load the main graph and the two derived graphs. Details about how the derived graphs are built are provided in the other notebooks, particularly with regard to the two distinct inheritance rules.

## a) Main O/RV/RL-O/RV/RL graph 

This graph is exported from the raw original export (2021/03/23, SWH). See the Replication package notebook regarding raw data formats and query examples for extracting information related to the "2021-03-23" dump of the Software Heritage project dataset, available here: https://registry.opendata.aws/software-heritage/ 



In [3]:
nodes,edges,nodesad,d,Nnodes,Nedges=LoadAllArray()   

Loaded : ./ImportData/nodes_20240310.pkl
Loaded : ./ImportData/edges_20240310.pkl
Loaded : ./ImportData/nodesad_20240310.pkl


In [4]:

# stats per node/edge type for the derived graph
# NB : A more generic implementation is provided later
output=f'stats per node/edge type for the derived graph'
print('-'*len(output));print(output);print('-'*len(output))
for dst in ["O","RL","RV"]:
    alledges=edges[nodes[d[dst+"indexMin"]]:nodes[d[dst+"indexMax"]+1]]
    nAll=len(alledges)
    nO=np.sum(alledges<=d["OindexMax"])
    nRV=np.sum(alledges>=d["RVindexMin"])
    nRL=nAll-nO-nRV
    output=f'{dst:2} #nodes {d[dst+"indexMax"]+1-d[dst+"indexMin"]:15,} '
    output+=f'|| #edges {dst+">O=":>5}{nO:15,} | {dst+">RL=":>6}{nRL:15,} | {dst+">RV=":>6}{nRV:15,}'

    print(output)
print(f'Total number of nodes = {len(nodes)-1:,}')
print(f'Total number of edges = {len(edges):,}')

----------------------------------------------
stats per node/edge type for the derived graph
----------------------------------------------
O  #nodes     139,524,533 || #edges  O>O=              0 |  O>RL=    687,095,698 |  O>RV=  1,017,960,393
RL #nodes      17,275,253 || #edges RL>O=              0 | RL>RL=         41,351 | RL>RV=     17,225,899
RV #nodes   2,067,579,054 || #edges RV>O=              0 | RV>RL=              0 | RV>RV=  2,119,355,702
Total number of nodes = 2,224,378,840
Total number of edges = 3,841,679,043


`nodes` is an array of size `Nnodes`+1

Indexes of target nodes of a source node of index `i` is the array `edges[nodes[i]:nodes[i+1]]`

If the size of this table is zero, this node has no descendants and is a leaf.

Since the number of nodes and edges is less than $2^{32}-1$, one-dimensional numpy arrays of type `uint32` are sufficient to store this graph and its associated transposed graph.

The graph (`nodes` and `edges`), along with the timestamps array (which uses the same amount of memory as `nodes`), requires $8.3 \times 2 + 14.3 = 31$ GB. 

If the graph (`nodes` and `edges`), timestamps (`nodesad`), and transposed graph need to be accessed simultaneously, the memory requirement increases to $8.3 \times 3 + 14.3 \times 2 = 53.5$ GB.

This Compressed Sparse Row (CSR, https://en.wikipedia.org/wiki/Sparse_matrix) representation is highly efficient for traversing a graph in the forward direction, as it allows for the quick retrieval of adjacent node indices. Other representations are possible and may be better suited for certain types of graph studies, but this point falls outside the scope of this study.


The type of a node, knowing its index, is simply returned by the `GetNodesType` function


In [5]:
for i in [0,111,150000000,1000000000]:
    print(f'The node of index {i:15,} is of type {GetNodesType(i):<}')

The node of index               0 is of type O
The node of index             111 is of type O
The node of index     150,000,000 is of type RL
The node of index   1,000,000,000 is of type RV


The abstraction layer over node types can be made more generic without any particular difficulty (whether maintaining the ordering by node type or not). The chosen ordering here has the advantage of not requiring a specific array or extra bits, resulting in memory savings and avoiding additional memory accesses for algorithms that need information about the node or edge types.

To facilitate reuse or simplify the use of methods from the numpy library, we define generic functions that allow the use of encoding based on `int` values for node and edge types. Adapting these functions according to your typing rules enables the reuse of the examples in this notebook.

In [6]:
nodes,edges,nodesad,d,Nnodes,Nedges=LoadAllArray()   
DisplayTypeStats(nodes,edges,d)    


Loaded : ./ImportData/nodes_20240310.pkl
Loaded : ./ImportData/edges_20240310.pkl
Loaded : ./ImportData/nodesad_20240310.pkl
GetNodesTypesArray [Elapse time : 9.0 (s)]
___ O     :     139,524,533 (6.27%)
___ RL    :      17,275,253 (0.78%)
___ RV    :   2,067,579,054 (92.95%)
______________________________
___ Total :   2,224,378,840 (100.0%)

GetEdgesTypesArray [Elapse time : 23.0 (s)]
___ O>O   :               0 (0.0%)
___ O>RL  :     687,095,698 (17.89%)
___ O>RV  :   1,017,960,393 (26.5%)
___ RL>O  :               0 (0.0%)
___ RL>RL :          41,351 (0.0%)
___ RL>RV :      17,225,899 (0.45%)
___ RV>O  :               0 (0.0%)
___ RV>RL :               0 (0.0%)
___ RV>RV :   2,119,355,702 (55.17%)
______________________________
___ Total :   3,841,679,043 (100.0%)



{'O': 139524533,
 'RL': 17275253,
 'RV': 2067579054,
 'O>O': 0,
 'O>RL': 687095698,
 'O>RV': 1017960393,
 'RL>O': 0,
 'RL>RL': 41351,
 'RL>RV': 17225899,
 'RV>O': 0,
 'RV>RL': 0,
 'RV>RV': 2119355702}

In [7]:
# cleanup
# ok if rerun
for var in [nodes,edges,nodesad,d,Nnodes,Nedges]:
    try:
        del var
    except:
        pass

## b) Derived O-(RL/RV)-O  dataset (with inheritance rule, 'BigO')

In [8]:
# Loading the derived dataset O-(RV/RL)-O  
# ~140 millions of nodes, ~155 millions of edges)

nodes,edges,nodesad,d,Nnodes,Nedges=LoadAllArray_OO()
# similar to nodes,edges,nodesad,d,Nnodes,Nedges=LoadAllArray_OO(keypath="BigO")
statoutput=DisplayTypeStats(nodes,edges,d)    


Loaded : nodes_o_derived_O-RVRL-O_BigO_20240429.pkl
Loaded : edges_o_derived_O-RVRL-O_BigO_20240429.pkl
Loaded : nodesadderived_derived_O-RVRL-O_BigO_20240429.pkl
GetNodesTypesArray [Elapse time : 1.0 (s)]
___ O     :     139,524,533 (100.0%)
___ RL    :               0 (0.0%)
___ RV    :               0 (0.0%)
______________________________
___ Total :     139,524,533 (100.0%)

GetEdgesTypesArray [Elapse time : 1.0 (s)]
___ O>O   :     156,682,302 (100.0%)
___ O>RL  :               0 (0.0%)
___ O>RV  :               0 (0.0%)
___ RL>O  :               0 (0.0%)
___ RL>RL :               0 (0.0%)
___ RL>RV :               0 (0.0%)
___ RV>O  :               0 (0.0%)
___ RV>RL :               0 (0.0%)
___ RV>RV :               0 (0.0%)
______________________________
___ Total :     156,682,302 (100.0%)



## c) Derived O-(RL/RV)-O  dataset (without inheritance rule, 'L0')

In [9]:
# Loading the derived dataset O-(RV/RL)-O  
# ~140 millions of nodes, ~150 millions of edges)

nodes,edges,nodesad,d,Nnodes,Nedges=LoadAllArray_OO(keypath="L0")
statoutput=DisplayTypeStats(nodes,edges,d)    


Loaded : nodes_o_derived_O-RVRL-O_L0_20240429.pkl
Loaded : edges_o_derived_O-RVRL-O_L0_20240429.pkl
Loaded : nodesadderived_derived_O-RVRL-O_BigO_20240429.pkl
GetNodesTypesArray [Elapse time : 1.0 (s)]
___ O     :     139,524,533 (100.0%)
___ RL    :               0 (0.0%)
___ RV    :               0 (0.0%)
______________________________
___ Total :     139,524,533 (100.0%)

GetEdgesTypesArray [Elapse time : 1.0 (s)]
___ O>O   :     149,732,521 (100.0%)
___ O>RL  :               0 (0.0%)
___ O>RV  :               0 (0.0%)
___ RL>O  :               0 (0.0%)
___ RL>RL :               0 (0.0%)
___ RL>RV :               0 (0.0%)
___ RV>O  :               0 (0.0%)
___ RV>RL :               0 (0.0%)
___ RV>RV :               0 (0.0%)
______________________________
___ Total :     149,732,521 (100.0%)

