In [1]:
from IPython.display import Markdown, display
display(Markdown(open("./SM_header.md", "r").read()))


Copyright © 2025 Université Paris Cité

Author: [Guillaume Rousseau](https://www.linkedin.com/in/grouss/), Department of Physics, Paris, France (email: guillaume.rousseau@u-paris.fr)

This archive contains the supplemental materials and replication package associated with the preprint, "*Temporal and topological partitioning in real-world growing networks for scale-free properties study*", available on [arXiv](https://arxiv.org/abs/2501.10145) and [ssrn](http://ssrn.com/abstract=5191689).

The current version of the Python scripts and associated resources is available on the [author's GitHub page](https://github.com/grouss/growing-network-study).

This work is currently licensed under the [Creative Commons CC BY-NC-SA 4.0 license](https://creativecommons.org/licenses/by-nc-sa/4.0).

To give appropriate credit and cite this work ([BibTeX entry](./rousseau2025temporal)):
Rousseau, G. (2025). *Temporal and topological partitioning in real-world growing networks for scale-free properties study* [Preprint]. arXiv:2501.10145. https://arxiv.org/abs/2501.10145; also available on SSRN: http://ssrn.com/abstract=5191689

 
# A) Replication Packages

[Open the Replication Package notebook related to the datasets.](./Replication_Package_Datasets.ipynb)

[Open the Replication Package notebook related to the figures.](./Replication_Package_Figures.ipynb)

# B) QuickStart Guide

[Open the QuickStart Guide notebook](./SM00_QuickStart.ipynb)

# C) Table of Contents

- 1. [Function Definitions](./SM01_Functions.ipynb)
- 2. [Dataset Import](./SM02_DatasetImport.ipynb)
- 3. [Building the Transposed Graph](./SM03_BuildingTransposedGraph.ipynb)
- 4. [Temporal Information Quality and Summary Statistics](./SM04_TemporalInformationMainStats.ipynb)
- 5. [Growth Relationship Between Nodes and Edges](./SM05_GrowingRules.ipynb)
- 6. [Topological Partitioning($RV$ Nodes)](./SM06_TopologicalPartitioning.ipynb)
- 7. [In-Degree and Out-Degree Distributions Over Time](./SM07_DegreeDistributionOverTime.ipynb)
- 8. [Distribution Tail Analysis](./SM08_DistributionTailAnalysis.ipynb)
- 9. [Temporal Partitioning](./SM09_TemporalPartitioning.ipynb)
- 10. [Derived $O-(RV/RL)-O$ Graph Construction](./SM10_DerivedGrowingNetwork.ipynb)
- 11. [Building the $TSL$ Partitioning](./SM11_TSLPartitioning.ipynb)
- 12. [Barabási–Albert Model Use Case](./SM12_BarabasiAlbertUseCase.ipynb)


**NB :** As of 2025/05/16, the QuickStart guide, the replication packages, and SM01 to SM12 are available. The Python scripts are also provided under `local_utils` directory, but they are not in their final form and should be considered an alpha release. The graph datasets used in the study are available in a distinct Zenodo Deposit 10.5281/zenodo.15260640 ($\sim50$ Go), including the main dataset $O/RV/RL-O/RV/RL$ (2+ billions of nodes, $\sim4$ billions of edges), and two derived $O-(RV/RL)-O$ graphs ($\sim150$ millions nodes and edges). 

More release notes are available in the [dedicated notebook](./SM_ReleaseNote.ipynb).

# **Replication Packages (Datasets)**

This notebook describes the data format and presents query examples for extracting information from the *2021-03-23* snapshot of the Software Heritage project dataset, available at: https://registry.opendata.aws/software-heritage/

The extraction yields the main dataset, denoted as $O/RV/RL−O/RV/RL$ graph, comprising over 2 billion nodes and approximately 4 billion edges.

Several comments and observations are provided concerning the export procedures for origin, release, and revision nodes, as well as the encoding of timestamps.

Further details regarding the CSR graph format employed are available in a separate notebook (*SM02_DatasetImport.ipynb*).


In [5]:
# connection to aws open data registry using aws cl
import os
from pyarrow import orc,fs
import pickle
import numpy as np

def GetBucket(currentdump):
    return f'softwareheritage/graph/'+currentdump+'/orc/'

def GetDatasetType(currentdump):
    bucket = GetBucket(currentdump)
    stream=os.popen('aws s3 ls --no-sign-request  s3://'+bucket)
    return ((stream.read()).replace('PRE','')).replace('/','').split()
    
def GetFileList(currentdump, datasettype):
    bucket = GetBucket(currentdump)
    stream=os.popen('aws s3 ls --no-sign-request  s3://'+bucket+datasettype+'/')
    filelistpertype=[s.split()[-1] for s in (stream.read()).split('\n') if len(s)!=0]
    return filelistpertype
    
def DisplayORCSchema(currentdump,datasettype,filename,Verbose=True,FlagSample=True):
    s3 = fs.S3FileSystem(region='us-east-1',anonymous=True,connect_timeout=180,request_timeout=180)
    bucket = GetBucket(currentdump)
    orc_file=orc.ORCFile(s3.open_input_file(bucket+datasettype+'/'+filename))
    nstripes=orc_file.nstripes
    nrows=orc_file.nrows
    schema=orc_file.schema
    content_length=orc_file.content_length
    if Verbose:
        print(currentdump,datasettype,filename)
        print(f'nstripes={nstripes:,}')
        print(f'nrows={nrows:,}')
        print(f'contenlength={content_length:,} ( {round(content_length/1024**3,3):} GB / orc file)' )
        if FlagSample:
            x=orc_file.read_stripe(0)
            num_rows=x.num_rows
            d=x.take([0]).to_pydict()
            for k,v in d.items():
                print(f'key={k:20} value={v}')
        else:
            print(f'Schema=>\n{schema:}')
    return nrows
    

currentdump="2021-03-23"    

## Raw ORC Files

In [14]:
print("Current Dump :",currentdump)
datasettypelist=GetDatasetType(currentdump)
print("Dataset Type List",datasettypelist)
filelistpertype={}
for datasettype in datasettypelist:
    print("-"*80)
    filelistpertype[datasettype]=sorted(GetFileList(currentdump, datasettype))
    DisplayORCSchema(currentdump,datasettype,filelistpertype[datasettype][0])
    #print(datasettype,filelistpertype[datasettype])
    #break

Current Dump : 2021-03-23
Dataset Type List ['content', 'directory', 'directory_entry', 'origin', 'origin_visit', 'origin_visit_status', 'release', 'revision', 'revision_history', 'skipped_content', 'snapshot', 'snapshot_branch']
--------------------------------------------------------------------------------
2021-03-23 content graph-0418f7e1-c8cb-45aa-a09d-2cb3ba66cb59.orc
nstripes=285
nrows=154,569,080
contenlength=19,176,427,525 ( 17.859 GB / orc file)
key=sha1                 value=['8d36d019d977a180ad6f3dbd9a78c98f9e9d771e']
key=sha1_git             value=['20be9982bab6f00b8cbbe11f686399c32fe696a6']
key=sha256               value=['ee11de9513b84ab9e77244bfb1a19844d965ad898d6635cdd2d68e3d24a9b5cf']
key=blake2s256           value=['beda46256282fcfc85598a5af17d49267b8656aa84bd8d3715212e9d93068be2']
key=length               value=[9892]
key=status               value=['visible']
--------------------------------------------------------------------------------
2021-03-23 directory graph

In [16]:
# count total (raw) number of items

print("Current Dump :",currentdump)
datasettypelist=GetDatasetType(currentdump)

datasettypelist=['origin','release','revision','revision_history']


print("Dataset Type List",datasettypelist)
filelistpertype={}
datasettypestat={}

Nprint=40
for datasettype in datasettypelist:
    filelistpertype[datasettype]=sorted(GetFileList(currentdump, datasettype))
    nrowall=0
    print("__"*Nprint)
    print(f'datasettype={datasettype} NbFile {len(filelistpertype[datasettype])}')    
    for i,filename in enumerate(filelistpertype[datasettype]):
        if i==0:
            print("_ "*Nprint)
            nrow=DisplayORCSchema(currentdump,datasettype,filename,FlagSample=False)
            print("_ "*Nprint)
        else:
            print(".",end="")
            nrow=DisplayORCSchema(currentdump,datasettype,filename,Verbose=False,FlagSample=False)
        nrowall+=nrow
    print()
    print(f'Total Number of Row for datasettype {datasettype} {nrowall:,}')
    print("__"*Nprint)
    datasettypestat[datasettype]=nrowall

Current Dump : 2021-03-23
Dataset Type List ['origin', 'release', 'revision', 'revision_history']
________________________________________________________________________________
datasettype=origin NbFile 64
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2021-03-23 origin graph-0783af40-0b8d-4e2b-8484-cae9e2b86e45.orc
nstripes=1
nrows=2,403,490
contenlength=42,107,507 ( 0.039 GB / orc file)
Schema=>
url: string
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
...............................................................
Total Number of Row for datasettype origin 153,838,272
________________________________________________________________________________
________________________________________________________________________________
datasettype=release NbFile 64
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2021-03-23 release graph-01cfa319-ed0a-43ff-9aaf-3c98cf184df7.orc
nstripes=1
n

In [None]:
Current Dump : 2021-03-23
Total Number of Row for datasettype origin 153,838,272
Total Number of Row for datasettype release 17,269,997
Total Number of Row for datasettype revision 2,067,533,290
Total Number of Row for datasettype revision_history 2,120,402,773

The nodes are ordered by type in the following order: (origin, release, revision). 
The start and end indices (both included) for each type are provided here.

In [6]:
d={
   "RVindexMin": 156799786,
   "RVindexMax": 2224378839,
   "RV": 2067579054,
   "OindexMin": 0,
   "OindexMax": 139524532,
   "O": 139524533,
   "RLindexMin": 139524533,
   "RLindexMax": 156799785,
   "RL": 17275253
    }
# Origins            139 524 533
# Revisions        2 067 579 054
# Releases            17 275 253
# Revision History 2 120 402 773

print(datasettypestat)
#{'origin': 153838272, 'release': 17269997, 'revision': 2067533290, 'revision_history': 2120402773}

{'origin': 153838272, 'release': 17269997, 'revision': 2067533290, 'revision_history': 2120402773}


## Comment about origin nodes

There are only 139,515,969 origins out of 153,838,272 that have at least one complete 'full' visit.

8564 origins in the nodes tables are the result of a bug during the import procedure. The associated indeces are in the file '/Data/id_unreal_origins_20210321.tx'. These origins have no descendants, like approximately 20 million other origins from the initial dataset.

Finally, in the derived graph we have:

-  119,199,726 origin nodes with one or more edges toward Revisions or Releases 
  
-  687,095,698 edges from Origin nodes to Release nodes
  
- 1,017,960,393 edges from Origin nodes to Revision nodes

## Comment about Release nodes

RL #nodes      17,275,253 || #edges RL>O=              0 | RL>RL=         41,351 | RL>RV=     17,225,899


Over the 17,269,997 releases from the orc export files, 2,747 are targetting directory or content nodes, which are not taken into account in the derived graph. 

Few thousands of "extra" releases are know only as target of other releases. 

Finally, in the derived graph we have:

-  17,275,253 Release nodes
  
-  41,351 edges from Release nodes to Release nodes
  
- 17,225,899 edges from Release nodes to Revision nodes

## Comment about Revision nodes

Extra Releases are known as target of known Releases from the orc export file.

Edges linking more than once a source node and a target node are deduplicated. It concerns 1,047,071 edges, and mainly a single revision wich have 1,000,000 times the same parent.

Finally, in the derived graph we have:

-  2,067,579,054 Revision nodes
    
- 2,119,355,702 edges from Revision nodes to Revision nodes

## Comment about timestamps

Timestamps correspond to the author commit dates. They are stored in a dedicated array of `uint32` corresponding to the numbers of second since EPOCH (1970/01/01 00:00 UTC).

Timestamps before EPOCH are set to 0.

Timestamps after EPOCH $+2^{32}-1$ seconds (i.e. Sunday February 07, 2106 06:28:15 (am) in time zone UTC) are set to 4,294,967,295

Unknown timestamps (for origin nodes for instance) are set to 4,294,967,295.


In [24]:
nodesad=pickle.load(open(graphpath+"nodesad_20240310.pkl","rb"))
nodesad_outlier=pickle.load(open(graphpath+"tsoutlier_20240310.pkl","rb"))


In [34]:
print("nodesad shape",nodesad.shape,"type",type(nodesad[0]),"min",np.min(nodesad),"max",np.max(nodesad))
print("nodesad outlier type",type(nodesad_outlier),len(nodesad_outlier))
before=0;after=0;Max=5
for key,value in nodesad_outlier.items():
    if value<0 and before<Max:
        print("___ key",key,"value",value);before+=1
    elif value>2**32-1 and after<Max:
        print("___ key",key,"value",value);after+=1
    elif after==Max and before==Max:
        break


nodesad shape (2224378840,) type <class 'numpy.uint32'> min 0 max 4294967295
4294967295
nodesad outlier type <class 'dict'> 151140758
___ key 164426691 value -5259859740
___ key 179842428 value -5230915740
___ key 182063837 value -5787546300
___ key 189467664 value -601606800
___ key 204969036 value -6105326400
___ key 303413620 value 4295409069
___ key 353874907 value 7766279630
___ key 647517099 value 7465305799
___ key 700453005 value 7956831600
___ key 775285948 value 5700505748


In [49]:
print("Including origin nodes")
print(f'Timestamp==0       {np.sum(nodesad==0):>12,}')
print(f'Timestamp==2**32-1 {np.sum(nodesad==np.uint32(2**32-1)):>12,} including {d["OindexMax"]+1:,} origin nodes')
print()
print("Excluding origin nodes")
print(f'Timestamp==0       {np.sum(nodesad[d["RLindexMin"]:]==0):>12,}')
print(f'Timestamp==2**32-1 {np.sum(nodesad[d["RLindexMin"]:]==np.uint32(2**32-1)):>12,}')


Including origin nodes
Timestamp==0         11,181,312
Timestamp==2**32-1  139,959,446 including 139,524,533 origin nodes

Excluding origin nodes
Timestamp==0         11,181,312
Timestamp==2**32-1      434,913


## Derived graph and transpose derived graph stat summary

In [12]:
graphpath='./Data/'
nodes=pickle.load(open(graphpath+"nodes_20240310.pkl","rb"))
edges=pickle.load(open(graphpath+"edges_20240310.pkl","rb"))

# stats per node/edge type for the derived graph
output=f'stats per node/edge type for the derived graph'
print('-'*len(output));print(output);print('-'*len(output))
for dst in ["O","RL","RV"]:
    alledges=edges[nodes[d[dst+"indexMin"]]:nodes[d[dst+"indexMax"]+1]]
    nAll=len(alledges)
    nO=np.sum(alledges<=d["OindexMax"])
    nRV=np.sum(alledges>=d["RVindexMin"])
    nRL=nAll-nO-nRV
    output=f'{dst:2} #nodes {d[dst+"indexMax"]+1-d[dst+"indexMin"]:15,} '
    output+=f'|| #edges {dst+">O=":>5}{nO:15,} | {dst+">RL=":>6}{nRL:15,} | {dst+">RV=":>6}{nRV:15,}'

    print(output)

# check deduplication of the "one million parent revision"
print()
index=2010142611
print("Degree of the One million parent revision=",nodes[index+1]-nodes[index])
print()

# free arrays no longer needed
del nodes
del edges
del alledges

In [12]:
# stats per node/edge types for the transposed graph
output=f'stats per node/edge type for the transposed derived graph'
print('-'*len(output));print(output);print('-'*len(output))
nodes_transpose=pickle.load(open(graphpath+"nodes_transpose_20240310.pkl","rb"))
edges_transpose=pickle.load(open(graphpath+"edges_transpose_20240310.pkl","rb"))
# nodes transpose and edges transpose graph
for dst in ["O","RL","RV"]:
    alledges=edges_transpose[nodes_transpose[d[dst+"indexMin"]]:nodes_transpose[d[dst+"indexMax"]+1]]
    nAll=len(alledges)
    nO=np.sum(alledges<=d["OindexMax"])
    nRV=np.sum(alledges>=d["RVindexMin"])
    nRL=nAll-nO-nRV
    output=f'{dst:2} #nodes {d[dst+"indexMax"]+1-d[dst+"indexMin"]:15,} '
    output+=f'|| #edges {dst+">O=":>5}{nO:15,} | {dst+">RL=":>6}{nRL:15,} | {dst+">RV=":>6}{nRV:15,}'

    print(output)

# free transpose derived graph and temporay array
del nodes_transpose
del edges_transpose
del alledges

----------------------------------------------
stats per node/edge type for the derived graph
----------------------------------------------
O  #nodes     139,524,533 || #edges  O>O=              0 |  O>RL=    687,095,698 |  O>RV=  1,017,960,393
RL #nodes      17,275,253 || #edges RL>O=              0 | RL>RL=         41,351 | RL>RV=     17,225,899
RV #nodes   2,067,579,054 || #edges RV>O=              0 | RV>RL=              0 | RV>RV=  2,119,355,702

Degree of the One million parent revision= 1

---------------------------------------------------------
stats per node/edge type for the transposed derived graph
---------------------------------------------------------
O  #nodes     139,524,533 || #edges  O>O=              0 |  O>RL=              0 |  O>RV=              0
RL #nodes      17,275,253 || #edges RL>O=    687,095,698 | RL>RL=         41,351 | RL>RV=              0
RV #nodes   2,067,579,054 || #edges RV>O=  1,017,960,393 | RV>RL=     17,225,899 | RV>RV=  2,119,355,702


See supplemental materials for a more detailed study of the timestamps distribution.