# Cross-validation experiments with networks

[Run notebook in Google Colab](https://colab.research.google.com/github/pathpy/pathpy/blob/master/doc/tutorial/cross_validation.ipynb)

`pathpy` provides basic support for evaluations based on cross-validation experiments. In particular, the `train_test_split` method can be used to create train and test splits. The semantics of the method as well as the arguments is similar to the [corresponding function in `sklearn`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

To demonstrate the use, we generate a random graph:

In [None]:
pip install git+git://github.com/pathpy/pathpy.git

In [1]:
import pathpy as pp

n = pp.generators.ER_np(100, 0.04)
print(n)
n.plot()

Uid:			0x26c2f33f160
Type:			Network
Directed:		False
Multi-Edges:		False
Number of nodes:	100
Number of edges:	214


To generate a test and train network instance, where the test network contains a random fraction of 25 % of the nodes, we can write:

In [2]:
test, train = pp.algorithms.evaluation.train_test_split(n, test_size = 0.25)
print(test)
print(train)

Uid:			0x26c2f33f160_test
Type:			Network
Directed:		False
Multi-Edges:		False
Number of nodes:	25
Number of edges:	17
Uid:			0x26c2f33f160_train
Type:			Network
Directed:		False
Multi-Edges:		False
Number of nodes:	75
Number of edges:	119


The method generates two new Network instances that refer to the same node and edge objects as the original network, i.e. the new objects do not consume a lot of memory. The original network instance is not changed. The uids of the newly generated networks will be set to the original uid with a suffix of `_test` and `_train` respectively.

By default, the split will be made based on the nodes, and the train and test networks will include all incident edges for the corresponding node sets. This implies that some edges can be lost if the split is made along the endpoints. To preserve the number of edges, we can set the split method to `edge`. This will sample a random fraction of edges, and all nodes are added to both networks, i.e. the node sets between the two networks are identical. The sum of the edges of the training and test network equals the number of edges in the original network.

In [3]:
test, train = pp.algorithms.evaluation.train_test_split(n, test_size = 0.25, split='edge')
print(test)
print(train)

Uid:			0x26c2f33f160_test
Type:			Network
Directed:		False
Multi-Edges:		False
Number of nodes:	100
Number of edges:	53
Uid:			0x26c2f33f160_train
Type:			Network
Directed:		False
Multi-Edges:		False
Number of nodes:	100
Number of edges:	161


We can alternatively set the size of the training set:

In [4]:
test, train = pp.algorithms.evaluation.train_test_split(n, train_size = 0.25, split='edge')
print(test)
print(train)

Uid:			0x26c2f33f160_test
Type:			Network
Directed:		False
Multi-Edges:		False
Number of nodes:	100
Number of edges:	160
Uid:			0x26c2f33f160_train
Type:			Network
Directed:		False
Multi-Edges:		False
Number of nodes:	100
Number of edges:	54


Apart from static networks, we can also create cross-validation sets for temporal networks. For this, we first load a temporal network from the KONECT database:

In [5]:
tn = pp.io.konect.read_konect_name('sociopatterns-hypertext')
print(tn)
tn.plot()

Uid:			0x26c0fb92748
Type:			TemporalNetwork
Directed:		False
Multi-Edges:		True
Number of unique nodes:	113
Number of unique edges:	2196
Number of temp nodes:	113
Number of temp edges:	20818
Observation periode:	1246255220 - 1246467561.0

Network attributes
------------------
category:	HumanContact
code:	HY
name:	Hypertext 2009
description:	Visitorâ€“visitor face-to-face contacts
extr:	sociopatterns
url:	http://www.sociopatterns.org/
long-description:	This is the network of face-to-face contacts of the attendees of the ACM Hypertext 2009 conference. The ACM Conference on Hypertext and Hypermedia 2009 (HT 2009, http://www.ht2009.org/) was held in Turin, Italy over three days from June 29 to July 1, 2009. In the network, a node represents a conference visitor, and an edge represents a face-to-face contact that was active for at least 20 seconds. Multiple edges denote multiple contacts. Each edge is annotated with the time at which the contact took place.
entity-names:	visitor
relationsh

We can call the same function on a temporal network instance. By default, the split will be made based on the observed interactions, i.e. in the following example the first 75 % of all time-stamped interactions will be included in the training network, while the last 25 % will be included in the test network. 

In [6]:
test, train = pp.algorithms.evaluation.train_test_split(tn, test_size=0.25)
print(train)
print(test)

Uid:			0x26c0fb92748_train
Type:			TemporalNetwork
Directed:		False
Multi-Edges:		True
Number of unique nodes:	112
Number of unique edges:	1854
Number of temp nodes:	112
Number of temp edges:	15614
Observation periode:	1246255220 - 1246441061.0
Uid:			0x26c0fb92748_test
Type:			TemporalNetwork
Directed:		False
Multi-Edges:		True
Number of unique nodes:	95
Number of unique edges:	713
Number of temp nodes:	95
Number of temp edges:	5204
Observation periode:	1246441080 - 1246467561.0


In [7]:
train.plot()

In [8]:
test.plot()

We can also split based on the observed time, i.e. here we include all interactions ocurring within in the first 75 % of the observed time period in the training network, while the remaining interactions are included in the test network.

In [9]:
test, train = pp.algorithms.evaluation.train_test_split(tn, test_size=0.25, split='time')
print(train)
print(test)

Uid:			0x26c0fb92748_train
Type:			TemporalNetwork
Directed:		False
Multi-Edges:		True
Number of unique nodes:	113
Number of unique edges:	2196
Number of temp nodes:	113
Number of temp edges:	20815
Observation periode:	1246255220 - 1246467541.0
Uid:			0x26c0fb92748_test
Type:			TemporalNetwork
Directed:		False
Multi-Edges:		True
Number of unique nodes:	5
Number of unique edges:	3
Number of temp nodes:	5
Number of temp edges:	3
Observation periode:	1246467560 - 1246467561.0
