# Higher-order data

XGI provides four ways for generating higher-order datasets:
1. Building node-by-node and edge-by-edge (Described in tutorial 1)
2. Generating a synthetic network from a model
3. Loading from XGI-DATA
4. Loading from a file

In this tutorial, we will describe methods 2-4.

In [1]:
import matplotlib.pyplot as plt
import xgi

## Generating synthetic data

XGI has implemented many models for generating synthetic higher-order networks. See [here](https://xgi.readthedocs.io/en/stable/api/generators.html) for a full list. We will demonstrate a few common random models here.

In [None]:
H = xgi.load_xgi_data("email-enron")
H.cleanup()

n = H.num_nodes

In [None]:
H_rand = xgi.random_hypergraph(n, [0.005, 0.001], seed=2)
k = H.nodes.degree.asdict()
s = H.edges.size.asdict()
H_cl = xgi.chung_lu_hypergraph(k, s, seed=0)
H_hppm = xgi.uniform_HPPM(n, 3, 6, 0.95, seed=1)
H_sun = xgi.sunflower(5, 5, 10)

In [None]:
plt.figure(figsize=(6, 6))
pos = xgi.pca_transform(xgi.pairwise_spring_layout(H_cl))
plt.subplot(221)
plt.title("Chung-Lu")
xgi.draw(H_cl, pos=pos)
plt.subplot(222)
plt.title("Random hypergraph")
xgi.draw(H_rand, pos=pos)
plt.subplot(223)
plt.title("Uniform SBM")
pos = xgi.pca_transform(xgi.pairwise_spring_layout(H_hppm))
xgi.draw(H_hppm, pos=pos)
plt.subplot(224)
plt.title("Sunflower")
xgi.draw(H_sun, hull=True)
plt.show()

## XGI-DATA

... is an open-source repository of higher-order datasets in standard JSON format:
* 27 datasets and counting
* A [**table**](https://xgi.readthedocs.io/en/stable/xgi-data.html) of the datasets with their basic statistics
* Hosted on Zenodo 

First, let's see all the datasets that are available:

In [None]:
xgi.load_xgi_data()  # calling this function without arguments returns the list of all datasets

Let's select a dataset.

In [None]:
H = xgi.load_xgi_data("email-eu")
print(H)

We can easily exclude edges larger than a certain size:

In [None]:
H2 = xgi.load_xgi_data("email-eu", max_order=2)
print(H2)

We look at some of the statistics of the original dataset:

In [None]:
print(
    "The dataset is connected"
    if xgi.is_connected(H)
    else "The dataset is not connected"
)
print(f"The unique edge sizes are \n{xgi.unique_edge_sizes(H)}")

### Cleaning up

XGI provides a method called `cleanup` to easily tidy up higher-order datasets. Operations that `cleanup` can perform:
* Removing isolated nodes
* Removing singleton edges
* Removing multiedges
* Renaming nodes and edges to a standard labeling scheme
* Removing nodes and edges that are not part of the giant component

For example:

In [None]:
print(H.nodes)
print(H)
print(xgi.is_connected(H))

In [None]:
H.cleanup()
print(H.nodes)
print(H)
print(xgi.is_connected(H))

...don't worry, the old label is still there!

In [None]:
H.nodes[0]

In [None]:
# Save the file
xgi.download_xgi_data("email-enron")
# this now lives in email-enron.json

In [None]:
Hlocal = xgi.load_xgi_data("email-enron", read=True)  # now we are loading locally!
print(Hlocal)

## Read and write

XGI offers 4 different formats to read and write:
* JSON (same format as XGI-DATA)
* Hyperedge list
* Bipartite edge list
* Incidence matrix

Starting with the JSON, this is identical to the format of XGI-DATA datasets. A benefit of this format is that it stores attributes of nodes, edges, and the hypergraph.

In [None]:
# Write the example hypergraph to a JSON file
xgi.write_json(Hlocal, "hypergraph_json.json")
# Load the file just written and store it in a new hypergraph
H_json = xgi.read_json("hypergraph_json.json")

We can also read/write a hyperedge list. In this case, each line tabulates the node ids of each edge. Pros: compact. Cons: Can't store attributes.

In [None]:
# Write the hypergraph to a file as a hyperedge list
xgi.write_edgelist(H, "hyperedge_list.csv", delimiter=",")
# Read the file just written as a new hypergraph
H_el = xgi.read_edgelist("hyperedge_list.csv", delimiter=",", nodetype=int)

Lastly, we can read/write an incidence matrix. In this format, rows represent the node IDs, and the columns represent the edge IDs. Pros: Easy to convert to a Numpy array or Pandas dataframe. Cons: Non-compact representation, no attributes, not easily readable.

In [None]:
# Write the hypergraph as a bipartite edge list
xgi.write_incidence_matrix(H, "incidence.csv", delimiter=",")
# Read the file just written as a new hypergraph
H_bel = xgi.read_incidence_matrix("incidence.csv", delimiter=",")

We can read/write a bipartite edgelist. In this format, each line is composed of two entries: column 1 is the ID of the node, and column 2 is the edge to which that node belongs. Pros: fixed number of columns, compact. Cons: again, no attributes.

## Challenge

* What is the average multiplicity of an edge in the `contact-high-school` dataset? Hint: use `cleanup()`.
* Load the `congress-bills` dataset excluding edges of order 11 and larger and save it as a tab-delimited hyperedge list.