# Properties and statistics

In this notebook, we will explore how to access properties and statistics of the nodes and edges, and of the network as a whole. We will
* Demonstrate the capabilities of `NodeView` and `EdgeView`
* Present the statistics interface for accessing node/edge statistics
* Present some algorithms for quantifying network structure.

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import xgi

In [None]:
H = xgi.load_xgi_data("kaggle-whats-cooking")
print(H)

## NodeView and EdgeView

*Views* are ways of offering read-only access to the nodes, edges, and their properties. For example, one can
* See what nodes are contained in an edge
* See which edges are maximal
* Find isolated nodes
* Access nodal attributes

For example:

In [None]:
H.edges.singletons()

In [None]:
H.edges.members("2088")

In [None]:
H.nodes["2974"]

In [None]:
H.edges - H.edges.maximal()

In [None]:
H.edges.members("67")

In [None]:
[H.nodes[n]["name"] for n in H.edges.members("67")]

In [None]:
H.nodes.neighbors("20")

## Statistics

You may have noticed that most of the functionality in the `Hypergraph` and `SimplicialComplex` classes takes care of modifying the unerlying structure of the network, and that these classes provide very limited functionality to compute statistics (a.k.a. measures) from the network. This is done via the `stats` package, explored here.

The stats package is one of the features that sets `xgi` apart from other libraries.  It
provides a common interface to all statistics that can be computed from a network, its
nodes, or edges.

### Introduction to Stat objects

Consider the degree of the nodes of a network `H`.  After computing the values of the
degrees, one may wish to store them in a dict, a list, an array, a dataframe, etc.
Through the stats package, `xgi` provides a simple interface that seamlessly allows for
this type conversion.  This is done via the `NodeStat` class.

In [None]:
import xgi

H = xgi.Hypergraph([[1, 2, 3], [2, 3, 4, 5], [3, 4, 5]])
H.nodes.degree

This `NodeStat` object is essentially a wrapper over a function that computes the
degrees of all nodes.  One of the main features of `NodeStat` objects is lazy
evaluation: `H.nodes.degree` will not compute the degrees of nodes until a specific
output format is requested.

In [None]:
H.nodes.degree.asdict()

In [None]:
H.nodes.degree.aslist()

In [None]:
H.nodes.degree.asnumpy()

To compute the degrees of a subset of the nodes, call `degree` from a smaller `NodeView`.

In [None]:
H.nodes([3, 4, 5]).degree.asdict()

Alternatively, to compute the degree of a single node, use square brackets.

In [None]:
H.nodes.degree[4]

Make sure the accessed node is in the underlying view.

In [None]:
# This will raise an exception
# because node 4 is not in the view [1, 2, 3]
#
# H.nodes([1, 2, 3]).degree[4]
#

args and kwargs may be passed to `NodeStat` objects, which will be stored and used when
the evaluation finally takes place.  For example, use the `order` keyword of `degree` to
count only those edges of the specified order.

In [None]:
H.nodes.degree(order=3)

In [None]:
H.nodes.degree(order=3).aslist()

The stats package provides some convenience functions for numerical operations.

In [None]:
H.nodes.degree.max(), H.nodes.degree.min()

In [None]:
import numpy as np

st = H.nodes([1, 2, 3]).degree(order=3)
np.round([st.max(), st.min(), st.mean(), st.median(), st.var(), st.std()], 3)

As a convenience, each node statistic may also be accessed directly through the network itself.

In [None]:
H.degree()

Note however that `H.degree` is a method that simply returns a dict, not a `NodeStat` object and thus does not support the features discussed above.

## Node attributes

Node attributes can be conceived of as a node-object mapping and thus they can also be accessed using the `NodeStat` interface and all its funcitonality.

In [None]:
H.add_nodes_from(
    [
        (1, {"color": "red", "name": "horse"}),
        (2, {"color": "blue", "name": "pony"}),
        (3, {"color": "yellow", "name": "zebra"}),
        (4, {"color": "red", "name": "orangutan", "age": 20}),
        (5, {"color": "blue", "name": "fish", "age": 2}),
    ]
)

Access all attributes of all nodes by specifying a return type.

In [None]:
H.nodes.attrs.asdict()

Access all attributes of a single node by using square brackets.

In [None]:
H.nodes.attrs[1]

Access a single attribute of all nodes by specifying a return type.

In [None]:
H.nodes.attrs("color").aslist()

If a node does not have the specified attribute, `None` will be used.

In [None]:
H.nodes.attrs("age").asdict()

Use the `missing` keyword argument to change the imputed value.

In [None]:
H.nodes.attrs("age", missing=100).asdict()

## Filtering

`NodeView` objects are aware of existing `NodeStat` objects via the `filterby` method.

In [None]:
H.degree()

In [None]:
H.nodes.filterby("degree", 2)  # apply the filter to all nodes

In [None]:
H.nodes([1, 2, 3]).filterby(
    "degree", 2
)  # apply the filter only to the subset of nodes [1, 2, 3]

Nodes can be filtered by attribute via the `filterby_attr` method.

In [None]:
H.nodes.filterby_attr("color", "red")

Since `filterby*` methods return a `NodeView` object, multiple filters can be chained, as well as other `NodeStat` calls. The following call computes the local clustering coefficient of those nodes with degree equal to 2 and "color" attribute equal to "blue", and outputs the result as a dict.

In [None]:
(
    H.nodes.filterby("degree", 2)
    .filterby_attr("color", "blue")
    .clustering_coefficient.asdict()
)

For example, here is how to access the nodes with maximum degree.

In [None]:
H.nodes.filterby("degree", H.nodes.degree.max())

## Set operations

Another way of chaining multiple results of `filterby*` methods is by using set operations. Indeed, chaining two filters is the same as intersecting the results of two separate calls:

In [None]:
print(H.nodes.filterby("degree", 2).filterby_attr("color", "blue"))
print(H.nodes.filterby("degree", 2) & H.nodes.filterby_attr("color", "blue"))

Other set operations are also supported.

In [None]:
nodes1 = H.nodes.filterby("degree", 2)
nodes2 = H.nodes.filterby_attr("color", "blue")
print(f"nodes1 - nodes2 = {nodes1 - nodes2}")
print(f"nodes2 - nodes1 = {nodes2 - nodes1}")
print(f"nodes1 & nodes2 = {nodes1 & nodes2}")
print(f"nodes1 | nodes2 = {nodes1 | nodes2}")
print(f"nodes1 ^ nodes2 = {nodes1 ^ nodes2}")

## Edge statistics

Every feature showcased above (lazy evaluation, type conversion, filtering, set operations, and multi objects) is supported for edge-quantity or edge-attribute mappings, via `EdgeStat` objects.

In [None]:
H.edges.order

In [None]:
H.edges.order.asdict()

In [None]:
H.edges.filterby("order", 3)

In [None]:
H.edges.multi(["order", "size"]).aspandas()

## User-defined statistics

Suppose during the course of your research you come up with a new node-level statistic. For the purpose of this tutorial, we are going to define a statistic called `user_degree`. The `user_degree` of a node is simply its standard degree times 10.

Since this is also a node-quantity mapping, we would like to give it the same interface as `degree` and all the other `NodeStat`s. The stats package provides a simple way to do this. Simply use the `nodestat_func` decorator.

In [None]:
@xgi.nodestat_func
def user_degree(net, bunch):
    """The user degree of a bunch of nodes in net."""
    return {n: 10 * net.degree(n) for n in bunch}

Now `user_degree` is a valid stat that can be computed on any hypergraph:

In [None]:
H.nodes.user_degree.asdict()

Every single feature showcased above is available for use with `user_degree`, including filtering nodes and multi stats objects.

In [None]:
H.nodes.filterby("user_degree", 20)

The `@xgi.nodestat_func` decorator works on any function or callable that admits two parameters: `net` and `bunch`, where `net` is the network and `bunch` is an iterable of nodes in `net`. Additionally, the function must return a dictionary with pairs of the form `node: value`, where `node` is an element of `bunch`. The library will take care of type conversions, but the output value of this function must always be a dict.

User-defined edge statistics can similarly be defined using the `@xgi.edgestat` decorator.

## Algorithms

Generally speaking, the algorithmic methods available in XGI are metrics related to
* clustering coefficient
* assortativity
* connectedness
* path lengths
* centrality
* general measures.

The list is available at the [Read The Docs](https://xgi.readthedocs.io/en/stable/api/algorithms.html) page.

In [None]:
H = xgi.load_xgi_data("email-enron")

Some generally useful functions:

In [None]:
print(xgi.degree_counts(H))
print(xgi.max_edge_order(H))
print(xgi.unique_edge_sizes(H))

### Connectedness

This module implements tools for analyzing the connectedness of a hypergraph.

In [None]:
print(xgi.is_connected(H))
print(f"The number of connected components is {xgi.number_connected_components(H)}")
c = xgi.connected_components(H)  ## Iterator
cc = [len(i) for i in c]
print(cc)

Other selected methods:

In [None]:
H.cleanup()
cec = H.nodes.clique_eigenvector_centrality.asnumpy()

kc = H.nodes.katz_centrality.asnumpy()

In [None]:
plt.plot(kc, cec, "ko")
sns.despine()
plt.xlabel("Katz centrality")
plt.ylabel("Clique eigenvector centrality")

## Challenge

* Find the number of recipes in `kaggle-whats-cooking` with more than 5 ingredients
* Find the number of ingredients that are only used once. What are they?
* Make a histogram of the edge size. What is the most common number of ingredients in a recipe?
* What is the maximum and minimum number of ingredients?
* What is the most popular ingredient?
* Extra: Copy the `user_degree` function above and modify it so that it weights each edge by the inverse of its size, i.e.,
$k_i = \sum_{e\in E} {\bf 1}_{i\in e} / |e|$
Output this custom degree for `kaggle-whats-cooking` in dictionary form.
