Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structured labelled arrays #4217

Closed
wants to merge 23 commits into from
Closed

Structured labelled arrays #4217

wants to merge 23 commits into from

Conversation

ericmjl
Copy link
Contributor

@ericmjl ericmjl commented Sep 18, 2020

This PR adds three functions allow a user to generate labelled adjacency tensors and labelled node feature matrices from a graph.

@dschult
Copy link
Member

dschult commented Sep 19, 2020

I'm a little confused by all the pep8 stuff. Shouldn't running black once fix all that?
Is it true that black doesn't mess with comments and doc_strings but that flake8 and pycodestyle do?
How can we make this easier for submitters?

examples/convert/graph_tensors.py Outdated Show resolved Hide resolved
examples/convert/graph_tensors.py Outdated Show resolved Hide resolved
examples/convert/graph_tensors.py Outdated Show resolved Hide resolved
networkx/conftest.py Outdated Show resolved Hide resolved
networkx/convert_matrix.py Outdated Show resolved Hide resolved
networkx/convert_matrix.py Show resolved Hide resolved
networkx/convert_matrix.py Outdated Show resolved Hide resolved
@jarrodmillman
Copy link
Member

I'm a little confused by all the pep8 stuff. Shouldn't running black once fix all that?
Is it true that black doesn't mess with comments and doc_strings but that flake8 and pycodestyle do?
How can we make this easier for submitters?

@dschult We need to set up black to run as part of the CI and remove pep8speaks. @rossbar started looking into how to do this and has a reasonable idea of what to do (I think).

networkx/convert_matrix.py Outdated Show resolved Hide resolved
networkx/convert_matrix.py Outdated Show resolved Hide resolved
@jarrodmillman
Copy link
Member

The example doesn't render correctly:
https://324-890377-gh.circle-artifacts.com/0/doc/build/html/auto_examples/convert/plot_graph_tensors.html#sphx-glr-auto-examples-convert-plot-graph-tensors-py

I think changing

An overview of the steps involved looks like this:

1. Create your graph with data attributes
1. Define functions to extract data.
    - For node DataFrames functions have signature:
    f(node, datadict) -> pd.Series.
    - For adjacency DataArray functions have signature: f(G) -> xr.DataArray.
    Use ``format_adjacency()`` to ease creation of the DataArray.
1. Call the relevant generate function
(``generate_node_dataframe`` or ``generate_adjacency_xarray``)

"""

to

An overview of the steps involved looks like this:

1. Create your graph with data attributes
2. Define functions to extract data.

    - For node DataFrames functions have signature:
      f(node, datadict) -> pd.Series.
    - For adjacency DataArray functions have signature: f(G) -> xr.DataArray.
      Use ``format_adjacency()`` to ease creation of the DataArray.

3. Call the relevant generate function
   (``generate_node_dataframe`` or ``generate_adjacency_xarray``)
"""

should fix it. That is,

  1. use different numbers (so it makes sense to people who read the text file).
  2. use spaces to mark the beginning and end of the bullet list.
  3. indent text to keep it in the same block (e.g., in the bullet list and item 3).

You can also remove the extra space at the end of the docstring.

I would also use a shorter title and a more descriptive first sentence. The short title is the image title in the gallery. I would use 3-4 words so the title in the gallery isn't super long. I would add a descriptive first sentence, since it is what is displayed when you hover over the image in the gallery (it currently says "In this example, we show how two things:"). Maybe make the current title the first sentence and add a shorter less descriptive title.

If possible it would be nice to draw a figure. The figure is used in the gallery and makes it more visually interesting. Otherwise, it just displays a default image. You may want to look at

for ideas. I.e., it would be nice if the plot used the fact that you have an xarray.

Should "f(node, datadict) -> pd.Series" be f(node, datadict) -> pd.Series (i.e., put in double backquotes)? And "f(G) -> xr.DataArray" be f(G) -> xr.DataArray?

@ericmjl
Copy link
Contributor Author

ericmjl commented Sep 25, 2020

Thanks @jarrodmillman! Yes, I was sticking with Markdown conventions, and forgot that we're using Sphinx here.

If possible it would be nice to draw a figure. The figure is used in the gallery and makes it more visually interesting. Otherwise, it just displays a default image.

I'm wondering, do you know if it's possible to display the HTML repr of an xarray object in sphinx? That would be the most rad. But if not, I have an idea of what to put in -- possibly just a heatmap of sorts.

@ericmjl
Copy link
Contributor Author

ericmjl commented Sep 25, 2020

@jarrodmillman @dschult looks like all tests pass, and the built example looks pretty cool too! 😄

Please let me know if there’s other things you’d like to see in the PR. Happy to make it happen.

Copy link
Member

@dschult dschult left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about having a function from_node_dataframe?

Anyway, I think once the name choice and sphinx syntax are worked out, this is ready for merging.

We can return in other PRs to address the questions about whether it is better to construct via lists or numpy.arrays or somehow directly into pandas and xarray. Sorry for the noise...

networkx/convert_matrix.py Outdated Show resolved Hide resolved
networkx/convert_matrix.py Show resolved Hide resolved
networkx/convert_matrix.py Show resolved Hide resolved
@ericmjl
Copy link
Contributor Author

ericmjl commented Sep 29, 2020

@dschult to address the final point:

What about having a function from_node_dataframe?

I like the idea! I think it could be left for another newcomer to contribute, since I think the pattern is similar:

def from_node_dataframe(df):
    G = nx.Graph()
    for n, d in df.iterrows():
        G.add_node(n, **d)
    return G

Probably could be made more efficient, more flexible with custom funcs too, but I think it's worth fleshing out in a different PR, maybe with a design sketch via discussion on the issue tracker! I'll raise this up there.

@ericmjl
Copy link
Contributor Author

ericmjl commented Oct 2, 2020

Chiming in here for a gentle nudge on the PR. No pressure though, I do realize we're all busy people; just didn't want the thread to get dropped.

Copy link
Contributor

@rossbar rossbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the liberty of doing a bit of rst reformatting & adding intersphinx links for pandas/xarray.

I really like the example! I think it's a great addition showing how to interoperate with other array/data libraries.

After taking a closer look at the functions - I'm not sure that some of the functions are worth the expanded API surface. For example, to_adjacency_xarray essentially boils down to a one-liner: xr.concat([func(G) for func in funcs], dim="name"). I'd be concerned about adding helper functions that don't really have much to do with NetworkX. Maybe this could be discussed further?

In summary, at this stage I'm:
+1 on the new plot_graph_tensors example
+1 on adding the to_node_dataframe function
-0.5 on to_adjacency_xarray and format_adjacency functions
-1 on adding a pyproject.toml (at least in this PR)

networkx/convert_matrix.py Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
networkx/convert_matrix.py Outdated Show resolved Hide resolved
networkx/convert_matrix.py Show resolved Hide resolved
networkx/convert_matrix.py Show resolved Hide resolved
networkx/convert_matrix.py Show resolved Hide resolved
networkx/convert_matrix.py Show resolved Hide resolved
@ericmjl
Copy link
Contributor Author

ericmjl commented Oct 3, 2020

@rossbar Regarding format_adjacency and to_adjacency_xarray, let me see if I understand where you're coming from - if the function doesn't call on core NetworkX functions or objects, then it probably shouldn't be included in the convert_matrix.py library of functions, is that right? That's the only distinguishing mark of these two functions compared to to_node_dataframe.

If so, then yes, there's an opportunity for discussion here! If I may, I'd like to lay out the arguments in favour of keeping the two functions in the library.

I think the two functions do belong inside NetworkX, because they are related to producing the array form(s) of a graph, particularly, the "diffusion matrices" that are used for linear algebra forms of message passing on a graph. Though they don't explicitly use much of the NetworkX API, they are conceptually tied to applied graph theory. Now, as we know from handling high dimensional arrays, even 3D ones get confusing quickly without names, so providing helper functions that construct named arrays (i.e. xarrays and dataframes!) can be super helpful.

From a practical standpoint, if I take out to_adjacency_xarray and format_adjacency, then the plot_graph_tensors example needs to be heavily modified by adding to_adjacency_xarray and format_adjacency explicitly in there, as plot_graph_tensors.py relies on those two functions. I'd also end up copy/pasting those two functions into another library where it'd live (probably graphein, for which I'm working with the lead developer to target NetworkX graphs as the idiomatic data structure). So now we end up with substantial code duplication: one copy of the code lives in an example in NetworkX, and another copy of the code lives in graphein.

As an aside, I also thought a bit about the API. It feels a bit too heavy right now. There's one experiment I did, inside a Jupyter notebook, taking format_adjacency and turning it into a decorator:

from functools import wraps
import numpy as np
import xarray as xr
import functools

def format_adjacency(func):
    @wraps(func)
    def inner(G, *args, **kwargs):
        name = kwargs.pop("name", None)
        if name is None:
            name = func.__name__

        adj = func(G, *args, **kwargs)
        expected_shape = (len(G), len(G))
        if adj.shape != expected_shape:
            raise ValueError(
                "Adjacency matrix is not shaped correctly, "
                f"should be of shape {expected_shape}, "
                f"instead got shape {adj.shape}."
            )
        #### THE MAGIC HAPPENS HERE #####
        adj = np.expand_dims(adj, axis=-1)
        adj = xr.DataArray(adj)
        nodes = list(G.nodes())
        return xr.DataArray(
            adj,
            dims=["n1", "n2", "name"],
            coords={"n1": nodes, "n2": nodes, "name": [name]},
        )
    return inner

With this, we can write code that is much more lightweight:

from functools import partial
from networkx import to_adjacency_xarray, format_adjacency
import networkx as nx
import numpy as np

G = nx.erdos_renyi_graph(n=30, p=0.3)

@format_adjacency
def adjacency(G):
    return nx.adjacency_matrix(G).todense()

@format_adjacency
def adjacency_power(G, n):
    a = nx.adjacency_matrix(G).todense()
    return np.linalg.matrix_power(a, n)

adjacency_2 = partial(adjacency_power, n=2, name="adj_2")
funcs = [
    adjacency,
    adjacency_2
]

da = to_adjacency_xarray(G, funcs)

Of course I'd have to document that name is a reserved keyword argument, or use a much more explicit keyword argument name. This way, end-users only have to worry about decorating their numpy-array-generating functions with one decorator. Makes usage much easier. @rossbar @dschult and @jarrodmillman what do you all think?

@rossbar
Copy link
Contributor

rossbar commented Oct 3, 2020

if the function doesn't call on core NetworkX functions or objects, then it probably shouldn't be included in the convert_matrix.py library of functions, is that right?

No, that's not the concern - since the convert* modules are related to converting graphs to other formats, this does indeed seem the appropriate place for such conversion functions.

My concern is more about the cost/benefit of expanding the API. The one that really stands out to me is the to_adjacency_matrix function. That essentially boils down to a one-liner: my_adj_ary = xr.concat([func(G) for func in funcs], name="dim"). My concern is this - is it worth adding specialized API to do something that is already achievable in a very "Pythonic" way? For this reason especially, it feels like the function belongs in the example rather than in the library because it demonstrates the right way to go about this, using the appropriate tools. format_adjacency has a similar feel: half the function is dedicated to error checking and the part that does the conversion essentially boils down to things that simply accomplished using the tools provided by the various libraries involved, e.g.

nodes = list(G.nodes())
my_formatted_adj = xr.DataArray(
    adj[..., np.newaxis], 
    dims=["n1", "n2", "name"],
    coords={"n1": nodes, "n2": nodes, "name": [name]},
)

Adding the extra layer of indirection for this functionality feels to me like a bit of a violation of one of the zen principles: There should be one—and preferably only one—obvious way to do it.

That's the only distinguishing mark of these two functions compared to to_node_dataframe.

IMO the difference with to_node_dataframe is that it performs a remapping of the node/attribute data, which is a more complicated procedure that feels more atomic.

I very well may be just be thinking too conservatively re: the expanded API. I don't at all doubt the utility of mapping graphs to named nd data structures!

@ericmjl
Copy link
Contributor Author

ericmjl commented Oct 3, 2020

My concern is this - is it worth adding specialized API to do something that is already achievable in a very "Pythonic" way? For this reason especially, it feels like the function belongs in the example rather than in the library because it demonstrates the right way to go about this, using the appropriate tools. format_adjacency has a similar feel: half the function is dedicated to error checking and the part that does the conversion essentially boils down to things that simply accomplished using the tools provided by the various libraries involved

Got it! I see where you're coming from now.

There is a second perspective, wanted to get your thoughts here. The API surface area provides functions that return specialized data structures already. For example, there's functions that target numpy arrays and pandas dataframes. Would a function that returns xarray DataArrays fall in the same category? And if so, would the previous concerns you raised nullify having the expanded API still?

If the consensus is still "let's not keep the function in the library", then I can move it to graphein instead. You made a good point that the bulk of the code is checking, so there's minimal duplication if I just moved the "soul" of the functions into the examples.

@rossbar
Copy link
Contributor

rossbar commented Oct 4, 2020

The API surface area provides functions that return specialized data structures already.

This is a good point. Looking at the other functions that are already in convert_matrix, there are some that fit the "pass-through" description. For example, to_numpy_matrix is essentially: return np.asmatrix(to_numpy_array(*args, **kwargs)). I would vote to deprecate functions like this as well (thanks for the reminder!) for the same reason mentioned above. Most of the others functions in the module have more complex procedures that translate between the Graph structures and the "other" format (DataFrame, ndarray, etc.) - I think to_node_dataframe fits this bill - which IMO are worthwhile.

@jarrodmillman
Copy link
Member

I originally (mistakenly) thought this only added xarray as a dependency for the gallery example. This is the first example of adding a new extra dependency since we adopted our new "policy" for extra dependencies:

Default dependencies are listed in requirements/default.txt and extra (i.e., optional) dependencies
are listed in requirements/extra.txt. We don't often add new default and extra dependencies. If you
are considering adding code that has a dependency, you should first consider adding a gallery example.
Typically, new proposed dependencies would first be added as extra dependencies. Extra dependencies
should be easy to install on all platforms and widely-used. New default dependencies should be easy
to install on all platforms, widely-used in the community, and have demonstrated potential for
wide-spread use in NetworkX.

Could all of the code move to the example (as we recommend for new extra dependencies)? If not, we should take the opportunity to update the policy to better explain our reasoning about when new dependencies can skip first being added as examples.

If we decide to make xarray an extra dependency, then xarray needs to be listed in requirements/extra.txt not requirements/example.txt. We should also mention the addition of a new extra dependency in the release notes.

@jarrodmillman
Copy link
Member

We already made some significant changes to our dependencies for the next release. I don't think it will cause any issues, but it makes me a little hesitant to make more dependency changes in general for the next release.

One option would be to add this as an example for the next release and then add it as an extra dependency for the 3.0 release. That might allow us to see if there are any unexpected issues.

This approach might be more desirable if there were a couple of additional dependencies that are added in the same way for the 3.0 release. We should discuss whether there are other potential extra dependencies that we want to add before the 3.0 release.

@dschult dschult added this to the networkx-2.6 milestone Oct 4, 2020
@jarrodmillman
Copy link
Member

@ericmjl This branch had gotten out-of-sync with master. It was a bit of work to rebase, so I decided to just push the changes to your branch. I was careful and I don't think I messed anything up. But you may want to double-check I didn't make any errors.

@networkx networkx deleted a comment from pep8speaks Oct 26, 2020
@jarrodmillman
Copy link
Member

I made xarray a default dependency so the tests would run, which is why the tests are failing now. I am still not sure if this should be a default dependency or an extra one. I would also prefer making this a gallery example first, but I don't feel strongly about that. In general, I think xarray is the type of thing it would make sense to add as a default or extra dependency.

I don't have much experience using xarray, so I am not sure this is a general solution or if there are other ways to do it. It would be helpful if we could get feedback from other people using xarray and nx. But there may not be a lot of people with more experience with this than Eric.

@dschult dschult self-requested a review October 27, 2020 02:33
@dschult
Copy link
Member

dschult commented Oct 27, 2020

I'm looking at this PR again, and I'm getting more excited about its inclusion.

We haven't really made edge and node attributes first class entities in NetworkX.
We allow them and use them, but functions like nx.get/set_node/edge_attributes
are pretty minimal. The functions in this PR increase our set of tools for manipulating node and edge attributes.

I wonder if they should be housed with get/set_node/edge_attributes instead of in convert?
They aren't converting a graph. They are exporting attributes. I'd like to see functions to
import attributes (or at least document how to import them. But that doesn't have to be in this PR.

@dschult
Copy link
Member

dschult commented Oct 27, 2020

I think the funcs should have reasonable defaults -- like writing all attributes in the edge/node_attr_dicts.
Then the to_node_dataframe output has columns corresponding to the attributes on the nodes.
And the to_adjacency_dataarray output has adjacency arrays for each attribute on the edges.

I'm pretty sure we could get a better name for format_adjacency. Perhaps adjacency_array_to_dataarray (or perhaps not) :)

Also, using "n1" and "n2" and "name" for the 3 axes in the xarray could be made more specific:
perhaps something like "source", "target" and "adj_name". But maybe there's others that are better?

@dschult
Copy link
Member

dschult commented Oct 27, 2020

In support of the functions in this PR, here are the one-liners that they replace.

First: to_adjacency_dataarray with two adjacencies to stack... :

da=xr.concat(
    [
      xr.DataArray(np.expand_dims(nx.adjacency_matrix(G, weight="weight").todense(), axis=-1), 
        dims=["n1","n2","adj_name"], coords={"n1":list(G), "n2":list(G), "name": ['weight_adj_matrix']}),
      xr.DataArray(np.expand_dims(nx.adjacency_matrix(G, weight="capacity").todense(), axis=-1), 
        dims=["n1","n2","adj_name"], coords={"n1":list(G), "n2":list(G), "name": ['capacity_adj_matrix']})
    ],
    dim="name"
) 

Notice that the wrapping code around the nx.adjacency_matrix(G, weight=...) is identical and very specific to naming rows and columns based on the Graph nature of the array. The general DataArray is much more flexible of course, but a graph adjacency version only involves these features -- and it may be helpful for users (and us) not to have people construct their own version of adjacency arrays.

Notice that while it is a one-liner, it's a long one-liner that duplicates the same cruft for each layer of the xarray.
The code in to_adjacency_dataarray hides much of the one-liner cruft in format_adjacency.

Second: to_node_dataframe with two attributes to stack...:

# setup   nx.set_node_attributes(G, {n: n**2 for n in G}, "n-squared')

df = pd.DataFrame([
    pd.concat([
          pd.Series(d, name=n), 
          pd.Series({"n-fourth": d['n-squared']**2}, name=n)
    ]) 
    for n, d in G.nodes(data=True)
])

This is also a one-liner (though I removed some error checking of the func by inlining.
Again, it would be possible for each user to figure out how to do this. But I think it might be helpful to
standardize for users a good way to store node information in dataframes.
(Just as we do for to_pandas_edgelist and to_pandas_adjacency.)

It would take me a long time to come up with these one-liners and I would need to learn a lot about xarray or pandas.
I admit that I would need to learn a fair amount about them just to create the funcs needed to customize these attributes. But, I think with default functions that simply write out the node/edge attributes on the Graph people can ease into understanding how this works.

@dschult dschult modified the milestones: networkx-2.6, networkx-2.7 Feb 24, 2021
Base automatically changed from master to main March 4, 2021 18:20
@jarrodmillman jarrodmillman modified the milestones: networkx-2.7, networkx-3.0 Apr 8, 2021
@jarrodmillman jarrodmillman removed this from the networkx-2.7 milestone Feb 12, 2022
@dschult dschult self-assigned this Jul 31, 2022
@ericmjl ericmjl closed this by deleting the head repository Nov 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

4 participants