-
-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Structured labelled arrays #4217
Conversation
I'm a little confused by all the pep8 stuff. Shouldn't running black once fix all that? |
@dschult We need to set up black to run as part of the CI and remove pep8speaks. @rossbar started looking into how to do this and has a reasonable idea of what to do (I think). |
The example doesn't render correctly: I think changing
to
should fix it. That is,
You can also remove the extra space at the end of the docstring. I would also use a shorter title and a more descriptive first sentence. The short title is the image title in the gallery. I would use 3-4 words so the title in the gallery isn't super long. I would add a descriptive first sentence, since it is what is displayed when you hover over the image in the gallery (it currently says "In this example, we show how two things:"). Maybe make the current title the first sentence and add a shorter less descriptive title. If possible it would be nice to draw a figure. The figure is used in the gallery and makes it more visually interesting. Otherwise, it just displays a default image. You may want to look at for ideas. I.e., it would be nice if the plot used the fact that you have an xarray. Should "f(node, datadict) -> pd.Series" be |
Thanks @jarrodmillman! Yes, I was sticking with Markdown conventions, and forgot that we're using Sphinx here.
I'm wondering, do you know if it's possible to display the HTML repr of an xarray object in sphinx? That would be the most rad. But if not, I have an idea of what to put in -- possibly just a heatmap of sorts. |
@jarrodmillman @dschult looks like all tests pass, and the built example looks pretty cool too! 😄 Please let me know if there’s other things you’d like to see in the PR. Happy to make it happen. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about having a function from_node_dataframe
?
Anyway, I think once the name choice and sphinx syntax are worked out, this is ready for merging.
We can return in other PRs to address the questions about whether it is better to construct via lists or numpy.arrays or somehow directly into pandas and xarray. Sorry for the noise...
@dschult to address the final point:
I like the idea! I think it could be left for another newcomer to contribute, since I think the pattern is similar: def from_node_dataframe(df):
G = nx.Graph()
for n, d in df.iterrows():
G.add_node(n, **d)
return G Probably could be made more efficient, more flexible with custom funcs too, but I think it's worth fleshing out in a different PR, maybe with a design sketch via discussion on the issue tracker! I'll raise this up there. |
Chiming in here for a gentle nudge on the PR. No pressure though, I do realize we're all busy people; just didn't want the thread to get dropped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the liberty of doing a bit of rst reformatting & adding intersphinx links for pandas/xarray.
I really like the example! I think it's a great addition showing how to interoperate with other array/data libraries.
After taking a closer look at the functions - I'm not sure that some of the functions are worth the expanded API surface. For example, to_adjacency_xarray
essentially boils down to a one-liner: xr.concat([func(G) for func in funcs], dim="name")
. I'd be concerned about adding helper functions that don't really have much to do with NetworkX. Maybe this could be discussed further?
In summary, at this stage I'm:
+1 on the new plot_graph_tensors
example
+1 on adding the to_node_dataframe
function
-0.5 on to_adjacency_xarray
and format_adjacency
functions
-1 on adding a pyproject.toml (at least in this PR)
@rossbar Regarding If so, then yes, there's an opportunity for discussion here! If I may, I'd like to lay out the arguments in favour of keeping the two functions in the library. I think the two functions do belong inside NetworkX, because they are related to producing the array form(s) of a graph, particularly, the "diffusion matrices" that are used for linear algebra forms of message passing on a graph. Though they don't explicitly use much of the NetworkX API, they are conceptually tied to applied graph theory. Now, as we know from handling high dimensional arrays, even 3D ones get confusing quickly without names, so providing helper functions that construct named arrays (i.e. xarrays and dataframes!) can be super helpful. From a practical standpoint, if I take out As an aside, I also thought a bit about the API. It feels a bit too heavy right now. There's one experiment I did, inside a Jupyter notebook, taking from functools import wraps
import numpy as np
import xarray as xr
import functools
def format_adjacency(func):
@wraps(func)
def inner(G, *args, **kwargs):
name = kwargs.pop("name", None)
if name is None:
name = func.__name__
adj = func(G, *args, **kwargs)
expected_shape = (len(G), len(G))
if adj.shape != expected_shape:
raise ValueError(
"Adjacency matrix is not shaped correctly, "
f"should be of shape {expected_shape}, "
f"instead got shape {adj.shape}."
)
#### THE MAGIC HAPPENS HERE #####
adj = np.expand_dims(adj, axis=-1)
adj = xr.DataArray(adj)
nodes = list(G.nodes())
return xr.DataArray(
adj,
dims=["n1", "n2", "name"],
coords={"n1": nodes, "n2": nodes, "name": [name]},
)
return inner With this, we can write code that is much more lightweight: from functools import partial
from networkx import to_adjacency_xarray, format_adjacency
import networkx as nx
import numpy as np
G = nx.erdos_renyi_graph(n=30, p=0.3)
@format_adjacency
def adjacency(G):
return nx.adjacency_matrix(G).todense()
@format_adjacency
def adjacency_power(G, n):
a = nx.adjacency_matrix(G).todense()
return np.linalg.matrix_power(a, n)
adjacency_2 = partial(adjacency_power, n=2, name="adj_2")
funcs = [
adjacency,
adjacency_2
]
da = to_adjacency_xarray(G, funcs) Of course I'd have to document that |
No, that's not the concern - since the My concern is more about the cost/benefit of expanding the API. The one that really stands out to me is the nodes = list(G.nodes())
my_formatted_adj = xr.DataArray(
adj[..., np.newaxis],
dims=["n1", "n2", "name"],
coords={"n1": nodes, "n2": nodes, "name": [name]},
) Adding the extra layer of indirection for this functionality feels to me like a bit of a violation of one of the zen principles: There should be one—and preferably only one—obvious way to do it.
IMO the difference with I very well may be just be thinking too conservatively re: the expanded API. I don't at all doubt the utility of mapping graphs to named nd data structures! |
Got it! I see where you're coming from now. There is a second perspective, wanted to get your thoughts here. The API surface area provides functions that return specialized data structures already. For example, there's functions that target numpy arrays and pandas dataframes. Would a function that returns xarray DataArrays fall in the same category? And if so, would the previous concerns you raised nullify having the expanded API still? If the consensus is still "let's not keep the function in the library", then I can move it to graphein instead. You made a good point that the bulk of the code is checking, so there's minimal duplication if I just moved the "soul" of the functions into the examples. |
This is a good point. Looking at the other functions that are already in |
I originally (mistakenly) thought this only added xarray as a dependency for the gallery example. This is the first example of adding a new extra dependency since we adopted our new "policy" for extra dependencies:
Could all of the code move to the example (as we recommend for new extra dependencies)? If not, we should take the opportunity to update the policy to better explain our reasoning about when new dependencies can skip first being added as examples. If we decide to make xarray an extra dependency, then xarray needs to be listed in |
We already made some significant changes to our dependencies for the next release. I don't think it will cause any issues, but it makes me a little hesitant to make more dependency changes in general for the next release. One option would be to add this as an example for the next release and then add it as an extra dependency for the 3.0 release. That might allow us to see if there are any unexpected issues. This approach might be more desirable if there were a couple of additional dependencies that are added in the same way for the 3.0 release. We should discuss whether there are other potential extra dependencies that we want to add before the 3.0 release. |
Co-authored-by: Dan Schult <dschult@colgate.edu>
It better describes, in more detail, how to use the function.
@ericmjl This branch had gotten out-of-sync with master. It was a bit of work to rebase, so I decided to just push the changes to your branch. I was careful and I don't think I messed anything up. But you may want to double-check I didn't make any errors. |
I made xarray a default dependency so the tests would run, which is why the tests are failing now. I am still not sure if this should be a default dependency or an extra one. I would also prefer making this a gallery example first, but I don't feel strongly about that. In general, I think xarray is the type of thing it would make sense to add as a default or extra dependency. I don't have much experience using xarray, so I am not sure this is a general solution or if there are other ways to do it. It would be helpful if we could get feedback from other people using xarray and nx. But there may not be a lot of people with more experience with this than Eric. |
I'm looking at this PR again, and I'm getting more excited about its inclusion. We haven't really made edge and node attributes first class entities in NetworkX. I wonder if they should be housed with |
I think the funcs should have reasonable defaults -- like writing all attributes in the I'm pretty sure we could get a better name for Also, using "n1" and "n2" and "name" for the 3 axes in the xarray could be made more specific: |
In support of the functions in this PR, here are the one-liners that they replace. First: da=xr.concat(
[
xr.DataArray(np.expand_dims(nx.adjacency_matrix(G, weight="weight").todense(), axis=-1),
dims=["n1","n2","adj_name"], coords={"n1":list(G), "n2":list(G), "name": ['weight_adj_matrix']}),
xr.DataArray(np.expand_dims(nx.adjacency_matrix(G, weight="capacity").todense(), axis=-1),
dims=["n1","n2","adj_name"], coords={"n1":list(G), "n2":list(G), "name": ['capacity_adj_matrix']})
],
dim="name"
) Notice that the wrapping code around the Notice that while it is a one-liner, it's a long one-liner that duplicates the same cruft for each layer of the xarray. Second: # setup nx.set_node_attributes(G, {n: n**2 for n in G}, "n-squared')
df = pd.DataFrame([
pd.concat([
pd.Series(d, name=n),
pd.Series({"n-fourth": d['n-squared']**2}, name=n)
])
for n, d in G.nodes(data=True)
]) This is also a one-liner (though I removed some error checking of the It would take me a long time to come up with these one-liners and I would need to learn a lot about xarray or pandas. |
This PR adds three functions allow a user to generate labelled adjacency tensors and labelled node feature matrices from a graph.