Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using 'from_pandas_adjacency', edges are added focusing on column names rather than the order #3105

Closed
komo-fr opened this issue Aug 1, 2018 · 3 comments

Comments

@komo-fr
Copy link
Contributor

komo-fr commented Aug 1, 2018

When using the function from_pandas_adjacency (networkx 2.1) to create a graph, edges are added focusing on the order of columns and rows in the pandas DataFrame.
Personally, I think that it is more useful by adding edges focusing on the names of columns and rows (indices) rather than the order, because

  1. Personally, when column names and row names are given in the pandas DataFrame, I think it is natural to interpret them as node names.
  2. In the current behavior, it is necessary for the code calling from_pandas_adjacency to ensure that the columns and rows of the pandas DataFrame are in the same order. For example, by writing the following code every time
      df = df[df.index] # Sort columns by row (index) order
    Without this code, there is a possibility that an unintended graph structure may be created.

The examples are as follows.

Current behavior:

In the following case 1 and case 2, the pandas DataFrames have the same structure except column order. (The columns order of case 1 is ["A", "B", "C"]], and the columns order of case 2 is ["A "," C "," B "])

dataframe

However, the graph structure created using from_pandas_adjacency (networkx 2.1) differs between case 1 and case 2.

graph

Case 1: This case is no problem.

# A->B, B->C
data = {
        "A": {"A": 0, "B" : 0, "C": 0},
        "B": {"A": 1, "B" : 0, "C": 0},
        "C": {"A": 0, "B" : 1, "C": 0}
        }
case1_df = pd.DataFrame(data)

G  = nx.from_pandas_adjacency(case1_df, create_using=nx.DiGraph())
pos = nx.spring_layout(G)
nx.draw_networkx_nodes(G, pos=pos, node_color="#ffa1a1")
nx.draw_networkx_labels(G, pos=pos)
nx.draw_networkx_edges(G, pos=pos)
plt.title('case 1')
plt.show()

case1_graph

Case 2:
I think that behavior in this case is a little confusing.
Focusing on column names and row names, it is expected that the same graph structure as case 1 will be created. However, in fact, a different graph structure is created. It is because from_pandas_adjacency is currently based on the order, not the names.

# A->B, B->C (Same as case 1)
data = {
        "A": {"A": 0, "B" : 0, "C": 0},
        "B": {"A": 1, "B" : 0, "C": 0},
        "C": {"A": 0, "B" : 1, "C": 0}
        }
case2_df = pd.DataFrame(data)

# !!! Change the column order from ["A", "B", "C"] to ["A", "C", "B"] !!!
case2_df = case2_df[["A", "C", "B"]]  

G  = nx.from_pandas_adjacency(case2_df, create_using=nx.DiGraph())
pos = nx.spring_layout(G)
nx.draw_networkx_nodes(G, pos=pos, node_color="#ffa1a1")
nx.draw_networkx_labels(G, pos=pos)
nx.draw_networkx_edges(G, pos=pos)
plt.title('case 2')
plt.show()

case2_graph

Expected result:

I want the result of case 2 to be the same as case 1. The reasons are mentioned at the beginning of this issue.
If the code is changed, I'm thinking of sending a PR like the following.

Before:

A = df.values
G = from_numpy_matrix(A, create_using=create_using)
try:
df = df[df.index]
except:
msg = "%s not in columns"
missing = list(set(df.index).difference(set(df.columns)))
raise nx.NetworkXError("Columns must match Indices.", msg % missing)
nx.relabel.relabel_nodes(G, dict(enumerate(df.columns)), copy=False)
return G

After:
My desire is to avoid the situation that always needs to be careful of columns order when using from_pandas_adjacency. I think that there are two approaches.

[Approach 1]
In the original code, the columns are sorted after creating the graph. Change this to sort before creating the graph.

def from_pandas_adjacency(df, create_using=None):
    # …

    try:
        df = df[df.index]
    except:
        msg = "%s not in columns"
        missing = list(set(df.index).difference(set(df.columns)))
        raise nx.NetworkXError("Columns must match Indices.", msg % missing)

    A = df.values
    G = from_numpy_matrix(A, create_using=create_using)

    nx.relabel.relabel_nodes(G, dict(enumerate(df.columns)), copy=False)
    return G

[Approach 2]
If the columns and rows are in the different order, throw an exception. Then, before using from_pandas_adjacency, I will be able to remember that the columns and rows are in the same order. And, it is possible to avoid trouble that an unintended graph structure is created.

@dschult
Copy link
Member

dschult commented Aug 1, 2018

Thanks for pointing this out! I agree that we need a fix here -- either raise an exception when order doesn't match or reorder the columns to match the rows. I would lean toward your Approach 1. Do you see advantages of the second approach? I guess my question is also getting at: are there backward compatibility issues we need to worry about here?

Thanks!

@komo-fr
Copy link
Contributor Author

komo-fr commented Aug 4, 2018

Thank you for the comments.

Do you see advantages of the second approach? 

I think there is no big advantage for the second approach. I myself want to adopt Approach 1 rather than Approach 2.
I presented Approach 2 as an alternative to the case where Approach 1 can not be adopted for any reasons. (I thought that from_pandas_adjacency might have been implemented to focus on order after some discussion.)

 I guess my question is also getting at: are there backward compatibility issues we need to worry about here?

I think there is no backward compatibility problem.
If there are cases in which it is necessary to ignore the column name and create a graph structure, it will be a problem, but I think that is not a general case.

If there is no problem, I would like to send a pull request with the contents of Approach 1.

@dschult
Copy link
Member

dschult commented Aug 4, 2018

This looks good -- Put together a pull request for Approach 1.
Thanks very much!

dschult pushed a commit that referenced this issue Aug 11, 2018
#3111)

* When using 'from_pandas_adjacency', edges are added focusing on column names rather than the order. 

Fixes #3105.

* Test for pandas DataFrame which the columns and rows are in the different order.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants