When using 'from_pandas_adjacency', edges are added focusing on column names rather than the order #3105

komo-fr · 2018-08-01T08:18:56Z

When using the function from_pandas_adjacency (networkx 2.1) to create a graph, edges are added focusing on the order of columns and rows in the pandas DataFrame.
Personally, I think that it is more useful by adding edges focusing on the names of columns and rows (indices) rather than the order, because

Personally, when column names and row names are given in the pandas DataFrame, I think it is natural to interpret them as node names.
In the current behavior, it is necessary for the code calling from_pandas_adjacency to ensure that the columns and rows of the pandas DataFrame are in the same order. For example, by writing the following code every time
　　df = df[df.index] # Sort columns by row (index) order
Without this code, there is a possibility that an unintended graph structure may be created.

The examples are as follows.

Current behavior:

In the following case 1 and case 2, the pandas DataFrames have the same structure except column order. (The columns order of case 1 is ["A", "B", "C"]], and the columns order of case 2 is ["A "," C "," B "])

However, the graph structure created using from_pandas_adjacency (networkx 2.1) differs between case 1 and case 2.

Case 1: This case is no problem.

# A->B, B->C
data = {
        "A": {"A": 0, "B" : 0, "C": 0},
        "B": {"A": 1, "B" : 0, "C": 0},
        "C": {"A": 0, "B" : 1, "C": 0}
        }
case1_df = pd.DataFrame(data)

G  = nx.from_pandas_adjacency(case1_df, create_using=nx.DiGraph())
pos = nx.spring_layout(G)
nx.draw_networkx_nodes(G, pos=pos, node_color="#ffa1a1")
nx.draw_networkx_labels(G, pos=pos)
nx.draw_networkx_edges(G, pos=pos)
plt.title('case 1')
plt.show()

Case 2:
I think that behavior in this case is a little confusing.
Focusing on column names and row names, it is expected that the same graph structure as case 1 will be created. However, in fact, a different graph structure is created. It is because from_pandas_adjacency is currently based on the order, not the names.

# A->B, B->C (Same as case 1)
data = {
        "A": {"A": 0, "B" : 0, "C": 0},
        "B": {"A": 1, "B" : 0, "C": 0},
        "C": {"A": 0, "B" : 1, "C": 0}
        }
case2_df = pd.DataFrame(data)

# !!! Change the column order from ["A", "B", "C"] to ["A", "C", "B"] !!!
case2_df = case2_df[["A", "C", "B"]]  

G  = nx.from_pandas_adjacency(case2_df, create_using=nx.DiGraph())
pos = nx.spring_layout(G)
nx.draw_networkx_nodes(G, pos=pos, node_color="#ffa1a1")
nx.draw_networkx_labels(G, pos=pos)
nx.draw_networkx_edges(G, pos=pos)
plt.title('case 2')
plt.show()

Expected result:

I want the result of case 2 to be the same as case 1. The reasons are mentioned at the beginning of this issue.
If the code is changed, I'm thinking of sending a PR like the following.

Before:

networkx/networkx/convert_matrix.py

Lines 180 to 190 in ab3a6cd

    
           A = df.values 
        
           G = from_numpy_matrix(A, create_using=create_using) 
        
           try: 
        
               df = df[df.index] 
        
           except: 
        
               msg = "%s not in columns" 
        
               missing = list(set(df.index).difference(set(df.columns))) 
        
               raise nx.NetworkXError("Columns must match Indices.", msg % missing) 
        
           nx.relabel.relabel_nodes(G, dict(enumerate(df.columns)), copy=False) 
        
           return G

After:
My desire is to avoid the situation that always needs to be careful of columns order when using from_pandas_adjacency. I think that there are two approaches.

[Approach 1]
In the original code, the columns are sorted after creating the graph. Change this to sort before creating the graph.

def from_pandas_adjacency(df, create_using=None):
    # …

    try:
        df = df[df.index]
    except:
        msg = "%s not in columns"
        missing = list(set(df.index).difference(set(df.columns)))
        raise nx.NetworkXError("Columns must match Indices.", msg % missing)

    A = df.values
    G = from_numpy_matrix(A, create_using=create_using)

    nx.relabel.relabel_nodes(G, dict(enumerate(df.columns)), copy=False)
    return G

[Approach 2]
If the columns and rows are in the different order, throw an exception. Then, before using from_pandas_adjacency, I will be able to remember that the columns and rows are in the same order. And, it is possible to avoid trouble that an unintended graph structure is created.

The text was updated successfully, but these errors were encountered:

dschult · 2018-08-01T13:27:44Z

Thanks for pointing this out! I agree that we need a fix here -- either raise an exception when order doesn't match or reorder the columns to match the rows. I would lean toward your Approach 1. Do you see advantages of the second approach? I guess my question is also getting at: are there backward compatibility issues we need to worry about here?

Thanks!

komo-fr · 2018-08-04T05:37:42Z

Thank you for the comments.

Do you see advantages of the second approach?

I think there is no big advantage for the second approach. I myself want to adopt Approach 1 rather than Approach 2.
I presented Approach 2 as an alternative to the case where Approach 1 can not be adopted for any reasons. (I thought that from_pandas_adjacency might have been implemented to focus on order after some discussion.)

I guess my question is also getting at: are there backward compatibility issues we need to worry about here?

I think there is no backward compatibility problem.
If there are cases in which it is necessary to ignore the column name and create a graph structure, it will be a problem, but I think that is not a general case.

If there is no problem, I would like to send a pull request with the contents of Approach 1.

dschult · 2018-08-04T20:13:50Z

This looks good -- Put together a pull request for Approach 1.
Thanks very much!

#3111) * When using 'from_pandas_adjacency', edges are added focusing on column names rather than the order. Fixes #3105. * Test for pandas DataFrame which the columns and rows are in the different order.

dschult added the type: Maintenance label Aug 4, 2018

komo-fr mentioned this issue Aug 4, 2018

In 'from_pandas_adjacency', use column names rather than column order. #3111

Merged

dschult closed this as completed in #3111 Aug 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using 'from_pandas_adjacency', edges are added focusing on column names rather than the order #3105

When using 'from_pandas_adjacency', edges are added focusing on column names rather than the order #3105

komo-fr commented Aug 1, 2018

dschult commented Aug 1, 2018

komo-fr commented Aug 4, 2018

dschult commented Aug 4, 2018

When using 'from_pandas_adjacency', edges are added focusing on column names rather than the order #3105

When using 'from_pandas_adjacency', edges are added focusing on column names rather than the order #3105

Comments

komo-fr commented Aug 1, 2018

Current behavior:

Expected result:

dschult commented Aug 1, 2018

komo-fr commented Aug 4, 2018

dschult commented Aug 4, 2018