You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using the function from_pandas_adjacency (networkx 2.1) to create a graph, edges are added focusing on the order of columns and rows in the pandas DataFrame.
Personally, I think that it is more useful by adding edges focusing on the names of columns and rows (indices) rather than the order, because
Personally, when column names and row names are given in the pandas DataFrame, I think it is natural to interpret them as node names.
In the current behavior, it is necessary for the code calling from_pandas_adjacency to ensure that the columns and rows of the pandas DataFrame are in the same order. For example, by writing the following code every time df = df[df.index] # Sort columns by row (index) order
Without this code, there is a possibility that an unintended graph structure may be created.
The examples are as follows.
Current behavior:
In the following case 1 and case 2, the pandas DataFrames have the same structure except column order. (The columns order of case 1 is ["A", "B", "C"]], and the columns order of case 2 is ["A "," C "," B "])
However, the graph structure created using from_pandas_adjacency (networkx 2.1) differs between case 1 and case 2.
Case 2:
I think that behavior in this case is a little confusing.
Focusing on column names and row names, it is expected that the same graph structure as case 1 will be created. However, in fact, a different graph structure is created. It is because from_pandas_adjacency is currently based on the order, not the names.
# A->B, B->C (Same as case 1)
data = {
"A": {"A": 0, "B" : 0, "C": 0},
"B": {"A": 1, "B" : 0, "C": 0},
"C": {"A": 0, "B" : 1, "C": 0}
}
case2_df = pd.DataFrame(data)
# !!! Change the column order from ["A", "B", "C"] to ["A", "C", "B"] !!!
case2_df = case2_df[["A", "C", "B"]]
G = nx.from_pandas_adjacency(case2_df, create_using=nx.DiGraph())
pos = nx.spring_layout(G)
nx.draw_networkx_nodes(G, pos=pos, node_color="#ffa1a1")
nx.draw_networkx_labels(G, pos=pos)
nx.draw_networkx_edges(G, pos=pos)
plt.title('case 2')
plt.show()
Expected result:
I want the result of case 2 to be the same as case 1. The reasons are mentioned at the beginning of this issue.
If the code is changed, I'm thinking of sending a PR like the following.
After:
My desire is to avoid the situation that always needs to be careful of columns order when using from_pandas_adjacency. I think that there are two approaches.
[Approach 1]
In the original code, the columns are sorted after creating the graph. Change this to sort before creating the graph.
def from_pandas_adjacency(df, create_using=None):
# …
try:
df = df[df.index]
except:
msg = "%s not in columns"
missing = list(set(df.index).difference(set(df.columns)))
raise nx.NetworkXError("Columns must match Indices.", msg % missing)
A = df.values
G = from_numpy_matrix(A, create_using=create_using)
nx.relabel.relabel_nodes(G, dict(enumerate(df.columns)), copy=False)
return G
[Approach 2]
If the columns and rows are in the different order, throw an exception. Then, before using from_pandas_adjacency, I will be able to remember that the columns and rows are in the same order. And, it is possible to avoid trouble that an unintended graph structure is created.
The text was updated successfully, but these errors were encountered:
Thanks for pointing this out! I agree that we need a fix here -- either raise an exception when order doesn't match or reorder the columns to match the rows. I would lean toward your Approach 1. Do you see advantages of the second approach? I guess my question is also getting at: are there backward compatibility issues we need to worry about here?
I think there is no big advantage for the second approach. I myself want to adopt Approach 1 rather than Approach 2.
I presented Approach 2 as an alternative to the case where Approach 1 can not be adopted for any reasons. (I thought that from_pandas_adjacency might have been implemented to focus on order after some discussion.)
I guess my question is also getting at: are there backward compatibility issues we need to worry about here?
I think there is no backward compatibility problem.
If there are cases in which it is necessary to ignore the column name and create a graph structure, it will be a problem, but I think that is not a general case.
If there is no problem, I would like to send a pull request with the contents of Approach 1.
#3111)
* When using 'from_pandas_adjacency', edges are added focusing on column names rather than the order.
Fixes#3105.
* Test for pandas DataFrame which the columns and rows are in the different order.
When using the function
from_pandas_adjacency
(networkx 2.1) to create a graph, edges are added focusing on the order of columns and rows in the pandas DataFrame.Personally, I think that it is more useful by adding edges focusing on the names of columns and rows (indices) rather than the order, because
from_pandas_adjacency
to ensure that the columns and rows of the pandas DataFrame are in the same order. For example, by writing the following code every timedf = df[df.index] # Sort columns by row (index) order
Without this code, there is a possibility that an unintended graph structure may be created.
The examples are as follows.
Current behavior:
In the following case 1 and case 2, the pandas DataFrames have the same structure except column order. (The columns order of case 1 is ["A", "B", "C"]], and the columns order of case 2 is ["A "," C "," B "])
However, the graph structure created using
from_pandas_adjacency
(networkx 2.1) differs between case 1 and case 2.Case 1: This case is no problem.
Case 2:
I think that behavior in this case is a little confusing.
Focusing on column names and row names, it is expected that the same graph structure as case 1 will be created. However, in fact, a different graph structure is created. It is because
from_pandas_adjacency
is currently based on the order, not the names.Expected result:
I want the result of case 2 to be the same as case 1. The reasons are mentioned at the beginning of this issue.
If the code is changed, I'm thinking of sending a PR like the following.
Before:
networkx/networkx/convert_matrix.py
Lines 180 to 190 in ab3a6cd
After:
My desire is to avoid the situation that always needs to be careful of columns order when using
from_pandas_adjacency
. I think that there are two approaches.[Approach 1]
In the original code, the columns are sorted after creating the graph. Change this to sort before creating the graph.
[Approach 2]
If the columns and rows are in the different order, throw an exception. Then, before using
from_pandas_adjacency
, I will be able to remember that the columns and rows are in the same order. And, it is possible to avoid trouble that an unintended graph structure is created.The text was updated successfully, but these errors were encountered: