Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.7.0 write_graphml changes integer node attribute type and value #796

Closed
macks22 opened this issue Dec 27, 2014 · 5 comments
Closed

v0.7.0 write_graphml changes integer node attribute type and value #796

macks22 opened this issue Dec 27, 2014 · 5 comments
Assignees
Labels
high High-priority issue; typically for cases when igraph returns incorrect result for non-corner cases

Comments

@macks22
Copy link

macks22 commented Dec 27, 2014

I have a graph with 2,146,334 nodes, each with a 'name' attribute which contains unique integer IDs. When I write this graph to a GraphML file using the write_graphml or write_graphmlz methods of the Graph instance, the 'name' attributes are given the type double and many of them change. The code below illustrates:

g = igraph.Graph()
len(set(ids))  # 2146334

g.add_vertices(ids)
len(set([v['name'] for v in g.vs]))  # 2146334
g.vs[2146331]['name']  # 1347793

g.write_graphml('test.graphml')
g = igraph.Graph.Read_GraphML('test.graphml')

len([v['name'] for v in g.vs])  # 2146334
len(set([v['name'] for v in g.vs]))  # 1114563
g.vs[2146331]['name']  # 1347790.0

The same issue occurs if I create a new attribute for the IDs rather than relying on the default 'name' attribute.

For now I am circumventing this problem by converting the integer IDs to strings, which are handled properly. However, it would be nice to have this resolved.

@ntamas
Copy link
Member

ntamas commented Dec 27, 2014

We are dealing with two separate issues here. One is the fact that igraph converts integer attributes to doubles when the graph is saved to GraphML. Unfortunately this is not easy to deal with because the GraphML writer is implemented deep down in igraph's C core, and on the C level igraph distinguishes between three types of attributes only: numbers (which are stored as doubles), strings and Booleans. This means that by the time the GraphML writer function is called, the Python interface has already converted the attribute values to "regular" C doubles because this is the only way it can pass the values down to the C layer. However, if you are not tied to the GraphML format, you can simply pickle your graphs instead (using the pickle module), which preserves the exact Python type of every attribute (because the saving and loading is done in the Python layer and not in C).

The other problem (the fact that it seems that the attributes are not loaded back properly) is more interesting, but unfortunately I can investigate this only if you could upload a full, self-contained script (and most likely a corresponding GraphML file) somewhere that reproduces the error on your machine. Please post the URL here if you managed to produce such a script so I can check what's going on here.

@ntamas ntamas added the Python label Dec 27, 2014
@ntamas ntamas self-assigned this Dec 27, 2014
@macks22
Copy link
Author

macks22 commented Dec 28, 2014

Thank you for your prompt reply. Actually, the only thing you should need in addition to the code I posted in the comment above is the list of integer IDs. I could give this to you, but it turns out you don't actually need it.

Interestingly, the attributes only change when the integers are above a certain value. I found the break point in my original graph case and then isolated the value above which errors start to occur. It seems any integer values over 1,000,000 gets rounded to the nearest tenth place using something similar to the typical decimal rounding procedure. You can replicate this with the following:

g = igraph.Graph()
ids = range(999980, 1000020)
g.add_vertices(ids)
g.write_graphml('test.graphml')
tg = igraph.Graph.Read_GraphML('test.graphml')
zip(g.vs['name'], tg.vs['name'])

This is the output:

[(999980, 999980.0),
 (999981, 999981.0),
 (999982, 999982.0),
 (999983, 999983.0),
 (999984, 999984.0),
 (999985, 999985.0),
 (999986, 999986.0),
 (999987, 999987.0),
 (999988, 999988.0),
 (999989, 999989.0),
 (999990, 999990.0),
 (999991, 999991.0),
 (999992, 999992.0),
 (999993, 999993.0),
 (999994, 999994.0),
 (999995, 999995.0),
 (999996, 999996.0),
 (999997, 999997.0),
 (999998, 999998.0),
 (999999, 999999.0),
 (1000000, 1000000.0),
 (1000001, 1000000.0),
 (1000002, 1000000.0),
 (1000003, 1000000.0),
 (1000004, 1000000.0),
 (1000005, 1000000.0),
 (1000006, 1000010.0),
 (1000007, 1000010.0),
 (1000008, 1000010.0),
 (1000009, 1000010.0),
 (1000010, 1000010.0),
 (1000011, 1000010.0),
 (1000012, 1000010.0),
 (1000013, 1000010.0),
 (1000014, 1000010.0),
 (1000015, 1000020.0),
 (1000016, 1000020.0),
 (1000017, 1000020.0),
 (1000018, 1000020.0),
 (1000019, 1000020.0)]

Oddly, the rounding at multiples of 5 alternates between rounding down and rounding up. The first rounds down, the next up, and so on. This may have something to do with the floating point rounding protocol.

@ntamas
Copy link
Member

ntamas commented Dec 28, 2014

Okay, this is due to how the standard C library prints floats into the GraphML file. Up to 999999 we are fine because we write the number exactly into GraphML. From 1000000 the standard C library switches to scientific notation with rounding, so we get 1.00000e+06 instead, and of course the least significant digit is lost. I'll post a patch soon.

(Fun fact: your code crashes my machine with zsh: illegal hardware instruction python attr_test.py - probably when we try to parse scientific notation back to a C double).

@ntamas ntamas added C high High-priority issue; typically for cases when igraph returns incorrect result for non-corner cases and removed Python labels Dec 28, 2014
@ntamas ntamas added this to the 0.7.2 milestone Dec 28, 2014
@ntamas
Copy link
Member

ntamas commented Dec 29, 2014

Note to self: we are probably looking for a solution that strives to

  1. represent integers using plain decimals only, without any scientific notation involved (to improve readability of the GraphML file) and
  2. display as many significant digits from non-integers as possible to ensure the smallest loss of precision when a GraphML file is saved and then loaded back

ntamas added a commit that referenced this issue Dec 29, 2014
@ntamas
Copy link
Member

ntamas commented Dec 29, 2014

Fixed in fdcaa14.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high High-priority issue; typically for cases when igraph returns incorrect result for non-corner cases
Projects
None yet
Development

No branches or pull requests

2 participants