Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streamNodeProperty() doesn't work with gds.run_cypher() as I guess #179

Closed
MOSSupport opened this issue Sep 16, 2022 · 27 comments
Closed

streamNodeProperty() doesn't work with gds.run_cypher() as I guess #179

MOSSupport opened this issue Sep 16, 2022 · 27 comments

Comments

@MOSSupport
Copy link

MOSSupport commented Sep 16, 2022

graphdatascience 1.3

I tried a query like this:

query = f'''
   call gds.graph.streamNodeProperty(
      'xxx',
      'xxxx',
      ['xxxxx']
   )
yield nodeId as id, propertyValue as degree
return id, degree limit 100
...
result = gds.run_cypher(query)

=> KeyError: 'graph_name'

I figured out to make it work like this:

query = f'''
   ...
'''
params = {
   'graph_name': 'xxx',
   'properties': 'xxxx',
   'entities'" ['xxxxx'],
   'config': ''
}
result = gds.run_cypher(query, params)

=> No error, but it returned all rows(not limited to 100) as nodeId and propertyValue(not renamed as id and degree)

Other cypher queries works with gds.run_cypher(query) as expected.

@FlorentinD
Copy link
Contributor

Hello @MOSSupport ,
I am trying to reproduce your error.

To clarify, are you trying to only return the rows with a degree >= 100 or only show the first 100 rows?

@MOSSupport
Copy link
Author

MOSSupport commented Sep 17, 2022

Hi,
I just tried to reduce the number of rows in the result for test purpose.
Sorry I cannot copy the whole error msgs or the code since it was tested in a closed environment(by security).

Thanks.

@FlorentinD
Copy link
Contributor

For further help, we need the GDS and Neo4j version you were running the queries against.

When I tried it in my test environment, I could get the expected result.

@MOSSupport
Copy link
Author

MOSSupport commented Sep 17, 2022

Neo4j 4.4.8 Enterprise image(Debian 11, openjdk 11.0.15) from docker hub, I'm running it on RHEL 7.9 machine.
GDS 2.1.9

@FlorentinD
Copy link
Contributor

Hello,
I used the same versions as you, but I still see the rename of the property as well as the limit to only 100 rows.
Could you try running a similar query with our example notebook (https://github.com/neo4j/graph-data-science-client/blob/main/examples/load-data-via-graph-construction.ipynb)?

The query I used based on your description was
gds.run_cypher("CALL gds.graph.streamNodeProperty($graph_name, $property, $nodeLabels) YIELD nodeId AS id, propertyValue AS degree RETURN id, degree LIMIT 10", {"graph_name": G.name(), "property": "subject", "nodeLabels": ["Paper"]})

@MOSSupport
Copy link
Author

I tried the code you typed above. But still the same "KeyError" returned as I explained above.
And I found one more thing:
When I use "result = gds.graph.streamNodeProperty(G, 'xxx', ['xxxx'])". I got the dfferenct result from the result I got when I run streamNodeProperty in Neo4j browser. For example, I got nodeId of the correct label in Neo4j browser but the result in python(I use pycharm) is from a different label even though it(python client version) returned the result without error.

@FlorentinD
Copy link
Contributor

Looks like I misunderstood you, I though you made it work afterwards (written at the end of your first description).

Can you share the exact (anonymized) browser query and python client equivalent?
Also can you share the stacktrace? For the first part, Key error: graph_name does not make sense as there is not graph_name inside the query.

If you are using a limit, you might also want to use an ORDER BY nodeId to compare the version between neo4j desktop and python client.

@MOSSupport
Copy link
Author

MOSSupport commented Sep 21, 2022

The original query is the code in my first post. Its error message was like this(Beware typo since I am typing it):
Traceback (most recent call last):
File "/usr/local/lib/python3.8/code.py", line 90 in runcode
exec(code, self.locals)
File "", line 10, in
File "/app1/pycharm-code/venv/lib/python3.8/site-packages/graphdatascience/graph_data_science.py", line 134, in run_cypher
return self._query_runner.run_query(query, params)
File "/app1/pycharm-code/venv/lib/python3.8/site-packages/graphdatascience/query_runner/arrow_query_runner.py", line 57, in run_query
graph_name = params["graph_name"]
KeyError: 'graph_name'

Yes, there is no 'graph_name' in my code. But it appeared in its error message above.

@MOSSupport
Copy link
Author

MOSSupport commented Sep 21, 2022

So I changed the first code according to the error messages like the 2nd code. It is processed without the error above. And its result is returned by label(3rd parameter) but their nodeId values are not the same ones with the node ids of the original nodes of the graph(stored on the storage).

@FlorentinD
Copy link
Contributor

Ah, you are using GDS enterprise with Arrow enabled that explains the difference!

I see the error now as well and will update you once we fixed the issue!

As a temporary workaround I can only think of not using gds.run_query but instead gds.graph.streamProperty and filter the result afterwards with pandas.

@MOSSupport
Copy link
Author

But I have to develop this process as python application now. How can I get the actual node properties like 'name' strings? As I told above, the node ids are not correct from gds.graph.streamProperty in python code. I cannot retrieve the correct properties since the ids are not correctly returned.

@FlorentinD
Copy link
Contributor

I would suggest to use pandas to transform the result for now. As mentioned above, for now you need to use a workaround.

import pandas as pd

G = G.graph.get("xxx");
result = gds.graph.streamNodeProperty(G, "myProperty", "my_label") # this is a pandas df
result.rename(columns = {"nodeId": "id", "propertyValue": "degree"}, inplace=True)
result = result.iloc[:100,:]

display(result)

Hope this helps you as a temporary solution.

@MOSSupport
Copy link
Author

MOSSupport commented Sep 21, 2022

You see my 4th posting. I already tried like that but I got the strange nodeIds. I cannot find the actual nodes in the original graph by matching with those nodeId retuned. They looks not the correct ids, so now I cannot use them in the development.

@FlorentinD
Copy link
Contributor

Ok, unfortunately there is another bug on the server side.
Thank you for pointing this out!
We could find a fix and I will update you when we published a new version.

The only workaround, right now is to disable arrow in the server settings until our next release for your use-case.

@FlorentinD
Copy link
Contributor

Another idea is to build this library using the version on #186.
(If you cant disable arrow on the server or wait for the next release)

@MOSSupport
Copy link
Author

It didn't work even after arrow disabled for my test. Currently I use bolt driver to bring the result of the cypher of streamNodeProperty in my python code. Other functions are developed with the python client since they are working except streamNodeProperty.

@FlorentinD
Copy link
Contributor

Sad to hear the workaround does not work for you.
When you are using the bolt driver even with arrow enabled it should work as we wont go through arrow in this case.

Without Arrow I could not get a test to fail. Did you make sure to restart the server after disabling arrow?
Couldnt you be more specific what didnt work after you disabled arrow?

@MOSSupport
Copy link
Author

Ah, the arrow must not be disabled at that time. Now I got the correct ids after I verified the arrow is disabled by dbms.listConfig.

@MOSSupport
Copy link
Author

But it takes too long time for a big data:
df_news = gds.graph.streamNodeProperty(
G,
'dimension12',
['News']
)
news_ids = df_news[0:10]['nodeId'].to_list()

It took 117 secs with 2M News nodes. It's not practical to use in the development without arrow.

@FlorentinD
Copy link
Contributor

With arrow disabled you can use your original query and only return the first 10 elements in the cypher query through run_cypher. This should be practical in development even without arrow.

For the fix using arrow, you need to wait until we release 2.1.13, which we plan to release next Thursday.

@MOSSupport
Copy link
Author

Yes, run_cypher works with no keyError after arrow disabled. The slice of 10 records is just to check the code. My purpose is to get all the vectors and then calculated them with numpy or scikit to find the most similar nodes. I tried Filtered KNN(alpha) but terminated in the middle of running since it took so long time with the big data. So I have to wait the next release.
Thanks.
Dongho.

@FlorentinD
Copy link
Contributor

FlorentinD commented Sep 22, 2022

Thats helpful feedback.
Can you describe your use-case, such as how large is set the of sourceNodes/targetNodes, which configuration did you try out?
Also what is your desired return time?

@MOSSupport
Copy link
Author

MOSSupport commented Sep 23, 2022

It's a quick PoC project to demonstrate Neo4j and GDSL capabilities. It has no specific requirement to meet.
The graph was buit with 2.4M news. It has News nodes and the Word or Entity(multi token) nodes generated from the news. It is projected of 77M nodes and 800M relationships.
Some codes can be made as a batch process but some cases need a real time responce. Any case the faster the better. I have to finish this demo at the end of this month or so.

@FlorentinD
Copy link
Contributor

Thanks for the details.
The current implementation filters as a post-processing, so its good feedback that this is not fast enough for your scenario.

I would like to understand a bit more about your filtering.
For the 77M nodes, what kind of filter are you trying to apply?
Do you want to find the nearest neighbor inside a small subset of nodes?

@MOSSupport
Copy link
Author

MOSSupport commented Sep 23, 2022

Yes, it's the core issue of the demo. Customer explained what kind of news they want to find in a set of hundreds words. They can be a source nodes. GDSL has several algos with source node or nodes in their parameter like bfs. But their approaches are almost to find News having the words or entities. They are not so different with the result of the exact matching search.
I tried fastRP to improve the result with the vector. But the result of the most similar nodes is so much noisy(having not correct result).
I am working to improve fastRP result or to make a scoring process to select the better news nodes.
It's competing with a Bert based approach(with GPU) by other team.

@FlorentinD
Copy link
Contributor

Thanks for the detailed response. We will consider your feedback in our future planning!

As a side comment, FastRP is also an algorithm where you need to tune the iterationWeights so noisy results could be due the embeddings not being good enough.

@MOSSupport 2.1.13 is released now with the bug fix for Arrow.
For the client we are also planning a release, but had to work on some issues around our CI first.

@FlorentinD
Copy link
Contributor

@MOSSupport we also released a new client version now, so with updating your versions, it should work now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants