Add IMDB data and loaders #224

brs96 · 2022-11-22T14:18:12Z

Co-authored-by: Adam Schill Collberg adam.schill.collberg@protonmail.com

Thank you for your contribution to the Graph Data Science Client project.

Before submitting this PR, please read Contributing to the Neo4j Ecosystem.

Make sure:

You signed the Neo4j CLA (Contributor License Agreement) so that we are allowed to ship your code in our library
Your contribution is covered by tests

netlify · 2022-11-22T14:18:20Z

✅ Deploy Preview for neo4j-graph-data-science-client canceled.

Name	Link
🔨 Latest commit	`908adcc`
🔍 Latest deploy log	https://app.netlify.com/sites/neo4j-graph-data-science-client/deploys/63a30fbfc96d5200095645e6

brs96 · 2022-11-23T09:35:32Z

Adding additional pre-canned datasets. For example: 1 heterogeneous graph and 1 social network.

brs96 · 2022-11-24T17:36:23Z

NO merging until it's confirmed that hosting datasets are fine.

Legal says ok, can merge. Email attached on card.

FlorentinD

Nice additions!

FlorentinD · 2022-11-25T08:48:09Z

graphdatascience/graph/graph_proc_runner.py

+        # load the graph undirected
+        opposite_rels = pd.DataFrame().assign(
+            sourceNodeId=rels["targetNodeId"], targetNodeId=rels["sourceNodeId"], relationshipType="KNOWS"
+        )


I wonder how we want to continue here.
Once we actually have undirected graph.construct do we differ on how we load based on the server version.
Maybe having a load_xy_undirected would an option or just a parameter to the load_xy method.

Mainly making this argument, to question whether we really want to load the graph undirected in this method by default. This would make different to the cora loader

Yeah I don't think we want to load it "undirected" at this point. Maybe we can add an optional parameter orientation="NATURAL" in the future when we have the support for it instead

graphdatascience/resources/imdb/serialize_imdb.py

FlorentinD · 2022-11-25T08:57:47Z

graphdatascience/resources/imdb/serialize_imdb.py

+)
+person_nodes_with_features["labels"] = "Person"
+# Set 'Person' class as -1.0 since gds.graph.construct only allows one nodeDF for community,
+# and that setting NaN gives value is null error from Arrow flight RPC


Thats surprising. There should be a way to load NaN values. I will double check, but maybe its worth to create a card

Talking with Martin, NaN we expect to support.
Setting values to null we cannot, as what is the primitive version of null after all

Tried np.nan and float("NaN"), neither seems to work... Any other way of doing it?

I think it could be nice to somehow separate movies that have class labels, and those that have not. For example, in addition to the "Movie" label, they could have either "Labeled" or "Unlabeled" labels. This way we would be able to use the dataset in our NC pipeline (which one of the reasons we are doing this), and you don't really have to worry about null/nan features. Quite simply, only ["Movie", "Labeled"] nodes would have the class label.
This is similar to what we're doing with Cypher in our quality benchmarks on this dataset.

can you card this @brs96 , that np.nan() does not work with arrow? I think its worth to investigate more (but not on this PR as I agree with Adam)

I think we still need NaN support for this since since Cypher loading does not support using several data frames. So they must all have the label property

NaN support was expected to work and we will look into why the python NaN does not work

adamnsch

Did you verify that this works with the new file paths? If you package the library, and then install it on a "fresh system", does it find the resource files?

See this line in setup.py:

package_data={"graphdatascience": ["py.typed", "resources/**/*.pkl"]},

I think it should still be fine.

Other than that, just have some smaller remarks.

Nice work!

graphdatascience/graph/graph_proc_runner.py

graphdatascience/tests/integration/test_database_ops.py

adamnsch · 2022-11-25T10:22:11Z

graphdatascience/graph/graph_proc_runner.py

+        # load the graph undirected
+        opposite_rels = pd.DataFrame().assign(
+            sourceNodeId=rels["targetNodeId"], targetNodeId=rels["sourceNodeId"], relationshipType="KNOWS"
+        )


Yeah I don't think we want to load it "undirected" at this point. Maybe we can add an optional parameter orientation="NATURAL" in the future when we have the support for it instead

graphdatascience/resources/imdb/serialize_imdb.py

adamnsch · 2022-11-25T12:33:17Z

graphdatascience/resources/imdb/serialize_imdb.py

+    adj_matrix = raw_adj_matrix[raw_adj_matrix[0] != 0]
+    edge_list.append(adj_matrix.iloc[:, :-1].rename(columns={"level_0": "sourceNodeId", "level_1": "targetNodeId"}))
+
+edge_list[0]["relationshipType"] = "MovieDirector"


I think "Director" and "Actor" are better names

How about using those as node labels, and using DIRECTED and ACTED_IN instead?

adamnsch · 2022-11-25T12:35:48Z

graphdatascience/resources/imdb/serialize_imdb.py

+with path("graphdatascience.resources.imdb", "raw/labels.pkl") as labels_resource:
+    class_labels = read_pickle(labels_resource)
+movies = pd.DataFrame([item for sublist in class_labels for item in sublist])
+movies = movies.rename(columns={0: "nodeId", 1: "class"})


Might be call the property "genre" instead of "class" to give it some semantic meaning. And indeed, it may not be used for classification after all, in which "class" does not really make sense

The same comment for "feature" below. I would instead call it "plot_keywords" or something similar

adamnsch

I don't think we should commit the "raw" data. It's very large for a github repo. Let's upload it to our google drive or something for safe keeping?

FlorentinD · 2022-11-25T13:35:56Z

i would suggest our s3 bucket

brs96 · 2022-11-25T15:54:48Z

What's left I think are:

Rename some fields for imdb (and then recreated the pickles). Perhaps also cleanup 'class' = -1 for 'Person' nodes by either set them to NaN or allowing multiple nodeDFs in graph.construct for community users?
Move the 'raw 'files to s3.

FlorentinD · 2022-12-06T09:18:12Z

I just merged the null support for float values in graph.construct so this should unblock you from using NaN in the input data for 2.3+

Mats-SX · 2022-12-12T08:49:47Z

This is blocked by #225

FlorentinD · 2022-12-12T09:35:11Z

Are you referring to supporting multiple DFs in CE? This is not the idea of this PR but probably a follow-up.
Couldnt we merge the multiple DFs into one for this loader here?

adamnsch · 2022-12-12T11:14:34Z

Are you referring to supporting multiple DFs in CE? This is not the idea of this PR but probably a follow-up. Couldnt we merge the multiple DFs into one for this loader here?

Ah ok. I don't think we should merge DFs here as we will just have to do unnecessary work. I'd rather wait :)

graphdatascience/tests/integration/test_database_ops.py

Mats-SX · 2022-12-15T15:14:43Z

Extracted karate club stuff to #234.
Rebased this PR on top of it.

Co-authored-by: Florentin Dörre <florentin.dorre@neotechnology.com> Co-authored-by: Adam Schill Collberg <adam.schill.collberg@protonmail.com>

Co-authored-by: Mats Rydberg <mats@neo4j.org>

brs96 force-pushed the add-karate-club-dataset branch 4 times, most recently from cc64644 to 7ec1172 Compare November 24, 2022 14:41

brs96 changed the title ~~Prevent possible flaky tests due to slow query~~ Add karate club and IMDB data and loaders Nov 24, 2022

brs96 marked this pull request as ready for review November 24, 2022 17:34

Mats-SX assigned adamnsch Nov 25, 2022

FlorentinD reviewed Nov 25, 2022

View reviewed changes

adamnsch requested changes Nov 25, 2022

View reviewed changes

adamnsch reviewed Nov 25, 2022

View reviewed changes

adamnsch mentioned this pull request Nov 25, 2022

Add heterogeneous NC with HashGNN and autotuning notebook #227

Merged

2 tasks

brs96 force-pushed the add-karate-club-dataset branch from dc2de25 to 536430a Compare November 25, 2022 15:32

Mats-SX added the REVIEW OK - MERGE ON HOLD label Dec 12, 2022

Mats-SX reviewed Dec 15, 2022

View reviewed changes

graphdatascience/tests/integration/test_database_ops.py Outdated Show resolved Hide resolved

Mats-SX mentioned this pull request Dec 15, 2022

Add karate club pickle and loader #234

Merged

Mats-SX force-pushed the add-karate-club-dataset branch from 536430a to 57536fd Compare December 15, 2022 15:14

Mats-SX changed the title ~~Add karate club and IMDB data and loaders~~ Add IMDB data and loaders Dec 15, 2022

brs96 force-pushed the add-karate-club-dataset branch from 57536fd to e949ad2 Compare December 20, 2022 14:30

brs96 and others added 8 commits December 21, 2022 14:52

Add imdb pickle and loader

8cd4429

Fix imdb loader for community and with arrow

154e0f5

Use pandas1.3.5 for pickles and fix pyarrow flight rpc null error

67c8c3d

Apply PR comments

e562024

Co-authored-by: Florentin Dörre <florentin.dorre@neotechnology.com> Co-authored-by: Adam Schill Collberg <adam.schill.collberg@protonmail.com>

Use multi dfs for imdb

7ba9be8

Co-authored-by: Mats Rydberg <mats@neo4j.org>

Fix test since undirected graph construct only added in 2.3

7253a6f

Fail load_imdb when server version before 2.3

c638357

Co-authored-by: Mats Rydberg <mats@neo4j.org>

Add precanned dataset changelog

908adcc

Co-authored-by: Mats Rydberg <mats@neo4j.org>

Mats-SX force-pushed the add-karate-club-dataset branch from fd85701 to 908adcc Compare December 21, 2022 13:53

Mats-SX enabled auto-merge December 21, 2022 14:05

Mats-SX mentioned this pull request Dec 21, 2022

Beautify cypher construct #236

Merged

Mats-SX merged commit d385372 into neo4j:main Dec 21, 2022

brs96 deleted the add-karate-club-dataset branch January 25, 2023 08:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add IMDB data and loaders #224

Add IMDB data and loaders #224

brs96 commented Nov 22, 2022 •

edited

netlify bot commented Nov 22, 2022 •

edited

brs96 commented Nov 23, 2022

brs96 commented Nov 24, 2022 •

edited

FlorentinD left a comment

FlorentinD Nov 25, 2022

adamnsch Nov 25, 2022

FlorentinD Nov 25, 2022

FlorentinD Nov 25, 2022

brs96 Nov 25, 2022

adamnsch Nov 25, 2022

FlorentinD Nov 25, 2022

adamnsch Nov 28, 2022

FlorentinD Nov 28, 2022

adamnsch left a comment

adamnsch Nov 25, 2022

adamnsch Nov 25, 2022

Mats-SX Dec 20, 2022

adamnsch Nov 25, 2022

adamnsch Nov 25, 2022

adamnsch left a comment

FlorentinD commented Nov 25, 2022

brs96 commented Nov 25, 2022

FlorentinD commented Dec 6, 2022

Mats-SX commented Dec 12, 2022

FlorentinD commented Dec 12, 2022

adamnsch commented Dec 12, 2022

Mats-SX commented Dec 15, 2022

Add IMDB data and loaders #224

Add IMDB data and loaders #224

Conversation

brs96 commented Nov 22, 2022 • edited

netlify bot commented Nov 22, 2022 • edited

✅ Deploy Preview for neo4j-graph-data-science-client canceled.

brs96 commented Nov 23, 2022

brs96 commented Nov 24, 2022 • edited

FlorentinD left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamnsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamnsch left a comment

Choose a reason for hiding this comment

FlorentinD commented Nov 25, 2022

brs96 commented Nov 25, 2022

FlorentinD commented Dec 6, 2022

Mats-SX commented Dec 12, 2022

FlorentinD commented Dec 12, 2022

adamnsch commented Dec 12, 2022

Mats-SX commented Dec 15, 2022

brs96 commented Nov 22, 2022 •

edited

netlify bot commented Nov 22, 2022 •

edited

brs96 commented Nov 24, 2022 •

edited