Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IMDB data and loaders #224

Merged
merged 8 commits into from Dec 21, 2022
Merged

Conversation

brs96
Copy link
Contributor

@brs96 brs96 commented Nov 22, 2022

Co-authored-by: Adam Schill Collberg adam.schill.collberg@protonmail.com

Thank you for your contribution to the Graph Data Science Client project.

Before submitting this PR, please read Contributing to the Neo4j Ecosystem.

Make sure:

  • You signed the Neo4j CLA (Contributor License Agreement) so that we are allowed to ship your code in our library
  • Your contribution is covered by tests

@netlify
Copy link

netlify bot commented Nov 22, 2022

Deploy Preview for neo4j-graph-data-science-client canceled.

Name Link
🔨 Latest commit 908adcc
🔍 Latest deploy log https://app.netlify.com/sites/neo4j-graph-data-science-client/deploys/63a30fbfc96d5200095645e6

@brs96
Copy link
Contributor Author

brs96 commented Nov 23, 2022

@brs96 brs96 force-pushed the add-karate-club-dataset branch 4 times, most recently from cc64644 to 7ec1172 Compare November 24, 2022 14:41
@brs96 brs96 changed the title Prevent possible flaky tests due to slow query Add karate club and IMDB data and loaders Nov 24, 2022
@brs96 brs96 marked this pull request as ready for review November 24, 2022 17:34
@brs96
Copy link
Contributor Author

brs96 commented Nov 24, 2022

NO merging until it's confirmed that hosting datasets are fine.

Legal says ok, can merge. Email attached on card.

Copy link
Contributor

@FlorentinD FlorentinD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice additions!

Comment on lines 54 to 57
# load the graph undirected
opposite_rels = pd.DataFrame().assign(
sourceNodeId=rels["targetNodeId"], targetNodeId=rels["sourceNodeId"], relationshipType="KNOWS"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how we want to continue here.
Once we actually have undirected graph.construct do we differ on how we load based on the server version.
Maybe having a load_xy_undirected would an option or just a parameter to the load_xy method.

Mainly making this argument, to question whether we really want to load the graph undirected in this method by default. This would make different to the cora loader

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't think we want to load it "undirected" at this point. Maybe we can add an optional parameter orientation="NATURAL" in the future when we have the support for it instead

)
person_nodes_with_features["labels"] = "Person"
# Set 'Person' class as -1.0 since gds.graph.construct only allows one nodeDF for community,
# and that setting NaN gives value is null error from Arrow flight RPC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats surprising. There should be a way to load NaN values. I will double check, but maybe its worth to create a card

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talking with Martin, NaN we expect to support.
Setting values to null we cannot, as what is the primitive version of null after all

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried np.nan and float("NaN"), neither seems to work... Any other way of doing it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be nice to somehow separate movies that have class labels, and those that have not. For example, in addition to the "Movie" label, they could have either "Labeled" or "Unlabeled" labels. This way we would be able to use the dataset in our NC pipeline (which one of the reasons we are doing this), and you don't really have to worry about null/nan features. Quite simply, only ["Movie", "Labeled"] nodes would have the class label.
This is similar to what we're doing with Cypher in our quality benchmarks on this dataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you card this @brs96 , that np.nan() does not work with arrow? I think its worth to investigate more (but not on this PR as I agree with Adam)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need NaN support for this since since Cypher loading does not support using several data frames. So they must all have the label property

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaN support was expected to work and we will look into why the python NaN does not work

Copy link
Contributor

@adamnsch adamnsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you verify that this works with the new file paths? If you package the library, and then install it on a "fresh system", does it find the resource files?

See this line in setup.py:

package_data={"graphdatascience": ["py.typed", "resources/**/*.pkl"]},

I think it should still be fine.

Other than that, just have some smaller remarks.

Nice work!

graphdatascience/graph/graph_proc_runner.py Outdated Show resolved Hide resolved
graphdatascience/graph/graph_proc_runner.py Outdated Show resolved Hide resolved
graphdatascience/tests/integration/test_database_ops.py Outdated Show resolved Hide resolved
Comment on lines 54 to 57
# load the graph undirected
opposite_rels = pd.DataFrame().assign(
sourceNodeId=rels["targetNodeId"], targetNodeId=rels["sourceNodeId"], relationshipType="KNOWS"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't think we want to load it "undirected" at this point. Maybe we can add an optional parameter orientation="NATURAL" in the future when we have the support for it instead

adj_matrix = raw_adj_matrix[raw_adj_matrix[0] != 0]
edge_list.append(adj_matrix.iloc[:, :-1].rename(columns={"level_0": "sourceNodeId", "level_1": "targetNodeId"}))

edge_list[0]["relationshipType"] = "MovieDirector"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "Director" and "Actor" are better names

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using those as node labels, and using DIRECTED and ACTED_IN instead?

with path("graphdatascience.resources.imdb", "raw/labels.pkl") as labels_resource:
class_labels = read_pickle(labels_resource)
movies = pd.DataFrame([item for sublist in class_labels for item in sublist])
movies = movies.rename(columns={0: "nodeId", 1: "class"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be call the property "genre" instead of "class" to give it some semantic meaning. And indeed, it may not be used for classification after all, in which "class" does not really make sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same comment for "feature" below. I would instead call it "plot_keywords" or something similar

Copy link
Contributor

@adamnsch adamnsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should commit the "raw" data. It's very large for a github repo. Let's upload it to our google drive or something for safe keeping?

@FlorentinD
Copy link
Contributor

i would suggest our s3 bucket

@brs96
Copy link
Contributor Author

brs96 commented Nov 25, 2022

What's left I think are:

  1. Rename some fields for imdb (and then recreated the pickles). Perhaps also cleanup 'class' = -1 for 'Person' nodes by either set them to NaN or allowing multiple nodeDFs in graph.construct for community users?
  2. Move the 'raw 'files to s3.

@FlorentinD
Copy link
Contributor

I just merged the null support for float values in graph.construct so this should unblock you from using NaN in the input data for 2.3+

@Mats-SX
Copy link
Contributor

Mats-SX commented Dec 12, 2022

This is blocked by #225

@FlorentinD
Copy link
Contributor

Are you referring to supporting multiple DFs in CE? This is not the idea of this PR but probably a follow-up.
Couldnt we merge the multiple DFs into one for this loader here?

@adamnsch
Copy link
Contributor

Are you referring to supporting multiple DFs in CE? This is not the idea of this PR but probably a follow-up. Couldnt we merge the multiple DFs into one for this loader here?

Ah ok. I don't think we should merge DFs here as we will just have to do unnecessary work. I'd rather wait :)

@Mats-SX Mats-SX changed the title Add karate club and IMDB data and loaders Add IMDB data and loaders Dec 15, 2022
@Mats-SX
Copy link
Contributor

Mats-SX commented Dec 15, 2022

Extracted karate club stuff to #234.
Rebased this PR on top of it.

brs96 and others added 8 commits December 21, 2022 14:52
Co-authored-by: Florentin Dörre <florentin.dorre@neotechnology.com>
Co-authored-by: Adam Schill Collberg <adam.schill.collberg@protonmail.com>
Co-authored-by: Mats Rydberg <mats@neo4j.org>
Co-authored-by: Mats Rydberg <mats@neo4j.org>
Co-authored-by: Mats Rydberg <mats@neo4j.org>
@Mats-SX Mats-SX merged commit d385372 into neo4j:main Dec 21, 2022
@brs96 brs96 deleted the add-karate-club-dataset branch January 25, 2023 08:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants