New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add IMDB data and loaders #224
Conversation
✅ Deploy Preview for neo4j-graph-data-science-client canceled.
|
cc64644
to
7ec1172
Compare
NO merging until it's confirmed that hosting datasets are fine. Legal says ok, can merge. Email attached on card. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice additions!
# load the graph undirected | ||
opposite_rels = pd.DataFrame().assign( | ||
sourceNodeId=rels["targetNodeId"], targetNodeId=rels["sourceNodeId"], relationshipType="KNOWS" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how we want to continue here.
Once we actually have undirected graph.construct do we differ on how we load based on the server version.
Maybe having a load_xy_undirected
would an option or just a parameter to the load_xy
method.
Mainly making this argument, to question whether we really want to load the graph undirected in this method by default. This would make different to the cora loader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't think we want to load it "undirected" at this point. Maybe we can add an optional parameter orientation="NATURAL"
in the future when we have the support for it instead
) | ||
person_nodes_with_features["labels"] = "Person" | ||
# Set 'Person' class as -1.0 since gds.graph.construct only allows one nodeDF for community, | ||
# and that setting NaN gives value is null error from Arrow flight RPC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats surprising. There should be a way to load NaN
values. I will double check, but maybe its worth to create a card
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talking with Martin, NaN
we expect to support.
Setting values to null
we cannot, as what is the primitive version of null
after all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried np.nan
and float("NaN")
, neither seems to work... Any other way of doing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be nice to somehow separate movies that have class labels, and those that have not. For example, in addition to the "Movie" label, they could have either "Labeled" or "Unlabeled" labels. This way we would be able to use the dataset in our NC pipeline (which one of the reasons we are doing this), and you don't really have to worry about null/nan features. Quite simply, only ["Movie", "Labeled"] nodes would have the class label.
This is similar to what we're doing with Cypher in our quality benchmarks on this dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you card this @brs96 , that np.nan() does not work with arrow? I think its worth to investigate more (but not on this PR as I agree with Adam)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need NaN support for this since since Cypher loading does not support using several data frames. So they must all have the label property
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NaN support was expected to work and we will look into why the python NaN does not work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you verify that this works with the new file paths? If you package the library, and then install it on a "fresh system", does it find the resource files?
See this line in setup.py
:
package_data={"graphdatascience": ["py.typed", "resources/**/*.pkl"]},
I think it should still be fine.
Other than that, just have some smaller remarks.
Nice work!
# load the graph undirected | ||
opposite_rels = pd.DataFrame().assign( | ||
sourceNodeId=rels["targetNodeId"], targetNodeId=rels["sourceNodeId"], relationshipType="KNOWS" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't think we want to load it "undirected" at this point. Maybe we can add an optional parameter orientation="NATURAL"
in the future when we have the support for it instead
adj_matrix = raw_adj_matrix[raw_adj_matrix[0] != 0] | ||
edge_list.append(adj_matrix.iloc[:, :-1].rename(columns={"level_0": "sourceNodeId", "level_1": "targetNodeId"})) | ||
|
||
edge_list[0]["relationshipType"] = "MovieDirector" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "Director" and "Actor" are better names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using those as node labels, and using DIRECTED
and ACTED_IN
instead?
with path("graphdatascience.resources.imdb", "raw/labels.pkl") as labels_resource: | ||
class_labels = read_pickle(labels_resource) | ||
movies = pd.DataFrame([item for sublist in class_labels for item in sublist]) | ||
movies = movies.rename(columns={0: "nodeId", 1: "class"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be call the property "genre" instead of "class" to give it some semantic meaning. And indeed, it may not be used for classification after all, in which "class" does not really make sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same comment for "feature" below. I would instead call it "plot_keywords" or something similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should commit the "raw" data. It's very large for a github repo. Let's upload it to our google drive or something for safe keeping?
i would suggest our s3 bucket |
dc2de25
to
536430a
Compare
What's left I think are:
|
I just merged the null support for float values in graph.construct so this should unblock you from using |
This is blocked by #225 |
Are you referring to supporting multiple DFs in CE? This is not the idea of this PR but probably a follow-up. |
Ah ok. I don't think we should merge DFs here as we will just have to do unnecessary work. I'd rather wait :) |
536430a
to
57536fd
Compare
Extracted karate club stuff to #234. |
57536fd
to
e949ad2
Compare
Co-authored-by: Florentin Dörre <florentin.dorre@neotechnology.com> Co-authored-by: Adam Schill Collberg <adam.schill.collberg@protonmail.com>
Co-authored-by: Mats Rydberg <mats@neo4j.org>
Co-authored-by: Mats Rydberg <mats@neo4j.org>
Co-authored-by: Mats Rydberg <mats@neo4j.org>
fd85701
to
908adcc
Compare
Co-authored-by: Adam Schill Collberg adam.schill.collberg@protonmail.com
Thank you for your contribution to the Graph Data Science Client project.
Before submitting this PR, please read Contributing to the Neo4j Ecosystem.
Make sure: