Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SQL database to graph conversion tool (Db2Graph) #99

Merged
merged 42 commits into from Sep 14, 2022

Conversation

ryansun117
Copy link
Collaborator

@ryansun117 ryansun117 commented May 15, 2022

Introducing a new feature to Marius: Db2Graph, a SQL database to graph conversion tool. Db2Graph converts relational databases into graphs as sets of triples which can be used as input datasets for Marius, allowing streamlined preprocessing from database to Marius.

Db2Graph is contained in Marius but can be used as a standalone tool. Db2Graph currently supports graph conversion from three relational database management systems: MySQL, MariaDB, and PostgreSQL. Conversion with Db2Graph is achieved in the following steps:

  1. Users import/create the database locally
  2. Users define the configuration file and entity/edge SQL SELECT queries
  3. Db2Graph executes the SQL SELECT queries
  4. Db2Graph transforms the result set of queries into sets of triples

This pull request adds the source file src/python/tools/db2graph/db2graph.py and a documentation page docs/db2graph/db2graph.rst which describes the requirements, definitions, and steps for using Db2Graph, and a real example use case.

Testing is provided using pytest and GitHub actions to validate the correctness of the db2graph functions.

@ryansun117 ryansun117 changed the title Migrate db2graph from marius-internal Add SQL database to graph conversion tool (Db2Graph) May 16, 2022
Copy link
Collaborator

@JasonMoho JasonMoho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass, will need more iterations.

src/python/tools/db2graph/db2graph.py Outdated Show resolved Hide resolved
src/python/tools/db2graph/db2graph.py Outdated Show resolved Hide resolved
src/python/tools/db2graph/db2graph.py Outdated Show resolved Hide resolved
src/python/tools/db2graph/db2graph.py Outdated Show resolved Hide resolved
src/python/tools/db2graph/db2graph.py Outdated Show resolved Hide resolved
docs/db2graph/db2graph.rst Outdated Show resolved Hide resolved
src/python/tools/db2graph/db2graph.py Outdated Show resolved Hide resolved
src/python/tools/db2graph/db2graph.py Outdated Show resolved Hide resolved
src/python/tools/db2graph/db2graph.py Outdated Show resolved Hide resolved
src/python/tools/db2graph/db2graph.py Outdated Show resolved Hide resolved
@JasonMoho JasonMoho requested a review from thodrek May 18, 2022 21:50
ryansun117 and others added 11 commits May 18, 2022 21:23
Renamed the file to marius_db2graph to align with commands like marius_preprocess;
Created two new functions for get_fetch_size to avoid duplicate code;
Added the marius_db2graph command to setup.cfg (but haven't tested it because I'm not sure if pip installing it right now would work);
Added 'my-sql' as an option to use mysql-connector because this wasn't added previously;
Copy link
Collaborator

@JasonMoho JasonMoho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do another pass once the feature edges are removed and we make a determination on whether generate_uuid is necessary

.github/workflows/db2graph_test_postgres.yml Outdated Show resolved Hide resolved
setup.cfg Show resolved Hide resolved
src/python/tools/db2graph/db2graph.py Outdated Show resolved Hide resolved
docs/db2graph/db2graph.rst Outdated Show resolved Hide resolved
docs/db2graph/db2graph.rst Outdated Show resolved Hide resolved
src/python/tools/db2graph/marius_db2graph.py Outdated Show resolved Hide resolved
src/python/tools/db2graph/marius_db2graph.py Outdated Show resolved Hide resolved
src/python/tools/db2graph/marius_db2graph.py Outdated Show resolved Hide resolved
src/python/tools/db2graph/marius_db2graph.py Outdated Show resolved Hide resolved
src/python/tools/db2graph/marius_db2graph.py Outdated Show resolved Hide resolved
src/python/tools/db2graph/marius_db2graph.py Outdated Show resolved Hide resolved
docs/db2graph/db2graph.rst Outdated Show resolved Hide resolved
docs/db2graph/db2graph.rst Outdated Show resolved Hide resolved
Renamed edge_entity_entity_queries to edge_queries,  edge_entity_entity_queries_list to edge_queries_list, and edge_entity_entity_rel_list to edge_rel_list
Copy link
Collaborator

@JasonMoho JasonMoho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example needs some work.

There is a lot of setup that is not relevant to the tool and my attempt to run the example as written failed. Also, the fact that the dataset requires creating a Kaggle account and does not have a simple download link is a problem.

Main todos:

  • Create a dockerfile which performs all setup (Postgres setup, database download and setup, marius install with pip) up to the creation of the conf/config.yaml and conf/edges_queries.txt files
  • Find a new dataset (or a version of this dataset) which can be downloaded easily without any account requirements.

Once this is fixed I will do another attempt

docs/db2graph/db2graph.rst Outdated Show resolved Hide resolved
docs/db2graph/db2graph.rst Outdated Show resolved Hide resolved
docs/db2graph/db2graph.rst Outdated Show resolved Hide resolved
@JasonMoho JasonMoho merged commit e1c0126 into marius-team:main Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants