Project metadata lineage with Kettle/PDI and Neo4j
This project is a solution to tracking metadata lineage using Kettle and Neo4j. Kettle is used to parse various files within a project directory. This parsed information is then written to a Neo4j graph. Currently supported are:
- aws (RDS, DMS)
- etl (PDI/Kettle)
- gcp (BigQuery)
- git
- pentaho
- mondrian (cubes, Analyzer)
- reporting (prpt)
- rdbms (relational databases)
The checks are built in Kettle. You'll need
- Java 8
- Kettle/PDI: download 8.2 Remix here or the original 8.2. Kettle/PDI client here.
- clone this repository to a location on your local system.
git clone https://github.com/knowbi/knowbi-meta-to-neo4j.git
- make sure you have a Neo4j database running with a graph connection named 'lineage-graph' (included in the
metastore
folder). Version 4.x is advised, lower versions should also work, but may require some tweaking. - start Spoon (with the Neo4j plugins or from the Remix and create a Neo4j connection.
All configuration is done through a single properties file. Copy the template file config/meta-to-neo4j.properties.template
to a properties file config/meta-to-neo4j.properties
.
This file contains the following options:
Enable or disable modules to run (all "Y" values will be included):
- do.aws= (default: N)
- do.aws.dms= (default: Y, only if
do.aws=Y
) - do.aws.rds= (default: Y, only if
do.aws=Y
) - do.etl= (default: Y)
- do.gcp= (default: N)
- do.git= (default: Y)
- do.pentaho= (default: N)
- do.pentaho.report= (default: Y, only if
do.pentaho=Y
) - do.pentaho.mondrian.schema= (default: Y, only if
do.pentaho=Y
) - do.rdbms= (default: Y)
- aws.dms.json.dir= (default: none): the path to read the AWS config from (in JSON, support for YAML, REST may come later)
- git.tmp.dir= (default: /tmp/): path to store temporary git information.
- etl.kettle.properties.dir= (default: none): the path to a kettle.properties file to take into account while parsing
- etl.dir= (default: none): the path where the Kettle/PDI jobs and transformations to be parsed are stored
- do.clean.neo4j.data= (default: Y): delete ALL data in the Neo4j database before processing starts (ALL really means EVERYTHING)
- neo4j.host= (default: localhost): the Neo4j database host
- neo4j.bolt.port= (default: 7687): the Neo4j bolt port
- neo4j.browser.port= (default: 7474): the Neo4j browser port
- neo4j.user= (default: neo4j): the Neo4j database username
- neo4j.pass= (default: none): the Neo4j database password
- pentaho.mondrian.analyzer.dir= (default: none): path the Pentaho Analyzer reports
- pentaho.mondrian.properties.dir= (default: none): path to the mondrian.properties configuration file
- pentaho.mondrian.schema.dir= (default: none): path to the Mondrian schema files
- pentaho.report.dir= (default: none): path to the Pentaho Report Designer (prpt) files
- do.rdbms.columns= (default: Y): include columns while loading database information. If not set to 'Y', only database, schema and table information is included.
Currently, only RDS and DMS information is supported. TODO: add detailed graph model description + screenshot
All Kettle jobs and transformations are parsed and written to the graph model. This includes (but is not limited to)
- jobs
- job entries
- steps
- step type
- parameter
- hops (jobs and transformations) TODO: add detailed graph model description + screenshot
Git is parsed by a walk through every commit in the history of git repository. For each commit, a git diff
is performed, the resulting output is written to the Neo4j graph.
This includes (but is not limited to):
- user
- commit
- branches
- tags
For each commit, the files that were touch are stored in a COMMITOPERATION
relationship.
Additionally, each commit will have a CONTAINS
relationship to all jobs and transformations that were touched.
TODO: add detailed graph model description + screenshot