Snitch: A Database transfer tool

Snitch is a command line tool. Its goal is to copy data from various data sources into a central data warehouse.

It provides a pipeline for copying data, temporarily storing the data on file storage and eventually moving that data into a data warehouse.

Main Features

Extract data from multiple databases.
Extract data from multiple tables.
Insert data into multiple data warehouses
Universal configuration of data query across relational and NoSQL DBs
Configurable scheduler with start/end dates and interval.
ACID compliant pipeline
Incremental transfer of data with indices stored on redis

Supported Databases

Snitch currently the following data stores

Data source

Snitch supports the following

MySQL
MongoDB
Postgres
MsSQL
RethinkDB
MariaDB
Apache Kafka

Pre-configured Templates

Snitch supports templates (Predetermined tables for popular frameworks):

Magento

NB We need help supporting more data sources. Please help!

Data warehouse

MySQL / Amazon Aurora
Postgres / Amazon's Redshift
Cassandra
Google BigQuery

NB We need help supporting more data warehouses. Please help!

Installation

Clone this repository

    git clone https://github.com/AtlasDev/data-pipeline-service

Navigate to the directory and initialize the project

cd data-pipeline-service && npm run setup

Run the pipeline

npm start

Configuration

Snitch uses a simple JSON config file.

Config file location

The default location of the configuration file is ./pipeline.json. The config directory location can be changed by setting the PIPELINE_CONF_PATH environment variable.

Configuration Options

The JSON file contains an object with the following keys:

scheduler - object

This tells the pipeline when to schedule the pipeline.

Parameter	Type	Description
start	date	Time when the pipeline can start running
end	date	Time when the pipeline will stop running
interval	number	Time intervals to run the pipeline

NB: There is a lock on the pipeline. As such, one pipeline has to be completed or fail before the next scheduled pipeline will run.

events - object

You can setup hooks to notify when the following events take place

When a pipeline runs successfully
When an error occur

Snitch currently supports two event triggers

success
error

You can configure an action to be triggered when an event happens e.g

"events" : { "success" : { "type" : "slack", "options" : { "webhook" : process.env.SLACK_WEBHOOK, "channel" : 'Pipeline Reports: '' } }, "error" : ... }

The following reporting types are supported

slack

sendgrid

segment

| write_key | string | Segment Write KEy |

zapier

| webhook | string | Zapier Webhook |

webhook This is a custom endpoint

| webhook | string | |

redis - object

The pipeline uses Redis to keep track of an incremental counter which will determine the next set of data to copy.

For example, if you want to incrementally copy 100 records from the table "users" based on the "last_modified_date" field, the pipeline saves the most recent "last_modified_date" key from the array of results and stores this in Redis. The next time the pipeline runs, it reads the "last_modified_date" key from Redis and uses that to fetch the next 100 records.

Parameter	Type	Description
url	string	Redis url

fileStorage - object

When data is extracted from the various data sources, it is saved to the local file storage or on Amazon's s3.

Parameter	Type	Description
fileMode	remote/local	Indicates where to
accessKeyId	string	AWS Access ID. Leave empty if "fileMode" is empty
secretAccessKey	string	AWS Secret Access key. Leave empty if "fileMode" is empty
defaultPath	string	AWS Default path. Leave empty if "fileMode" is empty
Bucket	string	AWS S3 bucket. Leave empty i

source - array

Specify the array of data sources to extract data from. Each object contains the following

Parameter	Type	Description
type	postgress/mysql/mongodb/rethinkdb/kafka	The type of database supported
credentials	array	Credentials of the data sources
schema	array	Fields to copy

credentials - array

Contains objects with the following parameters

Parameter	Type	Description
url	string	Database url

schema

Contains objects with the following parameters

Parameter	Type	Description
fields	array	List of fields to copy from. Leave an empty array to copy all fields
table	string	Table to copy from
limit	number	Number of records to transfer at once
destination_table_name	string	A name to be used as the destination table name
primary_key	string	Table's primary key
incremental_key_key	string	A field in the table used as an incremental key. This is usually a created_at or updated_at field based on your table schema. Leave empty to copy all the records at once.

Additional Parameters for V2

Kafka

Parameter	Type	Description
records_limit	string	Number of records to fetch before dumping to the Warehouse
topic	string	Kafka topic to listen on

destination

Information about the destination warehouse

Google Big Query The following keys are needed for Google's BigQuery

Parameter	Type	Description
type	string	Database url
projectId	string	Database url
datasetID	string<Your google dataset (database?) ID>	Database url

Big Query Authentication

We use Service Accounts for authentication. Thus you need to specify your GOOGLE_APPLICATION_CREDENTIALS environment variable. This should point to the credentials.json or p2 file.

Best practices

it might be good to override the default 10 seconds TIMEOUT environmental variable to something that suits you.

Dependencies

Under the hood, Snitch uses the amazing library:

vm2 - Advanced vm/sandbox for Node.js

Version 2 Updates

The following configuration are needed to support the additional data sources

Apache Kafka

Parameter	Type	Description
records_limit	string	Number of records to fetch before dumping to the Warehouse
topic	string	Kafka topic to listen on

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
lib		lib
test		test
.dockerignore		.dockerignore
.env-example		.env-example
.eslintignore		.eslintignore
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENCE		LICENCE
README.md		README.md
VERSION		VERSION
package-lock.json		package-lock.json
package.json		package.json
pipeline.example.js		pipeline.example.js
pipeline.test.js		pipeline.test.js
reset-redis.js		reset-redis.js
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Snitch: A Database transfer tool

Main Features

Supported Databases

Data source

Pre-configured Templates

Data warehouse

Installation

Configuration

Config file location

Configuration Options

scheduler - object

events - object

redis - object

fileStorage - object

source - array

credentials - array

schema

Additional Parameters for V2

destination

Big Query Authentication

Best practices

Dependencies

Version 2 Updates

Apache Kafka

About

Releases

Packages

Languages

License

oduonye1992/snitch

Folders and files

Latest commit

History

Repository files navigation

Snitch: A Database transfer tool

Main Features

Supported Databases

Data source

Pre-configured Templates

Data warehouse

Installation

Configuration

Config file location

Configuration Options

scheduler - object

events - object

redis - object

fileStorage - object

source - array

credentials - array

schema

Additional Parameters for V2

destination

Big Query Authentication

Best practices

Dependencies

Version 2 Updates

Apache Kafka

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages