Skip to content

kongc-organization/greenplum-streamsets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pivotal Greenplum

The Pivotal Greenplum Database (GPDB) is an advanced, fully featured, open source data warehouse. It provides powerful and rapid analytics on petabyte scale data volumes. Uniquely geared toward big data analytics, Greenplum Database is powered by the world’s most advanced cost-based query optimizer delivering high analytical query performance on large data volumes. https://pivotal.io/pivotal-greenplum

Streamsets

StreamSets software delivers performance management for data flows that feed the next generation of big data applications. Its mission is to bring operational excellence to the management of data in motion, so that data continually arrives on-time and with quality, empowering business-critical analysis and decision-making. https://streamsets.com/

Use Cases:

  1. Loading data from StreamSets data generator into Greenplum
  2. Streaming data from Kafka into Greenplum
  3. Loading data from Hadoop into Greenplum

Loading data from StreamSets data generator into Greenplum:

This example uses Streamsets data generator to generate random data and uses JDBC Producer that concurrently writes data into Greenplum.

The purpose of this use case is to demonstrate how to use StreamSets ETL solution to load large data sets into Greenplum database. For more details, see this README.MD

The example below shows records that are processed , number of records inserted per second while using Dev Data Generator to generate data that will be inserted into Greenplum via JDBC Producer alt text

Loading data from Kafka into Greenplum

This example uses Streamsets data generator to generate random data, store data into Kafka. Later, this example loads data from Kafka into Greenplum

The purpose of this use case is to demonstrate how to use StreamSets ETL solution to load large data sets from Kafka into Greenplum database. For more details, see this README.MD

The example below shows records that are processed , number of records inserted per second while using Kafka consumer to read data and insert data into GPDB via JDBC alt text

Loading data from Hadoop into Greenplum

To be added later

Alternative solution: You can use Spark ETL solution to load data from multiple sources including Kafka, S3 and others. Using Greenplum-Spark connector, you can parallelize data transfer from Spark cluster to Greenplum cluster.

Reference:

Enhancement to Streamsets to use Greenplum native loaders Greenplum - ETL Greenplum-github