Skip to content

mvrpl/big-shipper

Repository files navigation

Issue Count Build Status

Big Shipper

Spark ingestion tool for Big Data written in Scala

Mandatory

Installed build tool: SBT

Spark 1.4+

Beta version

0.1 - under development

Stack

  • Scala - Object-Oriented Meets Functional Programming Language.
  • SBT - Scala interactive build tool.
  • ScalaTest - The central concept in ScalaTest is the suite, a collection of zero to many tests.
  • Scalastyle - Scalastyle examines Scala code and indicates potential problems with it.
  • Apache Spark - Apache Spark is a fast and general engine for large-scale data processing.
  • Travis-ci - Travis CI is a hosted, distributed continuous integration service used to build and test software projects.

Compilation

Run script compiler.sh

cd big-shipper
./compiler.sh

This script generate target/scala-[version, like: 2.10]/BigShipper-assembly-0.1.jar file with dependencies embedded.

Usage

spark-submit --class main.Shipper target/scala-[version, like: 2.10]/BigShipper-assembly-0.1.jar -c /path_to/config.json --loglevel debug

Config example:

{
	"SOURCE":{
		"TYPE": "delimitedfile",
		"FIELDS": [
			{
				"NAME": "field1",
				"TYPE": "int"
			},
			{
				"NAME": "field2",
				"TYPE": "string"
			},
			{
				"NAME": "field3",
				"TYPE": "decimal"
			}
		],
		"DELIMITER_RAW": "|",
		"DIR_RAW_FILES": "/user/NAME/data_201703{2[7-9],3[0-1]}.txt"
	},
	"TARGET":{
		"TYPE": "hive",
		"ACTION": "append",
		"HIVE_TABLE": "table_name_here"
	}
}

Check more examples

SOURCE.TYPE: Type of source file(s). Values: [delimitedfile, json]

SOURCE.FIELDS.TYPE: Data types for fields. Values: [bigint, int, smallint, tinyint, double, decimal, float, byte, string, date, timestamp and boolean]

SOURCE.DIR_RAW_FILES: HDFS path with REGEX pattern to grab files or local path started with: [file://].

License

MIT License

About

Spark ingestion tool for Big Data written in Scala

Topics

Resources

License

Stars

Watchers

Forks