Spark ingestion tool for Big Data written in Scala
Installed build tool: SBT
Spark 1.4+
0.1 - under development
- Scala - Object-Oriented Meets Functional Programming Language.
- SBT - Scala interactive build tool.
- ScalaTest - The central concept in ScalaTest is the suite, a collection of zero to many tests.
- Scalastyle - Scalastyle examines Scala code and indicates potential problems with it.
- Apache Spark - Apache Spark is a fast and general engine for large-scale data processing.
- Travis-ci - Travis CI is a hosted, distributed continuous integration service used to build and test software projects.
Run script compiler.sh
cd big-shipper
./compiler.sh
This script generate target/scala-[version, like: 2.10]/BigShipper-assembly-0.1.jar file with dependencies embedded.
spark-submit --class main.Shipper target/scala-[version, like: 2.10]/BigShipper-assembly-0.1.jar -c /path_to/config.json --loglevel debug
{
"SOURCE":{
"TYPE": "delimitedfile",
"FIELDS": [
{
"NAME": "field1",
"TYPE": "int"
},
{
"NAME": "field2",
"TYPE": "string"
},
{
"NAME": "field3",
"TYPE": "decimal"
}
],
"DELIMITER_RAW": "|",
"DIR_RAW_FILES": "/user/NAME/data_201703{2[7-9],3[0-1]}.txt"
},
"TARGET":{
"TYPE": "hive",
"ACTION": "append",
"HIVE_TABLE": "table_name_here"
}
}
SOURCE.TYPE: Type of source file(s). Values: [delimitedfile, json]
SOURCE.FIELDS.TYPE: Data types for fields. Values: [bigint, int, smallint, tinyint, double, decimal, float, byte, string, date, timestamp and boolean]
SOURCE.DIR_RAW_FILES: HDFS path with REGEX pattern to grab files or local path started with: [file://].
MIT License