PNDA Fork

This fork adds the following:

A new converter to convert messages read from Kafka to the PNDA Avro schema (gobblin.pnda.PNDAConverter)
A new writer writing data with the kitesdk library (gobblin.pnda.PNDAKiteWriterBuilder)

The converter can only work with a Kafka-compatible source. The workflow is:

KafkaSimpleSource ► PNDAConverter ► [SchemaRowCheckPolicy] ► PNDAKiteWriter

Build

Gobblin needs to be compiled against CDH hadoop dependencies with the right version:

$ ./gradlew clean build -PhadoopVersion="2.6.0-cdh5.9.0" -PexcludeHadoopDeps -PexcludeHiveDeps -x gobblin-core:test

Configuration

Kite datasets

Two kite datasets need to be created.

The first one is the main dataset where all the valid PNDA messages are stored. The schema is:

{"namespace": "pnda.entity",
 "type": "record",
 "name": "event",
 "fields": [
     {"name": "timestamp", "type": "long"},
     {"name": "src",       "type": "string"},
     {"name": "host_ip",   "type": "string"},
     {"name": "rawdata",   "type": "bytes"}
 ]
}

The second one is where all data in the wrong format are stored. The schema is:

{"namespace": "pnda.entity",
 "type"     : "record",
 "name"     : "DeserializationError",
 "fields"   : [
   {"name": "topic",     "type": "string"},
   {"name": "timestamp", "type": "long"},
   {"name": "reason",    "type": ["string", "null"]},
   {"name": "payload",   "type": "bytes"}
 ]
}

See the Kite CLI reference for information on creating a Kite dataset.

For example, to create the first dataset with the following partition strategy (partition.json file):

[
    {"type": "identity", "source": "src", "name": "source"},
    {"type": "year",     "source": "timestamp"},
    {"type": "month",    "source": "timestamp"},
    {"type": "day",      "source": "timestamp"},
    {"type": "hour",     "source": "timestamp"}
]

Then execute the following command:

$ kite-dataset create --schema pnda.avsc dataset:hdfs://192.168.0.227:8020/tmp/PNDA_datasets/master --partition-by partition.json

Property file

The source.schema property must be set to deserialize AVRO data coming from kafka:

source.schema={"namespace": "pnda.entity", \
               "type": "record",                            \
               "name": "event",                             \
               "fields": [                                  \
                   {"name": "timestamp", "type": "long"},   \
                   {"name": "src",       "type": "string"}, \
                   {"name": "host_ip",   "type": "string"}, \
                   {"name": "rawdata",   "type": "bytes"}   \
               ]                                            \
              }

It can also be a the filename of an AVRO schema.

The converter.classes property must include the gobblin.pnda.PNDAConverter class. PNDA.quarantine.dataset.uri is optional. If not set, wrong messages will be discarded.

converter.classes=gobblin.pnda.PNDAConverter
PNDA.quarantine.dataset.uri=dataset:hdfs://192.168.0.227:8020/tmp/PNDA_datasets/quarantine

The writer.builder.class property must be set to gobblin.pnda.PNDAKiteWriterBuilder to enable the Kite writer. kite.writer.dataset.uri is mandatory.

writer.builder.class=gobblin.pnda.PNDAKiteWriterBuilder
kite.writer.dataset.uri=dataset:hdfs://192.168.0.227:8020/tmp/PNDA_datasets/master

PNDA Dataset creation

In the context of the PNDA platform, Gobblin is normally run on a regular basis every 30 minutes. Datasets are created dynamically on HDFS by discovering available topics in Kafka.

It's important to note that datasets are not created by topics, but dependending on the src field in PNDA Avro messages.

Datasets are stored on HDFS in the /user/pnda/PNDA_datasets/datasets directory in the Kite format, and are partitioned by source and timestamp.

The current partition strategy will write the data read from Kafka to the following hierarchy:

source=SSS/year=YYYY/month=MM/day=DD/hour=HH

In the path above, SSS represents the value of the src field in the PNDA Avro schema.

For example, the following message read from Kafka:

{
  "timestamp": 1462886335,
  "src": "netflow",
  "host_ip": "192.168.0.15",
  "rawdata": "XXXXXxxxxXXXX"
}

will be written to:

/user/pnda/PNDA_datasets/datasets/source=netflow/year=2016/month=05/day=10/hour=13/random-uuid.avro

Please refer to the Gobblin for more information about Gobblin.

Gobblin

Quick Links

Documentation: Check out the Gobblin documentation for a complete description of Gobblin's features
Powered By: Check out the list of companies known to use Gobblin
Architecture: The Gobblin Architecture page has a full explanation of Gobblin's architecture
Getting Started with Gobblin: Refer to the Getting Started Guide on how to get started with Gobblin
Building Gobblin: Refer to the page Building Gobblin for directions on how to build Gobblin
Javadocs: The full JavaDocs for each released version of Gobblin can be found here

Name		Name	Last commit message	Last commit date
Latest commit History 3,640 Commits
bin		bin
conf		conf
gobblin-PNDA		gobblin-PNDA
gobblin-admin		gobblin-admin
gobblin-api		gobblin-api
gobblin-audit		gobblin-audit
gobblin-aws		gobblin-aws
gobblin-azkaban		gobblin-azkaban
gobblin-cluster		gobblin-cluster
gobblin-compaction		gobblin-compaction
gobblin-config-management		gobblin-config-management
gobblin-core		gobblin-core
gobblin-data-management		gobblin-data-management
gobblin-distribution		gobblin-distribution
gobblin-docker		gobblin-docker
gobblin-docs		gobblin-docs
gobblin-example		gobblin-example
gobblin-hive-registration		gobblin-hive-registration
gobblin-kafka		gobblin-kafka
gobblin-metastore		gobblin-metastore
gobblin-metrics		gobblin-metrics
gobblin-oozie/src/test/resources/local		gobblin-oozie/src/test/resources/local
gobblin-rest-service		gobblin-rest-service
gobblin-runtime-hadoop		gobblin-runtime-hadoop
gobblin-runtime		gobblin-runtime
gobblin-salesforce		gobblin-salesforce
gobblin-test-harness		gobblin-test-harness
gobblin-test/resource		gobblin-test/resource
gobblin-tunnel		gobblin-tunnel
gobblin-utility		gobblin-utility
gobblin-yarn		gobblin-yarn
gradle		gradle
ligradle/findbugs		ligradle/findbugs
maven-sonatype		maven-sonatype
travis		travis
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG-gobblin.md		CHANGELOG-gobblin.md
CHANGELOG.md		CHANGELOG.md
Jenkinsfile		Jenkinsfile
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
build.gradle		build.gradle
build.sh		build.sh
defaultEnvironment.gradle		defaultEnvironment.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
mkdocs.yml		mkdocs.yml
readthedocs.yml		readthedocs.yml
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PNDA Fork

Build

Configuration

Kite datasets

Property file

PNDA Dataset creation

Gobblin

Quick Links

About

Releases 1

Packages

Languages

License

pndaproject/gobblin

Folders and files

Latest commit

History

Repository files navigation

PNDA Fork

Build

Configuration

Kite datasets

Property file

PNDA Dataset creation

Gobblin

Quick Links

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages