PNDA is a simple, scalable, open big data platform supporting operational and business intelligence analysis for networks and services. This guide provides an overview of PNDA, and will tell you how to set up and use PNDA in your own environment.
This chapter covers the main components of PNDA, including:
- Data ingress using Logstash, Open Daylight & the bulk ingest tool
- Data distribution with Kafka & Zookeeper
- High velocity stream processing with Spark Streaming
- High volume batch processing with Spark
- Free form data exploration with Jupyter
- Structured query over big data with Impala
- Handling time series with OpenTSDB & Grafana
This checklist will get you started setting up a fully operational PNDA cluster, with data flowing in and out.
This chapter describes how to provision a PNDA cluster, and includes some background information on SaltStack, OpenStack Heat and AWS CloudFormation.
For a complete list of all technologies brought together by PNDA
The PNDA console provides a real-time overview of all the components in a PNDA cluster. The home page shows health statistics for each component, color-coded by status. Components are grouped into categories, including data distribution, data processing, data storage, applications, etc.
Other pages on the console let you view detailed metrics, deploy packages, run applications, and set data retention policies.
Prior to the 4.0 PNDA release a mixture of fixed IP addresses and individual hostnames were used to wire everything together, with SaltStack inserting the right values into specific config files when PNDA was created. We were also relying on /etc/hosts for all DNS resolution which was not flexible or easy to maintain when adding or removing hosts.
In addition to this, it was not possible to discover service/endpoints externally without knowing in advance the PNDA deployment scheme and which hosts to address.
To solve all these problems we decided to use Consul.io for endpoints management and service discovery. See the following parts for getting more details on the current implementation using Consul:
Kafka is the "front door" of PNDA. It handles ingest of data streams from network sources and distributes data to all interested consumers. This chapter covers how to setup PNDA topics and how to integrate and develop "producers", which feed data into the Kafka topics.
- Preparing topics
- Preparing data
- Integrating Logstash
- Integrating OpenDaylight
- Integrating OpenBMP
- Integrating Pmacct
- Developing a custom producer
In addition to streaming ingest via Kafka producers, PNDA also provides an offline bulk ingest tool for those who would like to migrate pre-existing data into the PNDA platform.
Kafka has a simple, clean design that moves complexity traditionally found inside message brokers into its producers and consumers. A Kafka consumer pulls messages from one or more topics using Zookeeper for discovery, issuing fetch requests to the brokers leading the partitions it wants to consume. Rather than the broker maintaining state and controlling the flow of data, each consumer controls the rate at which it consumes messages.
Packages are independently deployable units of application layer functionality, and applications are instances of packages. You can use the PNDA console to deploy packages and manage the application lifecycle. The Deployment Manager documentation explains the structure of packages, and the REST API used to deploy them.
- Deployment Manager
- Example Applications
- Spark Streaming and HBase tutorial
- Spark Streaming and OpenTSDB tutorial
Logs from the various component services that make up PNDA, and the applications that run on PNDA, are collected and stored on the logserver node.
Apache Impala is a parallel execution engine for SQL queries. It supports low-latency access and interactive exploration of data in HDFS and HBase. Impala allows data to be stored in a raw form, with aggregation performed at query time without requiring upfront aggregation of data.
The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. In PNDA, it supports exploration and presentation of data from HDFS and HBase.
- Using Jupyter
- Exploratory data analytics tutorial
- Managing application dependencies
- Using dependencies in your streaming and batch applications
OpenTSDB is a scalable time series database that lets you store and serve massive amounts of time series data, without losing granularity. Grafana is a graph and dashboard builder for visualizing time series metrics.
A big data infrastructure like PNDA involves a multitude of technologies and tools, and may be deployed in a multi-tenant environment. Providing enterprise grade security for such system is not only complex, but is of primary concern for any production deployment. If you are implementing a client for a PNDA interface or developing a PNDA application, this chapter will cover some security guidelines that you should adhere to when working with individual components.
Hadoop distributions come with a resource management system. The ResourceManager has two main components: Scheduler and ApplicationsManager. Traditionally organization have a separate set of compute resources for development workloads and for productions workloads. This not only leads to poor average resource utilization and overhead of managing multiple independent clusters but more importantly to duplication of the data, which represent a considerable cost in a big data platform. Consequently, sharing data lakes between these two activities represent considerable cost-savings in infrastructure resources. However, sharing computes resources for production activities with development activities should be done in all respect of the critical SLA's production workloads have the abide by. In a default PNDA deployment, the yarn schedulers have been configured to prioritize the system functionality first (in order not to lose any data), then the production workload and finally as last priority the development applications if any resources are still available. Unfortunately, the Yarn schedulers and especially their queue placement tools are more designed around sharing resources across organizations rather than for a priority based queueing system. For this reason, PNDA has chosen to complement the queue placement policies with its own tool and configuration options.
The PNDA distribution consists of the following source code repositories and sub-projects:
- platform-salt: provisioning logic for creating PNDA
- pnda-cli: orchestration application for creating PNDA on AWS, OpenStack or an existing pre-prepared cluster
- pnda-dib-elements: tools for building disk image templates
- pnda: pnda release notes and build system
- platform-libraries: libraries for working with interactive notebooks
- platform-tools: tools for operating a cluster
- bulkingest: tools for performing a bulk ingest of data
- platform-console-frontend: “single pane of glass” giving operational overview and access to application and data management functions
- platform-console-backend: APIs that provide data to the console frontend
- platform-testing: modules that test both the end to end platform and individual components and collect metrics
- platform-deployment-manager: API to manage packages and application deployment and lifecycle
- platform-data-mgmnt: tools to manage data retention
- platform-package-repository: manages a simple package repository backed by OpenStack Swift
- gobblin: customized fork of the Gobblin data ingest framework
- prod-odl-kafka: plugin to ingest data from OpenDaylight
- logstash-codec-pnda-avro: patched AVRO codec ingest data from Logstash
- example-applications: example applications that can be built and run on PNDA
- example-kafka-clients: examples for working with kafka clients
- pnda-guide: this guide
This guide is the latest version and describes the software found on the pndaproject develop branches.
To refer to guides for specific releases, please navigate to the relevant release tag on github.