Skip to content

Latest commit

 

History

History
69 lines (52 loc) · 2.22 KB

File metadata and controls

69 lines (52 loc) · 2.22 KB

Spark-Pinot Connector

Spark-pinot connector to read and write data from/to Pinot.

Detailed read model documentation is here; Spark-Pinot Connector Read Model

Features

  • Query realtime, offline or hybrid tables
  • Distributed, parallel scan
  • SQL support instead of PQL
  • Column and filter push down to optimize performance
  • Overlap between realtime and offline segments is queried exactly once for hybrid tables
  • Schema discovery
    • Dynamic inference
    • Static analysis of case class

Quick Start

import org.apache.spark.sql.SparkSession

val spark: SparkSession = SparkSession
      .builder()
      .appName("spark-pinot-connector-test")
      .master("local")
      .getOrCreate()

import spark.implicits._

val data = spark.read
  .format("pinot")
  .option("table", "airlineStats")
  .option("tableType", "offline")
  .load()
  .filter($"DestStateName" === "Florida")

data.show(100)

For more examples, see src/test/scala/example/ExampleSparkPinotConnectorTest.scala

Spark-Pinot connector uses Spark DatasourceV2 API. Please check the Databricks presentation for DatasourceV2 API;

https://www.slideshare.net/databricks/apache-spark-data-source-v2-with-wenchen-fan-and-gengliang-wang

Future Works

  • Add integration tests for read operation
  • Streaming endpoint support for read operation
  • Add write support(pinot segment write logic will be changed in later versions of pinot)