Spark-pinot connector to read and write data from/to Pinot.
Detailed read model documentation is here; Spark-Pinot Connector Read Model
- Query realtime, offline or hybrid tables
- Distributed, parallel scan
- SQL support instead of PQL
- Column and filter push down to optimize performance
- Overlap between realtime and offline segments is queried exactly once for hybrid tables
- Schema discovery
- Dynamic inference
- Static analysis of case class
import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession
.builder()
.appName("spark-pinot-connector-test")
.master("local")
.getOrCreate()
import spark.implicits._
val data = spark.read
.format("pinot")
.option("table", "airlineStats")
.option("tableType", "offline")
.load()
.filter($"DestStateName" === "Florida")
data.show(100)
For more examples, see src/test/scala/example/ExampleSparkPinotConnectorTest.scala
Spark-Pinot connector uses Spark DatasourceV2 API
. Please check the Databricks presentation for DatasourceV2 API;
- Add integration tests for read operation
- Streaming endpoint support for read operation
- Add write support(pinot segment write logic will be changed in later versions of pinot)