[MINOR] HDFSBackedStateStore, StateStore and StateStoreRDD

japila-books · Aug 25, 2017 · b6a4a78 · b6a4a78
1 parent d12f767
commit b6a4a78
Show file tree

Hide file tree

Showing 9 changed files with 28 additions and 17 deletions.
diff --git a/SUMMARY.adoc b/SUMMARY.adoc
@@ -29,8 +29,8 @@
 .. link:spark-sql-streaming-Dataset-groupBy.adoc[groupBy Operator -- Streaming Aggregation (with Implicit State Logic)]
 
 . link:spark-sql-streaming-KeyValueGroupedDataset.adoc[KeyValueGroupedDataset -- Streaming Aggregation]
-.. link:spark-sql-streaming-KeyValueGroupedDataset-mapGroupsWithState.adoc[mapGroupsWithState Operator -- Stateful Stream Aggregation (with Explicit State Logic)]
-.. link:spark-sql-streaming-KeyValueGroupedDataset-flatMapGroupsWithState.adoc[flatMapGroupsWithState Operator -- Arbitrary Stateful Stream Aggregation (with Explicit State Logic)]
+.. link:spark-sql-streaming-KeyValueGroupedDataset-mapGroupsWithState.adoc[mapGroupsWithState Operator -- Stateful Streaming Aggregation (with Explicit State Logic)]
+.. link:spark-sql-streaming-KeyValueGroupedDataset-flatMapGroupsWithState.adoc[flatMapGroupsWithState Operator -- Arbitrary Stateful Streaming Aggregation (with Explicit State Logic)]
 
 . link:spark-sql-streaming-DataStreamWriter.adoc[DataStreamWriter -- Writing Datasets To Streaming Data Sinks]
 .. link:spark-sql-streaming-ForeachWriter.adoc[ForeachWriter]
@@ -58,6 +58,10 @@
 .. link:spark-sql-streaming-QueryProgressEvent.adoc[QueryProgressEvent]
 .. link:spark-sql-streaming-QueryTerminatedEvent.adoc[QueryTerminatedEvent]
 
+. link:spark-sql-streaming-webui.adoc[Web UI]
+
+. link:spark-sql-streaming-logging.adoc[Logging]
+
 == Extending Structured Streaming
 
 . link:spark-sql-streaming-StreamSourceProvider.adoc[StreamSourceProvider -- Streaming Data Source Provider]
@@ -119,25 +123,25 @@
 
 . link:spark-sql-streaming-OffsetSeqMetadata.adoc[OffsetSeqMetadata]
 
-=== State in Stateful Stream Processing
+=== State in Stateful Streaming Aggregations
 
-. link:spark-sql-streaming-GroupState.adoc[GroupState -- Contract for State Per Group For Stateful Aggregation]
+. link:spark-sql-streaming-GroupState.adoc[GroupState -- State Per Group in Stateful Streaming Aggregation]
 .. link:spark-sql-streaming-GroupStateImpl.adoc[GroupStateImpl]
 .. link:spark-sql-streaming-GroupStateTimeout.adoc[GroupStateTimeout]
 
 . link:spark-sql-streaming-StateStore.adoc[StateStore]
 .. link:spark-sql-streaming-StateStoreOps.adoc[StateStoreOps -- Implicits Methods for Creating StateStoreRDD]
-.. link:spark-sql-streaming-StateStoreRDD.adoc[StateStoreRDD]
+.. link:spark-sql-streaming-StateStoreProvider.adoc[StateStoreProvider]
 .. link:spark-sql-streaming-StateStoreUpdater.adoc[StateStoreUpdater]
 .. link:spark-sql-streaming-StateStoreWriter.adoc[StateStoreWriter -- Recording Metrics For Writing to StateStore]
-.. link:spark-sql-streaming-StateStoreCoordinator.adoc[StateStoreCoordinator -- Tracking Locations of StateStores for StateStoreRDD]
-... link:spark-sql-streaming-StateStoreCoordinatorRef.adoc[StateStoreCoordinatorRef Interface for Communication with StateStoreCoordinator]
-.. link:spark-sql-streaming-StateStoreProvider.adoc[StateStoreProvider]
 
 . link:spark-sql-streaming-HDFSBackedStateStore.adoc[HDFSBackedStateStore]
 .. link:spark-sql-streaming-HDFSBackedStateStoreProvider.adoc[HDFSBackedStateStoreProvider]
 
+. link:spark-sql-streaming-StateStoreRDD.adoc[StateStoreRDD -- RDD for Updating State (in StateStore)]
+. link:spark-sql-streaming-StateStoreCoordinator.adoc[StateStoreCoordinator -- Tracking Locations of StateStores for StateStoreRDD]
+.. link:spark-sql-streaming-StateStoreCoordinatorRef.adoc[StateStoreCoordinatorRef Interface for Communication with StateStoreCoordinator]
+
 == Varia
 
 . link:spark-sql-streaming-StreamProgress.adoc[StreamProgress Custom Scala Map]
-. link:spark-sql-streaming-logging.adoc[Logging]
diff --git a/spark-sql-streaming-GroupState.adoc b/spark-sql-streaming-GroupState.adoc
@@ -1,4 +1,4 @@
-== [[GroupState]] GroupState -- Contract for State Per Group For Stateful Aggregation
+== [[GroupState]] GroupState -- State Per Group in Stateful Streaming Aggregation
 
 `GroupState` is the <<contract, contract>> for working with a state (of type `S`) per group for arbitrary stateful aggregation (using link:spark-sql-streaming-KeyValueGroupedDataset.adoc#mapGroupsWithState[mapGroupsWithState] or link:spark-sql-streaming-KeyValueGroupedDataset.adoc#flatMapGroupsWithState[flatMapGroupsWithState] operators).
 

diff --git a/spark-sql-streaming-HDFSBackedStateStore.adoc b/spark-sql-streaming-HDFSBackedStateStore.adoc
@@ -1,6 +1,6 @@
 == [[HDFSBackedStateStore]] HDFSBackedStateStore
 
-`HDFSBackedStateStore` is a link:spark-sql-streaming-StateStore.adoc[StateStore] using HDFS-compatible file system for versioned state persistence.
+`HDFSBackedStateStore` is a link:spark-sql-streaming-StateStore.adoc[StateStore] that uses a HDFS-compatible file system for versioned state persistence.
 
 `HDFSBackedStateStore` is created exclusively when `HDFSBackedStateStoreProvider` <<getStore, is requested for a state store>> (which is when...FIXME).
 

diff --git a/spark-sql-streaming-KeyValueGroupedDataset-flatMapGroupsWithState.adoc b/spark-sql-streaming-KeyValueGroupedDataset-flatMapGroupsWithState.adoc
@@ -1,4 +1,4 @@
-== [[flatMapGroupsWithState]] flatMapGroupsWithState Operator -- Arbitrary Stateful Stream Aggregation (with Explicit State Logic)
+== [[flatMapGroupsWithState]] flatMapGroupsWithState Operator -- Arbitrary Stateful Streaming Aggregation (with Explicit State Logic)
 
 [source, scala]
 ----

diff --git a/spark-sql-streaming-KeyValueGroupedDataset-mapGroupsWithState.adoc b/spark-sql-streaming-KeyValueGroupedDataset-mapGroupsWithState.adoc
@@ -1,4 +1,4 @@
-== [[mapGroupsWithState]] mapGroupsWithState Operator -- Stateful Stream Aggregation (with Explicit State Logic)
+== [[mapGroupsWithState]] mapGroupsWithState Operator -- Stateful Streaming Aggregation (with Explicit State Logic)
 
 [source, scala]
 ----

diff --git a/spark-sql-streaming-StateStore.adoc b/spark-sql-streaming-StateStore.adoc
@@ -37,6 +37,8 @@ trait StateStore {
 | [[get]] `get`
 |
 
+Used exclusively when `StateStoreRDD` link:spark-sql-streaming-StateStoreRDD.adoc#compute[is executed].
+
 | [[getRange]] `getRange`
 |
 

diff --git a/spark-sql-streaming-StateStoreProvider.adoc b/spark-sql-streaming-StateStoreProvider.adoc
@@ -14,4 +14,4 @@ CAUTION: FIXME
 
 CAUTION: FIXME
 
-NOTE: `getStore` is used when `StateStore` is link:spark-sql-streaming-StateStore.adoc#get[requested for a state store] (given a `StateStoreProviderId`).
+NOTE: `getStore` is used exclusively when `StateStore` is link:spark-sql-streaming-StateStore.adoc#get[requested for a state store] (given a `StateStoreProviderId`).
diff --git a/spark-sql-streaming-StateStoreRDD.adoc b/spark-sql-streaming-StateStoreRDD.adoc
@@ -1,8 +1,10 @@
-== [[StateStoreRDD]] StateStoreRDD
+== [[StateStoreRDD]] StateStoreRDD -- RDD for Updating State (in StateStore)
 
-`StateStoreRDD` is a specialized `RDD` for <<compute, computations>> using link:spark-sql-streaming-StateStore.adoc[StateStore].
+`StateStoreRDD` is an `RDD` for <<compute, executing storeUpdateFunction>> with link:spark-sql-streaming-StateStore.adoc[StateStore] (and data from partitions of a <<dataRDD, child RDD>>).
 
-`StateStoreRDD` is <<creating-instance, created>> as the result of link:spark-sql-streaming-StateStoreOps.adoc#mapPartitionsWithStateStore[executing physical operators] (i.e. `FlatMapGroupsWithStateExec`, `StateStoreRestoreExec`, `StateStoreSaveExec`, `StreamingDeduplicateExec`).
+`StateStoreRDD` is <<creating-instance, created>> when link:spark-sql-streaming-StateStoreOps.adoc#mapPartitionsWithStateStore[executing physical operators] that work with state (i.e. `FlatMapGroupsWithStateExec`, `StateStoreRestoreExec`, `StateStoreSaveExec`, `StreamingDeduplicateExec`).
+
+CAUTION: FIXME graffle with operator -> logical operator -> planner -> physical operator -> StateStoreRDD
 
 `StateStoreRDD` uses `StateStoreCoordinator` for <<getPreferredLocations, preferred locations>> for job scheduling.
 

diff --git a/spark-sql-streaming-webui.adoc b/spark-sql-streaming-webui.adoc
@@ -0,0 +1,3 @@
+== Web UI
+
+Web UI...FIXME