# [structured streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#quick-example)

在 Apache Spark Structured Streaming 中支持的一些主要数据源格式及其使用示例如下：

1. **文本文件（Text）**：
   ```python
   text_df = spark.readStream.format("text").load("/path/to/directory")
   ```

2. **CSV 文件**：
   ```python
   csv_df = spark.readStream.format("csv").option("header", "true").load("/path/to/directory")
   ```

3. **JSON 文件**：
   ```python
   json_df = spark.readStream.format("json").load("/path/to/directory")
   ```

4. **ORC 文件**：
   ```python
   orc_df = spark.readStream.format("orc").load("/path/to/directory")
   ```

5. **Parquet 文件**：
   ```python
   parquet_df = spark.readStream.format("parquet").load("/path/to/directory")
   ```

6. **从 Kafka 读取**：
   ```python
   kafka_df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "host:port").option("subscribe", "topic").load()
   ```

7. **套接字（Socket，用于测试）**：
   ```python
   socket_df = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()
   ```

8. **Rate（用于测试）**：
   ```python
   rate_df = spark.readStream.format("rate").load()
   ```

在分布式环境中， **确保指定的路径或资源对所有 Spark 节点可访问**。

# csv

## csv1

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *

# 创建 SparkSession
spark = SparkSession.builder.appName("StructuredStreamingCSV1").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [2]:
file = 'file:///home/jupyter/data/test'

# 定义 CSV 文件的 schema
schema = StructType([
    StructField('time', TimestampType()),
    StructField('app_id', StringType()),
    StructField('store', StringType()),
    StructField('adid', StringType()),
    StructField('openid', StringType()),
    StructField('activity_kind', StringType()),
    StructField('created_at', StringType()),
    StructField('installed_at', StringType()),
    StructField('reattributed_at', StringType()),
    StructField('network_name', StringType()),
    StructField('country', StringType()),
    StructField('device_name', StringType()),
    StructField('device_type', StringType()),
    StructField('os_name', StringType()),
    StructField('timezone', StringType()),
    StructField('event_name', StringType()),
    StructField('revenue_float', StringType()),
    StructField('revenue', StringType()),
    StructField('currency', StringType()),
    StructField('revenue_usd', StringType()),
    StructField('reporting_revenue', StringType())
])


# 读取 CSV 文件
csvDF = spark.readStream \
    .option("sep", ",") \
    .schema(schema) \
    .csv(file)

In [3]:
# 定义数据处理逻辑
# 例如，简单的转换或聚合操作

# 定义输出接收器，例如输出到控制台
query = csvDF.writeStream \
    .outputMode("append") \
    .format("console") \
    .start()

# 等待流处理结束
query.awaitTermination(timeout=3)
query.status
query.stop()
query.lastProgress
query.status

spark.stop()

23/12/04 05:34:47 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-eb2a25c2-421c-4d6f-98bd-56873952f65c. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
23/12/04 05:34:48 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------


False

{'message': 'Processing new data',
 'isDataAvailable': True,
 'isTriggerActive': True}

+-------------------+--------------------+------+--------------------+------+-------------+----------+------------+---------------+--------------------+-------+-----------+-----------+-------+--------+----------+-------------+-------+--------+-----------+-----------------+
|               time|              app_id| store|                adid|openid|activity_kind|created_at|installed_at|reattributed_at|        network_name|country|device_name|device_type|os_name|timezone|event_name|revenue_float|revenue|currency|revenue_usd|reporting_revenue|
+-------------------+--------------------+------+--------------------+------+-------------+----------+------------+---------------+--------------------+-------+-----------+-----------+-------+--------+----------+-------------+-------+--------+-----------+-----------------+
|2023-10-01 00:00:00|          1456241577|itunes|041bf78c9dc6dd5f5...|  NULL|      session|      NULL|  1636532102|           NULL|             RWD-ady|     jp|       NULL|      

{'id': '9563b4fc-31af-4e1c-b3cc-a37fe40258c1',
 'runId': 'b3418f9a-20c6-4bee-a108-e29f05eb59d3',
 'name': None,
 'timestamp': '2023-12-04T05:34:48.232Z',
 'batchId': 0,
 'numInputRows': 135,
 'inputRowsPerSecond': 0.0,
 'processedRowsPerSecond': 39.72925250147145,
 'durationMs': {'addBatch': 2805,
  'commitOffsets': 36,
  'getBatch': 82,
  'latestOffset': 59,
  'queryPlanning': 347,
  'triggerExecution': 3398,
  'walCommit': 46},
 'stateOperators': [],
 'sources': [{'description': 'FileStreamSource[file:/home/jupyter/data/test]',
   'startOffset': None,
   'endOffset': {'logOffset': 0},
   'latestOffset': None,
   'numInputRows': 135,
   'inputRowsPerSecond': 0.0,
   'processedRowsPerSecond': 39.72925250147145}],
 'sink': {'description': 'org.apache.spark.sql.execution.streaming.ConsoleTable$@66f7501a',
  'numOutputRows': 135}}

{'message': 'Stopped', 'isDataAvailable': False, 'isTriggerActive': False}

## csv2 

In [4]:
# 创建 SparkSession
spark = SparkSession.builder.appName("StructuredStreamingCSV2").getOrCreate()

In [5]:
file = 'file:///home/jupyter/data/test'
# 读取 CSV 文件
csvdf = spark.readStream.format("csv") \
        .option("header", "false") \
        .schema(schema) \
        .load(file)

In [6]:
query = csvdf.writeStream.format('console').start()
query.awaitTermination(timeout=10)

query.status
query.stop()
query.lastProgress
query.status

spark.stop()

23/12/04 05:34:59 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-3264e932-2910-4ecc-ae47-a587b47bc1f2. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
23/12/04 05:34:59 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+-------------------+--------------------+------+--------------------+------+-------------+----------+------------+---------------+--------------------+-------+-----------+-----------+-------+--------+----------+-------------+-------+--------+-----------+-----------------+
|               time|              app_id| store|                adid|openid|activity_kind|created_at|installed_at|reattributed_at|        network_name|country|device_name|device_type|os_name|timezone|event_name|revenue_float|revenue|currency|revenue_usd|reporting_revenue|
+-------------------+--------------------+------+--------------------+------+-------------+----------+------------+---------------+--------------------+-------+-----------+-----------+-------+--------+----------+-------------+-------+--------+-----------+-----------------+
|2023-10-01 00:00:00|          1456241577|itunes|041bf78c9dc6dd5f5...|  NULL|    

False

{'message': 'Getting offsets from FileStreamSource[file:/home/jupyter/data/test]',
 'isDataAvailable': False,
 'isTriggerActive': True}

{'id': 'ec931669-afe0-4e5b-b048-161c0b782c58',
 'runId': '23a4b76b-fe9c-4666-a7e3-4bcfbfe75af1',
 'name': None,
 'timestamp': '2023-12-04T05:34:59.192Z',
 'batchId': 0,
 'numInputRows': 135,
 'inputRowsPerSecond': 0.0,
 'processedRowsPerSecond': 98.39650145772595,
 'durationMs': {'addBatch': 1234,
  'commitOffsets': 26,
  'getBatch': 25,
  'latestOffset': 43,
  'queryPlanning': 16,
  'triggerExecution': 1372,
  'walCommit': 27},
 'stateOperators': [],
 'sources': [{'description': 'FileStreamSource[file:/home/jupyter/data/test]',
   'startOffset': None,
   'endOffset': {'logOffset': 0},
   'latestOffset': None,
   'numInputRows': 135,
   'inputRowsPerSecond': 0.0,
   'processedRowsPerSecond': 98.39650145772595}],
 'sink': {'description': 'org.apache.spark.sql.execution.streaming.ConsoleTable$@66f7501a',
  'numOutputRows': 135}}

{'message': 'Stopped', 'isDataAvailable': False, 'isTriggerActive': False}

# output

## outfile 

In [12]:
from pyspark.sql import functions as F

# 创建 SparkSession
spark = SparkSession.builder.appName("StructuredStreamingOutPut").getOrCreate()

file = 'file:///home/jupyter/data/test'
# 读取 CSV 文件
csvdf = spark.readStream.format("csv") \
        .option("header", "false") \
        .schema(schema) \
        .load(file)

In [25]:
_csvdf = csvdf.withWatermark('time', '10 second').groupby('time').agg(
    F.count("time").alias("count"),
    F.last("adid").alias("last_adid"),
    F.max("adid").alias("max_adid")
)
# query = _csvdf.writeStream.outputMode('complete').format('console').start()
query = _csvdf.writeStream \
        .outputMode('append') \
        .format('csv') \
        .option('path', '/home/jupyter/notebook/output/path') \
        .option('checkpointLocation', '/home/jupyter/notebook/output/checkpointLocation') \
        .start()
query.awaitTermination(timeout=10)

query.status
# query.lastProgress
query.stop()
query.status

23/12/04 05:57:42 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


False

{'message': 'Waiting for data to arrive',
 'isDataAvailable': False,
 'isTriggerActive': False}

{'message': 'Stopped', 'isDataAvailable': False, 'isTriggerActive': False}

## complete

In [14]:
_csvdf = csvdf.groupby('time').agg(
    F.count("time").alias("count"),
    F.last("adid").alias("last_adid"),
    F.max("adid").alias("max_adid")
)
# query = _csvdf.writeStream.outputMode('complete').format('console').start()
query = _csvdf.writeStream \
        .outputMode('complete') \
        .format('csv') \
        .option('path', 'file:///home/jupyter/notebook/output/path') \
        .option('checkpointLocation', 'file:///home/jupyter/notebook/output/checkpointLocation') \
        .start()
query.awaitTermination(timeout=10)

query.status
# query.lastProgress
query.stop()
query.status

IndentationError: unexpected indent (3943124468.py, line 9)

## append

In [9]:
_csvdf = csvdf.withWatermark("time", "60 second")
_csvdf = _csvdf.groupby('time').agg(
    F.count("time").alias("count"),
    F.last("adid").alias("last_adid"),
    F.max("adid").alias("max_adid")
)
# query = _csvdf.writeStream.outputMode('complete').format('console').start()
query = _csvdf.writeStream \
        .outputMode('append') \
        .format('csv') \
        .option('path', 'file:///home/jupyter/notebook/output/path') \
        .option('checkpointLocation', 'file:///home/jupyter/notebook/output/checkpointLocation') \
        .start()
query.awaitTermination(timeout=10)

query.status
# query.lastProgress
query.stop()
query.status

23/12/04 05:35:26 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
                                                                                

False

{'message': 'Waiting for data to arrive',
 'isDataAvailable': False,
 'isTriggerActive': False}

{'message': 'Stopped', 'isDataAvailable': False, 'isTriggerActive': False}

## update

In [10]:
from pyspark.sql import functions as F

_csvdf = csvdf.groupby('time').agg(
    F.count("time").alias("count"),
    F.last("adid").alias("last_adid"),
    F.max("adid").alias("max_adid")
)
query = _csvdf.writeStream.outputMode('update').format('console').start()

query.awaitTermination(timeout=10)

query.status
# query.lastProgress
query.stop()
query.status

23/12/04 05:35:36 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-9d18bcc9-d0b9-4462-8079-91a7fb753ace. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
23/12/04 05:35:36 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+-------------------+-----+--------------------+--------------------+
|               time|count|           last_adid|            max_adid|
+-------------------+-----+--------------------+--------------------+
|2023-10-01 00:01:56|    1|7d2a5d84039288900...|7d2a5d84039288900...|
|2023-10-01 00:07:18|    2|e613f2c3335bf8979...|e613f2c3335bf8979...|
|2023-10-01 00:01:55|    1|137e93b9caf60253a...|137e93b9caf60253a...|
|2023-10-01 00:07:59|    1|f20502c58e489b68e...|f20502c58e489b68e...|
|2023-10-01 00:05:57|    1|3c39bd30358828d26...|3c39bd30358828d26...|
|2023-10-01 00:02:12|    1|504092e5ea59fcf82...|504092e5ea59fcf82...|
|2023-10-01 00:01:09|    2|1e8f640fde5c32716...|1f3314dff9eec7e18...|
|2023-10-01 00:02:06|    1|973f078d9832a1d20...|973f078d9832a1d20...|
|2023-10-01 00:07:44|    2|c2497138e365f15bd...|c2497138e365f15bd...|
|2023-10-01 00:03:48|    1|450277e0ce225e9c7...|450277e0ce225e9

False

{'message': 'Getting offsets from FileStreamSource[file:/home/jupyter/data/test]',
 'isDataAvailable': False,
 'isTriggerActive': True}

{'id': 'e2629ff9-f732-4369-a18c-d783d005f15e',
 'runId': '10c07989-39e1-4d48-bc30-19860946c4dc',
 'name': None,
 'timestamp': '2023-12-04T05:35:37.013Z',
 'batchId': 0,
 'numInputRows': 135,
 'inputRowsPerSecond': 0.0,
 'processedRowsPerSecond': 41.74397031539889,
 'durationMs': {'addBatch': 3106,
  'commitOffsets': 31,
  'getBatch': 18,
  'latestOffset': 39,
  'queryPlanning': 15,
  'triggerExecution': 3234,
  'walCommit': 25},
 'stateOperators': [{'operatorName': 'stateStoreSave',
   'numRowsTotal': 117,
   'numRowsUpdated': 117,
   'allUpdatesTimeMs': 324,
   'numRowsRemoved': 0,
   'allRemovalsTimeMs': 0,
   'commitTimeMs': 1947,
   'memoryUsedBytes': 83368,
   'numRowsDroppedByWatermark': 0,
   'numShufflePartitions': 200,
   'numStateStoreInstances': 200,
   'customMetrics': {'loadedMapCacheHitCount': 0,
    'loadedMapCacheMissCount': 0,
    'stateOnCurrentVersionSizeBytes': 54568}}],
 'sources': [{'description': 'FileStreamSource[file:/home/jupyter/data/test]',
   'startOffset':

{'message': 'Stopped', 'isDataAvailable': False, 'isTriggerActive': False}

In [11]:
spark.stop()

# window 