# [structured streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#quick-example)

在 Apache Spark Structured Streaming 中支持的一些主要数据源格式及其使用示例如下：

1. **文本文件（Text）**：
   ```python
   text_df = spark.readStream.format("text").load("/path/to/directory")
   ```

2. **CSV 文件**：
   ```python
   csv_df = spark.readStream.format("csv").option("header", "true").load("/path/to/directory")
   ```

3. **JSON 文件**：
   ```python
   json_df = spark.readStream.format("json").load("/path/to/directory")
   ```

4. **ORC 文件**：
   ```python
   orc_df = spark.readStream.format("orc").load("/path/to/directory")
   ```

5. **Parquet 文件**：
   ```python
   parquet_df = spark.readStream.format("parquet").load("/path/to/directory")
   ```

6. **从 Kafka 读取**：
   ```python
   kafka_df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "host:port").option("subscribe", "topic").load()
   ```

7. **套接字（Socket，用于测试）**：
   ```python
   socket_df = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()
   ```

8. **Rate（用于测试）**：
   ```python
   rate_df = spark.readStream.format("rate").load()
   ```

在分布式环境中， **确保指定的路径或资源对所有 Spark 节点可访问**。

# csv

## csv1

In [8]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *

In [9]:
file = 'file:///home/jupyter/data/test'

# 定义 CSV 文件的 schema
schema = StructType([
    StructField('time', TimestampType()),
    StructField('app_id', StringType()),
    StructField('store', StringType()),
    StructField('adid', StringType()),
    StructField('openid', StringType()),
    StructField('activity_kind', StringType()),
    StructField('created_at', StringType()),
    StructField('installed_at', StringType()),
    StructField('reattributed_at', StringType()),
    StructField('network_name', StringType()),
    StructField('country', StringType()),
    StructField('device_name', StringType()),
    StructField('device_type', StringType()),
    StructField('os_name', StringType()),
    StructField('timezone', StringType()),
    StructField('event_name', StringType()),
    StructField('revenue_float', StringType()),
    StructField('revenue', StringType()),
    StructField('currency', StringType()),
    StructField('revenue_usd', StringType()),
    StructField('reporting_revenue', StringType())
])

In [10]:
# 创建 SparkSession
spark = SparkSession.builder.appName("StructuredStreamingCSV1").getOrCreate()

# 读取 CSV 文件
csvDF = spark.readStream \
    .option("sep", ",") \
    .schema(schema) \
    .csv(file)

# 定义数据处理逻辑
# 例如，简单的转换或聚合操作

# 定义输出接收器，例如输出到控制台
query = csvDF.writeStream \
    .outputMode("append") \
    .format("console") \
    .start()

# 等待流处理结束
query.awaitTermination(timeout=3)
query.status
query.stop()
# query.lastProgress
query.status

spark.stop()

23/12/04 07:04:34 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.
23/12/04 07:04:34 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-4ec6c3f1-14a0-4112-bb0a-9385a01c5b52. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
23/12/04 07:04:34 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+-------------------+--------------------+------+--------------------+------+-------------+----------+------------+---------------+--------------------+-------+-----------+-----------+-------+--------+----------+-------------+-------+--------+-----------+-----------------+
|               time|              app_id| store|                adid|openid|activity_kind|created_at|installed_at|reattributed_at|        network_name|country|device_name|device_type|os_name|timezone|event_name|revenue_float|revenue|currency|revenue_usd|reporting_revenue|
+-------------------+--------------------+------+--------------------+------+-------------+----------+------------+---------------+--------------------+-------+-----------+-----------+-------+--------+----------+-------------+-------+--------+-----------+-----------------+
|2023-10-01 00:00:00|          1456241577|itunes|041bf78c9dc6dd5f5...|  NULL|    

False

{'message': 'Waiting for data to arrive',
 'isDataAvailable': False,
 'isTriggerActive': False}

{'message': 'Stopped', 'isDataAvailable': False, 'isTriggerActive': False}

## csv2 

In [11]:
# 创建 SparkSession
spark = SparkSession.builder.appName("StructuredStreamingCSV2").getOrCreate()

file = 'file:///home/jupyter/data/test'
# 读取 CSV 文件
csvdf = spark.readStream.format("csv") \
        .option("header", "false") \
        .schema(schema) \
        .load(file)

In [12]:
query = csvdf.writeStream.format('console').start()
query.awaitTermination(timeout=10)

query.status
query.stop()
# query.lastProgress
query.status

spark.stop()

23/12/04 07:04:45 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-bfe1c123-e309-44bb-ba7e-352a67ba52f5. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
23/12/04 07:04:45 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+-------------------+--------------------+------+--------------------+------+-------------+----------+------------+---------------+--------------------+-------+-----------+-----------+-------+--------+----------+-------------+-------+--------+-----------+-----------------+
|               time|              app_id| store|                adid|openid|activity_kind|created_at|installed_at|reattributed_at|        network_name|country|device_name|device_type|os_name|timezone|event_name|revenue_float|revenue|currency|revenue_usd|reporting_revenue|
+-------------------+--------------------+------+--------------------+------+-------------+----------+------------+---------------+--------------------+-------+-----------+-----------+-------+--------+----------+-------------+-------+--------+-----------+-----------------+
|2023-10-01 00:00:00|          1456241577|itunes|041bf78c9dc6dd5f5...|  NULL|    

False

{'message': 'Waiting for data to arrive',
 'isDataAvailable': False,
 'isTriggerActive': False}

{'message': 'Stopped', 'isDataAvailable': False, 'isTriggerActive': False}

# output

## outfile 

In [16]:
from pyspark.sql import functions as F

# 创建 SparkSession
spark = SparkSession.builder.appName("StructuredStreamingOutPutNew").getOrCreate()

file = 'file:///home/jupyter/data/test'
# 读取 CSV 文件
csvdf = spark.readStream.format("csv") \
        .option("header", "false") \
        .schema(schema) \
        .load(file)

In [18]:
_csvdf = csvdf.withWatermark('time', '100 second').groupby('time').agg(
    F.count("time").alias("count"),
    F.last("adid").alias("last_adid"),
    F.max("adid").alias("max_adid")
)

# csv only support append outputMode
query = _csvdf.writeStream \
        .format('csv') \
        .option('path', 'file:///home/jupyter/outputdata/csvput/path') \
        .option('checkpointLocation', 'file:///home/jupyter/outputdata/csvput/checkpointLocation') \
        .start()
query.awaitTermination(timeout=10)

query.status
# query.lastProgress
# query.stop()  # 如果没有写完数据，直接stop会报错

23/12/04 07:08:48 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.

False

{'message': 'Processing new data',
 'isDataAvailable': True,
 'isTriggerActive': True}

In [None]:
with open

## complete

In [9]:
_csvdf = csvdf.groupby('time').agg(
    F.count("time").alias("count"),
    F.last("adid").alias("last_adid"),
    F.max("adid").alias("max_adid")
)
# query = _csvdf.writeStream.outputMode('complete').format('console').start()
query = _csvdf.writeStream \
        .outputMode('complete') \
        .format('csv') \
        .option('path', 'file:///home/jupyter/notebook/output/path') \
        .option('checkpointLocation', 'file:///home/jupyter/notebook/output/checkpointLocation') \
        .start()
query.awaitTermination(timeout=10)

query.status
# query.lastProgress
query.stop()
query.status

23/12/04 06:28:10 WARN TaskSetManager: Lost task 105.0 in stage 1.0 (TID 133) (worker1 executor 1): TaskKilled (Stage cancelled: Job 0 cancelled part of cancelled job group 8360bb0c-fac7-4ddd-8d52-01ab2c54a374)
23/12/04 06:28:10 WARN TaskSetManager: Lost task 146.0 in stage 1.0 (TID 132) (worker2 executor 2): TaskKilled (Stage cancelled: Job 0 cancelled part of cancelled job group 8360bb0c-fac7-4ddd-8d52-01ab2c54a374)

AnalysisException: Data source csv does not support Complete output mode.

## append

In [None]:
_csvdf = csvdf.withWatermark("time", "60 second")
_csvdf = _csvdf.groupby('time').agg(
    F.count("time").alias("count"),
    F.last("adid").alias("last_adid"),
    F.max("adid").alias("max_adid")
)
# query = _csvdf.writeStream.outputMode('complete').format('console').start()
query = _csvdf.writeStream \
        .outputMode('append') \
        .format('csv') \
        .option('path', 'file:///home/jupyter/notebook/output/path') \
        .option('checkpointLocation', 'file:///home/jupyter/notebook/output/checkpointLocation') \
        .start()
query.awaitTermination(timeout=10)

query.status
# query.lastProgress
query.stop()
query.status

## update

In [None]:
from pyspark.sql import functions as F

_csvdf = csvdf.groupby('time').agg(
    F.count("time").alias("count"),
    F.last("adid").alias("last_adid"),
    F.max("adid").alias("max_adid")
)
query = _csvdf.writeStream.outputMode('update').format('console').start()

query.awaitTermination(timeout=10)

query.status
# query.lastProgress
query.stop()
query.status

In [None]:
spark.stop()

# window 