# [structured streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#quick-example)

在 Apache Spark Structured Streaming 中支持的一些主要数据源格式及其使用示例如下：

1. **文本文件（Text）**：
   ```python
   text_df = spark.readStream.format("text").load("/path/to/directory")
   ```

2. **CSV 文件**：
   ```python
   csv_df = spark.readStream.format("csv").option("header", "true").load("/path/to/directory")
   ```

3. **JSON 文件**：
   ```python
   json_df = spark.readStream.format("json").load("/path/to/directory")
   ```

4. **ORC 文件**：
   ```python
   orc_df = spark.readStream.format("orc").load("/path/to/directory")
   ```

5. **Parquet 文件**：
   ```python
   parquet_df = spark.readStream.format("parquet").load("/path/to/directory")
   ```

6. **从 Kafka 读取**：
   ```python
   kafka_df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "host:port").option("subscribe", "topic").load()
   ```

7. **套接字（Socket，用于测试）**：
   ```python
   socket_df = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()
   ```

8. **Rate（用于测试）**：
   ```python
   rate_df = spark.readStream.format("rate").load()
   ```

在分布式环境中， **确保指定的路径或资源对所有 Spark 节点可访问**。

# csv

## csv1

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *

# 创建 SparkSession
spark = SparkSession.builder.appName("StructuredStreamingCSV1").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [2]:
file = 'file:///home/jupyter/data/test'

# 定义 CSV 文件的 schema
schema = StructType([
    StructField('time', StringType()),
    StructField('app_id', StringType()),
    StructField('store', StringType()),
    StructField('adid', StringType()),
    StructField('openid', StringType()),
    StructField('activity_kind', StringType()),
    StructField('created_at', StringType()),
    StructField('installed_at', StringType()),
    StructField('reattributed_at', StringType()),
    StructField('network_name', StringType()),
    StructField('country', StringType()),
    StructField('device_name', StringType()),
    StructField('device_type', StringType()),
    StructField('os_name', StringType()),
    StructField('timezone', StringType()),
    StructField('event_name', StringType()),
    StructField('revenue_float', StringType()),
    StructField('revenue', StringType()),
    StructField('currency', StringType()),
    StructField('revenue_usd', StringType()),
    StructField('reporting_revenue', StringType())
])


# 读取 CSV 文件
csvDF = spark.readStream \
    .option("sep", ",") \
    .schema(schema) \
    .csv(file)

In [16]:
# 定义数据处理逻辑
# 例如，简单的转换或聚合操作

# 定义输出接收器，例如输出到控制台
query = csvDF.writeStream \
    .outputMode("append") \
    .format("console") \
    .start()

# 等待流处理结束
query.awaitTermination(timeout=3)
query.status
query.stop()
query.lastProgress
query.status

spark.stop()

23/11/29 02:27:02 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-c1eef534-f62d-467d-beb5-ebef7a2e9062. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
23/11/29 02:27:02 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


-------------------------------------------
Batch: 0
-------------------------------------------
+-------------------+----------+------+--------------------+------+-------------+----------+------------+---------------+------------+-------+-----------+-----------+-------+--------+----------+-------------+-------+--------+-----------+-----------------+
|               time|    app_id| store|                adid|openid|activity_kind|created_at|installed_at|reattributed_at|network_name|country|device_name|device_type|os_name|timezone|event_name|revenue_float|revenue|currency|revenue_usd|reporting_revenue|
+-------------------+----------+------+--------------------+------+-------------+----------+------------+---------------+------------+-------+-----------+-----------+-------+--------+----------+-------------+-------+--------+-----------+-----------------+
|2023-10-01 00:00:00|1456241577|itunes|041bf78c9dc6dd5f5...|  NULL|      session|      NULL|  1636532102|           NULL|     RWD-ady| 

False

{'message': 'Waiting for data to arrive',
 'isDataAvailable': False,
 'isTriggerActive': False}

{'id': '4476cf60-6d08-4e51-9fc2-b1d2d0a72387',
 'runId': '0c6a985e-4eed-4305-a23e-d6ee23fe50ff',
 'name': None,
 'timestamp': '2023-11-29T02:27:02.809Z',
 'batchId': 0,
 'numInputRows': 1,
 'inputRowsPerSecond': 0.0,
 'processedRowsPerSecond': 4.291845493562231,
 'durationMs': {'addBatch': 113,
  'commitOffsets': 31,
  'getBatch': 15,
  'latestOffset': 40,
  'queryPlanning': 6,
  'triggerExecution': 233,
  'walCommit': 26},
 'stateOperators': [],
 'sources': [{'description': 'FileStreamSource[file:/home/jupyter/data/test]',
   'startOffset': None,
   'endOffset': {'logOffset': 0},
   'latestOffset': None,
   'numInputRows': 1,
   'inputRowsPerSecond': 0.0,
   'processedRowsPerSecond': 4.291845493562231}],
 'sink': {'description': 'org.apache.spark.sql.execution.streaming.ConsoleTable$@3e050195',
  'numOutputRows': 1}}

{'message': 'Stopped', 'isDataAvailable': False, 'isTriggerActive': False}

## csv2 

In [17]:
# 创建 SparkSession
spark = SparkSession.builder.appName("StructuredStreamingCSV2").getOrCreate()

In [21]:
file = 'file:///home/jupyter/data/test'
# 读取 CSV 文件
csvdf = spark.readStream.format("csv").option("header", "false").load(file)

IllegalArgumentException: Schema must be specified when creating a streaming source DataFrame. If some files already exist in the directory, then depending on the file format you may be able to create a static DataFrame on that directory with 'spark.read.load(directory)' and infer schema from it.