Spark will process data in micro-batches which can be defined by triggers. For example, let's say we define a trigger as 1 second, this means Spark will create micro-batches every second and process them accordingly.


### Output modes

After processing the streaming data, Spark needs to store it somewhere on persistent storage. Spark uses various output modes to store the streaming data.

**Append Mode**: In this mode, Spark will output only newly processed rows since the last trigger.

**Update Mode**: In this mode, Spark will output only updated rows since the last trigger. If we are not using aggregation on streaming data (meaning previous records can’t be updated) then it will behave similarly to append mode.

**Complete Mode**: In this mode, Spark will output all the rows it has processed so far.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as f

In [2]:
spark = SparkSession\
        .builder\
        .master("local")\
        .appName("RateSource")\
        .getOrCreate()

In [3]:
spark.sparkContext.setLogLevel("ERROR")

### Create streaming DataFrame
Let’s create our first Spark Streaming DataFrame using rate source. Here we have specified the format as rate and specified rowsPerSecond = 1 to generate 1 row for each micro-batch and load the data into initDF streaming DataFrame. Also, we check if the initDF is a streaming DataFrame or not.

In [4]:
initDF = spark\
        .readStream\
        .format("Rate")\
        .option("rowsPerSecond", 1)\
        .load()

In [5]:
print(f"Streaming DataFrame : {initDF.isStreaming}")

Streaming DataFrame : True


### Basic transformation
Perform a basic transformation on initDF to generate another column result by just adding 1 to column value :

In [6]:
resultDF = initDF.withColumn("result", f.col("value") + f.lit(1))

In [7]:
resultDF.writeStream\
.outputMode("append")\
.option("truncate", False)\
.format("console")\
.start().awaitTermination()

ERROR:root:KeyboardInterrupt while sending command.
Traceback (most recent call last):
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 511, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
  File "/opt/conda/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt


KeyboardInterrupt: 