## Batch computation

Before we begin with Streaming, let's go back to Spark batch computation. The most basic steps for batch procesing are:
1. Read data from source file
2. Do transformation
3. Write transformed DataFrame to a file

In [1]:
# Import libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg, round
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DecimalType

In [2]:
# Initialize a Spark session
spark = SparkSession.builder.appName("Batch_Streaming_Comparison").getOrCreate()

your 131072x1 screen size is bogus. expect trouble
24/12/04 15:34:09 WARN Utils: Your hostname, DELEQ0283302041 resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
24/12/04 15:34:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/12/04 15:34:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
# Defining schema for `data/batch_resource/real_estate.csv`
real_estate_schema = StructType(
    [StructField('UID', IntegerType()), 
    StructField('Location', StringType(), True), 
    StructField('Price', DecimalType(11,2), True), 
    StructField('Bedrooms', IntegerType(), True), 
    StructField('Bathrooms', IntegerType(), True), 
    StructField('Size', IntegerType(), True), 
    StructField('Price SQ Ft', DecimalType(7,2), True), 
    StructField('Status', StringType(), True)])

### Exercise

As a warmup exercise:
1. Read csv files from `../data/batch_resource`
2. Use earlier defined `real_estate_schema` StructType object
3. Group data by `Location`
4. Get average `Price` per `Location`. Round it to 2 decimal places
5. Sort data by average price in descending order
6. Print output to console

In [4]:
real_estate_batch = (
    spark
    .read
    .schema(real_estate_schema)
    .csv("../data/batch_resource", header = True)
    .groupBy("Location")
    .agg(round(avg("Price"), 2).alias('AveragePrice'))
    .orderBy('AveragePrice', desc=True)).show()

+-------------------+------------+
|           Location|AveragePrice|
+-------------------+------------+
|         New Cuyama|    40900.00|
|    Santa Margarita|    59900.00|
|        Bakersfield|    91500.00|
|          Guadalupe|   117250.00|
|          King City|   131190.00|
|             Lompoc|   149900.00|
|        Out Of Area|   173900.00|
|            Soledad|   184053.33|
|         Greenfield|   184800.00|
|           Coalinga|   202071.43|
| Santa Maria-Orcutt|   231106.18|
|             Lompoc|   241260.77|
|         San Simeon|   274900.00|
|         San Miguel|   283642.86|
|            Creston|   309900.00|
|            Solvang|   325000.00|
| Santa Maria-Orcutt|   332546.08|
|        Paso Robles|   334280.22|
|           Los Osos|   359704.76|
|       Grover Beach|   365615.00|
+-------------------+------------+
only showing top 20 rows



Compare it to how the same process looks like in Spark Structured Streaming:

In [5]:
real_estate_stream = (spark
    .readStream
    .schema(real_estate_schema)
    .csv("../data/batch_resource", header=True)
    .groupBy("Location")
    .agg(round(avg("Price"), 2).alias('AveragePrice'))
    .orderBy('AveragePrice')
    .writeStream
    .outputMode("complete")
    .format("console")
    .start())


24/04/26 11:57:33 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-33e3b714-b66c-4919-9239-090383f3e5cc. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
24/04/26 11:57:33 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+-------------------+------------+
|           Location|AveragePrice|
+-------------------+------------+
|         New Cuyama|    40900.00|
|    Santa Margarita|    59900.00|
|        Bakersfield|    91500.00|
|          Guadalupe|   117250.00|
|          King City|   131190.00|
|             Lompoc|   149900.00|
|        Out Of Area|   173900.00|
|            Soledad|   184053.33|
|         Greenfield|   184800.00|
|           Coalinga|   202071.43|
| Santa Maria-Orcutt|   231106.18|
|             Lompoc|   241260.77|
|         San Simeon|   274900.00|
|         San Miguel|   283642.86|
|            Creston|   309900.00|
|            Solvang|   325000.00|
| Santa Maria-Orcutt|   332546.08|
|        Paso Robles|   334280.22|
|           Los Osos|   359704.76|
|       Grover Beach|   365615.00|
+-------------------+------------+
only showing top 20 rows



In [25]:
real_estate_stream.stop()

The similarity between batch and streaming processing is very noticible. `readStream` and `writeStream` are counterparts to `read` and `write` in batch processing methods.