## Batch computation

Let's go back to Spark batch computation. Usuall step for batch procesing are:
1. Read data from source file
2. Do transformation
3. Write transformed DataFrame to a file

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg, round
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DecimalType

In [None]:
spark = SparkSession.builder.appName("Batch_Streaming_Comparison").getOrCreate()
spark.sparkContext.setLogLevel('ERROR')

In [2]:
real_estate_schema = StructType(
    [StructField('UID', IntegerType()), 
    StructField('Location', StringType(), True), 
    StructField('Price', DecimalType(15,2), True), 
    StructField('Bedrooms', IntegerType(), True), 
    StructField('Bathrooms', IntegerType(), True), 
    StructField('Size', IntegerType(), True), 
    StructField('Price SQ Ft', DecimalType(10,2), True), 
    StructField('Status', StringType(), True)])


real_estate_batch = (
    spark
    .read
    .schema(real_estate_schema)
    .csv("../data/batch_resource", header = True)
    .groupBy("Location")
    .agg(round(avg("Price"), 2).alias('AveragePrice'))
    .orderBy('AveragePrice')
    ).show()


NameError: name 'StructType' is not defined

Compare it to how we process data in Spark Structured Streaming:

In [None]:
real_estate_stream = (spark
    .readStream
    .schema(real_estate_schema)
    .csv("../data/batch_resource", header=True)
    .groupBy("Location")
    .agg(round(avg("Price"), 2).alias('AveragePrice'))
    .orderBy('AveragePrice')
    .writeStream
    .outputMode("complete")
    .format("console")
    .start())


                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+-------------------+-------------+
|           Location|Average Price|
+-------------------+-------------+
|         New Cuyama|     40900.00|
|    Santa Margarita|     59900.00|
|        Bakersfield|     91500.00|
|          Guadalupe|    117250.00|
|          King City|    131190.00|
|             Lompoc|    149900.00|
|        Out Of Area|    173900.00|
|            Soledad|    184053.33|
|         Greenfield|    184800.00|
|           Coalinga|    202071.43|
| Santa Maria-Orcutt|    231106.18|
|             Lompoc|    241260.77|
|         San Simeon|    274900.00|
|         San Miguel|    283642.86|
|            Creston|    309900.00|
|            Solvang|    325000.00|
| Santa Maria-Orcutt|    332546.08|
|        Paso Robles|    334280.22|
|           Los Osos|    359704.76|
|       Grover Beach|    365615.00|
+-------------------+-------------+
only showing top 20 rows



In [None]:
real_estate_stream.stop()