#  4. Ejemplo End-to-End Streaming

**Documentación Oficial Structured Spark Streaming**: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

**Ejemplo**: https://blog.knoldus.com/basic-example-spark-structured-streaming-kafka-integration/

## 4.1 Instrucciones iniciales y setup

2. Crea un directorio `checkpoint` dentro del subdirectorio `data`.

3. Asegúrate de que tienes permisos suficientes para manipular archivos dentro del directorio (debería ser así ya, si has ejecutado los ejemplos previos). Si fuese necesario, ejecuta `sudo chmod -R 777 data`.

**Entrada: cola de Kafka**

4. Arranca el broker de Kafka, o bien localmente instalado o en una MV local o en un contenedor local (e.g. Docker).

5. Modifica el script de Python `4-kafka_producer.py` para que envíe los datos al broker de Kafka (indicar la IP y puerto correctos).

6. Activa si es necesario el entorno de Anaconda Python (**importante, usando Python v3.6+**). Ejecuta el productor de Kafka con `python p_kafka_producer.py 0.6 1.3 test data/occupancy_data.csv`.

7. A partir de ese momento ya estás listo para ejecutar los *jobs* de Spark Streaming de este notebook. ¡Empecemos con el análisis!

**WebUI**: Mientras el contexto de Spark Streaming esté activo, podemos acceder a la interfaz de monitorización de los *jobs* en http://localhost:4040.

## 4.2 Importaciones y creación del contexto

###  Creación del SparkContext (solo la primera vez)

In [1]:
import os
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from datetime import datetime
from operator import add
from operator import sub
from pyspark.sql.functions import *

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

In [3]:
# Load external packages programatically
import os
# THIS IS MANDATORY
# You must provide the information about the Maven artifact for the
# Spark Streaming connector to Kafka
# At present time, only the 0.8.2 version (deprecated) has
# Python support
#packages = "org.apache.spark:spark-streaming-kafka-0-10_2.11:2.4.5"
packages = "org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5"
os.environ["PYSPARK_SUBMIT_ARGS"] = (
    "--packages {0} pyspark-shell".format(packages)
)
# THIS IS COMPULSORY
# Comment the line below if JAVA_HOME is already set up or you
# only have a single JVM version in your system
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

# OPTIONAL: Check setup of environment variables
print("PYSPARK_SUBMIT_ARGS = ",os.environ["PYSPARK_SUBMIT_ARGS"],"\n")
print("JAVA_HOME = ", os.environ["JAVA_HOME"])

PYSPARK_SUBMIT_ARGS =  --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 pyspark-shell 

JAVA_HOME =  /usr/lib/jvm/java-8-openjdk-amd64


In [4]:
#sc = SparkContext(appName="KafkaStreamingEndtoEnd")

In [5]:
spark = SparkSession \
    .builder \
    .appName("prueba") \
    .getOrCreate()

In [6]:
spark

### Creación del streaming context (en cada ejecución de ejercicio)

## 4.3 Métodos auxiliares

### 4.3.1 Método de parseo de datos meteorológicos

Este método nos ayuda a parsear cada línea que llega por la cola de Kafka con datos meteorológicos. Lo utilizamos para acceder a los datos de cada evento (orden) del *stream* de entrada de datos.

### 4.3.2 Métodos de escritura - Envío de datos a Kafka

Este método contiene un *productor singleton* (para evitar tener más de un productor enviando datos al broker de Kafka) y un método para serializar los resultados en formato CSV.

## 4.4 Fuente de datos - Lectura

In [7]:
from pyspark.sql.types import *

Schema = StructType([
 StructField("id", IntegerType()),
 StructField("date", StringType()),
 StructField("Temperature", DoubleType()),
 StructField("Humidity", DoubleType()),
 StructField("Light", DoubleType()),
 StructField("CO2", DoubleType()),
 StructField("HumidityRatio", DoubleType()),
 StructField("Occupancy", DoubleType())])

In [8]:
df = spark \
    .readStream \
    .format("kafka")\
    .option("sep", ",") \
    .option("kafka.bootstrap.servers", 'localhost:9092')\
    .option('subscribe', 'test')\
    .load()
#.schema(Schema) \

In [9]:
df.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



In [23]:
df = df.selectExpr('CAST(value AS STRING)')

df_data = df.select(
        split(df.value, ',')[0].alias("row").cast(StringType()),
        split(df.value, ',')[1].alias("date").cast(StringType()),
        split(df.value, ',')[2].alias("Temperature").cast(DoubleType()),
        split(df.value, ',')[3].alias("Humidity").cast(DoubleType()),
        split(df.value, ',')[4].alias("Light").cast(DoubleType()),
        split(df.value, ',')[5].alias("CO2").cast(DoubleType()),
        split(df.value, ',')[6].alias("HumidityRatio").cast(DoubleType()),
        split(df.value, ',')[7].alias("Occupancy").cast(StringType()))

### Entrada de datos desde Kafka

## Ejercicio 1: Calcular el promedio de valores de Temperatura, humedad relativa y concentración de CO2 para cada micro-batch y el promedio de dichos valores desde el arranque

In [11]:
### from pyspark.sql.functions import col, avg 

# 1.1
result_1_1 = df_data.groupBy(window(df_data.date, "3 seconds", "5 minutes"))\
                    .agg(avg(col("CO2").alias('mean_CO2')))\
                    .writeStream\
                    .format('console')\
                    .trigger(processingTime= '5 seconds')\
                    .outputMode("append")\
                    .start()

#withWatermark("date", "3 seconds")\
#.withWatermark("date", "3 seconds")\

AnalysisException: "cannot resolve 'timewindow(date, 3000000, 300000000, 0)' due to data type mismatch: The slide duration (300000000) must be less than or equal to the windowDuration (3000000).;;\n'Aggregate [timewindow(date#38, 3000000, 300000000, 0)], [timewindow(date#38, 3000000, 300000000, 0) AS window#47, avg(CO2#35) AS avg(CO2 AS `mean_CO2`)#58]\n+- Project [cast(split(value#21, ,)[0] as string) AS row#31, cast(split(value#21, ,)[1] as timestamp) AS date#38, cast(split(value#21, ,)[2] as double) AS Temperature#32, cast(split(value#21, ,)[3] as double) AS Humidity#33, cast(split(value#21, ,)[4] as int) AS Light#34, cast(split(value#21, ,)[5] as double) AS CO2#35, cast(split(value#21, ,)[6] as double) AS HumidityRatio#36, cast(split(value#21, ,)[7] as string) AS Occupancy#37]\n   +- Project [cast(value#8 as string) AS value#21]\n      +- StreamingRelationV2 org.apache.spark.sql.kafka010.KafkaSourceProvider@ec9f01b, kafka, Map(sep -> ,, subscribe -> test, kafka.bootstrap.servers -> localhost:9092), [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13], StreamingRelation DataSource(org.apache.spark.sql.SparkSession@16b1f7be,kafka,List(),None,List(),None,Map(sep -> ,, subscribe -> test, kafka.bootstrap.servers -> localhost:9092),None), kafka, [key#0, value#1, topic#2, partition#3, offset#4L, timestamp#5, timestampType#6]\n"

In [None]:
from pyspark.sql.functions import col, avg 

In [None]:
result_1_1 = df_data.agg(avg(col("CO2").alias('mean_CO2')))\
                    .writeStream\
                    .format('console')\
                    .trigger(processingTime= '5 seconds')\
                    .outputMode("update")\
                    .start()

#.withWatermark("date", "3 seconds")\

In [12]:
result_1_1.stop()

NameError: name 'result_1_1' is not defined

In [15]:
# 1.2
result_1_2 = df_data.withWatermark("date", "3 seconds")\
                 .agg(avg(col("CO2").alias('mean_CO2')))\
                 .writeStream\
                 .format('console')\
                 .trigger(processingTime= '5 seconds')\
                 .outputMode("complete")\
                 .start()

In [24]:
result_1_2.stop()

NameError: name 'result_1_2' is not defined

## Ejercicio 2: Calcular el promedio de luminosidad en la estancia en ventanas deslizantes de tamaño 45 segundos, con un valor de deslizamiento de 15 segundos entre ventanas consecutivas.

In [26]:
df_data = df_data.withColumn("date", regexp_replace(col("date"), '"', ''))
df_data = df_data.withColumn('date2',to_timestamp("date", "yyyy-MM-dd HH:mm:ss"))

In [27]:
result_2 = (df_data.groupBy(window(col("date"), "45 seconds", "15 seconds"))
                   .agg(avg('Light').alias('Light_avg'))
                   .writeStream\
                   .format('console')\
                   .trigger(processingTime= '5 seconds')\
                   .outputMode("complete")\
                   .start())
#.avg("Light")

In [28]:
result_2.stop()

## Ejercicio 3: Examinando los datos, podemos apreciar que el intervalo entre muestras originales no es exactamente de 1 minuto en muchos casos. Calcular el número de parejas de muestras consecutivas en cada micro-batch entre las cuales el intervalo de separación no es exactamente de 1 minuto.

## Start Streaming context

In [15]:
result_2 = (df_data.groupBy('row')
                   .agg(first('date'),
                       first('Temperature'),
                       first('Humidity'),
                       first('Light'),
                       first('CO2'),
                       first('HumidityRatio'),
                       first('Occupancy'),
                       first('date2'))
                   .writeStream\
                   .format('console')\
                   .outputMode("complete")\
                   .start())

## Stop Streaming Context