#  4. Ejemplo End-to-End Streaming

**Documentación Oficial Structured Spark Streaming**: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

**Ejemplo**: https://blog.knoldus.com/basic-example-spark-structured-streaming-kafka-integration/

## 4.1 Instrucciones iniciales y setup

2. Crea un directorio `checkpoint` dentro del subdirectorio `data`.

3. Asegúrate de que tienes permisos suficientes para manipular archivos dentro del directorio (debería ser así ya, si has ejecutado los ejemplos previos). Si fuese necesario, ejecuta `sudo chmod -R 777 data`.

**Entrada: cola de Kafka**

4. Arranca el broker de Kafka, o bien localmente instalado o en una MV local o en un contenedor local (e.g. Docker).

5. Modifica el script de Python `4-kafka_producer.py` para que envíe los datos al broker de Kafka (indicar la IP y puerto correctos).

6. Activa si es necesario el entorno de Anaconda Python (**importante, usando Python v3.6+**). Ejecuta el productor de Kafka con `python p_kafka_producer.py 0.6 1.3 test data/occupancy_data.csv`.

7. A partir de ese momento ya estás listo para ejecutar los *jobs* de Spark Streaming de este notebook. ¡Empecemos con el análisis!

**WebUI**: Mientras el contexto de Spark Streaming esté activo, podemos acceder a la interfaz de monitorización de los *jobs* en http://localhost:4040.

## 4.2 Importaciones y creación del contexto

###  Creación del SparkContext (solo la primera vez)

In [1]:
# Importación de dependencias y funciones
from __future__ import print_function
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from operator import add
from operator import sub

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

In [3]:
# Load external packages programatically
import os
# THIS IS MANDATORY
# You must provide the information about the Maven artifact for the
# Spark Streaming connector to Kafka
# At present time, only the 0.8.2 version (deprecated) has
# Python support
packages = "org.apache.spark:spark-streaming-kafka-0-8_2.11:2.4.5"
os.environ["PYSPARK_SUBMIT_ARGS"] = (
    "--packages {0} pyspark-shell".format(packages)
)
# THIS IS COMPULSORY
# Comment the line below if JAVA_HOME is already set up or you
# only have a single JVM version in your system
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

# OPTIONAL: Check setup of environment variables
print("PYSPARK_SUBMIT_ARGS = ",os.environ["PYSPARK_SUBMIT_ARGS"],"\n")
print("JAVA_HOME = ", os.environ["JAVA_HOME"])

PYSPARK_SUBMIT_ARGS =  --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.4.5 pyspark-shell 

JAVA_HOME =  /usr/lib/jvm/java-8-openjdk-amd64


In [4]:
sc = SparkContext(appName="KafkaStreamingEndtoEnd")

In [5]:
spark = SparkSession \
    .builder \
    .appName("prueba") \
    .getOrCreate()

### Creación del streaming context (en cada ejecución de ejercicio)

In [6]:
# Crear el contexto de Spark Streaming
# Intervalo de actualización de micro-batches (triggers): 5s
ssc = StreamingContext(sc, 5)

## 4.3 Métodos auxiliares

### 4.3.1 Método de parseo de datos meteorológicos

Este método nos ayuda a parsear cada línea que llega por la cola de Kafka con datos meteorológicos. Lo utilizamos para acceder a los datos de cada evento (orden) del *stream* de entrada de datos.

In [7]:
from datetime import datetime

def parseOrder(line):
  s = line.split(",")
  try:
      return [{"id": s[0],
               "date": s[1],
               "Temperature": s[2], 
               "Humidity": s[3],
               "Light": s[4],
               "CO2": s[5],
               "HumidityRatio": s[6],
               "Occupancy": s[7]}]

  except Exception as err:
      print("Wrong line format (%s): " % line)
      return []

### 4.3.2 Métodos de escritura - Envío de datos a Kafka

Este método contiene un *productor singleton* (para evitar tener más de un productor enviando datos al broker de Kafka) y un método para serializar los resultados en formato CSV.

In [8]:
# Configura el endpoint para localizar el broker de Kafka
# kafkaBrokerIPPort = "172.20.1.21:9092"
kafkaBrokerIPPort = "127.0.0.1:9092"

# Productor simple (Singleton!)
# from kafka import KafkaProducer
import kafka
class KafkaProducerWrapper(object):
  producer = None
  @staticmethod
  def getProducer(brokerList):
    if KafkaProducerWrapper.producer != None:
      return KafkaProducerWrapper.producer
    else:
      KafkaProducerWrapper.producer = kafka.KafkaProducer(bootstrap_servers=brokerList,
                                                          key_serializer=str.encode,
                                                          value_serializer=str.encode)
      return KafkaProducerWrapper.producer

## 4.4 Fuente de datos - Lectura

In [17]:
from pyspark.sql.types import *

Schema = StructType([
 StructField("id", IntegerType()),
 StructField("date", StringType()),
 StructField("Temperature", DoubleType()),
 StructField("Humidity", DoubleType()),
 StructField("Light", DoubleType()),
 StructField("CO2", DoubleType()),
 StructField("HumidityRatio", DoubleType()),
 StructField("Occupancy", DoubleType())])

In [21]:
# Fichero de texto: Lectura de fuente de datos de fichero (no se usa en este ejemplo, en su lugar 
# enviamos los datos a Kafka para crear una simulación más realista)
stream = ssc.textFileStream("data/occupancy_data.txt")

In [32]:
df = spark.readStream.schema(Schema).csv("data/occupancy_data.csv")

In [23]:
df = spark \
    .readStream \
    .option("sep", ";") \
    .schema(Schema) \
    .csv("data/occupancy_data.csv")  # Equivalent to format("csv").load("/path/to/directory")

In [33]:
df.isStreaming()    # Returns True for DataFrames that have streaming sources

TypeError: 'bool' object is not callable

In [26]:
df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- date: string (nullable = true)
 |-- Temperature: double (nullable = true)
 |-- Humidity: double (nullable = true)
 |-- Light: double (nullable = true)
 |-- CO2: double (nullable = true)
 |-- HumidityRatio: double (nullable = true)
 |-- Occupancy: double (nullable = true)



In [29]:
df.select("Temperature").where("Temperature > 15")

DataFrame[Temperature: double]

In [10]:
#"id","date","Temperature","Humidity","Light","CO2","HumidityRatio","Occupancy"

### Entrada de datos desde Kafka

In [24]:
# Kafka: Lectura de datos
kafkaParams = {"metadata.broker.list": kafkaBrokerIPPort}
stream = KafkaUtils.createDirectStream(ssc, ["test"], kafkaParams)

In [10]:
#stream = stream.map(lambda o: str(o[1]))

## Ejercicio 1: Calcular el promedio de valores de Temperatura, humedad relativa y concentración de CO2 para cada micro-batch y el promedio de dichos valores desde el arranque

**prueba**

In [11]:
def mean_score(col):
    return pd.Series([np.mean(col)] * len(col))

def process():
    
    data = spark.read.json(rdd) #, schema = schema

    data = data.withColumn('mean_score', mean_score(data['score']))

    data.show()
    
    return(data)

In [12]:
#stream.foreachRDD(lambda rdd: rdd.foreachPartition(process))

In [13]:
valores = stream.flatMap(parseOrder)

contar_temp = valores.map(lambda o: (o['Temperature'], 1)).reduceByKey(add)

#amountPerClient = valores.map(lambda o: (o['clientId'], o['amount'] * o['price']))

conteo_temp = (contar_temp.updateStateByKey(lambda vals, 
                                            total_count: sum(vals) + total_count if total_count != None else sum(vals)))


In [22]:
lines.show()

AnalysisException: 'Queries with streaming sources must be executed with writeStream.start();;\nsocket'

In [15]:
#Temp_mean = valores.map(lambda o: (o['Temperature'], o)).reduceByKey(mean_score)
#Temp_mean = (valores.updateStateByKey(lambda vals, 
#                                      totalOpt: sum(vals) + totalOpt if totalOpt != None else sum(vals)))



#Temp_mean.repartition(1).saveAsTextFiles("data/output/metrics", "csv")

#lines = valores.map(lambda x: x[1])
#counts = lines.map(lambda line: line.split("\t")) \
#              .reduceByKey(lambda a, b: a+b)

#Temp_mean.pprint()

In [16]:
ssc.start()

Py4JJavaError: An error occurred while calling o26.start.
: java.lang.IllegalArgumentException: requirement failed: The checkpoint directory has not been set. Please set it by StreamingContext.checkpoint().
	at scala.Predef$.require(Predef.scala:224)
	at org.apache.spark.streaming.dstream.DStream.validateAtStart(DStream.scala:243)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$validateAtStart$8.apply(DStream.scala:276)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$validateAtStart$8.apply(DStream.scala:276)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.streaming.dstream.DStream.validateAtStart(DStream.scala:276)
	at org.apache.spark.streaming.DStreamGraph$$anonfun$start$4.apply(DStreamGraph.scala:51)
	at org.apache.spark.streaming.DStreamGraph$$anonfun$start$4.apply(DStreamGraph.scala:51)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.streaming.DStreamGraph.start(DStreamGraph.scala:51)
	at org.apache.spark.streaming.scheduler.JobGenerator.startFirstTime(JobGenerator.scala:194)
	at org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:100)
	at org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:103)
	at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply$mcV$sp(StreamingContext.scala:588)
	at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:583)
	at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:583)
	at ... run in separate thread using org.apache.spark.util.ThreadUtils ... ()
	at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:583)
	at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:575)
	at org.apache.spark.streaming.api.java.JavaStreamingContext.start(JavaStreamingContext.scala:556)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)


In [14]:
# Once you are done, stop the StreamingContext
ssc.stop(False)

In [None]:
def aggregate_tags_count(new_values, total_sum):
        return sum(new_values) + (total_sum or 0)
    
# divide cada Tweet en palabras
words = dataStream.flatMap(lambda line: line.split(" "))
# filtra las palabras para obtener solo hashtags, luego mapea cada hashtag para que sea un par de (hashtag,1)
hashtags = words.filter(lambda w: '#' in w).map(lambda x: (x, 1))
# agrega la cuenta de cada hashtag a su última cuenta
tags_totals = hashtags.updateStateByKey(aggregate_tags_count)
wordCounts.pprint()

In [12]:
ssc.start()

-------------------------------------------
Time: 2020-06-24 22:24:35
-------------------------------------------
{'id': '"1781"', 'date': '"2015-02-05 23:31:00"', 'Temperature': '20.39', 'Humidity': '21.29', 'Light': '0', 'CO2': '444.5', 'HumidityRatio': '0.00314694763447223', 'Occupancy': '0\n'}
{'id': '"1782"', 'date': '"2015-02-05 23:31:59"', 'Temperature': '20.3566666666667', 'Humidity': '21.29', 'Light': '0', 'CO2': '441.666666666667', 'HumidityRatio': '0.00314044248260255', 'Occupancy': '0\n'}
{'id': '"1783"', 'date': '"2015-02-05 23:32:59"', 'Temperature': '20.3566666666667', 'Humidity': '21.29', 'Light': '0', 'CO2': '442.333333333333', 'HumidityRatio': '0.00314044248260255', 'Occupancy': '0\n'}
{'id': '"1784"', 'date': '"2015-02-05 23:34:00"', 'Temperature': '20.39', 'Humidity': '21.29', 'Light': '0', 'CO2': '441', 'HumidityRatio': '0.00314694763447223', 'Occupancy': '0\n'}
{'id': '"1785"', 'date': '"2015-02-05 23:35:00"', 'Temperature': '20.29', 'Humidity': '21.29', 'Light': 

In [12]:
ssc.start()

-------------------------------------------
Time: 2020-06-24 22:22:00
-------------------------------------------
"1621","2015-02-05 20:51:00",21,19.7,0,463.5,0.00302284822358619,0

"1622","2015-02-05 20:51:59",21,19.7,0,467.5,0.00302284822358619,0

"1623","2015-02-05 20:53:00",21,19.7,0,476,0.00302284822358619,0

"1624","2015-02-05 20:54:00",21,19.76,0,474,0.00303209974808914,0


-------------------------------------------
Time: 2020-06-24 22:22:05
-------------------------------------------
"1625","2015-02-05 20:55:00",21,19.79,0,471,0.00303672561304647,0

"1626","2015-02-05 20:55:59",21,19.79,0,472,0.00303672561304647,0

"1627","2015-02-05 20:57:00",21,19.79,0,472,0.00303672561304647,0

"1628","2015-02-05 20:57:59",21,19.79,0,474.25,0.00303672561304647,0

"1629","2015-02-05 20:58:59",20.9725,19.79,0,473,0.00303157119548328,0


-------------------------------------------
Time: 2020-06-24 22:22:10
-------------------------------------------
"1630","2015-02-05 21:00:00",21,19.79,0,469,

## Ejercicio 2: Calcular el promedio de luminosidad en la estancia en ventanas deslizantes de tamaño 45 segundos, con un valor de deslizamiento de 15 segundos entre ventanas consecutivas.

## Ejercicio 3: Examinando los datos, podemos apreciar que el intervalo entre muestras originales no es exactamente de 1 minuto en muchos casos. Calcular el número de parejas de muestras consecutivas en cada micro-batch entre las cuales el intervalo de separación no es exactamente de 1 minuto.

## Start Streaming context

In [15]:
ssc.start()
#kafka.errors.UnrecognizedBrokerVersion: UnrecognizedBrokerVersion
#ssc.awaitTerminationOrTimeout(10)  # Espera 10 segs. antes de acabar

## Stop Streaming Context

In [16]:
ssc.stop(False)

In [None]:
from datetime import datetime

def parseOrder(line):
  s = line.split(",")
  try:
      return [{"id": s[0],
               "date": datetime.strptime(s[1], "%Y-%m-%d %H:%M:%S"),
               "Temperature": float(s[2]), 
               "Humidity": float(s[3]),
               "Light": s[4],
               "CO2": float(s[5]),
               "HumidityRatio": float(s[6]),
               "Occupancy": s[7]}]

  except Exception as err:
      print("Wrong line format (%s): " % line)
      return []