## Infraestructuras Computacionales para el Procesamiento de Datos Masivos
### Práctica del Módulo 3: Gestión de datos en tiempo real (Streaming)
### Autor: Jesús Galán Llano
#### Correo: jgalan279@alumno.uned.es

Desarrollar un notebook de Jupyter, denominado “tweetCount.ipynb” en el que se utilice como fuente de datos Kafka, y en concreto el topic kafkaTwitter, se establezca una duración de batch de un segundo, y se muestre, cada 5 segundos, el número de tweets recibidos en los últimos 10 segundos. ¿Alrededor de qué número (aproximado) se estabiliza el número de tweets que se procesan en el lapso de tiempo indicado (5 segundos)? ¿Tiene sentido? ¿Por qué?

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import window

En primer lugar, creamos la sesión de Spark

In [3]:
spark = SparkSession \
    .builder \
    .appName("tp3") \
    .getOrCreate()

22/01/16 21:38:04 WARN Utils: Your hostname, jesus-Aspire-A514-52 resolves to a loopback address: 127.0.1.1; using 192.168.1.54 instead (on interface wlp2s0)
22/01/16 21:38:04 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/01/16 21:38:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Tras esto, especificamos el tipo de entrada de datos que en este caso se corresponde con Apache Kafka. 
Es importante configurar correctamente el número de brokers de Kafka así como su dirección. La configuración descrita en la memoria
consiste de dos servidores, que están disponibles en los puertos 9092 y 9093 en el equipo local.

In [3]:
df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "localhost:9092,localhost:9093") \
  .option("subscribe", "kafkaTwitter") \
  .option("includeHeaders", "true") \
  .load()

Seguidamente, configuramos la duración del batch y la los parámetros de la ventana. La duración del batch se corresponde con 1 segundo, mientras que
cada ventana tiene una duración de 10 segundos y un desplazamiento de 5 segundos. La configuración de la venta se realiza a través de
la función window().

Para obtener el número de tweets en cada intervalo se agrupan según su marca de tiempo. Utilizando la función count() obtenemos el número de tweets
en cada intervalo para luego ordenarlos de forma ascendente según su ventana, es decir, en orden cronológico hacia delante.

In [4]:
windowedCounts = df.withWatermark("timestamp", "1 seconds")\
.groupBy(
    window(df.timestamp, "10 seconds", "5 seconds")
).count()\
    .orderBy('window')

POr último, ejecutamos el proceso para que comience a procesar los tweets según van llegando al topic correspondiente.

In [5]:
windowedCounts\
        .writeStream\
        .outputMode('complete')\
        .format('console')\
        .option('truncate', 'false')\
        .start()

22/01/16 19:59:05 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-508d6779-4ad8-4781-a12d-e81b47273935. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
22/01/16 19:59:05 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+------+-----+
|window|count|
+------+-----+
+------+-----+



                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
+------------------------------------------+-----+
|window                                    |count|
+------------------------------------------+-----+
|{2022-01-16 19:59:15, 2022-01-16 19:59:25}|2    |
|{2022-01-16 19:59:20, 2022-01-16 19:59:30}|2    |
+------------------------------------------+-----+



                                                                                

-------------------------------------------
Batch: 2
-------------------------------------------
+------------------------------------------+-----+
|window                                    |count|
+------------------------------------------+-----+
|{2022-01-16 19:59:15, 2022-01-16 19:59:25}|6    |
|{2022-01-16 19:59:20, 2022-01-16 19:59:30}|87   |
|{2022-01-16 19:59:25, 2022-01-16 19:59:35}|81   |
+------------------------------------------+-----+



                                                                                

-------------------------------------------
Batch: 3
-------------------------------------------
+------------------------------------------+-----+
|window                                    |count|
+------------------------------------------+-----+
|{2022-01-16 19:59:15, 2022-01-16 19:59:25}|6    |
|{2022-01-16 19:59:20, 2022-01-16 19:59:30}|104  |
|{2022-01-16 19:59:25, 2022-01-16 19:59:35}|161  |
|{2022-01-16 19:59:30, 2022-01-16 19:59:40}|63   |
+------------------------------------------+-----+



                                                                                

-------------------------------------------
Batch: 4
-------------------------------------------
+------------------------------------------+-----+
|window                                    |count|
+------------------------------------------+-----+
|{2022-01-16 19:59:15, 2022-01-16 19:59:25}|6    |
|{2022-01-16 19:59:20, 2022-01-16 19:59:30}|104  |
|{2022-01-16 19:59:25, 2022-01-16 19:59:35}|206  |
|{2022-01-16 19:59:30, 2022-01-16 19:59:40}|135  |
|{2022-01-16 19:59:35, 2022-01-16 19:59:45}|27   |
+------------------------------------------+-----+



                                                                                

-------------------------------------------
Batch: 5
-------------------------------------------
+------------------------------------------+-----+
|window                                    |count|
+------------------------------------------+-----+
|{2022-01-16 19:59:15, 2022-01-16 19:59:25}|6    |
|{2022-01-16 19:59:20, 2022-01-16 19:59:30}|104  |
|{2022-01-16 19:59:25, 2022-01-16 19:59:35}|206  |
|{2022-01-16 19:59:30, 2022-01-16 19:59:40}|196  |
|{2022-01-16 19:59:35, 2022-01-16 19:59:45}|88   |
+------------------------------------------+-----+



                                                                                

-------------------------------------------
Batch: 6
-------------------------------------------
+------------------------------------------+-----+
|window                                    |count|
+------------------------------------------+-----+
|{2022-01-16 19:59:15, 2022-01-16 19:59:25}|6    |
|{2022-01-16 19:59:20, 2022-01-16 19:59:30}|104  |
|{2022-01-16 19:59:25, 2022-01-16 19:59:35}|206  |
|{2022-01-16 19:59:30, 2022-01-16 19:59:40}|200  |
|{2022-01-16 19:59:35, 2022-01-16 19:59:45}|151  |
|{2022-01-16 19:59:40, 2022-01-16 19:59:50}|59   |
+------------------------------------------+-----+



                                                                                

-------------------------------------------
Batch: 7
-------------------------------------------
+------------------------------------------+-----+
|window                                    |count|
+------------------------------------------+-----+
|{2022-01-16 19:59:15, 2022-01-16 19:59:25}|6    |
|{2022-01-16 19:59:20, 2022-01-16 19:59:30}|104  |
|{2022-01-16 19:59:25, 2022-01-16 19:59:35}|206  |
|{2022-01-16 19:59:30, 2022-01-16 19:59:40}|200  |
|{2022-01-16 19:59:35, 2022-01-16 19:59:45}|194  |
|{2022-01-16 19:59:40, 2022-01-16 19:59:50}|116  |
|{2022-01-16 19:59:45, 2022-01-16 19:59:55}|14   |
+------------------------------------------+-----+



                                                                                

-------------------------------------------
Batch: 8
-------------------------------------------
+------------------------------------------+-----+
|window                                    |count|
+------------------------------------------+-----+
|{2022-01-16 19:59:15, 2022-01-16 19:59:25}|6    |
|{2022-01-16 19:59:20, 2022-01-16 19:59:30}|104  |
|{2022-01-16 19:59:25, 2022-01-16 19:59:35}|206  |
|{2022-01-16 19:59:30, 2022-01-16 19:59:40}|200  |
|{2022-01-16 19:59:35, 2022-01-16 19:59:45}|194  |
|{2022-01-16 19:59:40, 2022-01-16 19:59:50}|170  |
|{2022-01-16 19:59:45, 2022-01-16 19:59:55}|68   |
+------------------------------------------+-----+



                                                                                

-------------------------------------------
Batch: 9
-------------------------------------------
+------------------------------------------+-----+
|window                                    |count|
+------------------------------------------+-----+
|{2022-01-16 19:59:15, 2022-01-16 19:59:25}|6    |
|{2022-01-16 19:59:20, 2022-01-16 19:59:30}|104  |
|{2022-01-16 19:59:25, 2022-01-16 19:59:35}|206  |
|{2022-01-16 19:59:30, 2022-01-16 19:59:40}|200  |
|{2022-01-16 19:59:35, 2022-01-16 19:59:45}|194  |
|{2022-01-16 19:59:40, 2022-01-16 19:59:50}|199  |
|{2022-01-16 19:59:45, 2022-01-16 19:59:55}|134  |
|{2022-01-16 19:59:50, 2022-01-16 20:00:00}|37   |
+------------------------------------------+-----+



                                                                                

-------------------------------------------
Batch: 10
-------------------------------------------
+------------------------------------------+-----+
|window                                    |count|
+------------------------------------------+-----+
|{2022-01-16 19:59:15, 2022-01-16 19:59:25}|6    |
|{2022-01-16 19:59:20, 2022-01-16 19:59:30}|104  |
|{2022-01-16 19:59:25, 2022-01-16 19:59:35}|206  |
|{2022-01-16 19:59:30, 2022-01-16 19:59:40}|200  |
|{2022-01-16 19:59:35, 2022-01-16 19:59:45}|194  |
|{2022-01-16 19:59:40, 2022-01-16 19:59:50}|199  |
|{2022-01-16 19:59:45, 2022-01-16 19:59:55}|186  |
|{2022-01-16 19:59:50, 2022-01-16 20:00:00}|89   |
+------------------------------------------+-----+



                                                                                

-------------------------------------------
Batch: 11
-------------------------------------------
+------------------------------------------+-----+
|window                                    |count|
+------------------------------------------+-----+
|{2022-01-16 19:59:15, 2022-01-16 19:59:25}|6    |
|{2022-01-16 19:59:20, 2022-01-16 19:59:30}|104  |
|{2022-01-16 19:59:25, 2022-01-16 19:59:35}|206  |
|{2022-01-16 19:59:30, 2022-01-16 19:59:40}|200  |
|{2022-01-16 19:59:35, 2022-01-16 19:59:45}|194  |
|{2022-01-16 19:59:40, 2022-01-16 19:59:50}|199  |
|{2022-01-16 19:59:45, 2022-01-16 19:59:55}|189  |
|{2022-01-16 19:59:50, 2022-01-16 20:00:00}|153  |
|{2022-01-16 19:59:55, 2022-01-16 20:00:05}|61   |
+------------------------------------------+-----+



                                                                                

-------------------------------------------
Batch: 12
-------------------------------------------
+------------------------------------------+-----+
|window                                    |count|
+------------------------------------------+-----+
|{2022-01-16 19:59:15, 2022-01-16 19:59:25}|6    |
|{2022-01-16 19:59:20, 2022-01-16 19:59:30}|104  |
|{2022-01-16 19:59:25, 2022-01-16 19:59:35}|206  |
|{2022-01-16 19:59:30, 2022-01-16 19:59:40}|200  |
|{2022-01-16 19:59:35, 2022-01-16 19:59:45}|194  |
|{2022-01-16 19:59:40, 2022-01-16 19:59:50}|199  |
|{2022-01-16 19:59:45, 2022-01-16 19:59:55}|189  |
|{2022-01-16 19:59:50, 2022-01-16 20:00:00}|194  |
|{2022-01-16 19:59:55, 2022-01-16 20:00:05}|123  |
|{2022-01-16 20:00:00, 2022-01-16 20:00:10}|21   |
+------------------------------------------+-----+



                                                                                

-------------------------------------------
Batch: 13
-------------------------------------------
+------------------------------------------+-----+
|window                                    |count|
+------------------------------------------+-----+
|{2022-01-16 19:59:15, 2022-01-16 19:59:25}|6    |
|{2022-01-16 19:59:20, 2022-01-16 19:59:30}|104  |
|{2022-01-16 19:59:25, 2022-01-16 19:59:35}|206  |
|{2022-01-16 19:59:30, 2022-01-16 19:59:40}|200  |
|{2022-01-16 19:59:35, 2022-01-16 19:59:45}|194  |
|{2022-01-16 19:59:40, 2022-01-16 19:59:50}|199  |
|{2022-01-16 19:59:45, 2022-01-16 19:59:55}|189  |
|{2022-01-16 19:59:50, 2022-01-16 20:00:00}|194  |
|{2022-01-16 19:59:55, 2022-01-16 20:00:05}|191  |
|{2022-01-16 20:00:00, 2022-01-16 20:00:10}|89   |
+------------------------------------------+-----+



                                                                                

-------------------------------------------
Batch: 14
-------------------------------------------
+------------------------------------------+-----+
|window                                    |count|
+------------------------------------------+-----+
|{2022-01-16 19:59:15, 2022-01-16 19:59:25}|6    |
|{2022-01-16 19:59:20, 2022-01-16 19:59:30}|104  |
|{2022-01-16 19:59:25, 2022-01-16 19:59:35}|206  |
|{2022-01-16 19:59:30, 2022-01-16 19:59:40}|200  |
|{2022-01-16 19:59:35, 2022-01-16 19:59:45}|194  |
|{2022-01-16 19:59:40, 2022-01-16 19:59:50}|199  |
|{2022-01-16 19:59:45, 2022-01-16 19:59:55}|189  |
|{2022-01-16 19:59:50, 2022-01-16 20:00:00}|194  |
|{2022-01-16 19:59:55, 2022-01-16 20:00:05}|202  |
|{2022-01-16 20:00:00, 2022-01-16 20:00:10}|157  |
|{2022-01-16 20:00:05, 2022-01-16 20:00:15}|57   |
+------------------------------------------+-----+



Tras realizar varios procesamientos se decide parar el proceso para contestar a las preguntas que se plantean.

Como se puede observar en la salida del programa, el número de tweets procesados por el programa tiende a estabilizarse
de forma aproximada en 200 tweets. Esto es lógico teniendo en cuenta lo siguiente: el productor emite los tweets entre 0 y 0,1 segundos, produciendo 10 tweets por
segundo en el peor de los caos. Por tanto, el promedio de producción de tweets se corresponde con 20 tweets por segundo. Si se multiplica este número
por el tamaño de la ventana, 10 segundos, se obtiene el número de tweets que se deberían de procesar en cada ventana. 

Se debe comentar que este número no es exacto, como se ha comprobado, porque existen otros factores que influyen en el rendimiento del procesamiento, como la latencia
de red o la disponibilidad de recursos computacionales.