## PIPELINE 2

This Python script, designed to be executed immediately after the component that stores hourly departures in MongoDB, leverages Spark to further aggregate the data, focusing on creating an aggregated time series per hour and station (aggregated_hourly_departures_per_station).

Initially, we configure the MongoDB URI with the necessary authentication credentials to access the sensor_data database and specifically the archived_hourly_departures collection for input and aggregated_hourly_departures_per_station for output. This configuration is essential to ensure that Spark can securely read and write data in MongoDB.

The Spark session is initialized with parameters that include the MongoDB URI, database, input and output collections, and the MongoDB Spark Connector package necessary for integration between Spark and MongoDB. This initial setup is crucial to enable the distributed processing of data stored in MongoDB using Spark's capabilities.

The core of the script is represented by the process_data function, which reads data from the archived_hourly_departures collection as a Spark DataFrame, aggregates the data by date, hour, and station name, and calculates the number of departures in each combination of date, hour, and station.

For each partition of the aggregated DataFrame, we invoke the update_document function, which updates the aggregated_hourly_departures_per_station collection in MongoDB. This function uses the station name as a key for updating, ensuring that the data are organized in an intuitive and easily accessible manner for further analysis.

Finally, the infinite loop at the end of the script allows the process_data function to be executed periodically, with a predefined interval of 1800 seconds (30 minutes), ensuring that the aggregated data are continuously updated with the latest information. This approach ensures that data analysis and visualization can benefit from updated and relevant information.


In [1]:
import json
import time
import pymongo
import pandas as pd
from pymongo import MongoClient
from datetime import datetime
from pyspark.sql import SparkSession
from pyspark.sql.functions import abs as pyspark_abs, max as pyspark_max, sum as pyspark_sum


In [None]:
# Configurazione dell'URI di MongoDB con autenticazione
mongo_uri = "mongodb://mongoadmin:secret@localhost:27017/sales_data?authSource=admin"
mongo_database = "sensor_data"
mongo_input_collection = "archived_hourly_departures"
mongo_output_collection = "aggregated_hourly_departures_per_station"

# Inizializzazione di SparkSession con autenticazione
spark = SparkSession.builder \
    .appName("AggregateHourlySalesPerStation") \
    .config('spark.mongodb.input.uri', mongo_uri) \
    .config('spark.mongodb.input.database', mongo_database) \
    .config('spark.mongodb.input.collection', mongo_input_collection) \
    .config('spark.mongodb.output.uri', mongo_uri) \
    .config('spark.mongodb.output.database', mongo_database) \
    .config('spark.mongodb.output.collection', mongo_output_collection) \
    .config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.12:3.0.1') \
    .getOrCreate()

def update_document(iterator):
    from pymongo import MongoClient
    client = MongoClient(mongo_uri)
    db = client[mongo_database]
    collection = db[mongo_output_collection]
    
    for row in iterator:
        # Aggiornamento per utilizzare il campo "name" al posto di "station_id"
        query = {"date": row.date, "ora": row.ora, "station_name": row.name}
        update = {"$set": {"cnt_partenze": row.max_abs_cnt_partenze}}
        collection.update_one(query, update, upsert=True)
    
    client.close()

def process_data():
    try:
        df = spark.read.format("mongo").load()

        # Calcolo del valore assoluto di cnt_partenze
        df_with_abs = df.withColumn("abs_cnt_partenze", pyspark_abs(df["cnt_partenze"]))
        
        # Aggregazione utilizzando "name" al posto di "station_id"
        max_values = df_with_abs.groupBy("date", "ora", "name").agg(pyspark_max("abs_cnt_partenze").alias("max_abs_cnt_partenze"))

        max_values.foreachPartition(update_document)

        print("Aggiornamento completato.")

    except Exception as ex:
        print(f"ERRORE: {str(ex)}")

# Reintegrazione del ciclo infinito per l'esecuzione periodica
while True:
    process_data()
    time.sleep(1800)  # Intervallo tra le esecuzioni, ad esempio 60 secondi


24/03/02 17:54:21 WARN Utils: Your hostname, MacBook-Pro-di-Giuseppe.local resolves to a loopback address: 127.0.0.1; using 192.168.200.186 instead (on interface en0)
24/03/02 17:54:21 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/Users/panda/mambaforge/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /Users/panda/.ivy2/cache
The jars for the packages stored in: /Users/panda/.ivy2/jars
org.mongodb.spark#mongo-spark-connector_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-127019fd-1586-43c6-9631-494b66d23c41;1.0
	confs: [default]
	found org.mongodb.spark#mongo-spark-connector_2.12;3.0.1 in central
	found org.mongodb#mongodb-driver-sync;4.0.5 in central
	found org.mongodb#bson;4.0.5 in central
	found org.mongodb#mongodb-driver-core;4.0.5 in central
:: resolution report :: resolve 360ms :: artifacts dl 20ms
	:: modules in use:
	org.mongodb#bson;4.0.5 from central in [default]
	org.mongodb#mongodb-driver-core;4.0.5 from central in [default]
	org.mongodb#mongodb-driver-sync;4.0.5 from central in [default]
	org.mongodb.spark#mongo-spark-connector_2.12;3.0.1 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts

Aggiornamento completato.


                                                                                

Aggiornamento completato.


                                                                                

Aggiornamento completato.


                                                                                

Aggiornamento completato.


                                                                                

Aggiornamento completato.


                                                                                

Aggiornamento completato.


                                                                                

Aggiornamento completato.


                                                                                

Aggiornamento completato.


                                                                                

Aggiornamento completato.


                                                                                

Aggiornamento completato.
