<a href="https://colab.research.google.com/github/vaniamv/final-project-edit/blob/main/streaming_batch_ELT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Aplication in Real Time to Read Carris API - group 1 - ELT approach

This notebook documents the steps to implement a data pipeline leveraging Google Cloud Platform (GCP), following an **ELT (Extract, Load, Transform)** approach. The pipeline processes data in two stages:


1.   *Streaming Ingestion (Extract and Load):*
Data is ingested in real-time from a source bucket on GCP (input bucket) and stored in a bronze layer bucket, preserving the raw format for further processing.
2.   *Batch Transformation (Transform):*
The raw data from the bronze layer is transformed in batches. These transformations clean, standardize, and structure the data, preparing it for analytical use. The transformed data is then stored in the silver layer bucket for downstream consumption.

By prioritizing the ELT approach, this pipeline ensures flexibility in processing and allows the raw data to remain available for future transformations, ensuring adaptability to evolving business requirements.


---



---








1. Authentication to Google Cloud Platform (GCP)



In [None]:
!gcloud auth application-default login

Go to the following link in your browser, and complete the sign-in prompts:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&redirect_uri=https%3A%2F%2Fsdk.cloud.google.com%2Fapplicationdefaultauthcode.html&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login&state=qaOEoFAUxZRbBQ8TfypE63jacO2pKG&prompt=consent&token_usage=remote&access_type=offline&code_challenge=7KM99PL4_dcZdGrGArfJh3ydf_-r8lVuBZnuXfoNMxw&code_challenge_method=S256

Once finished, enter the verification code provided in your browser: 4/0ASVgi3J4HBTO7INmBkLwQCLAqti9-TduLXYuH9IiCfbWyy8UxLXATSDzZWRW92o2npmcIA

Credentials saved to file: [/content/.config/application_default_credentials.json]

These credentials will be used by any library that requests Application Default Credentials (ADC).
Ca

In [None]:
# download connector and save it local

!wget https://repo1.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/hadoop3-2.2.7/gcs-connector-hadoop3-2.2.7-shaded.jar -P /usr/local/lib/

--2025-01-22 22:39:41--  https://repo1.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/hadoop3-2.2.7/gcs-connector-hadoop3-2.2.7-shaded.jar
Resolving repo1.maven.org (repo1.maven.org)... 199.232.192.209, 199.232.196.209, 2a04:4e42:4c::209, ...
Connecting to repo1.maven.org (repo1.maven.org)|199.232.192.209|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 33831577 (32M) [application/java-archive]
Saving to: ‘/usr/local/lib/gcs-connector-hadoop3-2.2.7-shaded.jar’


2025-01-22 22:39:42 (145 MB/s) - ‘/usr/local/lib/gcs-connector-hadoop3-2.2.7-shaded.jar’ saved [33831577/33831577]





---


---
2. Initialize SparkSession and set up the access to GSC


In [None]:
# import libraries

import os
from pyspark.sql import SparkSession

#spark session
spark = SparkSession.builder \
    .appName('GCS_Spark') \
    .config('spark.jars', '/usr/local/lib/gcs-connector-hadoop3-2.2.7-shaded.jar') \
    .config('spark.hadoop.fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem') \
    .getOrCreate()

# save credentials
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = '/content/.config/application_default_credentials.json'

# Config PySpark to access the GCS
spark._jsc.hadoopConfiguration().set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
spark._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
spark._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.enable", "true")
spark._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.json.keyfile", '/content/.config/application_default_credentials.json')




---


---
3. Set up the source schema and initialize the readStream

In [None]:
from pyspark.sql.types import *

# create schema
vehicle_schema = StructType([StructField('bearing', IntegerType(), True),
                             StructField('block_id', StringType(), True),
                             StructField('current_status', StringType(), True),
                             StructField('id', StringType(), True),
                             StructField('lat', FloatType(), True),
                             StructField('line_id', StringType(), True),
                             StructField('lon', FloatType(), True),
                             StructField('pattern_id', StringType(), True),
                             StructField('route_id', StringType(), True),
                             StructField('schedule_relationship', StringType(), True),
                             StructField('shift_id', StringType(), True),
                             StructField('speed', FloatType(), True),
                             StructField('stop_id', StringType(), True),
                             StructField('timestamp', TimestampType(), True),
                             StructField('trip_id', StringType(), True)])


#readStreaming
stream = spark.readStream.format("json").schema(vehicle_schema).load("gs://edit-de-project-streaming-data/carris-vehicles")

In [None]:
#confirm that stream is streaming
stream.isStreaming

True



---


---
4. Write the stream in a bronze layer (landing zone)
* Purpose: Raw data ingestion layer.
* Data Characteristics: Raw, unprocessed, and schema-on-read where feasible.
* Data Storage: Store data exactly as ingested (in this case JSON format).
* Operations: Minimal transformation; only schema enforcement and deduplication.

In [None]:
bronze_layer="gs://edit-data-eng-project-group1/datalake/stream/ELT/bronze_layer"

# writeStreaming
query = (stream
        .writeStream
        .outputMode("append")
        .option("path", bronze_layer)
        .option('checkpointLocation', 'gs://edit-data-eng-project-group1/datalake/stream/ELT/bronze_layer/checkpoint')
        .start()

        )

query.awaitTermination(60)

False

In [None]:
#check the status of the query
query.status

{'message': 'Processing new data',
 'isDataAvailable': True,
 'isTriggerActive': True}

In [None]:
#check if the write streaming is active
query.isActive

True

In [None]:
query.stop()



---


---
5. Check if the bronze layer received the files and prepare the bronze layer operations (schema enforcement and deduplication)

In [None]:
vehicle_schema = StructType([StructField('bearing', IntegerType(), True),
                             StructField('block_id', StringType(), True),
                             StructField('current_status', StringType(), True),
                             StructField('id', StringType(), True),
                             StructField('lat', FloatType(), True),
                             StructField('line_id', StringType(), True),
                             StructField('lon', FloatType(), True),
                             StructField('pattern_id', StringType(), True),
                             StructField('route_id', StringType(), True),
                             StructField('schedule_relationship', StringType(), True),
                             StructField('shift_id', StringType(), True),
                             StructField('speed', FloatType(), True),
                             StructField('stop_id', StringType(), True),
                             StructField('timestamp', TimestampType(), True),
                             StructField('trip_id', StringType(), True)])

# Read the Parquet files into a DataFrame
parquet_df = spark.read.schema(vehicle_schema).parquet(bronze_layer)

parquet_df = parquet_df.drop_duplicates()

# Show the first few rows
parquet_df.show(truncate=False)

# Print the schema to understand the data structure
parquet_df.printSchema()


+-------+--------+--------------+---+---+-------+---+----------+--------+---------------------+--------+-----+-------+---------+-------+
|bearing|block_id|current_status|id |lat|line_id|lon|pattern_id|route_id|schedule_relationship|shift_id|speed|stop_id|timestamp|trip_id|
+-------+--------+--------------+---+---+-------+---+----------+--------+---------------------+--------+-----+-------+---------+-------+
+-------+--------+--------------+---+---+-------+---+----------+--------+---------------------+--------+-----+-------+---------+-------+

root
 |-- bearing: integer (nullable = true)
 |-- block_id: string (nullable = true)
 |-- current_status: string (nullable = true)
 |-- id: string (nullable = true)
 |-- lat: float (nullable = true)
 |-- line_id: string (nullable = true)
 |-- lon: float (nullable = true)
 |-- pattern_id: string (nullable = true)
 |-- route_id: string (nullable = true)
 |-- schedule_relationship: string (nullable = true)
 |-- shift_id: string (nullable = true)
 |--



---


---
6. Ingest to the silver layer
* Purpose: Cleaned and enriched data layer.
* Data Characteristics: Schema-on-write, normalized/structured, with quality checks applied.
* Operations:
Filter for getting only the columns required.
Joins with reference data (Historical STOPS) for enrichment.
Simple calculations and store it in new columns.

In [None]:
from pyspark.sql.functions import lag , col, coalesce
from pyspark.sql.window import Window
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType
import math

def haversine_distance(lat1, lon1, lat2, lon2):

    if any(x is None for x in [lat1, lon1, lat2, lon2]):
        return 0.0
    R = 6371  # Earth's radius in kilometers

    # Convert latitude and longitude to radians
    lat1, lon1, lat2, lon2 = map(math.radians, [lat1, lon1, lat2, lon2])

    # Calculate differences
    dlat = lat2 - lat1
    dlon = lon2 - lon1

    # Apply Haversine formula
    a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2
    c = 2 * math.asin(math.sqrt(a))

    # Calculate distance
    distance = R * c

    return distance

# Register the UDF
distance_udf = udf(haversine_distance, FloatType())

# Define a window specification
windowSpec = Window.partitionBy("id").orderBy("timestamp")

#select columns
transform = parquet_df.select('id', 'speed', 'timestamp','line_id','route_id','stop_id','lat', 'lon')

# Create a new column 'previous_value' using lag
transform = transform.withColumn("previous_lat", coalesce(lag("lat", 1).over(windowSpec), col('lat')))
transform = transform.withColumn("previous_lon", coalesce(lag("lon", 1).over(windowSpec), col('lon')))

# Get the dataset from endpoint STOPS that we need to join to our main dataset
df_stops = spark.read.option("header", "true").csv('gs://edit-data-eng-project-group1/LandingZone/GTFS/stops.txt')
df_stops = df_stops.select('stop_id','stop_lat','stop_lon')
df_stops = df_stops.withColumn("stop_lat", df_stops["stop_lat"].cast("float"))
df_stops = df_stops.withColumn("stop_lon", df_stops["stop_lon"].cast("float"))

# Join and add new calculated columns
transform = transform.join(df_stops, on='stop_id', how='left')

transform = transform.withColumn("distance", distance_udf(transform["previous_lat"],transform["previous_lon"],transform["lat"],transform["lon"]))
transform = transform.withColumn("distance_to_stop", distance_udf(transform["lat"],transform["lon"],transform["stop_lat"],transform["stop_lon"]))

transform.show()

+-------+-------+---------+-------------------+-------+--------+---------+---------+------------+------------+---------+---------+------------+----------------+
|stop_id|     id|    speed|          timestamp|line_id|route_id|      lat|      lon|previous_lat|previous_lon| stop_lat| stop_lon|    distance|distance_to_stop|
+-------+-------+---------+-------------------+-------+--------+---------+---------+------------+------------+---------+---------+------------+----------------+
| 030785|41|1100|     15.0|2025-01-03 18:35:39|   1716|  1716_0| 38.72486|  -9.1829|    38.72486|     -9.1829|38.728104|-9.216909|         0.0|       2.9721527|
| 030785|41|1100|11.944445|2025-01-03 18:36:31|   1716|  1716_0| 38.72566|-9.191012|    38.72486|     -9.1829|38.728104|-9.216909|   0.7092681|       2.2628906|
| 030785|41|1100|15.833333|2025-01-03 18:36:49|   1716|  1716_0|38.725082|-9.193826|    38.72566|   -9.191012|38.728104|-9.216909|   0.2523195|       2.0304399|
| 030785|41|1100|4.4444447|2025-01

In [None]:
transform.printSchema()

root
 |-- stop_id: string (nullable = true)
 |-- id: string (nullable = true)
 |-- speed: float (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- line_id: string (nullable = true)
 |-- route_id: string (nullable = true)
 |-- lat: float (nullable = true)
 |-- lon: float (nullable = true)
 |-- previous_lat: float (nullable = true)
 |-- previous_lon: float (nullable = true)
 |-- stop_lat: float (nullable = true)
 |-- stop_lon: float (nullable = true)
 |-- distance: float (nullable = true)
 |-- distance_to_stop: float (nullable = true)



In [None]:
transform.write.format("parquet").mode("overwrite").save("gs://edit-data-eng-project-group1/datalake/stream/ELT/silver_layer")

+-------+-------+---------+-------------------+---------+---------+------------+------------+---------+---------+------------+----------------+
|stop_id|     id|    speed|          timestamp|      lat|      lon|previous_lat|previous_lon| stop_lat| stop_lon|    distance|distance_to_stop|
+-------+-------+---------+-------------------+---------+---------+------------+------------+---------+---------+------------+----------------+
| 030785|41|1100|     15.0|2025-01-03 18:35:39| 38.72486|  -9.1829|    38.72486|     -9.1829|38.728104|-9.216909|         0.0|       2.9721527|
| 030785|41|1100|11.944445|2025-01-03 18:36:31| 38.72566|-9.191012|    38.72486|     -9.1829|38.728104|-9.216909|   0.7092681|       2.2628906|
| 030785|41|1100|15.833333|2025-01-03 18:36:49|38.725082|-9.193826|    38.72566|   -9.191012|38.728104|-9.216909|   0.2523195|       2.0304399|
| 030785|41|1100|4.4444447|2025-01-03 18:37:27| 38.72312|-9.199276|   38.725082|   -9.193826|38.728104|-9.216909|   0.5206602|       1.6



---


---
7. Ingest to the gold layer
* Purpose: Aggregated data ready for analytics and reporting.
* Data Characteristics: Pre-aggregated, aggregated by business logic, optimized for query performance.
* Operations: Aggregations, windowed calculations. Metrics computations (e.g., averages, counts, sums).

In [None]:
transform = spark.read.format("parquet").load("gs://edit-data-eng-project-group1/datalake/stream/ELT/silver_layer")

In [None]:
import pyspark.sql.functions as F


agg = transform.groupBy("id", "stop_id", F.window("timestamp", "2 minutes")).agg(
    F.sum('distance').alias("distance_2_min"),
    F.last('distance_to_stop').alias('distance_to_stop')
    )

agg = agg.withColumn('speed', col('distance_2_min')/(2/60))

agg = agg.filter(agg.distance_to_stop.isNotNull() & (agg.distance_to_stop > 0) & (agg.speed.isNotNull()) & (agg.speed > 0)) \
         .withColumn('time_to_stop', (col('distance_to_stop')/col('speed') * 3600))

agg = agg.withColumn(
    'time_to_stop',
    F.from_unixtime(
        F.unix_timestamp(F.lit('00:00:00'), 'HH:mm:ss') + col('time_to_stop'),
        'HH:mm:ss'
    )
)


agg.show()

+-------+-------+--------------------+--------------------+----------------+------------------+------------+
|     id|stop_id|              window|      distance_2_min|distance_to_stop|             speed|time_to_stop|
+-------+-------+--------------------+--------------------+----------------+------------------+------------+
|41|1104| 120083|{2025-01-03 19:26...|0.051517192274332047|       0.2787009|1.5455157682299614|    00:10:49|
|41|1109| 060041|{2025-01-03 20:52...|  0.4222240149974823|    0.0031869775|12.666720449924469|    00:00:00|
|41|1109| 120533|{2025-01-04 00:02...|  0.0678267776966095|      0.41164023| 2.034803330898285|    00:12:08|
|41|1109| 121099|{2025-01-04 01:00...| 0.04669193550944328|       1.0875545|1.4007580652832985|    00:46:35|
|41|1114| 120157|{2025-01-03 20:48...| 0.43597667291760445|     0.007636943|13.079300187528133|    00:00:02|
|41|1114| 121059|{2025-01-03 20:52...|  0.4232720732688904|    0.0060717873|12.698162198066711|    00:00:01|
|41|1114| 121298|{2

In [None]:
transform.write.format("parquet").mode("overwrite").save("gs://edit-data-eng-project-group1/datalake/stream/ELT/gold_layer")