<a href="https://colab.research.google.com/github/lucprosa/dataeng-basic-course/blob/main/spark/challenges/challenge_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CHALLENGE 1
##  Implement INGESTION process
- Set up path in the "lake"
  - !mkdir -p /content/lake/bronze

- Read data from API https://api.carrismetropolitana.pt/
  - Endpoints:
    - vehicles
    - lines
    - municipalities
  - Use StructFields to enforce schema

- Transformations
  - vehicles
    - create "date" extracted from "timestamp" column (format: hh24miss)

- Write data as PARQUET into the BRONZE layer (/content/lake/bronze)
  - Partition "vehicles" by "date" column
  - Paths:
    - vehicles - path: /content/lake/bronze/vehicles
    - lines - path: /content/lake/bronze/lines
    - municipalities - path: /content/lake/bronze/municipalities
  - Make sure there is only 1 single parquet created
  - Use overwrite as write mode

# Setting up PySpark

In [1]:
%pip install pyspark

Note: you may need to restart the kernel to use updated packages.


In [2]:
from pyspark.sql import DataFrame, SparkSession
from pyspark.sql.functions import *
import pyspark.sql.functions as F
from pyspark.sql.types import *
import requests




In [3]:
spark = SparkSession.builder.master('local').appName('ETL Program').getOrCreate()


24/12/01 23:18:25 WARN Utils: Your hostname, Iness-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.1.9 instead (on interface en0)
24/12/01 23:18:25 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/12/01 23:18:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/12/01 23:18:25 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [4]:
def get_api_data(endpoint, schema, spark):
    response = requests.get(endpoint)
    rdd = spark.sparkContext.parallelize(response.json())
    df = spark.read.schema(schema).json(rdd)
    return df

In [5]:
vehicle_schema = StructType([StructField('bearing', IntegerType(), True),
                            StructField('block_id', StringType(), True),
                            StructField('current_status', StringType(), True),
                            StructField('id', StringType(), True),
                            StructField('lat', FloatType(), True),
                            StructField('line_id', StringType(), True),
                            StructField('lon', FloatType(), True),
                            StructField('pattern_id', StringType(), True),
                            StructField('route_id', StringType(), True),
                            StructField('schedule_relationship', StringType(), True),
                            StructField('shift_id', StringType(), True),
                            StructField('speed', FloatType(), True),
                            StructField('stop_id', StringType(), True),
                            StructField('timestamp', TimestampType(), True),
                            StructField('trip_id', StringType(), True)])

In [6]:
lines_schema = StructType([StructField('color', StringType(), True),
                        StructField('facilities', ArrayType(StringType(), True), True),
                        StructField('id', StringType(), True),
                        StructField('localities',ArrayType(StringType(), True), True),
                        StructField('long_name', StringType(), True),
                        StructField('municipalities', ArrayType(StringType(), True), True),
                        StructField('patterns', ArrayType(StringType(), True), True),
                        StructField('routes', ArrayType(StringType(), True), True),
                        StructField('short_name', StringType(), True), StructField('text_color', StringType(), True)])

In [7]:
municipalities_schema = StructType([StructField('district_id', StringType(), True),
                            StructField('district_name', StringType(), True),
                            StructField('id', StringType(), True),
                            StructField('name', StringType(), True),
                            StructField('prefix', StringType(), True),
                            StructField('region_id', StringType(), True),
                            StructField('region_name', StringType(), True)
                            ])

In [8]:
get_vehicles = (
    get_api_data("https://api.carrismetropolitana.pt/vehicles", vehicle_schema, spark)
                .withColumn("date", expr("date(timestamp)"))
                )
get_municipalities = get_api_data("https://api.carrismetropolitana.pt/municipalities",municipalities_schema, spark)

get_lines = get_api_data("https://api.carrismetropolitana.pt/lines",lines_schema, spark)

In [9]:
#Write Parquet
get_vehicles.coalesce(1).write.mode("overwrite").partitionBy("date").format("parquet").save("./content/lake/bronze/vehicles")
get_municipalities.coalesce(1).write.mode("overwrite").format("parquet").save("./content/lake/bronze/municipalities")
get_lines.coalesce(1).write.mode("overwrite").format("parquet").save("./content/lake/bronze/lines")

                                                                                

- Transformations

    - remove any corrupted record

- Write data as PARQUET into the SILVER layer (/content/lake/silver)
  - Partition "vehicles" by "date"(created in the ingestion)
  - Paths:
    - vehicles - path: /content/lake/silver/vehicles
    - lines - path: /content/lake/silver/lines
    - municipalities - path: /content/lake/silver/municipalities