## Soluciones lab05-challenge

Apartado 1

In [None]:
from pyspark.sql import SparkSession

# Crear la sesión de Spark
spark = SparkSession.builder.appName("lab05").getOrCreate()

# Cargar datos
taxi_zones = spark.read.option("header", True).csv("../../data/taxi_zone_lookup.csv")
trips_df = spark.read.parquet("../../data/yellow_tripdata_2023-01.parquet")

In [None]:
taxi_zones.createOrReplaceTempView("taxi_zones")
trips_df.createOrReplaceTempView("yellow_trips")

In [None]:
taxi_zones.show(5)
trips_df.show(5)

Apartado 2

In [None]:
spark.sql("""
    SELECT z.Zone AS pickup_zone, COUNT(*) AS num_trips
    FROM yellow_trips t
    JOIN taxi_zones z
    ON t.PULocationID = z.LocationID
    GROUP BY z.Zone
    ORDER BY num_trips DESC
    LIMIT 5
""").show()

Apartado 3

In [None]:
spark.sql("""
    SELECT z.Zone AS dropoff_zone, AVG(tip_amount) AS prom_tips
    FROM yellow_trips t
    JOIN taxi_zones z
    ON t.DOLocationID = z.LocationID
    GROUP BY z.Zone
    ORDER BY prom_tips DESC
""").show()

Apartado 4

In [None]:
spark.sql("""
    SELECT z.Zone AS pickup_zone, 
       AVG((unix_timestamp(t.tpep_dropoff_datetime) - unix_timestamp(t.tpep_pickup_datetime)) / 60.0) AS prom_trip_dur_min
    FROM yellow_trips t
    JOIN taxi_zones z
    ON t.PULocationID = z.LocationID
    GROUP BY z.Zone
    ORDER BY prom_trip_dur_min DESC
""").show()

¿Podríamos usar la siguiente consulta?
```python
spark.sql("""
    SELECT z.Zone AS pickup_zone, 
    AVG(DATEDIFF(minute, tpep_pickup_datetime, tpep_dropoff_datetime)) AS prom_trip_dur_min
    FROM yellow_trips t
    JOIN taxi_zones z
    ON t.PULocationID = z.LocationID
    GROUP BY z.Zone
    ORDER BY prom_trip_dur_min DESC
""").show()
```

**En este entorno sí**, ya que probablemente esta consulta nos devuelva un resultado correcto y similar a la de arriba. Sin embargo, hay que tener en cuenta que este entorno Dockerizado está configurado a partir de una imagen predefinida (de Internet), por lo que seguramente cuente con alguna expensión adicional que añada un parsing flexible o compatibilidad extra.

Comento esto porque es interesante saber que la forma "canónica" en Spark SQL es la primera, que usa el método explícito "unix_timestamp()". Ten en mente que si corres ese mismo código en Apache Spark sin extensiones (por ejemplo en un cluster o en PySpark CLI), puede fallar.

Apartado 5

In [None]:
spark.sql("""
    WITH viajes_franja AS (
        SELECT z.Zone AS pickup_zone,
        HOUR(t.tpep_pickup_datetime) AS hour, 
        CASE 
            WHEN hour BETWEEN 5 AND 11 THEN 'mañana'
            WHEN hour BETWEEN 12 AND 17 THEN 'tarde'
            WHEN hour BETWEEN 18 AND 23 THEN 'noche'
            ELSE 'madrugada'
            END 
        AS franja
        FROM yellow_trips t
        JOIN taxi_zones z ON t.PULocationID = z.LocationID
    )
    SELECT pickup_zone, 
    franja,
    COUNT(*) AS num_trips
    FROM viajes_franja
    GROUP BY pickup_zone, franja
    ORDER BY num_trips DESC
""").show()

Apartado 6

In [None]:
spark.sql("""
    SELECT z.Zone AS pickup_zone, ROUND(SUM(total_amount), 2) AS total_income
    FROM yellow_trips t
    JOIN taxi_zones z
    ON t.PULocationID = z.LocationID
    GROUP BY z.Zone
    HAVING count(*) > 10000
    ORDER BY total_income DESC
""").show()

Apartado 7

In [None]:
spark.sql("""
    SELECT z.Zone AS dropoff_zone, ROUND(STDDEV(t.tip_amount),2) as tip_var
    FROM yellow_trips t
    JOIN taxi_zones z
    ON t.DOLocationID = z.LocationID
    GROUP BY z.Zone
    ORDER BY tip_var DESC
""").show()

In [None]:
spark.stop()