### **Paso 3.2 - Utilizar las funciones "GroupBy/Agg" y "Window Functions" para transformar los datos y asi obtener datos organizados que nos muestren un ranking según el piloto**

*   Trabajamos con el dataframe obtenido en el **Paso 3.1**

In [None]:
%run "../includes/configuration"

In [None]:
# El parámetro "presentation_folder_path" se encuentra en el notebook "configuration"
race_results_df = spark.read.parquet(f"{presentation_folder_path}/race_results")

In [None]:
race_results_df.show(truncate=False)

+---------+---------------------+-------------------+----------------+--------------------+-------------+------------------+-----------+----+-----------+-----------+------+--------+-----------------------+
|race_year|race_name            |race_date          |circuit_location|driver_name         |driver_number|driver_nationality|team       |grid|fastest_lap|race_time  |points|position|created_date           |
+---------+---------------------+-------------------+----------------+--------------------+-------------+------------------+-----------+----+-----------+-----------+------+--------+-----------------------+
|2009     |Australian Grand Prix|2009-03-29 06:00:00|Melbourne       |Lewis Hamilton      |44           |British           |McLaren    |18  |39         |\N         |0.0   |null    |2023-06-11 19:27:56.617|
|2009     |Australian Grand Prix|2009-03-29 06:00:00|Melbourne       |Heikki Kovalainen   |null         |Finnish           |McLaren    |12  |null       |\N         |0.0   |null

In [None]:
from pyspark.sql.functions import sum, when, count, col

driver_standings_df = race_results_df \
.groupBy("race_year", "driver_name", "driver_nationality", "team") \
.agg(sum("points").alias("total_points"),
     count(when(col("position") == 1, True)).alias("wins")).sort(col('wins').desc()) 
# Cuando posicion es igual a 1, entonces, es igual a 1 (True = 1 y False = 0)

In [None]:
driver_standings_df.show(truncate=False)

+---------+------------------+------------------+--------+------------+----+
|race_year|driver_name       |driver_nationality|team    |total_points|wins|
+---------+------------------+------------------+--------+------------+----+
|2004     |Michael Schumacher|German            |Ferrari |148.0       |13  |
|2013     |Sebastian Vettel  |German            |Red Bull|397.0       |13  |
|2011     |Sebastian Vettel  |German            |Red Bull|392.0       |11  |
|2020     |Lewis Hamilton    |British           |Mercedes|347.0       |11  |
|2019     |Lewis Hamilton    |British           |Mercedes|413.0       |11  |
|2014     |Lewis Hamilton    |British           |Mercedes|384.0       |11  |
|2018     |Lewis Hamilton    |British           |Mercedes|408.0       |11  |
|2002     |Michael Schumacher|German            |Ferrari |144.0       |11  |
|2015     |Lewis Hamilton    |British           |Mercedes|381.0       |10  |
|2016     |Lewis Hamilton    |British           |Mercedes|380.0       |10  |

In [None]:
driver_standings_df.filter("race_year = 2020").show(truncate=False)

+---------+------------------+------------------+------------+------------+----+
|race_year|driver_name       |driver_nationality|team        |total_points|wins|
+---------+------------------+------------------+------------+------------+----+
|2020     |Lewis Hamilton    |British           |Mercedes    |347.0       |11  |
|2020     |Valtteri Bottas   |Finnish           |Mercedes    |223.0       |2   |
|2020     |Max Verstappen    |Dutch             |Red Bull    |214.0       |2   |
|2020     |Pierre Gasly      |French            |AlphaTauri  |75.0        |1   |
|2020     |Sergio Pérez      |Mexican           |Racing Point|125.0       |1   |
|2020     |Lance Stroll      |Canadian          |Racing Point|75.0        |0   |
|2020     |Kevin Magnussen   |Danish            |Haas F1 Team|1.0         |0   |
|2020     |Antonio Giovinazzi|Italian           |Alfa Romeo  |4.0         |0   |
|2020     |Carlos Sainz      |Spanish           |McLaren     |105.0       |0   |
|2020     |Nicholas Latifi  

In [None]:
from pyspark.sql.window import Window
from pyspark.sql.functions import desc, rank, asc

driver_rank_spec = Window.partitionBy("race_year").orderBy(desc("total_points"), desc("wins"))
final_df = driver_standings_df.withColumn("rank", rank().over(driver_rank_spec))

In [None]:
final_df.show(truncate=False)

+---------+------------------+------------------+------------+------------+----+----+
|race_year|driver_name       |driver_nationality|team        |total_points|wins|rank|
+---------+------------------+------------------+------------+------------+----+----+
|1950     |Nino Farina       |Italian           |Alfa Romeo  |30.0        |3   |1   |
|1950     |Luigi Fagioli     |Italian           |Alfa Romeo  |28.0        |0   |2   |
|1950     |Juan Fangio       |Argentine         |Alfa Romeo  |27.0        |3   |3   |
|1950     |Louis Rosier      |French            |Talbot-Lago |13.0        |0   |4   |
|1950     |Alberto Ascari    |Italian           |Ferrari     |11.0        |0   |5   |
|1950     |Johnnie Parsons   |American          |Kurtis Kraft|9.0         |1   |6   |
|1950     |Bill Holland      |American          |Deidt       |6.0         |0   |7   |
|1950     |Prince Bira       |Thai              |Maserati    |5.0         |0   |8   |
|1950     |Peter Whitehead   |British           |Ferra

In [None]:
final_df.filter("race_year = 2020").show(truncate=False)

+---------+------------------+------------------+------------+------------+----+----+
|race_year|driver_name       |driver_nationality|team        |total_points|wins|rank|
+---------+------------------+------------------+------------+------------+----+----+
|2020     |Lewis Hamilton    |British           |Mercedes    |347.0       |11  |1   |
|2020     |Valtteri Bottas   |Finnish           |Mercedes    |223.0       |2   |2   |
|2020     |Max Verstappen    |Dutch             |Red Bull    |214.0       |2   |3   |
|2020     |Sergio Pérez      |Mexican           |Racing Point|125.0       |1   |4   |
|2020     |Daniel Ricciardo  |Australian        |Renault     |119.0       |0   |5   |
|2020     |Carlos Sainz      |Spanish           |McLaren     |105.0       |0   |6   |
|2020     |Alexander Albon   |Thai              |Red Bull    |105.0       |0   |6   |
|2020     |Charles Leclerc   |Monegasque        |Ferrari     |98.0        |0   |8   |
|2020     |Lando Norris      |British           |McLar

In [None]:
# El parámetro "presentation_folder_path" se encuentra en el notebook "configuration"
final_df.write.mode("overwrite").parquet(f"{presentation_folder_path}/driver_standings")