### **Paso 3.3 - Utilizar las funciones "GroupBy/Agg" y "Window Functions" para transformar los datos y asi obtener datos organizados que nos muestren un ranking según el team**

*   Trabajamos con el dataframe obtenido en el **Paso 3.1**

In [None]:
%run "../includes/configuration"

In [None]:
# El parámetro "presentation_folder_path" se encuentra en el notebook "configuration"
race_results_df = spark.read.parquet(f"{presentation_folder_path}/race_results")

In [None]:
race_results_df.show(truncate=False)

+---------+---------------------+-------------------+----------------+--------------------+-------------+------------------+-----------+----+-----------+-----------+------+--------+-----------------------+
|race_year|race_name            |race_date          |circuit_location|driver_name         |driver_number|driver_nationality|team       |grid|fastest_lap|race_time  |points|position|created_date           |
+---------+---------------------+-------------------+----------------+--------------------+-------------+------------------+-----------+----+-----------+-----------+------+--------+-----------------------+
|2009     |Australian Grand Prix|2009-03-29 06:00:00|Melbourne       |Lewis Hamilton      |44           |British           |McLaren    |18  |39         |\N         |0.0   |null    |2023-06-11 19:27:56.617|
|2009     |Australian Grand Prix|2009-03-29 06:00:00|Melbourne       |Heikki Kovalainen   |null         |Finnish           |McLaren    |12  |null       |\N         |0.0   |null

In [None]:
from pyspark.sql.functions import sum, when, count, col

constructor_standings_df = race_results_df \
.groupBy("race_year", "team") \
.agg(sum("points").alias("total_points"),
     count(when(col("position") == 1, True)).alias("wins")).sort(col('wins').desc()) 
# Cuando posicion es igual a 1, entonces, es igual a 1 (True = 1 y False = 0)

In [None]:
constructor_standings_df.show(truncate=False)

+---------+--------+------------+----+
|race_year|team    |total_points|wins|
+---------+--------+------------+----+
|2016     |Mercedes|765.0       |19  |
|2014     |Mercedes|701.0       |16  |
|2015     |Mercedes|703.0       |16  |
|1988     |McLaren |199.0       |15  |
|2004     |Ferrari |262.0       |15  |
|2019     |Mercedes|739.0       |15  |
|2002     |Ferrari |221.0       |15  |
|2020     |Mercedes|573.0       |13  |
|2013     |Red Bull|596.0       |13  |
|1984     |McLaren |143.5       |12  |
|2017     |Mercedes|668.0       |12  |
|2011     |Red Bull|650.0       |12  |
|1996     |Williams|175.0       |12  |
|2018     |Mercedes|655.0       |11  |
|1995     |Benetton|147.0       |11  |
|1992     |Williams|164.0       |10  |
|1989     |McLaren |141.0       |10  |
|1993     |Williams|168.0       |10  |
|2005     |McLaren |182.0       |10  |
|2000     |Ferrari |170.0       |10  |
+---------+--------+------------+----+
only showing top 20 rows



In [None]:
constructor_standings_df.filter("race_year = 2020").show(truncate=False)

+---------+------------+------------+----+
|race_year|team        |total_points|wins|
+---------+------------+------------+----+
|2020     |Mercedes    |573.0       |13  |
|2020     |Red Bull    |319.0       |2   |
|2020     |AlphaTauri  |107.0       |1   |
|2020     |Racing Point|210.0       |1   |
|2020     |Haas F1 Team|3.0         |0   |
|2020     |McLaren     |202.0       |0   |
|2020     |Ferrari     |131.0       |0   |
|2020     |Williams    |0.0         |0   |
|2020     |Alfa Romeo  |8.0         |0   |
|2020     |Renault     |181.0       |0   |
+---------+------------+------------+----+



In [None]:
from pyspark.sql.window import Window
from pyspark.sql.functions import desc, rank, asc

constructor_rank_spec = Window.partitionBy("race_year").orderBy(desc("total_points"), desc("wins"))
final_df = constructor_standings_df.withColumn("rank", rank().over(constructor_rank_spec))

In [None]:
final_df.show(truncate=False)

+---------+------------+------------+----+----+
|race_year|team        |total_points|wins|rank|
+---------+------------+------------+----+----+
|1950     |Alfa Romeo  |89.0        |6   |1   |
|1950     |Ferrari     |21.0        |0   |2   |
|1950     |Talbot-Lago |20.0        |0   |3   |
|1950     |Kurtis Kraft|14.0        |1   |4   |
|1950     |Maserati    |11.0        |0   |5   |
|1950     |Deidt       |10.0        |0   |6   |
|1950     |Simca       |3.0         |0   |7   |
|1950     |Milano      |0.0         |0   |8   |
|1950     |Rae         |0.0         |0   |8   |
|1950     |Langley     |0.0         |0   |8   |
|1950     |Lesovsky    |0.0         |0   |8   |
|1950     |Ewing       |0.0         |0   |8   |
|1950     |Stevens     |0.0         |0   |8   |
|1950     |ERA         |0.0         |0   |8   |
|1950     |Marchese    |0.0         |0   |8   |
|1950     |Nichels     |0.0         |0   |8   |
|1950     |Wetteroth   |0.0         |0   |8   |
|1950     |Watson      |0.0         |0  

In [None]:
final_df.filter("race_year = 2020").show(truncate=False)

+---------+------------+------------+----+----+
|race_year|team        |total_points|wins|rank|
+---------+------------+------------+----+----+
|2020     |Mercedes    |573.0       |13  |1   |
|2020     |Red Bull    |319.0       |2   |2   |
|2020     |Racing Point|210.0       |1   |3   |
|2020     |McLaren     |202.0       |0   |4   |
|2020     |Renault     |181.0       |0   |5   |
|2020     |Ferrari     |131.0       |0   |6   |
|2020     |AlphaTauri  |107.0       |1   |7   |
|2020     |Alfa Romeo  |8.0         |0   |8   |
|2020     |Haas F1 Team|3.0         |0   |9   |
|2020     |Williams    |0.0         |0   |10  |
+---------+------------+------------+----+----+



In [None]:
# El parámetro "presentation_folder_path" se encuentra en el notebook "configuration"
final_df.write.mode("overwrite").parquet(f"{presentation_folder_path}/constructor_standings")