### **Paso 6.4.3 - Utilizar las funciones "GroupBy/Agg" y "Window Functions" para transformar los datos y asi obtener datos organizados que nos muestren un ranking según el team**

*   Trabajamos con el dataframe obtenido en el **Paso 6.4.1**

1. Este notebook se ejecutó para el directorio **2021-03-21**. 
2. Podemos reutilizarlo para los dos directorios restantes: **2021-03-28** y **2021-04-18** o para los que lleguen en el futuro
3. Solamente debemos modificar el parámetro del notebook **p_file_date**

In [None]:
dbutils.widgets.text("p_file_date", "2021-03-21")
v_file_date = dbutils.widgets.get("p_file_date")

In [None]:
v_file_date

Out[24]: '2021-03-21'

In [None]:
%run "../includes/configuration"

In [None]:
%run "../includes/common_functions"

#### Encontrar el **race_year** para los que se deben reprocesar los datos

In [None]:
# El parámetro "presentation_folder_path" se encuentra en el notebook "configuration"
#race_results_df = spark.read.format('delta').load("/mnt/formula1dl/presentation/race_results") \
race_results_df = spark.read.format('delta').load(f"{presentation_folder_path}/race_results") \
.filter(f"file_date = '{v_file_date}'")

In [None]:
race_results_df.show(truncate=False)

+-------+---------+--------------------+-------------------+----------------+------------------+-------------+------------------+-----------+----+-----------+---------+------+--------+----------+-----------------------+
|race_id|race_year|race_name           |race_date          |circuit_location|driver_name       |driver_number|driver_nationality|team       |grid|fastest_lap|race_time|points|position|file_date |created_date           |
+-------+---------+--------------------+-------------------+----------------+------------------+-------------+------------------+-----------+----+-----------+---------+------+--------+----------+-----------------------+
|873    |2012     |Singapore Grand Prix|2012-09-23 12:00:00|Marina Bay      |Lewis Hamilton    |44           |British           |McLaren    |1   |14         |\N       |0.0   |null    |2021-03-21|2023-06-18 02:14:49.486|
|873    |2012     |Singapore Grand Prix|2012-09-23 12:00:00|Marina Bay      |Narain Karthikeyan|null         |Indian    

In [None]:
# La función "df_column_to_list()" se encuentra en el notebook "common_functions"
# Recordar que esta función se llamará al utilizar %run "../includes/common_functions"
race_year_list = df_column_to_list(race_results_df, 'race_year')

In [None]:
# El parámetro "presentation_folder_path" se encuentra en el notebook "configuration"
#race_results_df = spark.read.format('delta').load("/mnt/formula1dl/presentation/race_results") \
from pyspark.sql.functions import col

race_results_df = spark.read.format('delta').load(f"{presentation_folder_path}/race_results") \
.filter(col("race_year").isin(race_year_list))

In [None]:
from pyspark.sql.functions import sum, when, count, col

constructor_standings_df = race_results_df \
.groupBy("race_year", "team") \
.agg(sum("points").alias("total_points"),
     count(when(col("position") == 1, True)).alias("wins")).sort(col('wins').desc()) 
# Cuando posicion es igual a 1, entonces, es igual a 1 (True = 1 y False = 0)

In [None]:
constructor_standings_df.show(truncate=False)

+---------+--------+------------+----+
|race_year|team    |total_points|wins|
+---------+--------+------------+----+
|2016     |Mercedes|765.0       |19  |
|2014     |Mercedes|701.0       |16  |
|2015     |Mercedes|703.0       |16  |
|1988     |McLaren |199.0       |15  |
|2019     |Mercedes|739.0       |15  |
|2004     |Ferrari |262.0       |15  |
|2002     |Ferrari |221.0       |15  |
|2013     |Red Bull|596.0       |13  |
|2020     |Mercedes|573.0       |13  |
|2011     |Red Bull|650.0       |12  |
|2017     |Mercedes|668.0       |12  |
|1984     |McLaren |143.5       |12  |
|1996     |Williams|175.0       |12  |
|2018     |Mercedes|655.0       |11  |
|1995     |Benetton|147.0       |11  |
|1989     |McLaren |141.0       |10  |
|2005     |McLaren |182.0       |10  |
|1992     |Williams|164.0       |10  |
|1993     |Williams|168.0       |10  |
|2000     |Ferrari |170.0       |10  |
+---------+--------+------------+----+
only showing top 20 rows



In [None]:
constructor_standings_df.filter("race_year = 2020").show(truncate=False)

+---------+------------+------------+----+
|race_year|team        |total_points|wins|
+---------+------------+------------+----+
|2020     |Mercedes    |573.0       |13  |
|2020     |Red Bull    |319.0       |2   |
|2020     |AlphaTauri  |107.0       |1   |
|2020     |Racing Point|210.0       |1   |
|2020     |Haas F1 Team|3.0         |0   |
|2020     |McLaren     |202.0       |0   |
|2020     |Ferrari     |131.0       |0   |
|2020     |Williams    |0.0         |0   |
|2020     |Alfa Romeo  |8.0         |0   |
|2020     |Renault     |181.0       |0   |
+---------+------------+------------+----+



In [None]:
from pyspark.sql.window import Window
from pyspark.sql.functions import desc, rank, asc

constructor_rank_spec = Window.partitionBy("race_year").orderBy(desc("total_points"), desc("wins"))
final_df = constructor_standings_df.withColumn("rank", rank().over(constructor_rank_spec))

In [None]:
final_df.show(truncate=False)

+---------+------------+------------+----+----+
|race_year|team        |total_points|wins|rank|
+---------+------------+------------+----+----+
|1950     |Alfa Romeo  |89.0        |6   |1   |
|1950     |Ferrari     |21.0        |0   |2   |
|1950     |Talbot-Lago |20.0        |0   |3   |
|1950     |Kurtis Kraft|14.0        |1   |4   |
|1950     |Maserati    |11.0        |0   |5   |
|1950     |Deidt       |10.0        |0   |6   |
|1950     |Simca       |3.0         |0   |7   |
|1950     |Rae         |0.0         |0   |8   |
|1950     |Langley     |0.0         |0   |8   |
|1950     |Lesovsky    |0.0         |0   |8   |
|1950     |Ewing       |0.0         |0   |8   |
|1950     |Stevens     |0.0         |0   |8   |
|1950     |Marchese    |0.0         |0   |8   |
|1950     |Nichels     |0.0         |0   |8   |
|1950     |Wetteroth   |0.0         |0   |8   |
|1950     |Watson      |0.0         |0   |8   |
|1950     |Olson       |0.0         |0   |8   |
|1950     |Adams       |0.0         |0  

In [None]:
final_df.filter("race_year = 2020").show(truncate=False)

+---------+------------+------------+----+----+
|race_year|team        |total_points|wins|rank|
+---------+------------+------------+----+----+
|2020     |Mercedes    |573.0       |13  |1   |
|2020     |Red Bull    |319.0       |2   |2   |
|2020     |Racing Point|210.0       |1   |3   |
|2020     |McLaren     |202.0       |0   |4   |
|2020     |Renault     |181.0       |0   |5   |
|2020     |Ferrari     |131.0       |0   |6   |
|2020     |AlphaTauri  |107.0       |1   |7   |
|2020     |Alfa Romeo  |8.0         |0   |8   |
|2020     |Haas F1 Team|3.0         |0   |9   |
|2020     |Williams    |0.0         |0   |10  |
+---------+------------+------------+----+----+



#### Escribir datos en el contenedor **presentation** del ADLS como **parquet** y crear la tabla **constructor_standings** en la base de datos **f1_presentation**

In [None]:
# La función "merge_delta_data()" se encuentra en el notebook "common_functions"
# Recordar que esta función se llamará al utilizar %run "../includes/common_functions"
merge_condition = "tgt.race_year = src.race_year AND tgt.team = src.team"
merge_delta_data(final_df, 'f1_presentation', 'constructor_standings', presentation_folder_path, merge_condition, 'race_year')
#merge_delta_data(final_df, 'f1_presentation', 'constructor_standings', '/mnt/formula1dl/presentation', merge_condition, 'race_year')

In [None]:
spark.table("f1_presentation.constructor_standings").show(truncate=False)

+---------+------------+------------+----+----+
|race_year|team        |total_points|wins|rank|
+---------+------------+------------+----+----+
|1950     |Alfa Romeo  |89.0        |6   |1   |
|1950     |Ferrari     |21.0        |0   |2   |
|1950     |Talbot-Lago |20.0        |0   |3   |
|1950     |Kurtis Kraft|14.0        |1   |4   |
|1950     |Maserati    |11.0        |0   |5   |
|1950     |Deidt       |10.0        |0   |6   |
|1950     |Simca       |3.0         |0   |7   |
|1950     |Rae         |0.0         |0   |8   |
|1950     |Langley     |0.0         |0   |8   |
|1950     |Lesovsky    |0.0         |0   |8   |
|1950     |Ewing       |0.0         |0   |8   |
|1950     |Stevens     |0.0         |0   |8   |
|1950     |Marchese    |0.0         |0   |8   |
|1950     |Nichels     |0.0         |0   |8   |
|1950     |Wetteroth   |0.0         |0   |8   |
|1950     |Watson      |0.0         |0   |8   |
|1950     |Olson       |0.0         |0   |8   |
|1950     |Adams       |0.0         |0  

In [None]:
%sql
SELECT * FROM f1_presentation.constructor_standings;

race_year,team,total_points,wins,rank
1960,Cooper-Climax,102.0,6,1
1960,Team Lotus,52.0,2,2
1960,Ferrari,43.0,1,3
1960,Watson,14.0,1,4
1960,BRM,8.0,0,5
1960,Epperly,4.0,0,6
1960,Phillips,3.0,0,7
1960,Cooper-Maserati,3.0,0,7
1960,Cooper-Castellotti,3.0,0,7
1960,Lesovsky,2.0,0,10
