## **Transformar datos para obtener resultados de carreras organizados**

In [None]:
# Llamamos al notebook que contiene las variables de configuración
%run "../utils/configuration"

### Paso 1 - Leer **drivers** de la capa **processed**

In [None]:
from pyspark.sql.functions import current_timestamp

In [None]:
# El parámetro "processed_folder_path" se encuentra en el notebook "configuration"
drivers_df = spark.read.parquet(f"{processed_folder_path}/drivers") \
.withColumnRenamed("number", "driver_number") \
.withColumnRenamed("name", "driver_name") \
.withColumnRenamed("nationality", "driver_nationality") 

### Paso 2 - Leer **constructors** de la capa **processed**

In [None]:
# El parámetro "processed_folder_path" se encuentra en el notebook "configuration"
constructors_df = spark.read.parquet(f"{processed_folder_path}/constructors") \
.withColumnRenamed("name", "team") 

### Paso 3 - Leer **circuits** de la capa **processed**

In [None]:
# El parámetro "processed_folder_path" se encuentra en el notebook "configuration"
circuits_df = spark.read.parquet(f"{processed_folder_path}/circuits") \
.withColumnRenamed("location", "circuit_location") 

### Paso 4 - Leer **races** de la capa **processed**

In [None]:
# El parámetro "processed_folder_path" se encuentra en el notebook "configuration"
races_df = spark.read.parquet(f"{processed_folder_path}/races") \
.withColumnRenamed("name", "race_name") \
.withColumnRenamed("race_timestamp", "race_date") 

### Paso 5 - Leer **results** de la capa **processed**

In [None]:
# El parámetro "processed_folder_path" se encuentra en el notebook "configuration"
results_df = spark.read.parquet(f"{processed_folder_path}/results") \
.withColumnRenamed("time", "race_time") 

### Paso 6 - Realizar Join entre **circuits** y **races**

In [None]:
race_circuits_df = races_df.join(circuits_df, races_df.circuit_id == circuits_df.circuit_id, "inner") \
                           .select(races_df.race_id, races_df.race_year, races_df.race_name, races_df.race_date, circuits_df.circuit_location)

### Paso 7 - Realizar Join entre el **resultado previo**, **results**, **drivers** y **constructors**

In [None]:
race_results_df = results_df.join(race_circuits_df, results_df.race_id == race_circuits_df.race_id) \
                            .join(drivers_df, results_df.driver_id == drivers_df.driver_id) \
                            .join(constructors_df, results_df.constructor_id == constructors_df.constructor_id)

In [None]:
final_df = race_results_df.select("race_year", "race_name", "race_date", "circuit_location", "driver_name", "driver_number", "driver_nationality",
                                 "team", "grid", "fastest_lap", "race_time", "points", "position") \
                          .withColumn("created_date", current_timestamp())

### Paso 7 - Escribir datos en el datalake como **parquet** y crear la tabla **race_results** en la base de datos **f1_presentation**

In [None]:
# Escribimos el archivo con formato PARQUET en la base de datos "f1_presentation" y en la tabla "race_results"
final_df.write.mode("overwrite").format("parquet").saveAsTable("f1_presentation.race_results")