### **Paso 2.2 - Ingesta del archivo "races.csv"**

Nos permite crear e indicar parámetros en tiempo de ejecución

<center><img src="https://i.postimg.cc/RZyMDFYp/db69.png"></center>

In [None]:
dbutils.widgets.text("p_data_source", "")
v_data_source = dbutils.widgets.get("p_data_source")

In [None]:
v_data_source

Out[2]: 'testing'

In [None]:
dbutils.widgets.text("p_file_date", "2023-06-11")
v_file_date = dbutils.widgets.get("p_file_date")

In [None]:
v_file_date

Out[4]: '2023-06-11'

In [None]:
%run "../includes/configuration"

In [None]:
%run "../includes/common_functions"

#### Paso 1 - Leer el archivo CSV

In [None]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DateType

In [None]:
races_schema = StructType(fields=[StructField("raceId", IntegerType(), False),
                                  StructField("year", IntegerType(), True),
                                  StructField("round", IntegerType(), True),
                                  StructField("circuitId", IntegerType(), True),
                                  StructField("name", StringType(), True),
                                  StructField("date", DateType(), True),
                                  StructField("time", StringType(), True),
                                  StructField("url", StringType(), True) 
])

In [None]:
# El parámetro "raw_folder_path" se encuentra en el notebook "configuration"
# El parámetro "v_file_date" se encuentra en el notebook e indicamos su valor en tiempo de ejecución
races_df = spark.read \
.option("header", True) \
.schema(races_schema) \
.csv(f"{raw_folder_path}/races.csv")
#.csv(f"{raw_folder_path}/{v_file_date}/races.csv")

In [None]:
races_df.show(truncate=False)

+------+----+-----+---------+---------------------+----------+--------+-------------------------------------------------------+
|raceId|year|round|circuitId|name                 |date      |time    |url                                                    |
+------+----+-----+---------+---------------------+----------+--------+-------------------------------------------------------+
|1     |2009|1    |1        |Australian Grand Prix|2009-03-29|06:00:00|http://en.wikipedia.org/wiki/2009_Australian_Grand_Prix|
|2     |2009|2    |2        |Malaysian Grand Prix |2009-04-05|09:00:00|http://en.wikipedia.org/wiki/2009_Malaysian_Grand_Prix |
|3     |2009|3    |17       |Chinese Grand Prix   |2009-04-19|07:00:00|http://en.wikipedia.org/wiki/2009_Chinese_Grand_Prix   |
|4     |2009|4    |3        |Bahrain Grand Prix   |2009-04-26|12:00:00|http://en.wikipedia.org/wiki/2009_Bahrain_Grand_Prix   |
|5     |2009|5    |4        |Spanish Grand Prix   |2009-05-10|12:00:00|http://en.wikipedia.org/wiki/2009

#### Paso 2 - Añadir las columnas "ingestion_date" y "race_timestamp"

In [None]:
from pyspark.sql.functions import to_timestamp, concat, col, lit

In [None]:
races_with_timestamp_df = races_df.withColumn("race_timestamp", to_timestamp(concat(col('date'), lit(' '), col('time')), 'yyyy-MM-dd HH:mm:ss')) \
                                  .withColumn("data_source", lit(v_data_source)) \
                                  .withColumn("file_date", lit(v_file_date))

In [None]:
races_with_timestamp_df.show(truncate=False)

+------+----+-----+---------+---------------------+----------+--------+-------------------------------------------------------+-------------------+-----------+----------+
|raceId|year|round|circuitId|name                 |date      |time    |url                                                    |race_timestamp     |data_source|file_date |
+------+----+-----+---------+---------------------+----------+--------+-------------------------------------------------------+-------------------+-----------+----------+
|1     |2009|1    |1        |Australian Grand Prix|2009-03-29|06:00:00|http://en.wikipedia.org/wiki/2009_Australian_Grand_Prix|2009-03-29 06:00:00|testing    |2023-06-11|
|2     |2009|2    |2        |Malaysian Grand Prix |2009-04-05|09:00:00|http://en.wikipedia.org/wiki/2009_Malaysian_Grand_Prix |2009-04-05 09:00:00|testing    |2023-06-11|
|3     |2009|3    |17       |Chinese Grand Prix   |2009-04-19|07:00:00|http://en.wikipedia.org/wiki/2009_Chinese_Grand_Prix   |2009-04-19 07:00:0

In [None]:
# La función "add_ingestion_date()" se encuentra en el notebook "common_functions"
races_with_ingestion_date_df = add_ingestion_date(races_with_timestamp_df)

In [None]:
races_with_ingestion_date_df.show(truncate=False)

+------+----+-----+---------+---------------------+----------+--------+-------------------------------------------------------+-------------------+-----------+----------+-----------------------+
|raceId|year|round|circuitId|name                 |date      |time    |url                                                    |race_timestamp     |data_source|file_date |ingestion_date         |
+------+----+-----+---------+---------------------+----------+--------+-------------------------------------------------------+-------------------+-----------+----------+-----------------------+
|1     |2009|1    |1        |Australian Grand Prix|2009-03-29|06:00:00|http://en.wikipedia.org/wiki/2009_Australian_Grand_Prix|2009-03-29 06:00:00|testing    |2023-06-11|2023-06-11 13:36:52.583|
|2     |2009|2    |2        |Malaysian Grand Prix |2009-04-05|09:00:00|http://en.wikipedia.org/wiki/2009_Malaysian_Grand_Prix |2009-04-05 09:00:00|testing    |2023-06-11|2023-06-11 13:36:52.583|
|3     |2009|3    |17    

#### Paso 3 - Seleccionar sólo las columnas necesarias y renombrarlas como corresponda

In [None]:
races_selected_df = races_with_ingestion_date_df.select(col('raceId').alias('race_id'), col('year').alias('race_year'), 
                                                        col('round'), col('circuitId').alias('circuit_id'),col('name'), col('ingestion_date'), col('race_timestamp'))

In [None]:
races_selected_df.show(truncate=False)

+-------+---------+-----+----------+---------------------+-----------------------+-------------------+
|race_id|race_year|round|circuit_id|name                 |ingestion_date         |race_timestamp     |
+-------+---------+-----+----------+---------------------+-----------------------+-------------------+
|1      |2009     |1    |1         |Australian Grand Prix|2023-06-11 13:36:53.317|2009-03-29 06:00:00|
|2      |2009     |2    |2         |Malaysian Grand Prix |2023-06-11 13:36:53.317|2009-04-05 09:00:00|
|3      |2009     |3    |17        |Chinese Grand Prix   |2023-06-11 13:36:53.317|2009-04-19 07:00:00|
|4      |2009     |4    |3         |Bahrain Grand Prix   |2023-06-11 13:36:53.317|2009-04-26 12:00:00|
|5      |2009     |5    |4         |Spanish Grand Prix   |2023-06-11 13:36:53.317|2009-05-10 12:00:00|
|6      |2009     |6    |6         |Monaco Grand Prix    |2023-06-11 13:36:53.317|2009-05-24 12:00:00|
|7      |2009     |7    |5         |Turkish Grand Prix   |2023-06-11 13:3

#### Paso 4 - Escribir datos en el contenedor **processed** del ADLS como **parquet**

In [None]:
# El parámetro "processed_folder_path" se encuentra en el notebook "configuration"
races_selected_df.write.mode('overwrite').parquet(f"{processed_folder_path}/races")

In [None]:
# Visualizamos los objetos que se encuentran en el directorio "circuits"
%fs
ls /mnt/formula1dl/processed/races

In [None]:
# Podemos particionar el archivo al momento de escribirlo de acuerdo a los distintos años en la columna "race_year"
# El parámetro "processed_folder_path" se encuentra en el notebook "configuration"
races_selected_df.write.mode('overwrite').partitionBy('race_year').parquet(f"{processed_folder_path}/races")

<center><img src="https://images2.imgbox.com/6b/e6/5V8WqxvV_o.png"></center> <!--db60-->

In [None]:
dbutils.notebook.exit("Success")