### Ingesting Multiple files CSV & Multi-Line JSON

* Ingesting multiple csv files
* Ingesting multiple multi-line JSON files


### Ingest lap_times folder

This is for ingesting multiple files at the same time.

##### Step 1 - Read the CSV file using the spark dataframe reader API

In [0]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

In [0]:
lap_times_schema = StructType(fields=[StructField("raceId", IntegerType(), False),
                                      StructField("driverId", IntegerType(), True),
                                      StructField("lap", IntegerType(), True),
                                      StructField("position", IntegerType(), True),
                                      StructField("time", StringType(), True),
                                      StructField("milliseconds", IntegerType(), True)
                                     ])

In [0]:
%fs ls /mnt/formula1dl/raw/lap_times

path,name,size,modificationTime
dbfs:/mnt/formula1dl/raw/lap_times/lap_times_split_1.csv,lap_times_split_1.csv,3016498,1703384683000
dbfs:/mnt/formula1dl/raw/lap_times/lap_times_split_2.csv,lap_times_split_2.csv,2959610,1703384684000
dbfs:/mnt/formula1dl/raw/lap_times/lap_times_split_3.csv,lap_times_split_3.csv,2880491,1703384684000
dbfs:/mnt/formula1dl/raw/lap_times/lap_times_split_4.csv,lap_times_split_4.csv,2882624,1703384685000
dbfs:/mnt/formula1dl/raw/lap_times/lap_times_split_5.csv,lap_times_split_5.csv,2806321,1703384685000


##### Import an entire folder of files

In [0]:
lap_times_df = spark.read \
.schema(lap_times_schema) \
.csv("/mnt/formula1dl/raw/lap_times")

In [0]:
lap_times_df.count()

Out[20]: 490904

In [0]:
lap_times_df.limit(5).show()

+------+--------+---+--------+--------+------------+
|raceId|driverId|lap|position|    time|milliseconds|
+------+--------+---+--------+--------+------------+
|   841|      20|  1|       1|1:38.109|       98109|
|   841|      20|  2|       1|1:33.006|       93006|
|   841|      20|  3|       1|1:32.713|       92713|
|   841|      20|  4|       1|1:32.803|       92803|
|   841|      20|  5|       1|1:32.342|       92342|
+------+--------+---+--------+--------+------------+



##### Step 2 - Rename columns and add new columns
1. Rename driverId and raceId
1. Add ingestion_date with current timestamp

In [0]:
from pyspark.sql.functions import current_timestamp

In [0]:
final_df = lap_times_df.withColumnRenamed("driverId", "driver_id") \
.withColumnRenamed("raceId", "race_id") \
.withColumn("ingestion_date", current_timestamp())

In [0]:
final_df.limit(5).show()

+----------+-------+---------+--------------+------+--------+--------+--------+--------+--------------------+
|qualify_id|race_id|driver_id|constructor_id|number|position|      q1|      q2|      q3|      ingestion_date|
+----------+-------+---------+--------------+------+--------+--------+--------+--------+--------------------+
|         1|     18|        1|             1|    22|       1|1:26.572|1:25.187|1:26.714|2023-12-24 04:49:...|
|         2|     18|        9|             2|     4|       2|1:26.103|1:25.315|1:26.869|2023-12-24 04:49:...|
|         3|     18|        5|             1|    23|       3|1:25.664|1:25.452|1:27.079|2023-12-24 04:49:...|
|         4|     18|       13|             6|     2|       4|1:25.994|1:25.691|1:27.178|2023-12-24 04:49:...|
|         5|     18|        2|             2|     3|       5|1:25.960|1:25.518|1:27.236|2023-12-24 04:49:...|
+----------+-------+---------+--------------+------+--------+--------+--------+--------+--------------------+



##### Step 3 - Write to output to processed container in parquet format

In [0]:
final_df.write.mode("overwrite").parquet("/mnt/formula1dl/processed/lap_times")

In [0]:
spark.read.parquet('/mnt/formula1dl/processed/lap_times').limit(5).show()

+-------+---------+---+--------+--------+------------+--------------------+
|race_id|driver_id|lap|position|    time|milliseconds|      ingestion_date|
+-------+---------+---+--------+--------+------------+--------------------+
|     67|       14| 26|      13|1:25.802|       85802|2023-12-24 04:19:...|
|     67|       14| 27|      13|1:25.338|       85338|2023-12-24 04:19:...|
|     67|       14| 28|      13|1:25.395|       85395|2023-12-24 04:19:...|
|     67|       14| 29|      12|1:26.191|       86191|2023-12-24 04:19:...|
|     67|       14| 30|      11|1:25.439|       85439|2023-12-24 04:19:...|
+-------+---------+---+--------+--------+------------+--------------------+



# PART 2: Ingest multiple json files

Importing multiple multi-line JSON files
* must specify multiline as true

##### Step 1 - Read the JSON file using the spark dataframe reader API

In [0]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

In [0]:
qualifying_schema = StructType(fields=[StructField("qualifyId", IntegerType(), False),
                                      StructField("raceId", IntegerType(), True),
                                      StructField("driverId", IntegerType(), True),
                                      StructField("constructorId", IntegerType(), True),
                                      StructField("number", IntegerType(), True),
                                      StructField("position", IntegerType(), True),
                                      StructField("q1", StringType(), True),
                                      StructField("q2", StringType(), True),
                                      StructField("q3", StringType(), True),
                                     ])

In [0]:
%fs ls /mnt/formula1dl/raw/qualifying/

path,name,size,modificationTime
dbfs:/mnt/formula1dl/raw/qualifying/__qualifying_split_1.json,__qualifying_split_1.json,368,1703389598000
dbfs:/mnt/formula1dl/raw/qualifying/__qualifying_split_2.json,__qualifying_split_2.json,312,1703389598000


In [0]:
qualifying_df = spark.read \
.schema(qualifying_schema) \
.option("multiLine", True) \
.json("/mnt/formula1dl/raw/qualifying")

qualifying_df.limit(5).show()

+---------+------+--------+-------------+------+--------+--------+--------+--------+
|qualifyId|raceId|driverId|constructorId|number|position|      q1|      q2|      q3|
+---------+------+--------+-------------+------+--------+--------+--------+--------+
|        1|    18|       1|            1|    22|       1|1:26.572|1:25.187|1:26.714|
|        2|    18|       9|            2|     4|       2|1:26.103|1:25.315|1:26.869|
|        3|    18|       5|            1|    23|       3|1:25.664|1:25.452|1:27.079|
|        4|    18|      13|            6|     2|       4|1:25.994|1:25.691|1:27.178|
|        5|    18|       2|            2|     3|       5|1:25.960|1:25.518|1:27.236|
+---------+------+--------+-------------+------+--------+--------+--------+--------+



##### Step 2 - Rename columns and add new columns
1. Rename qualifyingId, driverId, constructorId and raceId
1. Add ingestion_date with current timestamp

In [0]:
from pyspark.sql.functions import current_timestamp

In [0]:
final_df = qualifying_df.withColumnRenamed("qualifyId", "qualify_id") \
.withColumnRenamed("driverId", "driver_id") \
.withColumnRenamed("raceId", "race_id") \
.withColumnRenamed("constructorId", "constructor_id") \
.withColumn("ingestion_date", current_timestamp())

In [0]:
final_df.limit(5).show()

+----------+-------+---------+--------------+------+--------+--------+--------+--------+--------------------+
|qualify_id|race_id|driver_id|constructor_id|number|position|      q1|      q2|      q3|      ingestion_date|
+----------+-------+---------+--------------+------+--------+--------+--------+--------+--------------------+
|         1|     18|        1|             1|    22|       1|1:26.572|1:25.187|1:26.714|2023-12-24 04:48:...|
|         2|     18|        9|             2|     4|       2|1:26.103|1:25.315|1:26.869|2023-12-24 04:48:...|
|         3|     18|        5|             1|    23|       3|1:25.664|1:25.452|1:27.079|2023-12-24 04:48:...|
|         4|     18|       13|             6|     2|       4|1:25.994|1:25.691|1:27.178|2023-12-24 04:48:...|
|         5|     18|        2|             2|     3|       5|1:25.960|1:25.518|1:27.236|2023-12-24 04:48:...|
+----------+-------+---------+--------------+------+--------+--------+--------+--------+--------------------+



##### Step 3 - Write to output to processed container in parquet format

In [0]:
final_df.write.mode("overwrite").parquet("/mnt/formula1dl/processed/qualifying")

In [0]:
spark.read.parquet('/mnt/formula1dl/processed/qualifying').limit(5).show()

+----------+-------+---------+--------------+------+--------+--------+--------+--------+--------------------+
|qualify_id|race_id|driver_id|constructor_id|number|position|      q1|      q2|      q3|      ingestion_date|
+----------+-------+---------+--------------+------+--------+--------+--------+--------+--------------------+
|         1|     18|        1|             1|    22|       1|1:26.572|1:25.187|1:26.714|2023-12-24 04:05:...|
|         2|     18|        9|             2|     4|       2|1:26.103|1:25.315|1:26.869|2023-12-24 04:05:...|
|         3|     18|        5|             1|    23|       3|1:25.664|1:25.452|1:27.079|2023-12-24 04:05:...|
|         4|     18|       13|             6|     2|       4|1:25.994|1:25.691|1:27.178|2023-12-24 04:05:...|
|         5|     18|        2|             2|     3|       5|1:25.960|1:25.518|1:27.236|2023-12-24 04:05:...|
+----------+-------+---------+--------------+------+--------+--------+--------+--------+--------------------+

