### Step 1 - Start Spark Session and Include additional configurations and common functions

In [1]:
%run "../includes/configurations"

In [2]:
%run "../includes/common_functions"

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType

# Initialize a Spark session
spark = SparkSession.builder.appName("F1Database")\
    .config("spark.sql.warehouse.dir", "hive/")\
    .enableHiveSupport()\
    .getOrCreate()

24/01/02 16:45:26 WARN Utils: Your hostname, falcao-sys resolves to a loopback address: 127.0.1.1; using 192.168.11.185 instead (on interface wlx7898e8c12476)
24/01/02 16:45:26 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/01/02 16:45:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
#spark.stop()

###  Step 2 - Create F1 Database

In [5]:
# Create the database if it doesn't exist
spark.sql("CREATE DATABASE IF NOT EXISTS f1_raw")

# Switch to the newly created database
spark.sql("USE f1_raw")

24/01/02 16:45:31 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
24/01/02 16:45:31 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
24/01/02 16:45:34 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
24/01/02 16:45:34 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore giu@127.0.1.1
24/01/02 16:45:34 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException


DataFrame[]

### Step 3 - Create Circuits table and populate it

In [6]:
spark.sql("DROP TABLE IF EXISTS f1_raw.circuits")

spark.sql("""
  CREATE TABLE IF NOT EXISTS f1_raw.circuits (
    circuitId INT,
    circuitRef STRING,
    name STRING,
    location STRING,
    country STRING,
    lat DOUBLE,
    lng DOUBLE,
    alt INT,
    url STRING
  )
  USING csv
  OPTIONS (path '../../data/circuits.csv', header 'true')
""")

24/01/02 16:45:36 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider csv. Persisting data source table `spark_catalog`.`f1_raw`.`circuits` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
24/01/02 16:45:36 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
24/01/02 16:45:36 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
24/01/02 16:45:36 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
24/01/02 16:45:36 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist


+---------+--------------+--------------------+------------+---------+--------+---------+---+--------------------+
|circuitId|    circuitRef|                name|    location|  country|     lat|      lng|alt|                 url|
+---------+--------------+--------------------+------------+---------+--------+---------+---+--------------------+
|        1|   albert_park|Albert Park Grand...|   Melbourne|Australia|-37.8497|  144.968| 10|http://en.wikiped...|
|        2|        sepang|Sepang Internatio...|Kuala Lumpur| Malaysia| 2.76083|  101.738| 18|http://en.wikiped...|
|        3|       bahrain|Bahrain Internati...|      Sakhir|  Bahrain| 26.0325|  50.5106|  7|http://en.wikiped...|
|        4|     catalunya|Circuit de Barcel...|    Montmeló|    Spain|   41.57|  2.26111|109|http://en.wikiped...|
|        5|      istanbul|       Istanbul Park|    Istanbul|   Turkey| 40.9517|   29.405|130|http://en.wikiped...|
|        6|        monaco|   Circuit de Monaco| Monte-Carlo|   Monaco| 43.7347| 

### Step 4 - Create Races table and populate it

In [7]:
spark.sql("DROP TABLE IF EXISTS f1_raw.races")

spark.sql("""
  CREATE TABLE IF NOT EXISTS f1_raw.races(
    raceId INT,
    year INT,
    round INT,
    circuitId INT,
    name STRING,
    date DATE,
    time STRING,
    url STRING
  )
  USING csv
  OPTIONS (path "../../data/races.csv", header true)
""")

24/01/02 16:45:39 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider csv. Persisting data source table `spark_catalog`.`f1_raw`.`races` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.


DataFrame[]

### Step 5 - Create Constructors table and populate it

In [8]:
spark.sql("DROP TABLE IF EXISTS f1_raw.constructors")

spark.sql("""
  CREATE TABLE IF NOT EXISTS f1_raw.constructors(
    constructorId INT,
    constructorRef STRING,
    name STRING,
    nationality STRING,
    url STRING
  )
  USING json
  OPTIONS(path "../../data/constructors.json")
""")

24/01/02 16:45:39 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider json. Persisting data source table `spark_catalog`.`f1_raw`.`constructors` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.


DataFrame[]

### Step 6 - Create Drivers table and populate it

In [9]:
spark.sql("DROP TABLE IF EXISTS f1_raw.drivers")

spark.sql("""
  CREATE TABLE IF NOT EXISTS f1_raw.drivers(
    driverId INT,
    driverRef STRING,
    number INT,
    code STRING,
    name STRUCT<forename: STRING, surname: STRING>,
    dob DATE,
    nationality STRING,
    url STRING
  )
  USING json
  OPTIONS (path "../../data/drivers.json")
""")

24/01/02 16:45:39 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider json. Persisting data source table `spark_catalog`.`f1_raw`.`drivers` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.


DataFrame[]

### Step 7 - Create Results table and populate it

In [10]:
spark.sql("DROP TABLE IF EXISTS f1_raw.results")

spark.sql("""
  CREATE TABLE IF NOT EXISTS f1_raw.results(
    resultId INT,
    raceId INT,
    driverId INT,
    constructorId INT,
    number INT,grid INT,
    position INT,
    positionText STRING,
    positionOrder INT,
    points INT,
    laps INT,
    time STRING,
    milliseconds INT,
    fastestLap INT,
    rank INT,
    fastestLapTime STRING,
    fastestLapSpeed FLOAT,
    statusId STRING
  )
  USING json
  OPTIONS(path "../../data/results.json")
""")

24/01/02 16:45:39 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider json. Persisting data source table `spark_catalog`.`f1_raw`.`results` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.


DataFrame[]

### Step 8 - Create Pit Stops table and populate it

In [11]:
spark.sql("DROP TABLE IF EXISTS f1_raw.pit_stops")

spark.sql("""
  CREATE TABLE IF NOT EXISTS f1_raw.pit_stops(
    driverId INT,
    duration STRING,
    lap INT,
    milliseconds INT,
    raceId INT,
    stop INT,
    time STRING
  )
  USING json
  OPTIONS(path "../../data/pit_stops.json", multiLine true)
""")

24/01/02 16:45:39 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider json. Persisting data source table `spark_catalog`.`f1_raw`.`pit_stops` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.


DataFrame[]

### Step 9 - Create Lap Times table and populate it

In [12]:
spark.sql("DROP TABLE IF EXISTS f1_raw.lap_times")

spark.sql("""
  CREATE TABLE IF NOT EXISTS f1_raw.lap_times(
    raceId INT,
    driverId INT,
    lap INT,
    position INT,
    time STRING,
    milliseconds INT
  )
  USING csv
  OPTIONS (path "../../data/lap_times")
""")

24/01/02 16:45:39 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider csv. Persisting data source table `spark_catalog`.`f1_raw`.`lap_times` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.


DataFrame[]

### Step 9 - Create Qualifying table and populate it

In [14]:
spark.sql("DROP TABLE IF EXISTS f1_raw.qualifying")

spark.sql("""
  CREATE TABLE IF NOT EXISTS f1_raw.qualifying(
    constructorId INT,
    driverId INT,
    number INT,
    position INT,
    q1 STRING,
    q2 STRING,
    q3 STRING,
    qualifyId INT,
    raceId INT
  )
  USING json
  OPTIONS (path "../../data/qualifying", multiLine true)
""")

24/01/02 16:46:04 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider json. Persisting data source table `spark_catalog`.`f1_raw`.`qualifying` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.


DataFrame[]

+-------------+--------+------+--------+--------+--------+--------+---------+------+
|constructorId|driverId|number|position|      q1|      q2|      q3|qualifyId|raceId|
+-------------+--------+------+--------+--------+--------+--------+---------+------+
|            1|       1|    22|       1|1:26.572|1:25.187|1:26.714|        1|    18|
|            2|       9|     4|       2|1:26.103|1:25.315|1:26.869|        2|    18|
|            1|       5|    23|       3|1:25.664|1:25.452|1:27.079|        3|    18|
|            6|      13|     2|       4|1:25.994|1:25.691|1:27.178|        4|    18|
|            2|       2|     3|       5|1:25.960|1:25.518|1:27.236|        5|    18|
|            7|      15|    11|       6|1:26.427|1:26.101|1:28.527|        6|    18|
|            3|       3|     7|       7|1:26.295|1:26.059|1:28.687|        7|    18|
|            9|      14|     9|       8|1:26.381|1:26.063|1:29.041|        8|    18|
|            7|      10|    12|       9|1:26.919|1:26.164|1:29.59