# Transformation and Aggregation of Race Data

##### Read all the data as required
* rename ambiguous or conflicting names with `.withColumnRenamed()`


Data source is at: 
* [BBC Sports](https://www.bbc.com/sport/formula1/drivers-world-championship/standings)

Spark Documentation is at: [API Reference](https://spark.apache.org/docs/latest/api/python/reference/index.html)
* **Window Functions**: [pyspark.sql.Window](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.Window.html#pyspark.sql.Window)
* **Order By**: [pyspark.sql.DataFrame.orderBy](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.orderBy.html#pyspark.sql.DataFrame.orderBy)

<img src="https://ergast.com/images/ergast_db.png" alt="Ergast DB Image">


In [0]:
%run "./includes/config_file_paths"

In [0]:
drivers_df = spark.read.parquet(f"{processed_folder_path}/drivers") \
.withColumnRenamed("number", "driver_number") \
.withColumnRenamed("name", "driver_name") \
.withColumnRenamed("nationality", "driver_nationality") 

In [0]:
constructors_df = spark.read.parquet(f"{processed_folder_path}/constructors") \
.withColumnRenamed("name", "team") 

In [0]:
circuits_df = spark.read.parquet(f"{processed_folder_path}/circuits") \
.withColumnRenamed("location", "circuit_location") 

In [0]:
races_df = spark.read.parquet(f"{processed_folder_path}/races") \
.withColumnRenamed("name", "race_name") \
.withColumnRenamed("race_timestamp", "race_date") 

In [0]:
results_df = spark.read.parquet(f"{processed_folder_path}/results") \
.withColumnRenamed("time", "race_time") 

##### Join circuits to races

In [0]:
race_circuits_df = races_df.join(circuits_df, races_df.circuit_id == circuits_df.circuit_id, "inner") \
.select(races_df.race_id, races_df.race_year, races_df.race_name, races_df.race_date, circuits_df.circuit_location)

##### Join results to all other dataframes

In [0]:
race_circuits_df.limit(5).show()

+-------+---------+--------------------+-------------------+----------------+
|race_id|race_year|           race_name|          race_date|circuit_location|
+-------+---------+--------------------+-------------------+----------------+
|   1053|     2021|Emilia Romagna Gr...|2021-04-18 13:00:00|           Imola|
|   1052|     2021|  Bahrain Grand Prix|2021-03-28 15:00:00|          Sakhir|
|   1051|     2021|Australian Grand ...|2021-11-21 06:00:00|       Melbourne|
|   1054|     2021|                 TBC|               null|         Nürburg|
|   1055|     2021|  Spanish Grand Prix|2021-05-09 13:00:00|        Montmeló|
+-------+---------+--------------------+-------------------+----------------+



In [0]:
race_results_df = results_df.join(race_circuits_df, results_df.race_id == race_circuits_df.race_id) \
                            .join(drivers_df, results_df.driver_id == drivers_df.driver_id) \
                            .join(constructors_df, results_df.constructor_id == constructors_df.constructor_id)

In [0]:
from pyspark.sql.functions import current_timestamp

In [0]:
final_df = race_results_df.select("race_year", "race_name", "race_date", "circuit_location", "driver_name", "driver_number", "driver_nationality",
                                 "team", "grid", "fastest_lap", "race_time", "points", "position") \
                          .withColumn("created_date", current_timestamp())


compare against:
https://www.bbc.com/sport/formula1/2020/results

documentation for oder by:
https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.orderBy.html

In [0]:
temp=final_df.filter("race_year == 2020 and race_name == 'Abu Dhabi Grand Prix'").orderBy(final_df.points.desc())
first_eight_columns = temp.select(temp.columns[:8])
first_eight_columns.limit(5).show()

+---------+--------------------+-------------------+----------------+---------------+-------------+------------------+--------+
|race_year|           race_name|          race_date|circuit_location|    driver_name|driver_number|driver_nationality|    team|
+---------+--------------------+-------------------+----------------+---------------+-------------+------------------+--------+
|     2020|Abu Dhabi Grand Prix|2020-12-13 13:10:00|       Abu Dhabi| Max Verstappen|           33|             Dutch|Red Bull|
|     2020|Abu Dhabi Grand Prix|2020-12-13 13:10:00|       Abu Dhabi| Max Verstappen|           33|             Dutch|Red Bull|
|     2020|Abu Dhabi Grand Prix|2020-12-13 13:10:00|       Abu Dhabi|Valtteri Bottas|           77|           Finnish|Mercedes|
|     2020|Abu Dhabi Grand Prix|2020-12-13 13:10:00|       Abu Dhabi|Valtteri Bottas|           77|           Finnish|Mercedes|
|     2020|Abu Dhabi Grand Prix|2020-12-13 13:10:00|       Abu Dhabi| Lewis Hamilton|           44|     

##### NOTE TO SELF: Fix this duplicate issue!!!!!!!!!!!!!!!!
nodup_final_df = final_df.dropDuplicates() works but I need to find out where the issue happened.  Hypothesis: Simply reimport the data and rerun the entire project again. I suspect a file got double concatonated during the ingestion process.

In [0]:
nodup_final_df = final_df.dropDuplicates()

In [0]:
from pyspark.sql import DataFrame
# Because display isn't github friendly
def ezView(df, n=5, m=8):
    """
    Display the first n rows and the first m columns of a DataFrame.

    Parameters:
    df (DataFrame): The DataFrame to display.
    n (int): Number of rows to display. Default is 5.
    m (int): Number of columns to display. Default is 8.
    """
    # Ensure that n and m are within the DataFrame's bounds
    num_rows = df.count()
    num_columns = len(df.columns)
    n = min(n, num_rows)
    m = min(m, num_columns)

    # Select the first m columns and display the first n rows
    selected_columns = df.select(df.columns[:m])
    selected_columns.limit(n).show()

In [0]:
temp=nodup_final_df.filter("race_year == 2020 and race_name == 'Abu Dhabi Grand Prix'").orderBy(final_df.points.desc())
ezView(temp)

+---------+--------------------+-------------------+----------------+---------------+-------------+------------------+--------+
|race_year|           race_name|          race_date|circuit_location|    driver_name|driver_number|driver_nationality|    team|
+---------+--------------------+-------------------+----------------+---------------+-------------+------------------+--------+
|     2020|Abu Dhabi Grand Prix|2020-12-13 13:10:00|       Abu Dhabi| Max Verstappen|           33|             Dutch|Red Bull|
|     2020|Abu Dhabi Grand Prix|2020-12-13 13:10:00|       Abu Dhabi|Valtteri Bottas|           77|           Finnish|Mercedes|
|     2020|Abu Dhabi Grand Prix|2020-12-13 13:10:00|       Abu Dhabi| Lewis Hamilton|           44|           British|Mercedes|
|     2020|Abu Dhabi Grand Prix|2020-12-13 13:10:00|       Abu Dhabi|Alexander Albon|           23|              Thai|Red Bull|
|     2020|Abu Dhabi Grand Prix|2020-12-13 13:10:00|       Abu Dhabi|   Lando Norris|            4|     

In [0]:
final_df=nodup_final_df

In [0]:
final_df.write.mode("overwrite").parquet(f"{presentation_folder_path}/race_results")

## Aggregations
Sum count, min, max, etc....

Documentation:
* https://spark.apache.org/docs/3.1.1/api/python/reference/pyspark.sql.html#functions

and then search for "Aggregate"

##### Produce driver standings

In [0]:
%run "./includes/config_file_paths"

In [0]:
race_results_df = spark.read.parquet(f"{presentation_folder_path}/race_results")

##### Count the Number of Wins per Season Per Driver

```
count(when(col("position") == 1, True)).alias("wins"))
```

In [0]:
from pyspark.sql.functions import sum, when, count, col

driver_standings_df = race_results_df \
.groupBy("race_year", "driver_name", "driver_nationality", "team") \
.agg(sum("points").alias("total_points"),
     count(when(col("position") == 1, True)).alias("wins"))

In [0]:
temp=driver_standings_df.filter("race_year = 2020")
ezView(temp)

+---------+------------------+------------------+------------+------------+----+
|race_year|       driver_name|driver_nationality|        team|total_points|wins|
+---------+------------------+------------------+------------+------------+----+
|     2020|      Lance Stroll|          Canadian|Racing Point|        75.0|   0|
|     2020|   Kevin Magnussen|            Danish|Haas F1 Team|         1.0|   0|
|     2020|Antonio Giovinazzi|           Italian|  Alfa Romeo|         4.0|   0|
|     2020|      Carlos Sainz|           Spanish|     McLaren|       105.0|   0|
|     2020|    Lewis Hamilton|           British|    Mercedes|       347.0|  11|
+---------+------------------+------------------+------------+------------+----+



##### Rank Each Driver

```
.withColumn("rank", rank().over(driver_rank_spec))
```


 WindowSpec object created using PySpark's Window functions. The WindowSpec object defines the window specification for use in window functions applied to a DataFrame.
 * partition by "race_year"
 * order by descending total points
 * if the drivers tied the ranking then goes to the driver with the most wins


Documentation is at: [API Reference](https://spark.apache.org/docs/latest/api/python/reference/index.html)
* **Window Functions**: [pyspark.sql.Window](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.Window.html#pyspark.sql.Window)
* **Order By**: [pyspark.sql.DataFrame.orderBy](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.orderBy.html#pyspark.sql.DataFrame.orderBy)


 The example command:

 ```
 # PARTITION BY country ORDER BY date RANGE BETWEEN 3 PRECEDING AND 3 FOLLOWING
window = Window.orderBy("date").partitionBy("country").rangeBetween(-3, 3)
```

In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import desc, rank, asc

driver_rank_spec = Window.partitionBy("race_year").orderBy(desc("total_points"), desc("wins"))
final_df = driver_standings_df.withColumn("rank", rank().over(driver_rank_spec))

Notice 
* Pierre Gasly
* Lance Stroll

Both have 75 points but Gasly has has more wins so he comes out ahead.

In [0]:
ezView(final_df.filter("race_year = 2020"))

+---------+----------------+------------------+------------+------------+----+----+
|race_year|     driver_name|driver_nationality|        team|total_points|wins|rank|
+---------+----------------+------------------+------------+------------+----+----+
|     2020|  Lewis Hamilton|           British|    Mercedes|       347.0|  11|   1|
|     2020| Valtteri Bottas|           Finnish|    Mercedes|       223.0|   2|   2|
|     2020|  Max Verstappen|             Dutch|    Red Bull|       214.0|   2|   3|
|     2020|    Sergio Pérez|           Mexican|Racing Point|       125.0|   1|   4|
|     2020|Daniel Ricciardo|        Australian|     Renault|       119.0|   0|   5|
+---------+----------------+------------------+------------+------------+----+----+



In [0]:
final_df.write.mode("overwrite").parquet(f"{presentation_folder_path}/driver_standings")

### Constructor standings

Objective: Figure out the rank of each team

https://www.bbc.com/sport/formula1/constructors-world-championship/standings

In [0]:
%run "./includes/config_file_paths"

In [0]:
race_results_df = spark.read.parquet(f"{presentation_folder_path}/race_results")

In [0]:
from pyspark.sql.functions import sum, when, count, col

constructor_standings_df = race_results_df \
.groupBy("race_year", "team") \
.agg(sum("points").alias("total_points"),
     count(when(col("position") == 1, True)).alias("wins"))

In [0]:
ezView(constructor_standings_df.filter("race_year = 2020"))

+---------+------------+------------+----+
|race_year|        team|total_points|wins|
+---------+------------+------------+----+
|     2020|Haas F1 Team|         3.0|   0|
|     2020|     McLaren|       202.0|   0|
|     2020|     Ferrari|       131.0|   0|
|     2020|    Mercedes|       573.0|  13|
|     2020|  AlphaTauri|       107.0|   1|
+---------+------------+------------+----+




 WindowSpec object created using PySpark's Window functions. The WindowSpec object defines the window specification for use in window functions applied to a DataFrame.

In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import desc, rank, asc

constructor_rank_spec = Window.partitionBy("race_year").orderBy(desc("total_points"), desc("wins"))
final_df = constructor_standings_df.withColumn("rank", rank().over(constructor_rank_spec))

In [0]:
ezView(final_df.filter("race_year = 2020"))

+---------+------------+------------+----+----+
|race_year|        team|total_points|wins|rank|
+---------+------------+------------+----+----+
|     2020|    Mercedes|       573.0|  13|   1|
|     2020|    Red Bull|       319.0|   2|   2|
|     2020|Racing Point|       210.0|   1|   3|
|     2020|     McLaren|       202.0|   0|   4|
|     2020|     Renault|       181.0|   0|   5|
+---------+------------+------------+----+----+



In [0]:
final_df.write.mode("overwrite").parquet(f"{presentation_folder_path}/constructor_standings")