## Notebook Description
This notebook is used to clean the data and aggregated them into 1 big table for our data analysis. We are going to clean the following tables:
- `data_combined_events_w_vehicle_map.csv`
- `data_driving_periods_w_vehicle_map.csv`
- `data_inspections_w_vehicle_map.csv`
- `data_idle_events_w_vehicle_map.csv`

Once all these tables are cleaned, I would first cache all of the table before joining. For the joined, we are planning to aggregated them so that we would have an aggregated table in which it will records for all (driver_id, vehicle_id, start_date and event type) primary keys 

## System SetUp



In [0]:
# import libraries
from pyspark.sql import SparkSession
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql.functions import col, count, mean, stddev, min, max, when, isnan, countDistinct, lit, to_timestamp, to_date
import matplotlib.pyplot as plt
import seaborn as sns

In [0]:
app_name = "final-proj"
master = "local[*]"
spark = SparkSession\
        .builder\
        .appName(app_name)\
        .master(master)\
        .getOrCreate()

sc = spark.sparkContext

## Data Cleaning 
### `data_combined_events_w_vehicle_map`
Here are some of the data cleaning steps that I am going to do for this table 
- drop NULL values from column driver_id 
- Checking if event_id is unique and not null (primary key); if event_id is primary key, drop the id column because the primary key for this table is event_id 
- Create new column called `main_event_type` with value "hazard" 
- Create `trip_date` column that is converted from `start_date` timestamp to date 



In [0]:
# Load the data
combined_event_data = spark.read.csv('dbfs:/FileStore/tables/data_combined_events_w_vehicle_map.csv', inferSchema=True, header=True)

# drop Null values
cleaned_combined_event_data = combined_event_data.dropna(subset=['driver_id'])
# chek if event_id is unique and not null
event_id_unique_count = cleaned_combined_event_data.select(countDistinct("event_id")).collect()[0][0]
total_count = cleaned_combined_event_data.count()
# check event_id not null 
null_event_id_count = cleaned_combined_event_data.filter(col("event_id").isNull()).count()

if null_event_id_count == 0 and event_id_unique_count == total_count:
    print("event_id is a primary key")
    # event_id is a primary key, so drop the id column
    cleaned_combined_event_data = cleaned_combined_event_data.drop("id")

# add main_event_type columns 
cleaned_combined_event_data = cleaned_combined_event_data.withColumn("main_event_type", lit("hazard"))

# add trip_date column
cleaned_combined_event_data = cleaned_combined_event_data.withColumn("trip_date", to_date("start_date"))
# cache the data
cleaned_combined_event_data = cleaned_combined_event_data.cache()
cleaned_combined_event_data.show()

event_id is a primary key
+---------+-------------------+---------+-----------------+----------------+----------+---------------+-------------------+--------+-----+--------------------+--------------------+---------------------+---------------------+-----------+---------+-----------+--------------+--------------------+--------+--------------------+---------------+----------+
| event_id|               type|driver_id|driver_first_name|driver_last_name|vehicle_id|coaching_status|         start_date|severity|month|          created_at|          updated_at|max_over_speed_in_kph|max_over_speed_in_mph|not_current|   number|     status|          make|               model|group_id|          group_name|main_event_type| trip_date|
+---------+-------------------+---------+-----------------+----------------+----------+---------------+-------------------+--------+-----+--------------------+--------------------+---------------------+---------------------+-----------+---------+-----------+------------

In [0]:
cleaned_combined_event_data.printSchema()

root
 |-- event_id: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- driver_id: double (nullable = true)
 |-- driver_first_name: string (nullable = true)
 |-- driver_last_name: string (nullable = true)
 |-- vehicle_id: integer (nullable = true)
 |-- coaching_status: string (nullable = true)
 |-- start_date: timestamp (nullable = true)
 |-- severity: string (nullable = true)
 |-- month: string (nullable = true)
 |-- created_at: timestamp (nullable = true)
 |-- updated_at: timestamp (nullable = true)
 |-- max_over_speed_in_kph: double (nullable = true)
 |-- max_over_speed_in_mph: double (nullable = true)
 |-- not_current: boolean (nullable = true)
 |-- number: string (nullable = true)
 |-- status: string (nullable = true)
 |-- make: string (nullable = true)
 |-- model: string (nullable = true)
 |-- group_id: double (nullable = true)
 |-- group_name: string (nullable = true)
 |-- main_event_type: string (nullable = false)
 |-- trip_date: date (nullable = true)



### `data_driving_periods_w_vehicle_map`
Here are some of the data cleaning steps that I am going to do for this table 
- drop NULL values from column `driver_id` 
- Filter `minutes_driving` to be not null and have values larger than 0 
- Filter `driving_distance` to be not null and have values larger than 0 and less than 10000 (larger than 10000 are outliers from data exploratory)
- Count how many trip have `driving_distance` larger than 500 
- Check how many vehicle_id is null
- Checking if `event_id` is unique and not null (primary key); if `event_id` is primary key, drop the id column because the primary key for this table is `event_id`
- Create new column called `main_event_type` with value "driving" 
- Create a `trip_date` column from `start_date` timestamp


In [0]:
# Load the data
driving_period = spark.read.csv('dbfs:/FileStore/tables/data_driving_periods_w_vehicle_map.csv', inferSchema=True, header=True)

# Drop NULL values from driver_id
cleaned_driving_period = driving_period.dropna(subset=['driver_id'])

# Filter minutes_driving to be non-null and greater than 0
cleaned_driving_period = cleaned_driving_period.filter((col("minutes_driving").isNotNull()) & (col("minutes_driving") > 0))

# Filter driving_distance to be non-null and greater than 0
cleaned_driving_period = cleaned_driving_period.filter((col("driving_distance").isNotNull()) & (col("driving_distance") > 0) & (col("driving_distance")<10000))

# Count how many trips have driving_distance > 500
high_distance_trips = cleaned_driving_period.filter(col("driving_distance") > 500).count()
print(f"Number of trips with driving_distance > 500: {high_distance_trips}")

# Check how many vehicle_id values are NULL
null_vehicle_id_count = cleaned_driving_period.filter(col("vehicle_id").isNull()).count()
print(f"Number of NULL vehicle_id values: {null_vehicle_id_count}")

# add main_event_type column
cleaned_driving_period = cleaned_driving_period.withColumn("main_event_type", lit("driving"))
# add trip_date column
cleaned_driving_period = cleaned_driving_period.withColumn("trip_date", to_date("start_date"))

# Convert 'end_date' from string to timestamp
cleaned_driving_period = cleaned_driving_period.withColumn("end_date", to_timestamp("end_date"))
# cache data 
cleaned_driving_period = cleaned_driving_period.cache()



Number of trips with driving_distance > 500: 19
Number of NULL vehicle_id values: 0


In [0]:
from pyspark.sql.functions import col, count, when

null_event_id_count = cleaned_driving_period.filter(col("event_id").isNull()).count()
event_id_unique_count = cleaned_driving_period.select(countDistinct("event_id")).collect()[0][0]
total_count = cleaned_driving_period.count()

if null_event_id_count == 0 and event_id_unique_count == total_count:
    print("event_id is a primary key (unique and non-null)")
    # Drop id column since event_id is the primary key
    cleaned_driving_period = cleaned_driving_period.drop("id_y", "id_x")


event_id is a primary key (unique and non-null)


In [0]:
print(cleaned_driving_period.printSchema())
print("number of rows: ", cleaned_driving_period.count())
cleaned_driving_period.show()

root
 |-- event_id: long (nullable = true)
 |-- driver_id: double (nullable = true)
 |-- vehicle_id: integer (nullable = true)
 |-- start_date: timestamp (nullable = true)
 |-- end_date: timestamp (nullable = true)
 |-- driving_distance: double (nullable = true)
 |-- driving_period_type: string (nullable = true)
 |-- driver_company_id: string (nullable = true)
 |-- minutes_driving: double (nullable = true)
 |-- month: string (nullable = true)
 |-- created_at: timestamp (nullable = true)
 |-- updated_at: timestamp (nullable = true)
 |-- unassigned: boolean (nullable = true)
 |-- not_current: string (nullable = true)
 |-- number: string (nullable = true)
 |-- status: string (nullable = true)
 |-- make: string (nullable = true)
 |-- model: string (nullable = true)
 |-- group_id: double (nullable = true)
 |-- group_name: string (nullable = true)
 |-- main_event_type: string (nullable = false)
 |-- trip_date: date (nullable = true)

None
number of rows:  2450719
+----------+---------+------

### `data_inspections_w_vehicle_map`
Here are some of the data cleaning steps that I am going to do for this table 
- drop NULL values from column `driver_id` 
- Rename columns "status" to "inspection_status" and "status-2" to "status"
- Checking if `inspection_id` is unique and not null (primary key); if `inspection_id` is primary key, drop the `id` column because the primary key for this table is `inspection_id` 


In [0]:
inspection_data = spark.read.csv('dbfs:/FileStore/tables/data_inspections_w_vehicle_map.csv', inferSchema=True, header=True)

cleaned_inspection_data = inspection_data.dropna(subset=['driver_id'])

# Rename columns: "status" → "inspection_status" and "status-2" → "status"
cleaned_inspection_data = cleaned_inspection_data.withColumnRenamed("status", "inspection_status").withColumnRenamed("status-2", "status")

# Check if inspection_id is unique and not null
null_inspection_id_count = cleaned_inspection_data.filter(col("inspection_id").isNull()).count()
inspection_id_unique_count = cleaned_inspection_data.select(countDistinct("inspection_id")).collect()[0][0]
total_count = cleaned_inspection_data.count()

if null_inspection_id_count == 0 and inspection_id_unique_count == total_count:
    print("inspection_id is a primary key (unique and non-null)")
    # Drop id column since inspection_id is the primary key
    cleaned_inspection_data = cleaned_inspection_data.drop("id")

# add main_event_type column
cleaned_inspection_data = cleaned_inspection_data.withColumn("main_event_type", lit("inspection"))
# cache data 
cleaned_inspection_data = cleaned_inspection_data.cache()

print(cleaned_inspection_data.printSchema())
print(cleaned_inspection_data.count())
cleaned_inspection_data.show()


inspection_id is a primary key (unique and non-null)
root
 |-- inspection_id: long (nullable = true)
 |-- vehicle_id: integer (nullable = true)
 |-- date: date (nullable = true)
 |-- location: string (nullable = true)
 |-- inspection_status: string (nullable = true)
 |-- inspection_type: string (nullable = true)
 |-- driver_id: integer (nullable = true)
 |-- mechanic_id: string (nullable = true)
 |-- reviewer_id: string (nullable = true)
 |-- number: string (nullable = true)
 |-- status: string (nullable = true)
 |-- make: string (nullable = true)
 |-- model: string (nullable = true)
 |-- main_event_type: string (nullable = false)

None
215292
+-------------+----------+----------+--------------------+-----------------+---------------+---------+-----------+-----------+-------------+-----------+--------------+--------------------+---------------+
|inspection_id|vehicle_id|      date|            location|inspection_status|inspection_type|driver_id|mechanic_id|reviewer_id|       number|   

### `data_idle_events_w_vehicle_map`
Here are some of the data cleaning steps that I am going to do for this table 
- drop NULL values from column `driver_id` 
- Remove idling events that is larger than 500 minutes from column `minutes_idling` 
- Checking if `event_id` is unique and not null (primary key); if `event_id` is primary key, drop the `id_x` and `id_y column because the primary key for this table is `event_id` 
- Create `trip_date` from `start_date` timestamp 



In [0]:
idle_data = spark.read.csv('dbfs:/FileStore/tables/data_idle_events_w_vehicle_map.csv', inferSchema=True, header=True)

cleaned_idle_data = idle_data.dropna(subset=['driver_id'])

# Remove idling events where minutes_idling > 500
cleaned_idle_data = cleaned_idle_data.filter(col("minutes_idling") <= 500)

# Check if event_id is unique and not null
null_event_id_count = cleaned_idle_data.filter(col("event_id").isNull()).count()
event_id_unique_count = cleaned_idle_data.select(countDistinct("event_id")).collect()[0][0]
total_count = cleaned_idle_data.count()

if null_event_id_count == 0 and event_id_unique_count == total_count:
    print("event_id is a primary key (unique and non-null)")
    # Drop id_x and id_y columns since event_id is the primary key
    cleaned_idle_data = cleaned_idle_data.drop("id_x", "id_y")

# add trip_date from start_date timestamp
cleaned_idle_data = cleaned_idle_data.withColumn("idle_date", to_date("start_time"))

# add main_event_type column
cleaned_idle_data = cleaned_idle_data.withColumn("main_event_type", lit("idle"))
# cache data 
cleaned_idle_data = cleaned_idle_data.cache()

print(cleaned_idle_data.printSchema())
print(cleaned_idle_data.count())
cleaned_idle_data.show()

event_id is a primary key (unique and non-null)
root
 |-- event_id: long (nullable = true)
 |-- start_time: timestamp (nullable = true)
 |-- end_time: timestamp (nullable = true)
 |-- vehicle_id: integer (nullable = true)
 |-- driver_id: double (nullable = true)
 |-- driver_company_id: double (nullable = true)
 |-- minutes_idling: double (nullable = true)
 |-- number: string (nullable = true)
 |-- status: string (nullable = true)
 |-- make: string (nullable = true)
 |-- model: string (nullable = true)
 |-- group_id: double (nullable = true)
 |-- group_name: string (nullable = true)
 |-- idle_date: date (nullable = true)
 |-- main_event_type: string (nullable = false)

None
4264634
+----------+-------------------+-------------------+----------+---------+-----------------+------------------+------------+-----------+-------------+--------------------+--------+--------------------+----------+---------------+
|  event_id|         start_time|           end_time|vehicle_id|driver_id|driver_co

## Aggregated tables

In [0]:
# Create the main dataframe that other tables will join based on driver, vehicle_id, trip_date and main_event_type
all_trips_df =  cleaned_driving_period.select("driver_id", "vehicle_id", "trip_date").distinct()
main_event_types = ["hazard", "driving", "inspection", "idle"]
main_event_df = spark.createDataFrame([Row(main_event_type=t) for t in main_event_types])
all_trips_df = all_trips_df.crossJoin(main_event_df)

all_trips_df.show()

+---------+----------+----------+---------------+
|driver_id|vehicle_id| trip_date|main_event_type|
+---------+----------+----------+---------------+
|3757580.0|   1292577|2023-07-18|         hazard|
|4210116.0|   1194292|2023-07-19|         hazard|
|3862130.0|   1123168|2023-07-19|         hazard|
|3895392.0|   1311984|2023-07-19|         hazard|
|4178853.0|   1123166|2023-07-19|         hazard|
|4105456.0|   1123074|2023-07-19|         hazard|
|3601235.0|   1175213|2023-07-19|         hazard|
|3933445.0|   1189499|2023-07-18|         hazard|
|4039747.0|   1123105|2023-07-18|         hazard|
|3601236.0|   1033103|2023-07-19|         hazard|
|3773186.0|   1123248|2023-07-18|         hazard|
|3926379.0|   1163496|2023-07-18|         hazard|
|4134532.0|   1193220|2023-07-19|         hazard|
|4036826.0|   1123065|2023-07-19|         hazard|
|3749443.0|   1086398|2023-07-18|         hazard|
|4165548.0|   1164772|2023-07-18|         hazard|
|3792802.0|   1123233|2023-07-19|         hazard|


In [0]:
from pyspark.sql.functions import col

# Rename created_at and updated_at columns in both tables
joining_driving_period= cleaned_driving_period.withColumnRenamed("created_at", "created_at_driving_period") \
                                               .withColumnRenamed("updated_at", "updated_at_driving_period") \
                                               .withColumnRenamed("start_date", "start_date_driving_period") \
                                               .withColumnRenamed("end_date", "end_date_driving_period") \
                                                .withColumnRenamed("event_id", 
                                                "event_id_driving_period") \
                                                .withColumnRenamed("not_current", "not_current_driving_period")

main_joined_df = all_trips_df.join(
    joining_driving_period,
    on=["driver_id", "vehicle_id", "trip_date", "main_event_type"], 
    how="left"
)

# rename combined_event
joining_combined_event_data = cleaned_combined_event_data.withColumnRenamed("created_at", "created_at_combined_event") \
                                                         .withColumnRenamed("updated_at", "updated_at_combined_event") \
                                               .withColumnRenamed("start_date", "start_date_combined_event") \
                                                .withColumnRenamed("event_id", 
                                                    "event_id_combined_event") \
                                                .withColumnRenamed("type", "hazard_type") \
                                                .withColumnRenamed("not_current", "not_current_combined_event")
# drop following columns from combined_event
drop_col = ["number", "status", "make", "model", "group_id", "group_name", "month"]
joining_combined_event_data = joining_combined_event_data.drop(*drop_col)


# join on driver_id, vehicle_id, trip_date
main_joined_df = main_joined_df.join(
    joining_combined_event_data,
    on=["driver_id", "vehicle_id", "trip_date", "main_event_type"], 
    how="left"
)

# Show the joined DataFrame
print("Number of rows in the joined DataFrame:", main_joined_df.count())
print("Schema", main_joined_df.printSchema())
main_joined_df.show()


Number of rows in the joined DataFrame: 3312011
root
 |-- driver_id: double (nullable = true)
 |-- vehicle_id: integer (nullable = true)
 |-- trip_date: date (nullable = true)
 |-- main_event_type: string (nullable = true)
 |-- event_id_driving_period: long (nullable = true)
 |-- start_date_driving_period: timestamp (nullable = true)
 |-- end_date_driving_period: timestamp (nullable = true)
 |-- driving_distance: double (nullable = true)
 |-- driving_period_type: string (nullable = true)
 |-- driver_company_id: string (nullable = true)
 |-- minutes_driving: double (nullable = true)
 |-- month: string (nullable = true)
 |-- created_at_driving_period: timestamp (nullable = true)
 |-- updated_at_driving_period: timestamp (nullable = true)
 |-- unassigned: boolean (nullable = true)
 |-- not_current_driving_period: string (nullable = true)
 |-- number: string (nullable = true)
 |-- status: string (nullable = true)
 |-- make: string (nullable = true)
 |-- model: string (nullable = true)
 |--

In [0]:
joining_inspection_data = cleaned_inspection_data.drop(*drop_col)
# rename: 
joining_inspection_data = joining_inspection_data.withColumnRenamed("date", "trip_date") \
                                               .withColumnRenamed("location", "inspection_location")

# join on driver_id, vehicle_id, trip_date
main_joined_df = main_joined_df.join(
    joining_inspection_data,
    on=["driver_id", "vehicle_id", "trip_date", "main_event_type"], 
    how="left"
)

# Show the joined DataFrame
print("Number of rows in the joined DataFrame:", main_joined_df.count())
print("Schema", main_joined_df.printSchema())
main_joined_df.show()



Number of rows in the joined DataFrame: 3355239
root
 |-- driver_id: double (nullable = true)
 |-- vehicle_id: integer (nullable = true)
 |-- trip_date: date (nullable = true)
 |-- main_event_type: string (nullable = true)
 |-- event_id_driving_period: long (nullable = true)
 |-- start_date_driving_period: timestamp (nullable = true)
 |-- end_date_driving_period: timestamp (nullable = true)
 |-- driving_distance: double (nullable = true)
 |-- driving_period_type: string (nullable = true)
 |-- driver_company_id: string (nullable = true)
 |-- minutes_driving: double (nullable = true)
 |-- month: string (nullable = true)
 |-- created_at_driving_period: timestamp (nullable = true)
 |-- updated_at_driving_period: timestamp (nullable = true)
 |-- unassigned: boolean (nullable = true)
 |-- not_current_driving_period: string (nullable = true)
 |-- number: string (nullable = true)
 |-- status: string (nullable = true)
 |-- make: string (nullable = true)
 |-- model: string (nullable = true)
 |--

In [0]:
cleaned_idle_data.printSchema()

root
 |-- event_id: long (nullable = true)
 |-- start_time: timestamp (nullable = true)
 |-- end_time: timestamp (nullable = true)
 |-- vehicle_id: integer (nullable = true)
 |-- driver_id: double (nullable = true)
 |-- driver_company_id: double (nullable = true)
 |-- minutes_idling: double (nullable = true)
 |-- number: string (nullable = true)
 |-- status: string (nullable = true)
 |-- make: string (nullable = true)
 |-- model: string (nullable = true)
 |-- group_id: double (nullable = true)
 |-- group_name: string (nullable = true)
 |-- idle_date: date (nullable = true)
 |-- main_event_type: string (nullable = false)



In [0]:
# join with idle event 
joining_idle_data = cleaned_idle_data.drop(*drop_col)
joining_idle_data= joining_idle_data.drop("driver_company_id")
# rename: 
joining_idle_data = joining_idle_data.withColumnRenamed("start_time", "start_time_idle") \
                                    .withColumnRenamed("end_time", "end_time_idle") \
                                    .withColumnRenamed("idle_date", "trip_date") \
                                    .withColumnRenamed("event_id", "event_id_idle") \

# join on driver_id, vehicle_id, trip_date
main_joined_df = main_joined_df.join(
    joining_idle_data,
    on=["driver_id", "vehicle_id", "trip_date", "main_event_type"], 
    how="left"
)

# Show the joined DataFrame
print("Number of rows in the joined DataFrame:", main_joined_df.count())
print("Schema", main_joined_df.printSchema())

# cache data for later use 
main_joined_df=main_joined_df.cache()
main_joined_df.show()




Number of rows in the joined DataFrame: 7428575
root
 |-- driver_id: double (nullable = true)
 |-- vehicle_id: integer (nullable = true)
 |-- trip_date: date (nullable = true)
 |-- main_event_type: string (nullable = true)
 |-- event_id_driving_period: long (nullable = true)
 |-- start_date_driving_period: timestamp (nullable = true)
 |-- end_date_driving_period: timestamp (nullable = true)
 |-- driving_distance: double (nullable = true)
 |-- driving_period_type: string (nullable = true)
 |-- driver_company_id: string (nullable = true)
 |-- minutes_driving: double (nullable = true)
 |-- month: string (nullable = true)
 |-- created_at_driving_period: timestamp (nullable = true)
 |-- updated_at_driving_period: timestamp (nullable = true)
 |-- unassigned: boolean (nullable = true)
 |-- not_current_driving_period: string (nullable = true)
 |-- number: string (nullable = true)
 |-- status: string (nullable = true)
 |-- make: string (nullable = true)
 |-- model: string (nullable = true)
 |--

In [0]:
main_joined_df.printSchema()

root
 |-- driver_id: double (nullable = true)
 |-- vehicle_id: integer (nullable = true)
 |-- trip_date: date (nullable = true)
 |-- main_event_type: string (nullable = true)
 |-- event_id_driving_period: long (nullable = true)
 |-- start_date_driving_period: timestamp (nullable = true)
 |-- end_date_driving_period: timestamp (nullable = true)
 |-- driving_distance: double (nullable = true)
 |-- driving_period_type: string (nullable = true)
 |-- driver_company_id: string (nullable = true)
 |-- minutes_driving: double (nullable = true)
 |-- month: string (nullable = true)
 |-- created_at_driving_period: timestamp (nullable = true)
 |-- updated_at_driving_period: timestamp (nullable = true)
 |-- unassigned: boolean (nullable = true)
 |-- not_current: string (nullable = true)
 |-- number: string (nullable = true)
 |-- status: string (nullable = true)
 |-- make: string (nullable = true)
 |-- model: string (nullable = true)
 |-- group_id: double (nullable = true)
 |-- group_name: string (nu

In [0]:
# Save this data in dbfs for future use
main_joined_df.write.mode("overwrite").option("header", "true").csv("dbfs:/FileStore/tables/motive_joined_events.csv")
