# Analysis of v_load_summary_hourly Table

This notebook analyzes the schema and contents of the v_load_summary_hourly table from AWS Glue Catalog.

In [None]:
# Import required libraries
import pandas as pd
import pyarrow as pa
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Load the parquet file
file_path = '/local/home/admsia/parquet_analysis/load_summary.parquet'
df = pd.read_parquet(file_path)
print(f"Dataset shape: {df.shape}")

## Schema Overview

The table contains 116 columns with various data about vehicle routes, schedules, and logistics metrics.

In [None]:
# Display table columns
print("Column Names:")
columns = df.columns.tolist()
for i, col in enumerate(columns):
    print(f"{i+1}. {col}")

In [None]:
# Show data types
df.dtypes

## Data Sample

Here's a sample of the data to understand its structure:

In [None]:
# Display a sample of the data
df.head(5)

## Key Field Analysis

Let's examine the distribution of key fields in the dataset.

In [None]:
# Program code distribution
print("Program code distribution:")
print(df['program_code'].value_counts().head(10))

# Shipment mode distribution
print("\nShipment mode distribution:")
print(df['shipment_mode'].value_counts())

# Equipment type distribution
print("\nEquipment type distribution:")
print(df['equipment_type'].value_counts().head(10))

# Vehicle execution status
print("\nVehicle execution status distribution:")
print(df['vehicle_execution_status'].value_counts())

## Transit Time Analysis

Analysis of scheduled vs. actual transit times.

In [None]:
# Transit time statistics
print("Transit time statistics:")
df[['transit_hours_actual', 'scheduled_transit_hours']].describe()

# Late arrival analysis
print("\nLate arrival statistics:")
print(f"Origin arrival late hours - average: {df['origin_arrival_late_hrs'].mean()}")
print(f"Origin arrival late hours - median: {df['origin_arrival_late_hrs'].median()}")
print(f"Destination arrival late hours - average: {df['dest_arrival_late_hrs'].mean()}")
print(f"Destination arrival late hours - median: {df['dest_arrival_late_hrs'].median()}")

## Route Analysis

Examining characteristics of the routes.

In [None]:
# Miles distribution
plt.figure(figsize=(12, 6))
sns.histplot(df['miles'], bins=50)
plt.title('Distribution of Route Miles')
plt.xlabel('Miles')
plt.ylabel('Count')
plt.show()

# Stop count distribution
print("Stop count distribution:")
print(df['stop_count'].value_counts().head(10))

## Data Completeness

Checking for missing values in key columns.

In [None]:
# Calculate missing values percentage
missing_data = df.isnull().sum()
missing_percent = (missing_data / len(df)) * 100

# Create a dataframe with the results
missing_info = pd.DataFrame({
    'Missing Values': missing_data,
    'Missing Percent': missing_percent
})

# Sort by missing percent
missing_info = missing_info[missing_info['Missing Values'] > 0].sort_values('Missing Percent', ascending=False)
missing_info.head(20)