# Simulated Supply Chain DataFrame

Project Introduction

In an increasingly complex and interconnected world, efficient logistics and supply chain management have become paramount to the success of businesses and economies. From ensuring timely deliveries to optimizing operational costs, the challenges within logistics are vast and varied. This project undertakes a rigorous analysis of simulated logistics data to unravel key performance indicators, uncover inefficiencies, and propose actionable insights to enhance overall supply chain operations.

Our simulated dataset mirrors real-world logistics challenges, with data points that include distance traveled, fuel costs, delivery times, driver consistency, customer satisfaction, traffic conditions, and more. By leveraging machine learning techniques such as clustering, dimensionality reduction, and predictive modeling, this project seeks to provide a comprehensive framework for identifying areas of improvement within the logistics landscape.

Our objective is twofold: first, to develop a deeper understanding of the factors impacting operational costs and delivery efficiency, and second, to construct predictive models capable of guiding decision-making to optimize these factors. This multi-faceted project is divided into several stages, each building upon the previous one to achieve a holistic view of the logistics operation.

In [57]:
import numpy as np
import pandas as pd
from faker import Faker
import sys

In [58]:
import os

In [59]:

# Assuming 'scripts' is in the same parent directory as 'notebooks'
sys.path.append(os.path.abspath("../scripts"))

from data_cleaning import load_data, clean_data

In [60]:
# Initialize Faker
fake = Faker()

# Define the number of records you want to simulate
num_records = 10000  # Increased to 10,000 for larger dataset

# Define lists for categorical variables
weather_conditions = ["Clear", "Light Rain", "Heavy Rain", "Snow", "Fog"]
traffic_conditions = ["Light", "Moderate", "Heavy", "Severe"]
experience_levels = ["Junior", "Intermediate", "Senior"]
delivery_windows = ["Morning", "Afternoon", "Evening", "Overnight"]
package_types = ["Standard", "Fragile", "Perishable", "Oversized"]
route_types = ["Interstate", "Urban", "Suburban"]
truck_types = ["Box Truck", "Semi", "Flatbed"]
satisfaction_levels = ["Very Satisfied", "Satisfied", "Neutral", "Dissatisfied", "Very Dissatisfied"]
fuel_types = ["Diesel", "Gasoline"]

# Generate a range of dates
date_range = pd.date_range(start="2023-01-01", end="2023-12-31", freq='D')

# Generate unique driver IDs
num_drivers = 500  # Assuming 500 unique drivers
driver_ids = [fake.uuid4() for _ in range(num_drivers)]

# Generate random data
data = {
    "Route ID": [fake.uuid4() for _ in range(num_records)],
    "Driver ID": np.random.choice(driver_ids, num_records),
    "Delivery Time (hours)": np.random.uniform(1, 10, num_records),
    "Date": np.random.choice(date_range, num_records),
    "Fuel Costs (USD)": np.random.uniform(50, 1000, num_records),
    "Delivery Start Time": [fake.time(pattern="%H:%M:%S") for _ in range(num_records)],
    "Distance Traveled (miles)": np.random.uniform(50, 3000, num_records),
    "Estimated Distance (miles)": np.random.uniform(50, 3000, num_records),
    "Weather Conditions": np.random.choice(weather_conditions, num_records),
    "Traffic Conditions": np.random.choice(traffic_conditions, num_records),
    "Driver Ratings": np.random.uniform(1, 5, num_records),
    "Customer Satisfaction": np.random.choice(satisfaction_levels, num_records),
    "Delays (hours)": np.random.uniform(0, 5, num_records),
    "Warehouse Storage Costs (USD)": np.random.uniform(100, 500, num_records),
    "Truck Maintenance Costs (USD)": np.random.uniform(500, 2000, num_records),
    "Load Type": np.random.choice(package_types, num_records),
    "Load Weight (tons)": np.random.uniform(0.5, 20, num_records),
    "Route Type": np.random.choice(route_types, num_records),
    "Truck Type": np.random.choice(truck_types, num_records),
    "Driver Experience": np.random.choice(experience_levels, num_records),
    "Delivery Window": np.random.choice(delivery_windows, num_records),
    "Truck Condition": np.random.randint(1, 6, num_records),  # Rating from 1 to 5
    "Labor Costs (USD)": np.random.uniform(20, 200, num_records),
    "Fuel Type": np.random.choice(fuel_types, num_records),
    "Toll Costs (USD)": np.random.uniform(0, 50, num_records),
    "Parking Costs (USD)": np.random.uniform(0, 30, num_records),
    "Idle Time (hours)": np.random.uniform(0, 2, num_records)
}

# Additional calculated columns
data["Distance Difference (miles)"] = data["Distance Traveled (miles)"] - data["Estimated Distance (miles)"]
data["Cost per Gallon (USD)"] = np.where(np.array(data["Fuel Type"]) == "Diesel", 3.5, 3.0)
data["Total Fuel Cost (USD)"] = data["Distance Traveled (miles)"] / np.random.uniform(5, 10) * data["Cost per Gallon (USD)"]
data["Insurance Costs (USD)"] = np.where(np.array(data["Load Type"]) == "Hazardous", np.random.uniform(50, 150, num_records), np.random.uniform(20, 100, num_records))
data["Breakdown Repair Costs (USD)"] = np.where(np.array(data["Truck Condition"]) <= 2, np.random.uniform(200, 1000, num_records), 0)
data["Overtime Labor Costs (USD)"] = np.where(data["Delivery Time (hours)"] > 8, (data["Delivery Time (hours)"] - 8) * np.random.uniform(20, 40), 0)
data["Fuel Surcharge (USD)"] = data["Fuel Costs (USD)"] * np.random.uniform(0.05, 0.15)
data["Idle Cost (USD)"] = data["Idle Time (hours)"] * data["Cost per Gallon (USD)"] * 0.5  # Assuming half a gallon per hour idling

# Calculate total operational costs
data["Total Operational Cost (USD)"] = (
    data["Fuel Costs (USD)"] +
    data["Toll Costs (USD)"] +
    data["Insurance Costs (USD)"] +
    data["Parking Costs (USD)"] +
    data["Breakdown Repair Costs (USD)"] +
    data["Overtime Labor Costs (USD)"] +
    data["Fuel Surcharge (USD)"] +
    data["Idle Cost (USD)"]
)

# Additional metrics
data["Fuel Cost per Mile"] = data["Fuel Costs (USD)"] / data["Distance Traveled (miles)"]
data["Delivery Efficiency Score"] = (
    1 / (1 + data["Fuel Cost per Mile"]) *
    (1 / (1 + data["Delays (hours)"])) *
    (1 / (1 + data["Load Weight (tons)"]))
)

# Create a DataFrame
logistics_df = pd.DataFrame(data)

# Display the first few rows to verify
print(logistics_df.head())

                               Route ID                             Driver ID  \
0  3a2d8f81-6054-45fa-b13a-24cd540ee76d  75149212-0e83-4e5b-83d4-3bd0c53f3030   
1  e8e92a39-62e1-4743-baee-c8544fbe59ee  9f47ca00-1ae7-4f76-878a-9a283d2c2ec2   
2  f6b4a887-f0cc-4528-a178-583c6c4d1fc4  fc05b7ca-0b9f-427d-9874-e21be1c6e521   
3  257e5a3f-78d3-48a1-8584-f508f2ee3670  4dd64322-0cc6-44a8-a2b7-b5f2663c9cae   
4  c74ae987-3974-490b-9707-bb325937916f  4ad93d48-9ad6-46a7-933c-57f8603a7f03   

   Delivery Time (hours)       Date  Fuel Costs (USD) Delivery Start Time  \
0               6.880603 2023-02-09        225.206331            22:34:14   
1               4.967601 2023-06-17        570.840791            03:38:06   
2               8.094009 2023-12-18        447.387571            09:01:13   
3               7.297488 2023-04-22        886.741516            14:26:53   
4               7.797437 2023-07-07        871.855361            03:17:26   

   Distance Traveled (miles)  Estimated Distance (

In [61]:
# Define paths to save/load data in the `data` directory
raw_data_path = '../data/raw/logistics_df.csv'
raw_excel_path = '../data/raw/logistics_df.xlsx'
cleaned_data_path = '../data/processed/cleaned_logistics_data.csv'
engineered_data_path = '../data/processed/engineered_data.csv'

# Ensure the `raw` directory exists within `data`
raw_directory = os.path.dirname(raw_data_path)
if not os.path.exists(raw_directory):
    os.makedirs(raw_directory)
    print(f"Directory created at {raw_directory}")

# Example: Saving a DataFrame to the `data/raw` directory# Sample data
logistics_df.to_csv(raw_data_path, index=False)
print(f"Data saved successfully to {raw_data_path}")

# Save the DataFrame to an Excel file
logistics_df.to_excel(raw_excel_path, index=False)  # index=False to avoid saving the index as a column

Data saved successfully to ../data/raw/logistics_df.csv


In [62]:
# Load the dataset
logistics_df = pd.read_csv('../data/raw/logistics_df.csv')

# Display first few rows
logistics_df.head()

Unnamed: 0,Route ID,Driver ID,Delivery Time (hours),Date,Fuel Costs (USD),Delivery Start Time,Distance Traveled (miles),Estimated Distance (miles),Weather Conditions,Traffic Conditions,...,Cost per Gallon (USD),Total Fuel Cost (USD),Insurance Costs (USD),Breakdown Repair Costs (USD),Overtime Labor Costs (USD),Fuel Surcharge (USD),Idle Cost (USD),Total Operational Cost (USD),Fuel Cost per Mile,Delivery Efficiency Score
0,3a2d8f81-6054-45fa-b13a-24cd540ee76d,75149212-0e83-4e5b-83d4-3bd0c53f3030,6.880603,2023-02-09,225.206331,22:34:14,571.318744,2339.2718,Clear,Moderate,...,3.0,305.072583,53.29335,0.0,0.0,15.806772,0.368712,340.228658,0.394187,0.015323
1,e8e92a39-62e1-4743-baee-c8544fbe59ee,9f47ca00-1ae7-4f76-878a-9a283d2c2ec2,4.967601,2023-06-17,570.840791,03:38:06,1593.917947,1468.769307,Heavy Rain,Heavy,...,3.5,992.973038,60.448295,0.0,0.0,40.066147,1.623336,684.194856,0.358137,0.0478
2,f6b4a887-f0cc-4528-a178-583c6c4d1fc4,fc05b7ca-0b9f-427d-9874-e21be1c6e521,8.094009,2023-12-18,447.387571,09:01:13,1127.075475,2270.634894,Clear,Severe,...,3.0,601.835368,36.241719,0.0,2.384313,31.401218,0.855647,559.776989,0.396946,0.007551
3,257e5a3f-78d3-48a1-8584-f508f2ee3670,4dd64322-0cc6-44a8-a2b7-b5f2663c9cae,7.297488,2023-04-22,886.741516,14:26:53,1607.293364,2476.720021,Heavy Rain,Light,...,3.0,858.261948,75.561809,0.0,0.0,62.238573,0.330756,1052.419293,0.551699,0.008019
4,c74ae987-3974-490b-9707-bb325937916f,4ad93d48-9ad6-46a7-933c-57f8603a7f03,7.797437,2023-07-07,871.855361,03:17:26,379.177698,965.593024,Heavy Rain,Light,...,3.5,236.218704,26.602416,0.0,0.0,61.193744,2.769288,1008.8488,2.299332,0.013692
