## Data Preprocessing - Food Delivery Time Dataset
### Introduction
This notebook transforms the raw dataset into machine learning-ready format. We handle missing values, encode categorical variables, and scale numerical features for optimal model performance.

**Author:** NGUYEN Ngoc Dang Nguyen - Final-year Student in Computer Science, Aix-Marseille University

**Preprocessing Steps:**
1. Import Libraries and Load Data
2. Missing Values Handling
3. Encode Categorical Features
4. Feature Scaling
5. Train-Test Split
6. Save Processed Data

### 1. Import Libraries and Load Data

In [22]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

df = pd.read_csv("../data/raw/Food_Delivery_Times.csv")
df.columns = df.columns.str.strip() 

print(f"Dataset loaded: {df.shape[0]} rows and {df.shape[1]} columns")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024:.1f} KB")

df.head()

Dataset loaded: 1000 rows and 9 columns
Memory usage: 280.2 KB


Unnamed: 0,Order_ID,Distance_km,Weather,Traffic_Level,Time_of_Day,Vehicle_Type,Preparation_Time_min,Courier_Experience_yrs,Delivery_Time_min
0,522,7.93,Windy,Low,Afternoon,Scooter,12,1.0,43
1,738,16.42,Clear,Medium,Evening,Bike,20,2.0,84
2,741,9.52,Foggy,Low,Night,Scooter,28,1.0,59
3,661,7.44,Rainy,Medium,Afternoon,Scooter,5,1.0,37
4,412,19.03,Clear,Low,Morning,Bike,16,5.0,68


### 2. Missing Values Handling

In [23]:
missing = df.isnull().sum()
print("Missing values:")
print(missing[missing > 0])

print("\nHandling missing values...")
df['Weather'].fillna(df['Weather'].mode()[0], inplace=True)
df['Traffic_Level'].fillna(df['Traffic_Level'].mode()[0], inplace=True)  
df['Time_of_Day'].fillna(df['Time_of_Day'].mode()[0], inplace=True)
df['Courier_Experience_yrs'].fillna(df['Courier_Experience_yrs'].median(), inplace=True)

print("After handling:")
print(df.isnull().sum().sum())

Missing values:
Weather                   30
Traffic_Level             30
Time_of_Day               30
Courier_Experience_yrs    30
dtype: int64

Handling missing values...
After handling:
0


### 3. Encode Categorical Variable

In [24]:
categorical_cols = ['Weather', 'Traffic_Level', 'Time_of_Day', 'Vehicle_Type']
print(f"Before encoding: {df.shape[1]} columns")
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=False)
print(f"After encoding: {df_encoded.shape[1]} columns")

for col in categorical_cols:
    new_cols = [c for c in df_encoded.columns if c.startswith(col + "_")]
    print(f"{col}: {len(new_cols)} new columns : {new_cols}")

Before encoding: 9 columns
After encoding: 20 columns
Weather: 5 new columns : ['Weather_Clear', 'Weather_Foggy', 'Weather_Rainy', 'Weather_Snowy', 'Weather_Windy']
Traffic_Level: 3 new columns : ['Traffic_Level_High', 'Traffic_Level_Low', 'Traffic_Level_Medium']
Time_of_Day: 4 new columns : ['Time_of_Day_Afternoon', 'Time_of_Day_Evening', 'Time_of_Day_Morning', 'Time_of_Day_Night']
Vehicle_Type: 3 new columns : ['Vehicle_Type_Bike', 'Vehicle_Type_Car', 'Vehicle_Type_Scooter']


### 4. Feature Scaling

In [None]:
numeric_cols = ['Distance_km', 'Preparation_Time_min', 'Courier_Experience_yrs']
scaler = StandardScaler()

df_scaled = df_encoded.copy()
df_scaled[numeric_cols] = scaler.fit_transform(df_encoded[numeric_cols])

print("Numeric features scaled")
print(df_scaled[numeric_cols].describe())

Numeric features scaled
        Distance_km  Preparation_Time_min  Courier_Experience_yrs
count  1.000000e+03          1.000000e+03            1.000000e+03
mean   3.019807e-17          8.526513e-17            9.947598e-17
std    1.000500e+00          1.000500e+00            1.000500e+00
min   -1.663205e+00         -1.663947e+00           -1.600132e+00
25%   -8.702386e-01         -8.307238e-01           -9.032107e-01
50%    2.283710e-02          2.499670e-03            1.421721e-01
75%    8.706882e-01          8.357231e-01            8.390939e-01
max    1.744006e+00          1.668947e+00            1.536016e+00


### 5. Train-Test Split

In [26]:
X = df_scaled.drop(['Order_ID', 'Delivery_Time_min'], axis=1)
y = df_scaled['Delivery_Time_min']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"Features: {X.shape[1]}")

Training set: (800, 18)
Test set: (200, 18)
Features: 18


### 6. Save Processed Data

In [27]:
df_scaled.to_csv("../data/processed/processed_data.csv", index=False)
pd.DataFrame(X_train).to_csv("../data/processed/X_train.csv", index=False)
pd.DataFrame(X_test).to_csv("../data/processed/X_test.csv", index=False)
pd.DataFrame(y_train).to_csv("../data/processed/y_train.csv", index=False)
pd.DataFrame(y_test).to_csv("../data/processed/y_test.csv", index=False)

### Conclusion
Preprocessing completed on 1000 records. Missing values (30 total) filled using mode/median. One-hot encoding expanded features to 20 columns. StandardScaler applied to numerical features. Final 80/20 train-test split maintains data integrity.