## Using `MissForest` to Impute Missing Values in `MultipleDeliveries`

Missing values are indicated by -1 in `MultipleDeliveries`. We will use the machine learning feature in `MissForest` to predict and fill those missing entries.

In [2]:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
from missforest.missforest import MissForest
#pip install lightgbm
#pip install scikit-learn
#pip install MissForest

In [3]:
data = pd.read_csv("train_cleaned.csv")
data.head()

Unnamed: 0,Age,Ratings,RestaurantLat,RestaurantLon,DeliveryLocationLat,DeliveryLocationLon,TimeOrderPickedUp,WeatherConditions,RoadTrafficDensity,VehicleCondition,TypeOfOrder,TypeOfVehicle,MultipleDeliveries,Festival,City,TimeTaken,Distance,Day,Hour
0,37.0,4.9,22.745049,75.892471,22.765049,75.912471,11:45:00,conditions Sunny,High,2,Snack,motorcycle,0,No,Urban,24,3.025149,Saturday,11
1,34.0,4.5,12.913041,77.683237,13.043041,77.813237,19:50:00,conditions Stormy,Jam,2,Snack,scooter,1,No,Metropolitian,33,20.18353,Friday,19
2,23.0,4.4,12.914264,77.6784,12.924264,77.6884,08:45:00,conditions Sandstorms,Low,0,Drinks,motorcycle,1,No,Urban,26,1.552758,Saturday,8
3,38.0,4.7,11.003669,76.976494,11.053669,77.026494,18:10:00,conditions Sunny,Medium,0,Buffet,motorcycle,1,No,Metropolitian,21,7.790401,Tuesday,18
4,32.0,4.6,12.972793,80.249982,13.012793,80.289982,13:45:00,conditions Cloudy,High,1,Snack,scooter,1,No,Metropolitian,30,6.210138,Saturday,13


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40353 entries, 0 to 40352
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Age                  40353 non-null  float64
 1   Ratings              40353 non-null  float64
 2   RestaurantLat        40353 non-null  float64
 3   RestaurantLon        40353 non-null  float64
 4   DeliveryLocationLat  40353 non-null  float64
 5   DeliveryLocationLon  40353 non-null  float64
 6   TimeOrderPickedUp    40353 non-null  object 
 7   WeatherConditions    40353 non-null  object 
 8   RoadTrafficDensity   40353 non-null  object 
 9   VehicleCondition     40353 non-null  int64  
 10  TypeOfOrder          40353 non-null  object 
 11  TypeOfVehicle        40353 non-null  object 
 12  MultipleDeliveries   40353 non-null  int64  
 13  Festival             40353 non-null  object 
 14  City                 40353 non-null  object 
 15  TimeTaken            40353 non-null 

Since `MissForest` do not accept time series, we write a function `categorize_time` map the values in `Hour` to a new categorical variable called `OrderPeriod`.

In [5]:
# Convert hour to categories based on time of day
def categorize_time(hour):
    if 6 <= hour < 12:
        return 'morning'
    elif 12 <= hour < 18:
        return 'afternoon'
    elif 18 <= hour < 24:
        return 'evening'
    else:
        return 'night'

# Apply categorization function to your hour column
data['OrderPeriod'] = data['Hour'].apply(categorize_time)

# Convert the newly created category column to categorical data type
data['OrderPeriod'] = pd.Categorical(data['OrderPeriod'], categories=['morning', 'afternoon', 'evening', 'night'], ordered=True)


### Convert missing data back to `np.nan`
Previously we converted missing data to -1 to change the datatype from `object` to `int32`. Now we have to convert the missing data back to `np.nan` so that `missforest` can impute the missing data.

In [6]:
#Convert missing value back to NaN
data['MultipleDeliveries'].replace(-1 , np.nan, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['MultipleDeliveries'].replace(-1 , np.nan, inplace=True)


### Selecting variables
Now we select the variables that can be used in the prediction.

In [7]:
columnstokeep = ['Age', 'Ratings', 'WeatherConditions', 'RoadTrafficDensity', 'VehicleCondition',
                 'TypeOfOrder', 'MultipleDeliveries', 'Festival', 'City', 'TimeTaken','Distance', 'Day', 'OrderPeriod']

Cleanedforimpute = data.loc[:, columnstokeep].copy()

#### Convert non numerical values

In [8]:
# Get the list of nominal columns
nominal_columns = list(Cleanedforimpute.select_dtypes(include=["object", "category"]).columns)

# Get the indices of the nominal columns
cat = Cleanedforimpute[Cleanedforimpute.select_dtypes('object').columns]  
cat_ind = [Cleanedforimpute.columns.get_loc(c) for c in cat]

# Translate categorical fields to numeric
from sklearn import preprocessing
col_le = {}
Cleanedforimpute_trans = Cleanedforimpute.copy()
for col in nominal_columns:
    le = preprocessing.LabelEncoder()
    le.fit(Cleanedforimpute[col])
    Cleanedforimpute_trans[col] = le.transform(Cleanedforimpute[col])
    col_le[col] = le

# Add back in the NaNs
for col in nominal_columns:
    new_values = []
    for i in range(len(Cleanedforimpute)):
        if pd.isna(Cleanedforimpute.loc[i][col]):
            new_values.append(np.nan)
        else:
            new_values.append(Cleanedforimpute_trans.loc[i][col])
    Cleanedforimpute_trans[col] = new_values


In [9]:
# Initialize the magical forest
imputer = MissForest()

# Impute away
data_imputed = imputer.fit_transform(Cleanedforimpute_trans)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  x[c].fillna(initial_imputations[c], inplace=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003668 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 383
[LightGBM] [Info] Number of data points in the train set: 39498, number of used features: 12
[LightGBM] [Info] Start training from score 0.744620
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001547 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 383
[LightGBM] [Info] Number of data points in the train set: 39498, number of used features: 12
[LightGBM] [Info] Start training from score 0.744620
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001622 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins

In [10]:
data_imputed['MultipleDeliveries'].describe()

count    40353.000000
mean         0.742851
std          0.567109
min          0.000000
25%          0.000000
50%          1.000000
75%          1.000000
max          3.000000
Name: MultipleDeliveries, dtype: float64

Now we put the imputed data back into the original dataframe.

In [11]:
data['MultipleDeliveries'] = data_imputed['MultipleDeliveries']

In [12]:
data['MultipleDeliveries'] = data['MultipleDeliveries'].astype(int)

In [13]:
data['Age'] = data['Age'].astype(int)

### Exporting data 

In [14]:
data.to_csv(r'train_cleaned_imputed.csv', index=False)