# I. Project Team Members

| Prepared by | Email | Prepared for |
| :-: | :-: | :-: |
| **Hardefa Rogonondo** | hardefarogonondo@gmail.com | **E-commerce Shipping Time Prediction Engine** |

# II. Notebook Target Definition

This notebook is dedicated to the data preparation stage of the E-commerce Shipping Time Prediction Engine project. Using a dataset sourced from Kaggle, we initiate by inspecting its structure and details to understand its shape and inherent information. Following this, a meticulous cross-check of the data definitions and corresponding data types is conducted, with necessary conversions made to ensure validation. Finally, the dataset is segregated from its target labels and exported in .pkl format, preparing it for the subsequent exploratory data analysis (EDA) phase.

# III. Notebook Setup

## III.A. Import Libraries

In [1]:
import pandas as pd
import pickle

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## III.B. Import Data

In [2]:
df = pd.read_csv('../../data/raw/Train.csv')
df.head()

Unnamed: 0,ID,Warehouse_block,Mode_of_Shipment,Customer_care_calls,Customer_rating,Cost_of_the_Product,Prior_purchases,Product_importance,Gender,Discount_offered,Weight_in_gms,Reached.on.Time_Y.N
0,1,D,Flight,4,2,177,3,low,F,44,1233,1
1,2,F,Flight,4,5,216,2,low,M,59,3088,1
2,3,A,Flight,2,2,183,4,low,M,48,3374,1
3,4,B,Flight,3,3,176,4,medium,M,10,1177,1
4,5,C,Flight,2,2,184,3,medium,F,46,2484,1


# IV. Data Preparation

## IV.A. Data Shape Inspection

In [3]:
df.shape

(10999, 12)

## IV.B. Data Information Inspection

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10999 entries, 0 to 10998
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   ID                   10999 non-null  int64 
 1   Warehouse_block      10999 non-null  object
 2   Mode_of_Shipment     10999 non-null  object
 3   Customer_care_calls  10999 non-null  int64 
 4   Customer_rating      10999 non-null  int64 
 5   Cost_of_the_Product  10999 non-null  int64 
 6   Prior_purchases      10999 non-null  int64 
 7   Product_importance   10999 non-null  object
 8   Gender               10999 non-null  object
 9   Discount_offered     10999 non-null  int64 
 10  Weight_in_gms        10999 non-null  int64 
 11  Reached.on.Time_Y.N  10999 non-null  int64 
dtypes: int64(8), object(4)
memory usage: 1.0+ MB


## IV.C. Data Definition

| Variables | Columns Definition |
| :-: | :-: |
| ID | ID Number of Customers. |
| Warehouse_block | The Company have big Warehouse which is divided in to block such as A,B,C,D,E. |
| Mode_of_Shipment | The Company Ships the products in multiple way such as Ship, Flight and Road. |
| Customer_care_calls | The number of calls made from enquiry for enquiry of the shipment. |
| Customer_rating | The company has rated from every customer. 1 is the lowest (Worst), 5 is the highest (Best). |
| Cost_of_the_Product | Cost of the Product in US Dollars. |
| Prior_purchases | The Number of Prior Purchase. |
| Product_importance | The company has categorized the product in the various parameter such as low, medium, high. |
| Gender | Male and Female. |
| Discount_offered | Discount offered on that specific product. |
| Weight_in_gms | It is the weight in grams. |
| Reached.on.Time_Y.N | It is the target variable, where 1 Indicates that the product has NOT reached on time and 0 indicates it has reached on time. |

## IV.D. Data Validation

| Variables | Data Types |
| :-: | :-: |
| ID | String |

In [5]:
# Convert to string
df["ID"] = df["ID"].astype(str)

In [6]:
df.head()

Unnamed: 0,ID,Warehouse_block,Mode_of_Shipment,Customer_care_calls,Customer_rating,Cost_of_the_Product,Prior_purchases,Product_importance,Gender,Discount_offered,Weight_in_gms,Reached.on.Time_Y.N
0,1,D,Flight,4,2,177,3,low,F,44,1233,1
1,2,F,Flight,4,5,216,2,low,M,59,3088,1
2,3,A,Flight,2,2,183,4,low,M,48,3374,1
3,4,B,Flight,3,3,176,4,medium,M,10,1177,1
4,5,C,Flight,2,2,184,3,medium,F,46,2484,1


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10999 entries, 0 to 10998
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   ID                   10999 non-null  object
 1   Warehouse_block      10999 non-null  object
 2   Mode_of_Shipment     10999 non-null  object
 3   Customer_care_calls  10999 non-null  int64 
 4   Customer_rating      10999 non-null  int64 
 5   Cost_of_the_Product  10999 non-null  int64 
 6   Prior_purchases      10999 non-null  int64 
 7   Product_importance   10999 non-null  object
 8   Gender               10999 non-null  object
 9   Discount_offered     10999 non-null  int64 
 10  Weight_in_gms        10999 non-null  int64 
 11  Reached.on.Time_Y.N  10999 non-null  int64 
dtypes: int64(7), object(5)
memory usage: 1.0+ MB


## IV.E. Data Segregation

In [8]:
X = df.drop("Reached.on.Time_Y.N", axis=1)
y = df["Reached.on.Time_Y.N"]
X.shape, y.shape

((10999, 11), (10999,))

In [9]:
X.head()

Unnamed: 0,ID,Warehouse_block,Mode_of_Shipment,Customer_care_calls,Customer_rating,Cost_of_the_Product,Prior_purchases,Product_importance,Gender,Discount_offered,Weight_in_gms
0,1,D,Flight,4,2,177,3,low,F,44,1233
1,2,F,Flight,4,5,216,2,low,M,59,3088
2,3,A,Flight,2,2,183,4,low,M,48,3374
3,4,B,Flight,3,3,176,4,medium,M,10,1177
4,5,C,Flight,2,2,184,3,medium,F,46,2484


In [10]:
y.head()

0    1
1    1
2    1
3    1
4    1
Name: Reached.on.Time_Y.N, dtype: int64

## IV.F. Export Data

In [11]:
X.to_pickle('../../data/processed/X.pkl')
y.to_pickle('../../data/processed/y.pkl')