<a href="https://colab.research.google.com/github/mayankbrn/9.7_Delhivery_feature_engineering/blob/MA_working/07_Delhivery_Feature_Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Business Case: Delhivery - Feature Engineering



## About Delhivery

Delhivery is the largest and fastest-growing fully integrated player in India by revenue in Fiscal 2021. They aim to build the operating system for commerce, through a combination of world-class infrastructure, logistics operations of the highest quality, and cutting-edge engineering and technology capabilities.

The Data team builds intelligence and capabilities using this data that helps them to widen the gap between the quality, efficiency, and profitability of their business versus their competitors.

### How can you help here?



The company wants to understand and process the data coming out of data engineering pipelines:

• Clean, sanitize and manipulate data to get useful features out of raw fields

• Make sense out of the raw data and help the data science team to build forecasting models on it

### Dataset

Dataset Link: [Delhivery data](https://drive.google.com/file/d/1ZkF2gGCDkjwQgOTGVBpsqhPpSGg1Fybb/view?usp=drive_link)



### Column Profiling

The dataset at the heart of this exploration. Here are some of the key features:

- **data**: tells whether the data is testing or training data.
- **trip_creation_time**: Timestamp of trip creation.
- **route_schedule_uuid**: Unique Id for a particular route schedule.
- **route_type**: Transportation type.
- **FTL**: Full Truck Load - FTL shipments get to the destination sooner, as the truck is making no other pickups or drop-offs along the way.
- **Carting**: Handling system consisting of small vehicles (carts).
- **trip_uuid**: Unique ID given to a particular trip (A trip may include different source and destination centers).
- **source_center**: Source ID of trip origin.
- **source_name**: Source Name of trip origin.
- **destination_center**: Destination ID.
- **destination_name**: Destination Name.
- **od_start_time**: Trip start time.
- **od_end_time**: Trip end time.
- **start_scan_to_end_scan**: Time taken to deliver from source to destination.
- **is_cutoff**: Unknown field.
- **cutoff_factor**: Unknown field.
- **cutoff_timestamp**: Unknown field.
- **actual_distance_to_destination**: Distance in Kms between source and destination warehouse.
- **actual_time**: Actual time taken to complete the delivery (Cumulative).
- **osrm_time**: An open-source routing engine time calculator which computes the shortest path between points in a given map (Includes usual traffic, distance through major and minor roads) and gives the time (Cumulative).
- **osrm_distance**: An open-source routing engine which computes the shortest path between points in a given map (Includes usual traffic, distance through major and minor roads) (Cumulative).
- **factor**: Unknown field.
- **segment_actual_time**: This is a segment time. Time taken by the subset of the package delivery.
- **segment_osrm_time**: This is the OSRM segment time. Time taken by the subset of the package delivery.
- **segment_osrm_distance**: This is the OSRM distance. Distance covered by the subset of the package delivery.
- **segment_factor**: Unknown field.

### Concepts Used

- Feature Creation
- Relationship between Features
- Column Normalization /Column Standardization
- Handling categorical values
- Missing values - Outlier treatment / Types of outliers

### How to begin:

Since delivery details of one package are divided into several rows (think of it as connecting flights to reach a particular destination). Now think about how we should treat their fields if we combine these rows? What aggregation would make sense if we merge. What would happen to the numeric fields if we merge the rows?

#### Hint:


You can use inbuilt functions like `groupby` and aggregations like `sum()`, `cumsum()` to merge some rows based on their
1. `Trip_uuid`, `Source ID` and `Destination ID`
2. Further aggregate on the basis of just `Trip_uuid`. You can also keep the first and last values for some numeric/categorical fields if aggregating them won’t make sense.

#### Basic data cleaning and exploration:

- Handle missing values in the data.
- Analyze the structure of the data.
- Try merging the rows using the hint mentioned above.
- Build some features to prepare the data for actual analysis. Extract features from the below fields:
  - **Destination Name**: Split and extract features out of destination. City-place-code (State)
  - **Source Name**: Split and extract features out of destination. City-place-code (State)
  - **Trip_creation_time**: Extract features like month, year and day etc

#### In-depth analysis and feature engineering:

- Calculate the time taken between `od_start_time` and `od_end_time` and keep it as a feature. Drop the original columns, if required
- Compare the difference between Point a. and `start_scan_to_end_scan`. Do hypothesis testing/ Visual analysis to check.
- Do hypothesis testing/ visual analysis between `actual_time` aggregated value and `OSRM time` aggregated value (aggregated values are the values you’ll get after merging the rows on the basis of `trip_uuid`)
- Do hypothesis testing/ visual analysis between `actual_time` aggregated value and `segment actual time` aggregated value (aggregated values are the values you’ll get after merging the rows on the basis of `trip_uuid`)
- Do hypothesis testing/ visual analysis between `osrm distance` aggregated value and `segment osrm distance` aggregated value (aggregated values are the values you’ll get after merging the rows on the basis of `trip_uuid`)
- Do hypothesis testing/ visual analysis between `osrm time` aggregated value and `segment osrm time` aggregated value (aggregated values are the values you’ll get after merging the rows on the basis of `trip_uuid`)
- Find outliers in the numerical variables (you might find outliers in almost all the variables), and check it using visual analysis
- Handle the outliers using the **IQR method**.
- Do one-hot encoding of categorical variables (like `route_type`)
- Normalize/ Standardize the numerical features using `MinMaxScaler` or `StandardScaler`.

### Evaluation Criteria (100 Points):


- Define Problem Statement and Perform Exploratory Data Analysis (10 points)

  - **Definition of Problem** (as per given problem statement with additional views)
  - **Observations** on:
    - Shape of data
    - Data types of all the attributes
    - Conversion of categorical attributes to 'category' (if required)
    - Missing value detection
    - Statistical summary
  - **Visual Analysis**:
    - Distribution plots of all the continuous variable(s)
    - Boxplots of all the categorical variables
  - **Insights** based on EDA
  - **Comments** on:
    - Range of attributes
    - Outliers of various attributes
    - Distribution of the variables and relationship between them
  - **Comments** for each univariate and bivariate plot

- Feature Creation (10 Points)
- Merging of Rows and Aggregation of Fields (10 Points)
- Comparison & Visualization of Time and Distance Fields (10 Points)
- Missing Values Treatment & Outlier Treatment (10 Points)
- Checking Relationship Between Aggregated Fields (10 Points)
- Handling Categorical Values (10 Points)
- Column Normalization / Column Standardization (10 Points)

- Business Insights (10 Points):
  Should include patterns observed in the data along with what you can infer from it. Examples:
  - Check from where most orders are coming from (State, Corridor, etc.)
  - Busiest corridor, average distance between them, average time taken

- Recommendations (10 Points)
  Actionable items for business. No technical jargon. No complications. Simple action items that everyone can understand.

## Solution

### Basic data cleaning and exploration:

In [323]:
#importing the relevent libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [324]:
#importing the csv file from google drive
!gdown 1ZkF2gGCDkjwQgOTGVBpsqhPpSGg1Fybb

Downloading...
From: https://drive.google.com/uc?id=1ZkF2gGCDkjwQgOTGVBpsqhPpSGg1Fybb
To: /content/07_delhivery_data.csv
100% 55.6M/55.6M [00:00<00:00, 89.8MB/s]


In [325]:
#load the csv into dataframe
df = pd.read_csv('07_delhivery_data.csv')

In [326]:
df.head(5)

Unnamed: 0,data,trip_creation_time,route_schedule_uuid,route_type,trip_uuid,source_center,source_name,destination_center,destination_name,od_start_time,...,cutoff_timestamp,actual_distance_to_destination,actual_time,osrm_time,osrm_distance,factor,segment_actual_time,segment_osrm_time,segment_osrm_distance,segment_factor
0,training,2018-09-20 02:35:36.476840,thanos::sroute:eb7bfc78-b351-4c0e-a951-fa3d5c3...,Carting,trip-153741093647649320,IND388121AAA,Anand_VUNagar_DC (Gujarat),IND388620AAB,Khambhat_MotvdDPP_D (Gujarat),2018-09-20 03:21:32.418600,...,2018-09-20 04:27:55,10.43566,14.0,11.0,11.9653,1.272727,14.0,11.0,11.9653,1.272727
1,training,2018-09-20 02:35:36.476840,thanos::sroute:eb7bfc78-b351-4c0e-a951-fa3d5c3...,Carting,trip-153741093647649320,IND388121AAA,Anand_VUNagar_DC (Gujarat),IND388620AAB,Khambhat_MotvdDPP_D (Gujarat),2018-09-20 03:21:32.418600,...,2018-09-20 04:17:55,18.936842,24.0,20.0,21.7243,1.2,10.0,9.0,9.759,1.111111
2,training,2018-09-20 02:35:36.476840,thanos::sroute:eb7bfc78-b351-4c0e-a951-fa3d5c3...,Carting,trip-153741093647649320,IND388121AAA,Anand_VUNagar_DC (Gujarat),IND388620AAB,Khambhat_MotvdDPP_D (Gujarat),2018-09-20 03:21:32.418600,...,2018-09-20 04:01:19.505586,27.637279,40.0,28.0,32.5395,1.428571,16.0,7.0,10.8152,2.285714
3,training,2018-09-20 02:35:36.476840,thanos::sroute:eb7bfc78-b351-4c0e-a951-fa3d5c3...,Carting,trip-153741093647649320,IND388121AAA,Anand_VUNagar_DC (Gujarat),IND388620AAB,Khambhat_MotvdDPP_D (Gujarat),2018-09-20 03:21:32.418600,...,2018-09-20 03:39:57,36.118028,62.0,40.0,45.562,1.55,21.0,12.0,13.0224,1.75
4,training,2018-09-20 02:35:36.476840,thanos::sroute:eb7bfc78-b351-4c0e-a951-fa3d5c3...,Carting,trip-153741093647649320,IND388121AAA,Anand_VUNagar_DC (Gujarat),IND388620AAB,Khambhat_MotvdDPP_D (Gujarat),2018-09-20 03:21:32.418600,...,2018-09-20 03:33:55,39.38604,68.0,44.0,54.2181,1.545455,6.0,5.0,3.9153,1.2


#### Dropping the unknown columns

In [327]:
# all the columns which are marked as unknown in the column profiling we can remove them

unknown_fields = ['is_cutoff', 'cutoff_factor', 'cutoff_timestamp', 'factor', 'segment_factor']
df.drop(unknown_fields, axis = 1, inplace = True)

We can see that time is considered as object and float in no of columns

In [328]:
df.head(5)

Unnamed: 0,data,trip_creation_time,route_schedule_uuid,route_type,trip_uuid,source_center,source_name,destination_center,destination_name,od_start_time,od_end_time,start_scan_to_end_scan,actual_distance_to_destination,actual_time,osrm_time,osrm_distance,segment_actual_time,segment_osrm_time,segment_osrm_distance
0,training,2018-09-20 02:35:36.476840,thanos::sroute:eb7bfc78-b351-4c0e-a951-fa3d5c3...,Carting,trip-153741093647649320,IND388121AAA,Anand_VUNagar_DC (Gujarat),IND388620AAB,Khambhat_MotvdDPP_D (Gujarat),2018-09-20 03:21:32.418600,2018-09-20 04:47:45.236797,86.0,10.43566,14.0,11.0,11.9653,14.0,11.0,11.9653
1,training,2018-09-20 02:35:36.476840,thanos::sroute:eb7bfc78-b351-4c0e-a951-fa3d5c3...,Carting,trip-153741093647649320,IND388121AAA,Anand_VUNagar_DC (Gujarat),IND388620AAB,Khambhat_MotvdDPP_D (Gujarat),2018-09-20 03:21:32.418600,2018-09-20 04:47:45.236797,86.0,18.936842,24.0,20.0,21.7243,10.0,9.0,9.759
2,training,2018-09-20 02:35:36.476840,thanos::sroute:eb7bfc78-b351-4c0e-a951-fa3d5c3...,Carting,trip-153741093647649320,IND388121AAA,Anand_VUNagar_DC (Gujarat),IND388620AAB,Khambhat_MotvdDPP_D (Gujarat),2018-09-20 03:21:32.418600,2018-09-20 04:47:45.236797,86.0,27.637279,40.0,28.0,32.5395,16.0,7.0,10.8152
3,training,2018-09-20 02:35:36.476840,thanos::sroute:eb7bfc78-b351-4c0e-a951-fa3d5c3...,Carting,trip-153741093647649320,IND388121AAA,Anand_VUNagar_DC (Gujarat),IND388620AAB,Khambhat_MotvdDPP_D (Gujarat),2018-09-20 03:21:32.418600,2018-09-20 04:47:45.236797,86.0,36.118028,62.0,40.0,45.562,21.0,12.0,13.0224
4,training,2018-09-20 02:35:36.476840,thanos::sroute:eb7bfc78-b351-4c0e-a951-fa3d5c3...,Carting,trip-153741093647649320,IND388121AAA,Anand_VUNagar_DC (Gujarat),IND388620AAB,Khambhat_MotvdDPP_D (Gujarat),2018-09-20 03:21:32.418600,2018-09-20 04:47:45.236797,86.0,39.38604,68.0,44.0,54.2181,6.0,5.0,3.9153


#### Analyzing the structure of data

In [329]:
#checking Null values in the columns
df.isna().sum()

Unnamed: 0,0
data,0
trip_creation_time,0
route_schedule_uuid,0
route_type,0
trip_uuid,0
source_center,0
source_name,293
destination_center,0
destination_name,261
od_start_time,0


'Source name' and 'Destination name' have some missing values.

In [330]:
#unique values in the columns
df.nunique()

Unnamed: 0,0
data,2
trip_creation_time,14817
route_schedule_uuid,1504
route_type,2
trip_uuid,14817
source_center,1508
source_name,1498
destination_center,1481
destination_name,1468
od_start_time,26369


In [331]:
# statistical summary of data - Numercial columns
df.describe()

Unnamed: 0,start_scan_to_end_scan,actual_distance_to_destination,actual_time,osrm_time,osrm_distance,segment_actual_time,segment_osrm_time,segment_osrm_distance
count,144867.0,144867.0,144867.0,144867.0,144867.0,144867.0,144867.0,144867.0
mean,961.262986,234.073372,416.927527,213.868272,284.771297,36.196111,18.507548,22.82902
std,1037.012769,344.990009,598.103621,308.011085,421.119294,53.571158,14.77596,17.86066
min,20.0,9.000045,9.0,6.0,9.0082,-244.0,0.0,0.0
25%,161.0,23.355874,51.0,27.0,29.9147,20.0,11.0,12.0701
50%,449.0,66.126571,132.0,64.0,78.5258,29.0,17.0,23.513
75%,1634.0,286.708875,513.0,257.0,343.19325,40.0,22.0,27.81325
max,7898.0,1927.447705,4532.0,1686.0,2326.1991,3051.0,1611.0,2191.4037


In [332]:
#statistical summary of data - categorical columns
df.describe(include = object)

Unnamed: 0,data,trip_creation_time,route_schedule_uuid,route_type,trip_uuid,source_center,source_name,destination_center,destination_name,od_start_time,od_end_time
count,144867,144867,144867,144867,144867,144867,144574,144867,144606,144867,144867
unique,2,14817,1504,2,14817,1508,1498,1481,1468,26369,26369
top,training,2018-09-28 05:23:15.359220,thanos::sroute:4029a8a2-6c74-4b7e-a6d8-f9e069f...,FTL,trip-153811219535896559,IND000000ACB,Gurgaon_Bilaspur_HB (Haryana),IND000000ACB,Gurgaon_Bilaspur_HB (Haryana),2018-09-21 18:37:09.322207,2018-09-24 09:59:15.691618
freq,104858,101,1812,99660,101,23347,23347,15192,15192,81,81


In [333]:
#shape of data
df.shape

(144867, 19)

There are 19 columns and 144867 rows in the dataset.

In [334]:
#checking the datatype of columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144867 entries, 0 to 144866
Data columns (total 19 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   data                            144867 non-null  object 
 1   trip_creation_time              144867 non-null  object 
 2   route_schedule_uuid             144867 non-null  object 
 3   route_type                      144867 non-null  object 
 4   trip_uuid                       144867 non-null  object 
 5   source_center                   144867 non-null  object 
 6   source_name                     144574 non-null  object 
 7   destination_center              144867 non-null  object 
 8   destination_name                144606 non-null  object 
 9   od_start_time                   144867 non-null  object 
 10  od_end_time                     144867 non-null  object 
 11  start_scan_to_end_scan          144867 non-null  float64
 12  actual_distance_

#### Changing he datatype of columns

In [335]:
#converting the data and route_type as categorical columns

df["data"] = df["data"].astype("category")
df["route_type"] = df["route_type"].astype("category")

In [336]:
# There are few columns which represent time but they in object datatype format those can be convereted to datetime column

datetype_column = ['trip_creation_time', 'od_start_time', 'od_end_time']

for i in datetype_column:
  df[i] = pd.to_datetime(df[i]) #passing the column one by one in the for loop

In [337]:
#checking the column datatypes again
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144867 entries, 0 to 144866
Data columns (total 19 columns):
 #   Column                          Non-Null Count   Dtype         
---  ------                          --------------   -----         
 0   data                            144867 non-null  category      
 1   trip_creation_time              144867 non-null  datetime64[ns]
 2   route_schedule_uuid             144867 non-null  object        
 3   route_type                      144867 non-null  category      
 4   trip_uuid                       144867 non-null  object        
 5   source_center                   144867 non-null  object        
 6   source_name                     144574 non-null  object        
 7   destination_center              144867 non-null  object        
 8   destination_name                144606 non-null  object        
 9   od_start_time                   144867 non-null  datetime64[ns]
 10  od_end_time                     144867 non-null  datetim

#### Handling missing values

In [338]:
# checking for the source_center  for which source_name is null

center_for_missing_source_name = df[df['source_name'].isna()]['source_center'].unique()
center_for_missing_source_name

array(['IND342902A1B', 'IND577116AAA', 'IND282002AAD', 'IND465333A1B',
       'IND841301AAC', 'IND509103AAC', 'IND126116AAA', 'IND331022A1B',
       'IND505326AAB', 'IND852118A1B'], dtype=object)

This give us all the source_center names for which source_name is not available.
Let's check if these source_center name is having source_name

In [339]:
# Checking if we can get the source_name for above source_centers from other rows in the data
# In short, it's selecting rows where "source_name" has a value, but the "source_center" is associated with some other rows where "source_name" is missing (null).

df[(df['source_name'].notnull()) & (df['source_center'].isin(df[df['source_name'].isnull()]))]

Unnamed: 0,data,trip_creation_time,route_schedule_uuid,route_type,trip_uuid,source_center,source_name,destination_center,destination_name,od_start_time,od_end_time,start_scan_to_end_scan,actual_distance_to_destination,actual_time,osrm_time,osrm_distance,segment_actual_time,segment_osrm_time,segment_osrm_distance


It means that source_name for missing values is not available in other rows as well.

In [340]:
#checking for destination_center for which destination_name is null
center_for_missing_destination_name = df[df['destination_name'].isna()]['destination_center'].unique()
center_for_missing_destination_name

array(['IND342902A1B', 'IND577116AAA', 'IND282002AAD', 'IND465333A1B',
       'IND841301AAC', 'IND505326AAB', 'IND852118A1B', 'IND126116AAA',
       'IND509103AAC', 'IND221005A1A', 'IND250002AAC', 'IND331001A1C',
       'IND122015AAC'], dtype=object)

In [341]:
# Checking if we can get the destintion_name for above destinatio_centers from other rows in the data
# In short, it's selecting rows where "destination_name" has a value, but the "destination_center" is associated with some other rows where "destination_name" is missing (null).

df[(df['destination_name'].notnull()) & (df['destination_center'].isin(df[df['destination_name'].isnull()]))]

Unnamed: 0,data,trip_creation_time,route_schedule_uuid,route_type,trip_uuid,source_center,source_name,destination_center,destination_name,od_start_time,od_end_time,start_scan_to_end_scan,actual_distance_to_destination,actual_time,osrm_time,osrm_distance,segment_actual_time,segment_osrm_time,segment_osrm_distance


#### Removing incorrect data

It means that destination_name for missing values is not available in other rows as well.

In [342]:
# We can see from the describe that the values of segment actual time is negative, which can't be true so lets drop that data
df.drop(df[df['segment_actual_time']<0].index, inplace = True)

In [343]:
#now checking the describe data
df.describe()

Unnamed: 0,trip_creation_time,od_start_time,od_end_time,start_scan_to_end_scan,actual_distance_to_destination,actual_time,osrm_time,osrm_distance,segment_actual_time,segment_osrm_time,segment_osrm_distance
count,144846,144846,144846,144846.0,144846.0,144846.0,144846.0,144846.0,144846.0,144846.0,144846.0
mean,2018-09-22 13:34:27.259366400,2018-09-22 18:02:50.434589952,2018-09-23 10:04:33.787580160,961.226537,234.057171,416.908724,213.853002,284.750969,36.207427,18.507304,22.828528
min,2018-09-12 00:00:16.535741,2018-09-12 00:00:16.535741,2018-09-12 00:50:10.814399,20.0,9.000045,9.0,6.0,9.0082,0.0,0.0,0.0
25%,2018-09-17 03:20:51.775845888,2018-09-17 08:05:40.886155008,2018-09-18 01:48:06.410121984,161.0,23.354927,51.0,27.0,29.909925,20.0,11.0,12.0701
50%,2018-09-22 04:24:27.932764928,2018-09-22 08:52:50.639791104,2018-09-23 03:13:03.520212992,449.0,66.126234,132.0,64.0,78.5246,29.0,17.0,23.513
75%,2018-09-27 17:57:56.350054912,2018-09-27 22:41:50.285857024,2018-09-28 12:49:06.054018048,1634.0,286.706673,513.0,257.0,343.062075,40.0,22.0,27.812975
max,2018-10-03 23:59:42.701692,2018-10-06 04:27:23.392375,2018-10-08 03:00:24.353479,7898.0,1927.447705,4532.0,1686.0,2326.1991,3051.0,1611.0,2191.4037
std,,,,1036.993595,344.974984,598.085058,307.997702,421.101831,53.561259,14.77587,17.860268


In [344]:
df.isna().sum()

Unnamed: 0,0
data,0
trip_creation_time,0
route_schedule_uuid,0
route_type,0
trip_uuid,0
source_center,0
source_name,293
destination_center,0
destination_name,261
od_start_time,0


Now negative values of negative segment time are removed.

#### Merging rows and aggregaion of data

In [345]:
df.head(5)

Unnamed: 0,data,trip_creation_time,route_schedule_uuid,route_type,trip_uuid,source_center,source_name,destination_center,destination_name,od_start_time,od_end_time,start_scan_to_end_scan,actual_distance_to_destination,actual_time,osrm_time,osrm_distance,segment_actual_time,segment_osrm_time,segment_osrm_distance
0,training,2018-09-20 02:35:36.476840,thanos::sroute:eb7bfc78-b351-4c0e-a951-fa3d5c3...,Carting,trip-153741093647649320,IND388121AAA,Anand_VUNagar_DC (Gujarat),IND388620AAB,Khambhat_MotvdDPP_D (Gujarat),2018-09-20 03:21:32.418600,2018-09-20 04:47:45.236797,86.0,10.43566,14.0,11.0,11.9653,14.0,11.0,11.9653
1,training,2018-09-20 02:35:36.476840,thanos::sroute:eb7bfc78-b351-4c0e-a951-fa3d5c3...,Carting,trip-153741093647649320,IND388121AAA,Anand_VUNagar_DC (Gujarat),IND388620AAB,Khambhat_MotvdDPP_D (Gujarat),2018-09-20 03:21:32.418600,2018-09-20 04:47:45.236797,86.0,18.936842,24.0,20.0,21.7243,10.0,9.0,9.759
2,training,2018-09-20 02:35:36.476840,thanos::sroute:eb7bfc78-b351-4c0e-a951-fa3d5c3...,Carting,trip-153741093647649320,IND388121AAA,Anand_VUNagar_DC (Gujarat),IND388620AAB,Khambhat_MotvdDPP_D (Gujarat),2018-09-20 03:21:32.418600,2018-09-20 04:47:45.236797,86.0,27.637279,40.0,28.0,32.5395,16.0,7.0,10.8152
3,training,2018-09-20 02:35:36.476840,thanos::sroute:eb7bfc78-b351-4c0e-a951-fa3d5c3...,Carting,trip-153741093647649320,IND388121AAA,Anand_VUNagar_DC (Gujarat),IND388620AAB,Khambhat_MotvdDPP_D (Gujarat),2018-09-20 03:21:32.418600,2018-09-20 04:47:45.236797,86.0,36.118028,62.0,40.0,45.562,21.0,12.0,13.0224
4,training,2018-09-20 02:35:36.476840,thanos::sroute:eb7bfc78-b351-4c0e-a951-fa3d5c3...,Carting,trip-153741093647649320,IND388121AAA,Anand_VUNagar_DC (Gujarat),IND388620AAB,Khambhat_MotvdDPP_D (Gujarat),2018-09-20 03:21:32.418600,2018-09-20 04:47:45.236797,86.0,39.38604,68.0,44.0,54.2181,6.0,5.0,3.9153


##### Merging row based on trip_uuid, source center and destination center as df1

In [346]:
# name of columns for aggregation
group_by_columns = ['trip_uuid', 'source_center', 'destination_center']

df1 = df.groupby(by = group_by_columns, as_index= False ).agg(
                                                                  {
                                                                      'data' : 'first',
                                                                      'trip_creation_time' : 'first',
                                                                      'route_type': 'first',
                                                                      'source_name' : 'first',
                                                                      'destination_name' : 'last',
                                                                      'od_start_time' : 'first',
                                                                      'od_end_time' : 'first',
                                                                      'start_scan_to_end_scan' : 'first',
                                                                      'actual_distance_to_destination' : 'last', #cumulative distance so using the last for taking end value
                                                                      'actual_time' : 'last', #cumulative time for trip till that point so using last for end value
                                                                      'osrm_time' : 'last', #cumualative time
                                                                      'osrm_distance' : 'last', #cumulative distance so taking last
                                                                      'segment_actual_time' : 'sum', #segment time so doing summation
                                                                      'segment_osrm_time' : 'sum', #segment time so doing the sum for whole trip
                                                                      'segment_osrm_distance' : 'sum' #segment distance so doing the summation


                                                                  }
                                                              )
df1.head()

Unnamed: 0,trip_uuid,source_center,destination_center,data,trip_creation_time,route_type,source_name,destination_name,od_start_time,od_end_time,start_scan_to_end_scan,actual_distance_to_destination,actual_time,osrm_time,osrm_distance,segment_actual_time,segment_osrm_time,segment_osrm_distance
0,trip-153671041653548748,IND209304AAA,IND000000ACB,training,2018-09-12 00:00:16.535741,FTL,Kanpur_Central_H_6 (Uttar Pradesh),Gurgaon_Bilaspur_HB (Haryana),2018-09-12 16:39:46.858469,2018-09-13 13:40:23.123744,1260.0,383.759164,732.0,329.0,446.5496,728.0,534.0,670.6205
1,trip-153671041653548748,IND462022AAA,IND209304AAA,training,2018-09-12 00:00:16.535741,FTL,Bhopal_Trnsport_H (Madhya Pradesh),Kanpur_Central_H_6 (Uttar Pradesh),2018-09-12 00:00:16.535741,2018-09-12 16:39:46.858469,999.0,440.973689,830.0,388.0,544.8027,820.0,474.0,649.8528
2,trip-153671042288605164,IND561203AAB,IND562101AAA,training,2018-09-12 00:00:22.886430,Carting,Doddablpur_ChikaDPP_D (Karnataka),Chikblapur_ShntiSgr_D (Karnataka),2018-09-12 02:03:09.655591,2018-09-12 03:01:59.598855,58.0,24.644021,47.0,26.0,28.1994,46.0,26.0,28.1995
3,trip-153671042288605164,IND572101AAA,IND561203AAB,training,2018-09-12 00:00:22.886430,Carting,Tumkur_Veersagr_I (Karnataka),Doddablpur_ChikaDPP_D (Karnataka),2018-09-12 00:00:22.886430,2018-09-12 02:03:09.655591,122.0,48.54289,96.0,42.0,56.9116,95.0,39.0,55.9899
4,trip-153671043369099517,IND000000ACB,IND160002AAC,training,2018-09-12 00:00:33.691250,FTL,Gurgaon_Bilaspur_HB (Haryana),Chandigarh_Mehmdpur_H (Punjab),2018-09-14 03:40:17.106733,2018-09-14 17:34:55.442454,834.0,237.43961,611.0,212.0,281.2109,608.0,231.0,317.7408


#### Adding the od_total_time

Calculate the time taken between od_start_time and od_end_time and keep it as a feature. Drop the original columns, if required

In [347]:
#calculating the total trip time
df1['od_total_time'] = df1['od_end_time'] - df1['od_start_time']

#dropping the original columns from the dataframe
df1.drop(['od_start_time', 'od_end_time'], inplace = True, axis = 1)

#convert the total time into seconds
df1['od_total_time'] = ((df1['od_total_time'].dt.total_seconds())/ 60).round(2)

df1.head(2) #check the value in new column for total trip time

Unnamed: 0,trip_uuid,source_center,destination_center,data,trip_creation_time,route_type,source_name,destination_name,start_scan_to_end_scan,actual_distance_to_destination,actual_time,osrm_time,osrm_distance,segment_actual_time,segment_osrm_time,segment_osrm_distance,od_total_time
0,trip-153671041653548748,IND209304AAA,IND000000ACB,training,2018-09-12 00:00:16.535741,FTL,Kanpur_Central_H_6 (Uttar Pradesh),Gurgaon_Bilaspur_HB (Haryana),1260.0,383.759164,732.0,329.0,446.5496,728.0,534.0,670.6205,1260.6
1,trip-153671041653548748,IND462022AAA,IND209304AAA,training,2018-09-12 00:00:16.535741,FTL,Bhopal_Trnsport_H (Madhya Pradesh),Kanpur_Central_H_6 (Uttar Pradesh),999.0,440.973689,830.0,388.0,544.8027,820.0,474.0,649.8528,999.51


##### Merging row based on trip_uuid as df2

In [348]:
df2 = df1.groupby(by = 'trip_uuid', as_index = False).agg({'source_center' : 'first',
                                                           'destination_center' : 'last',
                                                           'data' : 'first',
                                                           'route_type' : 'first',
                                                           'trip_creation_time' : 'first',
                                                           'source_name' : 'first',
                                                           'destination_name' : 'last',
                                                           'od_total_time' : 'sum',
                                                           'start_scan_to_end_scan' : 'sum',
                                                           'actual_distance_to_destination' : 'sum',
                                                           'actual_time' : 'sum',
                                                           'osrm_time' : 'sum',
                                                           'osrm_distance' : 'sum',
                                                           'segment_actual_time' : 'sum',
                                                           'segment_osrm_time' : 'sum',
                                                           'segment_osrm_distance' : 'sum'})
df2.head()

Unnamed: 0,trip_uuid,source_center,destination_center,data,route_type,trip_creation_time,source_name,destination_name,od_total_time,start_scan_to_end_scan,actual_distance_to_destination,actual_time,osrm_time,osrm_distance,segment_actual_time,segment_osrm_time,segment_osrm_distance
0,trip-153671041653548748,IND209304AAA,IND209304AAA,training,FTL,2018-09-12 00:00:16.535741,Kanpur_Central_H_6 (Uttar Pradesh),Kanpur_Central_H_6 (Uttar Pradesh),2260.11,2259.0,824.732854,1562.0,717.0,991.3523,1548.0,1008.0,1320.4733
1,trip-153671042288605164,IND561203AAB,IND561203AAB,training,Carting,2018-09-12 00:00:22.886430,Doddablpur_ChikaDPP_D (Karnataka),Doddablpur_ChikaDPP_D (Karnataka),181.61,180.0,73.186911,143.0,68.0,85.111,141.0,65.0,84.1894
2,trip-153671043369099517,IND000000ACB,IND000000ACB,training,FTL,2018-09-12 00:00:33.691250,Gurgaon_Bilaspur_HB (Haryana),Gurgaon_Bilaspur_HB (Haryana),3934.36,3933.0,1927.404273,3347.0,1740.0,2354.0665,3308.0,1941.0,2545.2678
3,trip-153671046011330457,IND400072AAB,IND401104AAA,training,Carting,2018-09-12 00:01:00.113710,Mumbai Hub (Maharashtra),Mumbai_MiraRd_IP (Maharashtra),100.49,100.0,17.175274,59.0,15.0,19.68,59.0,16.0,19.8766
4,trip-153671052974046625,IND583101AAA,IND583119AAA,training,FTL,2018-09-12 00:02:09.740725,Bellary_Dc (Karnataka),Sandur_WrdN1DPP_D (Karnataka),718.34,717.0,127.4485,341.0,117.0,146.7918,340.0,115.0,146.7919


### Feature Generation

#### Extract state, City and Place from Source & Destination

In [349]:
("Kanpur_Central_H_6 (Uttar Pradesh)").split('_')

['Kanpur', 'Central', 'H', '6 (Uttar Pradesh)']

In [350]:
("Kanpur_Central_H_6 (Uttar Pradesh)").split('_')[0] #city

'Kanpur'

In [351]:
("Kanpur_Central_H_6 (Uttar Pradesh)").split('_')[1] #place

'Central'

In [352]:
("Kanpur_Central_H_6 (Uttar Pradesh)").split('(')[1]

'Uttar Pradesh)'

In [353]:
("Kanpur_Central_H_6 (Uttar Pradesh)").split('(')[1][:-1]  #state

'Uttar Pradesh'

In [354]:
#we will be using the df2 dataframe that is been grouped at trip_uuid level

# Extracting state, city and place name from source_name column value e.g. Kanpur_Central_H_6 (Uttar Pradesh)

# Source Name: Split and extract features out of destination. City-place-code (State)
df2['source_state'] = df2['source_name'].apply(lambda x : str(x).split('(')[1][:-1] if '(' in str(x) else None) #state
df2['source_city'] = df2['source_name'].apply(lambda x : str(x).split('_')[0]) #city
#df2['source_place'] = df2['source_name'].apply(lambda x : str(x).split('_')[1]) #place

# Destination Name: Split and extract features out of destination. City-place-code (State)
df2['destination_state'] = df2['destination_name'].apply(lambda x : str(x).split('(')[1][:-1] if '(' in str(x) else None) #state
df2['destination_city'] = df2['destination_name'].apply(lambda x : str(x).split('_')[0]) #city
#df2['destination_place'] = df2['destination_name'].apply(lambda x : str(x).split('_')[1]) #place

df2.head()

Unnamed: 0,trip_uuid,source_center,destination_center,data,route_type,trip_creation_time,source_name,destination_name,od_total_time,start_scan_to_end_scan,...,actual_time,osrm_time,osrm_distance,segment_actual_time,segment_osrm_time,segment_osrm_distance,source_state,source_city,destination_state,destination_city
0,trip-153671041653548748,IND209304AAA,IND209304AAA,training,FTL,2018-09-12 00:00:16.535741,Kanpur_Central_H_6 (Uttar Pradesh),Kanpur_Central_H_6 (Uttar Pradesh),2260.11,2259.0,...,1562.0,717.0,991.3523,1548.0,1008.0,1320.4733,Uttar Pradesh,Kanpur,Uttar Pradesh,Kanpur
1,trip-153671042288605164,IND561203AAB,IND561203AAB,training,Carting,2018-09-12 00:00:22.886430,Doddablpur_ChikaDPP_D (Karnataka),Doddablpur_ChikaDPP_D (Karnataka),181.61,180.0,...,143.0,68.0,85.111,141.0,65.0,84.1894,Karnataka,Doddablpur,Karnataka,Doddablpur
2,trip-153671043369099517,IND000000ACB,IND000000ACB,training,FTL,2018-09-12 00:00:33.691250,Gurgaon_Bilaspur_HB (Haryana),Gurgaon_Bilaspur_HB (Haryana),3934.36,3933.0,...,3347.0,1740.0,2354.0665,3308.0,1941.0,2545.2678,Haryana,Gurgaon,Haryana,Gurgaon
3,trip-153671046011330457,IND400072AAB,IND401104AAA,training,Carting,2018-09-12 00:01:00.113710,Mumbai Hub (Maharashtra),Mumbai_MiraRd_IP (Maharashtra),100.49,100.0,...,59.0,15.0,19.68,59.0,16.0,19.8766,Maharashtra,Mumbai Hub (Maharashtra),Maharashtra,Mumbai
4,trip-153671052974046625,IND583101AAA,IND583119AAA,training,FTL,2018-09-12 00:02:09.740725,Bellary_Dc (Karnataka),Sandur_WrdN1DPP_D (Karnataka),718.34,717.0,...,341.0,117.0,146.7918,340.0,115.0,146.7919,Karnataka,Bellary,Karnataka,Sandur


In [355]:
df2['source_state'].unique()

array(['Uttar Pradesh', 'Karnataka', 'Haryana', 'Maharashtra',
       'Tamil Nadu', 'Gujarat', 'Delhi', 'Telangana', 'Rajasthan',
       'Assam', 'Madhya Pradesh', 'West Bengal', 'Andhra Pradesh',
       'Punjab', 'Chandigarh', 'Goa', 'Jharkhand', 'Pondicherry',
       'Orissa', 'Uttarakhand', 'Himachal Pradesh', 'Kerala',
       'Arunachal Pradesh', 'Bihar', 'Chhattisgarh',
       'Dadra and Nagar Haveli', 'Jammu & Kashmir', 'Mizoram', 'Nagaland',
       None], dtype=object)

In [356]:
df2['destination_state'].unique()

array(['Uttar Pradesh', 'Karnataka', 'Haryana', 'Maharashtra',
       'Tamil Nadu', 'Gujarat', 'Delhi', 'Telangana', 'Rajasthan',
       'Madhya Pradesh', 'Assam', 'West Bengal', 'Andhra Pradesh',
       'Punjab', 'Chandigarh', 'Dadra and Nagar Haveli', 'Orissa',
       'Bihar', 'Jharkhand', 'Goa', 'Uttarakhand', 'Himachal Pradesh',
       'Kerala', 'Arunachal Pradesh', 'Mizoram', 'Chhattisgarh',
       'Jammu & Kashmir', 'Nagaland', 'Meghalaya', 'Tripura', None,
       'Daman & Diu'], dtype=object)

State names for source and destination looks good

In [357]:
df2['source_city'].unique()[:50]

array(['Kanpur', 'Doddablpur', 'Gurgaon', 'Mumbai Hub (Maharashtra)',
       'Bellary', 'Chennai', 'HBR Layout PC (Karnataka)', 'Surat',
       'Delhi', 'Pune', 'FBD', 'Shirala', 'Hyderabad', 'Thirumalagiri',
       'Gulbarga', 'Jaipur', 'Allahabad', 'Guwahati', 'Narsinghpur',
       'Shrirampur', 'Hoogly', 'Madakasira', 'Sonari', 'Bengaluru',
       'Dindigul', 'Jalandhar', 'Faridabad', 'Chandigarh', 'Deoli',
       'Pandharpur', 'CCU', 'Bhandara', 'Kurnool', 'Bhiwandi', 'Bhatinda',
       'RoopNagar', 'Bantwal', 'Lalru', 'Kadi', 'Shahdol', 'Gangakher',
       'Durgapur', 'Vapi', 'Jamjodhpur', 'Jetpur', 'Mehsana', 'Jabalpur',
       'Junagadh', 'Gundlupet', 'Mysore'], dtype=object)

In [358]:
df2['destination_city'].unique()[:50]

array(['Kanpur', 'Doddablpur', 'Gurgaon', 'Mumbai', 'Sandur', 'Chennai',
       'HBR Layout PC (Karnataka)', 'Surat', 'Delhi',
       'PNQ Rahatani DPC (Maharashtra)', 'Faridabad (Haryana)',
       'Ratnagiri', 'Bangalore', 'Hyderabad', 'Aland', 'Jaipur', 'Satna',
       'Janakpuri (Delhi)', 'Guwahati', 'Bareli', 'Nashik', 'Hooghly',
       'Puttaprthi', 'Sivasagar', 'Bengaluru', 'Palani', 'Jalandhar',
       'Chandigarh', 'Yavatmal', 'Sangola', 'Kolkata', 'Savner',
       'Kurnool', 'FBD', 'Bhatinda', 'Bhiwandi', 'Barnala', 'Murbad',
       'Kadaba', 'Gulbarga', 'Naraingarh', 'Ludhiana', 'Kadi', 'Jabalpur',
       'MAA', 'Gangakher', 'Bankura', 'Silvassa', 'Porbandar', 'Jetpur'],
      dtype=object)

With the city name there is some problem, at some places name of state is added with it.

In [359]:
df2['source_city'] = df2['source_city'].apply(lambda x : str(x).split()[0])
df2['destination_city'] = df2['destination_city'].apply(lambda x : str(x).split()[0])
df2.sample(10)

Unnamed: 0,trip_uuid,source_center,destination_center,data,route_type,trip_creation_time,source_name,destination_name,od_total_time,start_scan_to_end_scan,...,actual_time,osrm_time,osrm_distance,segment_actual_time,segment_osrm_time,segment_osrm_distance,source_state,source_city,destination_state,destination_city
587,trip-153678476032423204,IND562132AAA,IND560300AAA,training,Carting,2018-09-12 20:39:20.324489,Bangalore_Nelmngla_H (Karnataka),Bengaluru_KGAirprt_HB (Karnataka),314.65,314.0,...,55.0,46.0,36.8545,54.0,65.0,65.8588,Karnataka,Bangalore,Karnataka,Bengaluru
1747,trip-153690681610502349,IND110064AAA,IND000000ACB,training,Carting,2018-09-14 06:33:36.105259,Delhi_Mayapuri_PC (Delhi),Gurgaon_Bilaspur_HB (Haryana),720.29,720.0,...,693.0,53.0,57.2795,691.0,50.0,57.2796,Delhi,Delhi,Haryana,Gurgaon
7710,trip-153763761783703072,IND382430AAB,IND421302AAG,training,FTL,2018-09-22 17:33:37.837275,Ahmedabad_East_H_1 (Gujarat),Bhiwandi_Mankoli_HB (Maharashtra),1537.75,1536.0,...,960.0,405.0,541.7943,955.0,412.0,556.2722,Gujarat,Ahmedabad,Maharashtra,Bhiwandi
2330,trip-153697823038467653,IND387001AAA,IND387001AAA,training,Carting,2018-09-15 02:23:50.385028,Nadiad_DC (Gujarat),Nadiad_DC (Gujarat),201.43,200.0,...,128.0,64.0,74.5832,125.0,64.0,74.5831,Gujarat,Nadiad,Gujarat,Nadiad
5066,trip-153731180533627844,IND000000ACB,IND131028AAB,training,Carting,2018-09-18 23:03:25.336546,Gurgaon_Bilaspur_HB (Haryana),Sonipat_Kundli_H (Haryana),317.44,317.0,...,158.0,98.0,122.9206,155.0,191.0,129.1677,Haryana,Gurgaon,Haryana,Sonipat
8548,trip-153774263614858836,IND462022AAA,IND462022AAA,training,FTL,2018-09-23 22:43:56.148841,Bhopal_Trnsport_H (Madhya Pradesh),Bhopal_Trnsport_H (Madhya Pradesh),2872.67,2872.0,...,2510.0,1198.0,1680.0168,2490.0,1534.0,1837.2194,Madhya Pradesh,Bhopal,Madhya Pradesh,Bhopal
3412,trip-153712554195755989,IND421302AAG,IND562132AAA,training,FTL,2018-09-16 19:19:01.957793,Bhiwandi_Mankoli_HB (Maharashtra),Bangalore_Nelmngla_H (Karnataka),1427.13,1427.0,...,1192.0,712.0,970.9006,1174.0,762.0,982.8149,Maharashtra,Bhiwandi,Karnataka,Bangalore
3817,trip-153716795842522836,IND110014AAA,IND110024AAA,training,Carting,2018-09-17 07:05:58.425509,Delhi_Bhogal (Delhi),Delhi_Lajpat_IP (Delhi),57.72,57.0,...,24.0,7.0,9.0729,24.0,7.0,9.0729,Delhi,Delhi,Delhi,Delhi
11908,trip-153817885327966424,IND121004AAB,IND121004AAA,test,Carting,2018-09-28 23:54:13.280015,FBD_Balabhgarh_DPC (Haryana),Faridabad_Blbgarh_DC (Haryana),76.67,76.0,...,20.0,12.0,16.1132,20.0,12.0,16.1132,Haryana,FBD,Haryana,Faridabad
4997,trip-153730763549066095,IND273014AAB,IND277201AAA,training,FTL,2018-09-18 21:53:55.490956,Gorakhpur_Matriprm_IP (Uttar Pradesh),Bariya_BgnprDPP_D (Uttar Pradesh),311.55,311.0,...,245.0,122.0,153.4156,244.0,120.0,153.4155,Uttar Pradesh,Gorakhpur,Uttar Pradesh,Bariya


Now the city data is also clean. So we are good for some EDA analysis.

#### Extract Date, year, month, day, hour and week from trip creation time

In [360]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14817 entries, 0 to 14816
Data columns (total 21 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   trip_uuid                       14817 non-null  object        
 1   source_center                   14817 non-null  object        
 2   destination_center              14817 non-null  object        
 3   data                            14817 non-null  category      
 4   route_type                      14817 non-null  category      
 5   trip_creation_time              14817 non-null  datetime64[ns]
 6   source_name                     14807 non-null  object        
 7   destination_name                14809 non-null  object        
 8   od_total_time                   14817 non-null  float64       
 9   start_scan_to_end_scan          14817 non-null  float64       
 10  actual_distance_to_destination  14817 non-null  float64       
 11  ac

In [361]:
# since 'trip_creation_time' is already a datetime column so we don't need to use the pd.to_datetime here again.

df2['trip_creation_date'] = df2['trip_creation_time'].dt.date
df2['trip_creation_year'] = df2['trip_creation_time'].dt.year.astype('int16')
df2['trip_creation_month'] = df2['trip_creation_time'].dt.month.astype('int8')
df2['trip_creation_day'] = df2['trip_creation_time'].dt.day.astype('int8')
df2['trip_creation_week'] = df2['trip_creation_time'].dt.isocalendar().week.astype('int8')
df2['trip_creation_hour'] = df2['trip_creation_time'].dt.hour.astype('int8')

df2.sample(5)

Unnamed: 0,trip_uuid,source_center,destination_center,data,route_type,trip_creation_time,source_name,destination_name,od_total_time,start_scan_to_end_scan,...,source_state,source_city,destination_state,destination_city,trip_creation_date,trip_creation_year,trip_creation_month,trip_creation_day,trip_creation_week,trip_creation_hour
5585,trip-153738482530900806,IND562132AAA,IND560099AAB,training,Carting,2018-09-19 19:20:25.309225,Bangalore_Nelmngla_H (Karnataka),Bengaluru_Bomsndra_HB (Karnataka),434.39,434.0,...,Karnataka,Bangalore,Karnataka,Bengaluru,2018-09-19,2018,9,19,38,19
8632,trip-153774865685588881,IND614620AAA,IND623407AAA,training,Carting,2018-09-24 00:24:16.856131,Manamelkudi_TmpleSrt_D (Tamil Nadu),Thiruvadanai_RamnadRD_D (Tamil Nadu),353.96,353.0,...,Tamil Nadu,Manamelkudi,Tamil Nadu,Thiruvadanai,2018-09-24,2018,9,24,39,0
8139,trip-153767748006206022,IND712311AAA,IND700065AAA,training,Carting,2018-09-23 04:38:00.062339,Kolkata_Dankuni_HB (West Bengal),CCU_Beliaghata_DPC (West Bengal),110.6,110.0,...,West Bengal,Kolkata,West Bengal,CCU,2018-09-23,2018,9,23,38,4
11937,trip-153818067015264507,IND282001AAF,IND282002AAD,test,FTL,2018-09-29 00:24:30.152895,Agra_Central_D_3 (Uttar Pradesh),Dholpur_GtRoad_D (Rajasthan),422.66,420.0,...,Uttar Pradesh,Agra,Rajasthan,Dholpur,2018-09-29,2018,9,29,39,0
4816,trip-153729413047777264,IND834002AAB,IND832109AAB,training,FTL,2018-09-18 18:08:50.478028,Ranchi_Hub (Jharkhand),Jamshedpur_Central_I_3 (Jharkhand),328.04,328.0,...,Jharkhand,Ranchi,Jharkhand,Jamshedpur,2018-09-18,2018,9,18,38,18


In [362]:
df2.describe().T

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
trip_creation_time,14817.0,2018-09-22 12:44:19.555167744,2018-09-12 00:00:16.535741,2018-09-17 02:51:25.129125888,2018-09-22 04:02:35.066945024,2018-09-27 19:37:41.898427904,2018-10-03 23:59:42.701692,
od_total_time,14817.0,531.69763,23.46,149.93,280.77,638.2,7898.55,658.868223
start_scan_to_end_scan,14817.0,530.810016,23.0,149.0,280.0,637.0,7898.0,658.705957
actual_distance_to_destination,14817.0,164.477838,9.002461,22.837239,48.474072,164.583208,2186.531787,305.388147
actual_time,14817.0,357.143754,9.0,67.0,149.0,370.0,6265.0,561.396157
osrm_time,14817.0,161.384018,6.0,29.0,60.0,168.0,2032.0,271.360995
osrm_distance,14817.0,204.344689,9.0729,30.8192,65.6188,208.475,2840.081,370.395573
segment_actual_time,14817.0,353.95161,9.0,66.0,147.0,367.0,6230.0,556.320988
segment_osrm_time,14817.0,180.921172,6.0,31.0,65.0,185.0,2564.0,314.485624
segment_osrm_distance,14817.0,223.164,9.0729,32.6545,70.1544,218.7102,3523.6324,416.547252


In [363]:
df2.describe(include = 'object').T

Unnamed: 0,count,unique,top,freq
trip_uuid,14817,14817,trip-153671041653548748,1
source_center,14817,938,IND000000ACB,1063
destination_center,14817,1042,IND000000ACB,821
source_name,14807,933,Gurgaon_Bilaspur_HB (Haryana),1063
destination_name,14809,1034,Gurgaon_Bilaspur_HB (Haryana),821
source_state,14807,29,Maharashtra,2714
source_city,14817,717,Gurgaon,1139
destination_state,14809,31,Maharashtra,2561
destination_city,14817,841,Mumbai,1202
trip_creation_date,14817,22,2018-09-18,791


### Univariate & Bi-variate Analysis

### Hypothesis testing

### Outlier Detection

### Outlier Treatment

### Encoding

### Normalisation/ Standardization

## Business Insights

### Recommendations