# Exploratory Data Analysis

In [None]:
%load_ext nb_black

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
pricing_data_df = pd.read_csv("data/test.csv")

During upfront pricing, the following factors are available to us:

1. Type of vehicle - premium, XL, go, etc
2. Customers Profile
    - Fraud Score
    - Lifetime value
    - Number of previous cancellation by driver within journey
3. Geography
    - Distance
    - Starting destination
    - Ending destination
    - Tolls
2. Traffic
    - Wait time due to incoming traffic
3. Surge
    - Time of day i.e. Rush Hoiur
    - High Demand/Low Supply
    - Bad weather

In [None]:
pricing_data_df["calc_created"] = pd.to_datetime(pricing_data_df["calc_created"])

* Removing all UIDs and tokens as we can't feed them into model and UUIDs are calculated uniquely for each session.
* Ticket ID for resolution isn't useful as we don't have any ticket information.

In [None]:
pricing_data_df.drop(
    ["driver_device_uid_new", "device_token", "ticket_id_new"], axis=1, inplace=True
)

Can we remove all the states if all the rides are `finished`?

In [None]:
pricing_data_df["b_state"].value_counts()

In [None]:
pricing_data_df["order_state"].value_counts()

In [None]:
pricing_data_df["order_try_state"].value_counts()

In [None]:
pricing_data_df.drop(
    ["b_state", "order_state", "order_try_state"], axis=1, inplace=True
)

In [None]:
pricing_data_df.info()

In [None]:
pricing_data_df.groupby(["prediction_price_type", "change_reason_pricing"])[
    "order_id_new"
].count().reset_index()

## Data Catalog

| Variable                             | Description                                                                                                                                                                                                           |
|--------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `order_id_new`                       | ID of an order                                                                                                                                                                                                        |
| `order_try_id_new`                   | ID of an order "attempt" (one order can be attempted on multiple drivers, until one accepts)                                                                                                                          |
| `Metered_price, distance & duration` | Actual price, distance and duration of a ride                                                                                                                                                                         |
| `upfront_price`                      | Promised to the rider price, based on predicted duration (predicted_duration) and distance (predicted_distance)                                                                                                       |
| `distance`                           | Ride distance                                                                                                                                                                                                         |
| `duration`                           | Ride Duration                                                                                                                                                                                                         |
| `gps_conﬁdence`                      | Indicator for good GPS connection (1 - good one, 0 - bad one)                                                                                                                                                         |
| `entered_by`                         | Who entered the address                                                                                                                                                                                               |
| `b_state`                            | State of a ride (ﬁnished implies that the ride was actually done)                                                                                                                                                     |
| `dest_change_number`                 | Number of destination changes by a rider and a driver. It includes the original input of the destination by a rider. That is why the minimum value of it is 1                                                         |
| `predicted_distance`                 | Predicted duration of a ride based on the pickup and dropoff points entered by the rider requesting a car                                                                                                             |
| `predicted_duration`                 | Predicted duration of a ride based on the pickup and dropoff points entered by the rider requesting a car                                                                                                             |
| `prediction_price_type`              | Internal variable for the type of prediction: (1) `upfront`, `prediction` - prediction happened before the ride; (2) `upfront_destination_changed` - prediction happened after rider changed destination during the ride |
| `change_reason_pricing`              | Indicates whose action triggered a change in the price prediction. If it is empty, it means that either nobody changed the destination or that the change has not affected the predicted price                        |
| `ticket_id_new`                      | ID for customer support ticket                                                                                                                                                                                        |
| `device_token, device_token_new`     | ID for a device_token (empty for all the ﬁelds)                                                                                                                                                                       |
| `rider_app_version`                  | App version of rider phone                                                                                                                                                                                            |
| `driver_app_version`                 | App version of driver phone                                                                                                                                                                                           |
| `driver_device_uid_new`              | ID for UID of a phone device                                                                                                                                                                                          |
| `device_name`                        | The name of the phone                                                                                                                                                                                                 |
| `eu_indicator`                       | Whether a ride happens in EU                                                                                                                                                                                          |
| `overpaid_ride_ticket`               | Indicator for a rider complaining about the overpaid ride                                                                                                                                                             |
| `fraud_score`                        | Fraud score of a rider. The higher it is the more likely the rider will cheat.                                                                                                                                        |

In [None]:
pricing_data_df.sample(5).T

## Analyzing the columns

### Date Range

In [None]:
pricing_data_df["calc_created"].min(), pricing_data_df["calc_created"].max()

Roughly a month's worth of data from 2020, right before the first lockdown of COVID-19 in the UK on __23 March 2020__.

Additional columns we can create:
* Day of week
* Day of month
* Month
* Holidays

### Distribution of Pricing

In [None]:
fig, ax = plt.subplots(figsize=(16, 6), dpi=120)
sns.kdeplot(data=pricing_data_df[["upfront_price", "metered_price"]], ax=ax, fill=True)
plt.grid()
plt.show()

Upfront pricing typically seems to be a lower than the metered pricing.

### Distribution of Duration

In [None]:
fig, ax = plt.subplots(figsize=(16, 6), dpi=120)
sns.kdeplot(data=pricing_data_df[["predicted_duration", "duration"]], ax=ax, fill=True)
plt.grid()
plt.show()

That's because predicted duration seems to be lesser than the actual duration.

### Distribution of Distances

In [None]:
fig, ax = plt.subplots(figsize=(16, 6), dpi=120)
sns.kdeplot(data=pricing_data_df[["predicted_distance", "distance"]], ax=ax, fill=True)
plt.grid()
plt.show()

The same goes for distance. the predicted distance typically lays on the lower end.