# Exploratory Data Analysis

In [None]:
%load_ext nb_black

In [None]:
import scipy

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
pricing_data_df = pd.read_csv("data/test.csv")

During upfront pricing, the following factors are available to us:

1. Type of vehicle - premium, XL, go, etc
2. Customers Profile
    - Fraud Score
    - Lifetime value
    - Number of previous cancellation by driver within journey
3. Geography
    - Distance
    - Starting destination
    - Ending destination
    - Tolls
2. Traffic
    - Wait time due to incoming traffic
3. Surge
    - Time of day i.e. Rush Hoiur
    - High Demand/Low Supply
    - Bad weather

In [None]:
pricing_data_df["calc_created"] = pd.to_datetime(pricing_data_df["calc_created"])

* Removing all UIDs and tokens as we can't feed them into model and UUIDs are calculated uniquely for each session.
* Ticket ID for resolution isn't useful as we don't have any ticket information.

In [None]:
pricing_data_df.drop(
    ["driver_device_uid_new", "device_token", "ticket_id_new"], axis=1, inplace=True
)

Can we remove all the states if all the rides are `finished`?

In [None]:
pricing_data_df["b_state"].value_counts()

In [None]:
pricing_data_df["order_state"].value_counts()

In [None]:
pricing_data_df["order_try_state"].value_counts()

In [None]:
pricing_data_df.drop(
    ["b_state", "order_state", "order_try_state"], axis=1, inplace=True
)

All the orders are finished, hence this information is redundant.

We can remove `order_try_id_new` since we already have `order_id_new` available. Furthermore, `dest_change_number` let's us know how many times the destination was changed.

In [None]:
pricing_data_df.drop(["order_try_id_new"], axis=1, inplace=True)

In [None]:
pricing_data_df.drop_duplicates(inplace=True)

In [None]:
pricing_data_df.reset_index(inplace=True, drop=True)

In [None]:
pricing_data_df.info()

## Data Catalog

| Variable                             | Description                                                                                                                                                                                                           |
|--------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `order_id_new`                       | ID of an order                                                                                                                                                                                                        |
| `Metered_price, distance & duration` | Actual price, distance and duration of a ride                                                                                                                                                                         |
| `upfront_price`                      | Promised to the rider price, based on predicted duration (predicted_duration) and distance (predicted_distance)                                                                                                       |
| `distance`                           | Ride distance                                                                                                                                                                                                         |
| `duration`                           | Ride Duration                                                                                                                                                                                                         |
| `gps_conﬁdence`                      | Indicator for good GPS connection (1 - good one, 0 - bad one)                                                                                                                                                         |
| `entered_by`                         | Who entered the address                                                                                                                                                                                               |
| `dest_change_number`                 | Number of destination changes by a rider and a driver. It includes the original input of the destination by a rider. That is why the minimum value of it is 1                                                         |
| `predicted_distance`                 | Predicted duration of a ride based on the pickup and dropoff points entered by the rider requesting a car                                                                                                             |
| `predicted_duration`                 | Predicted duration of a ride based on the pickup and dropoff points entered by the rider requesting a car                                                                                                             |
| `prediction_price_type`              | Internal variable for the type of prediction: (1) `upfront`, `prediction` - prediction happened before the ride; (2) `upfront_destination_changed` - prediction happened after rider changed destination during the ride |
| `change_reason_pricing`              | Indicates whose action triggered a change in the price prediction. If it is empty, it means that either nobody changed the destination or that the change has not affected the predicted price                        |
| `rider_app_version`                  | App version of rider phone                                                                                                                                                                                            |
| `driver_app_version`                 | App version of driver phone                                                                                                                                                                                           |
| `device_name`                        | The name of the phone                                                                                                                                                                                                 |
| `eu_indicator`                       | Whether a ride happens in EU                                                                                                                                                                                          |
| `overpaid_ride_ticket`               | Indicator for a rider complaining about the overpaid ride                                                                                                                                                             |
| `fraud_score`                        | Fraud score of a rider. The higher it is the more likely the rider will cheat.                                                                                                                                        |

In [None]:
pricing_data_df.sample(5).T

# Upfront Pricing Exploration

In [None]:
upfront_pricing_data_df = pricing_data_df.loc[
    pricing_data_df["upfront_price"].notnull(), :
]

In [None]:
upfront_pricing_data_df["prediction_price_type"].value_counts()

Since all upfront prices have prediction price type as upfront, we can drop `prediction_price_type`,

In [None]:
upfront_pricing_data_df.drop(["prediction_price_type"], axis=1, inplace=True)

## Problem Scope

Does a deviation actually exist?

In [None]:
upfront_pricing_data_df["upfront_price_deviation_perc"] = (
    (
        upfront_pricing_data_df["upfront_price"]
        - upfront_pricing_data_df["metered_price"]
    )
    / upfront_pricing_data_df["upfront_price"]
    * 100
)
upfront_pricing_data_df["abs_upfront_price_deviation_perc"] = abs(
    upfront_pricing_data_df["upfront_price_deviation_perc"]
)

In [None]:
upfront_pricing_data_df["abs_upfront_price_deviation_perc"].describe(
    percentiles=[0.25, 0.5, 0.75, 0.85, 0.9, 0.95, 0.99]
)

Roughly 50% of the orders are deviating below 20% from the upfront pricing. 

In [None]:
fig, ax = plt.subplots(figsize=(16, 6), dpi=120)
p = sns.kdeplot(data=upfront_pricing_data_df["upfront_price_deviation_perc"], ax=ax)
x, y = p.get_lines()[0].get_data()
cdf = scipy.integrate.cumtrapz(y, x, initial=0)
nearest_05 = np.abs(cdf - 0.5).argmin()
x_median = x[nearest_05]
y_median = y[nearest_05]
plt.vlines(x_median, 0, y_median, colors="black")
plt.grid()
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(16, 6), dpi=120)
p = sns.kdeplot(data=upfront_pricing_data_df["abs_upfront_price_deviation_perc"], ax=ax)
x, y = p.get_lines()[0].get_data()
cdf = scipy.integrate.cumtrapz(y, x, initial=0)
nearest_05 = np.abs(cdf - 0.5).argmin()
x_median = x[nearest_05]
y_median = y[nearest_05]
plt.vlines(x_median, 0, y_median, colors="black")
plt.grid()
plt.show()

The distribution of pricing is right skewed long tailed. Our focus will on identifying what the source of the 50% of the error is.

## Problem Impact

How many customers does this deviation impact?

In [None]:
pricing_data_df.shape

Assuming our population consists of 4270 customers.

In [None]:
upfront_pricing_data_df.shape[0] / pricing_data_df.shape[0]

Around 70% of the customers have suffered from some form of a deviation between upfront and metered pricing on the app.

In [None]:
upfront_pricing_data_df["upfront_price_deviation_perc"].describe(
    percentiles=[0.25, 0.35, 0.5, 0.55, 0.75, 0.85, 0.9, 0.95, 0.99]
)

In [None]:
upfront_pricing_data_df[
    upfront_pricing_data_df["upfront_price_deviation_perc"] < 0
].shape[0] / upfront_pricing_data_df.shape[0]

Around 60% of the customers see a price higher than the one that is shown upfront.

In [None]:
upfront_pricing_data_df[
    upfront_pricing_data_df["upfront_price_deviation_perc"] < -20
].shape[0] / upfront_pricing_data_df.shape[0]

Around 35% of the customers get charged more at the end of the journey.

We're going to assume that anyone who created an `overpaid_ride_ticket` and didn't pay more for a ride, did it by accident. 

In [None]:
upfront_pricing_data_df[
    (upfront_pricing_data_df["upfront_price_deviation_perc"] < -20)
]["overpaid_ride_ticket"].value_counts(normalize=True)

A staggering 95% of customers who were shown a higher price (i.e. with a deviation of 20%), complained about an overpaid ticket.

## Analyzing the columns

### Date Range

In [None]:
upfront_pricing_data_df["calc_created"].min(), upfront_pricing_data_df[
    "calc_created"
].max()

Roughly a month's worth of data from 2020, right before the first lockdown of COVID-19 in the UK on __23 March 2020__.

Additional columns we can create:
* Day of week
* Day of month
* Month
* Holidays

### Distribution of Pricing

In [None]:
fig, ax = plt.subplots(figsize=(16, 6), dpi=120)
sns.kdeplot(
    data=upfront_pricing_data_df[["upfront_price", "metered_price"]], ax=ax, fill=True
)
plt.grid()
plt.show()

Upfront pricing typically seems to be a lower than the metered pricing.

### Distribution of Duration

In [None]:
fig, ax = plt.subplots(figsize=(16, 6), dpi=120)
sns.kdeplot(
    data=upfront_pricing_data_df[["predicted_duration", "duration"]], ax=ax, fill=True
)
plt.grid()
plt.show()

That's because predicted duration seems to be lesser than the actual duration.

### Distribution of Distances

In [None]:
fig, ax = plt.subplots(figsize=(16, 6), dpi=120)
sns.kdeplot(
    data=upfront_pricing_data_df[["predicted_distance", "distance"]], ax=ax, fill=True
)
plt.grid()
plt.show()

The same goes for distance. the predicted distance typically lays on the lower end.

## How are pricing, distance and duration correlated?

### Correlation between deviations

In [None]:
upfront_pricing_data_df["predicted_distance_deviation_perc"] = (
    (
        upfront_pricing_data_df["predicted_distance"]
        - upfront_pricing_data_df["distance"]
    )
    / upfront_pricing_data_df["predicted_distance"]
    * 100
)
upfront_pricing_data_df["abs_predicted_distance_deviation_perc"] = abs(
    upfront_pricing_data_df["predicted_distance_deviation_perc"]
)

In [None]:
upfront_pricing_data_df["abs_predicted_distance_deviation_perc"].describe(
    percentiles=[0.25, 0.5, 0.75, 0.85, 0.9, 0.95, 0.99]
)

In [None]:
upfront_pricing_data_df["predicted_duration_deviation_perc"] = (
    (
        upfront_pricing_data_df["predicted_duration"]
        - upfront_pricing_data_df["duration"]
    )
    / upfront_pricing_data_df["predicted_duration"]
    * 100
)

upfront_pricing_data_df["abs_predicted_duration_deviation_perc"] = abs(
    upfront_pricing_data_df["predicted_duration_deviation_perc"]
)

In [None]:
upfront_pricing_data_df["abs_predicted_duration_deviation_perc"].describe(
    percentiles=[0.25, 0.5, 0.75, 0.85, 0.9, 0.95, 0.99]
)

In [None]:
fig, ax = plt.subplots(figsize=(16, 6), dpi=120)
sns.kdeplot(
    data=upfront_pricing_data_df[
        ["predicted_distance_deviation_perc", "predicted_duration_deviation_perc"]
    ],
    ax=ax,
    fill=True,
)
plt.grid()
plt.show()

In [None]:
upfront_pricing_data_df[
    [
        "predicted_distance_deviation_perc",
        "predicted_duration_deviation_perc",
        "upfront_price_deviation_perc",
    ]
].corr()

In [None]:
upfront_pricing_data_df[
    [
        "abs_predicted_distance_deviation_perc",
        "abs_predicted_duration_deviation_perc",
        "abs_upfront_price_deviation_perc",
    ]
].corr()

The duration, distance, and price deviations aren't correlation.

### Correlation between absolute values

In [None]:
upfront_pricing_data_df[
    [
        "predicted_distance",
        "distance",
        "predicted_duration",
        "duration",
        "upfront_price",
        "metered_price",
    ]
].corr()

The predicted distances, duration and pricing aren't highly correlated with each other. This means that there's another component that's affecting the pricing.

## What are the attributes of highly deviated pricing?

For any deviation over 20%, what are the attributes of those orders like.

In [None]:
pricing_deviation_breached_df = upfront_pricing_data_df[
    upfront_pricing_data_df["abs_predicted_duration_deviation_perc"] >= 20
]
pricing_deviation_non_breached_df = upfront_pricing_data_df[
    upfront_pricing_data_df["abs_predicted_duration_deviation_perc"] < 20
]

### Destination Change Number

If customers change destinations more often, does it impact the pricing deviation?

In [None]:
pricing_deviation_breached_df["dest_change_number"].value_counts(normalize=True) * 100

In [None]:
upfront_pricing_data_df["dest_change_number"].value_counts(normalize=True) * 100

In [None]:
upfront_pricing_data_df["dest_change_number"].value_counts(normalize=True) * 100

The high pricing deviation doesn't differentiate much from the global number. Ideally if there was a difference, more destination changes would be apparent in the high deviation dataset.

### gps_confidence

Is poor gps confidence resulting in poor price prediction?

In [None]:
pricing_deviation_breached_df["gps_confidence"].value_counts(normalize=True) * 100

In [None]:
upfront_pricing_data_df["gps_confidence"].value_counts(normalize=True) * 100

In [None]:
pricing_deviation_non_breached_df["gps_confidence"].value_counts(normalize=True) * 100

* 4 pp worse than the global population
* 8 pp lower than less than 20% deviation

### eu_indicator

In [None]:
pricing_deviation_breached_df["eu_indicator"].value_counts(normalize=True) * 100

In [None]:
upfront_pricing_data_df["eu_indicator"].value_counts(normalize=True) * 100

In [None]:
pricing_deviation_non_breached_df["eu_indicator"].value_counts(normalize=True) * 100

Most of the deviations occur outside the EU.

Is there less confidence on one GPS device?