# Getaround Analysis project - Rental Delay Analysis



Contents
--------
1. [Data loading](#loading)
2. [Exploratory data analysis](#eda)
2. [Conclusion and perspectives](#conclusion)



In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Patch

## <a name="loading"></a> Data loading

In [4]:
df = pd.read_excel('./data/get_around_delay_analysis.xlsx')
df

Unnamed: 0,rental_id,car_id,checkin_type,state,delay_at_checkout_in_minutes,previous_ended_rental_id,time_delta_with_previous_rental_in_minutes
0,505000,363965,mobile,canceled,,,
1,507750,269550,mobile,ended,-81.0,,
2,508131,359049,connect,ended,70.0,,
3,508865,299063,connect,canceled,,,
4,511440,313932,mobile,ended,,,
...,...,...,...,...,...,...,...
21305,573446,380069,mobile,ended,,573429.0,300.0
21306,573790,341965,mobile,ended,-337.0,,
21307,573791,364890,mobile,ended,144.0,,
21308,574852,362531,connect,ended,-76.0,,


In [5]:
df.describe(include='all')

Unnamed: 0,rental_id,car_id,checkin_type,state,delay_at_checkout_in_minutes,previous_ended_rental_id,time_delta_with_previous_rental_in_minutes
count,21310.0,21310.0,21310,21310,16346.0,1841.0,1841.0
unique,,,2,2,,,
top,,,mobile,ended,,,
freq,,,17003,18045,,,
mean,549712.880338,350030.603426,,,59.701517,550127.411733,279.28843
std,13863.446964,58206.249765,,,1002.561635,13184.023111,254.594486
min,504806.0,159250.0,,,-22433.0,505628.0,0.0
25%,540613.25,317639.0,,,-36.0,540896.0,60.0
50%,550350.0,368717.0,,,9.0,550567.0,180.0
75%,560468.5,394928.0,,,67.0,560823.0,540.0


The dataset contains 21310 observations, each consisting of data pertaining to a car rental event. The dataset has 7 columns:
- The column `car_id` refers to the car that was rented. In the absence of further information, it is of no use to us.
- The columns `rental_id` and `previous_ended_rental_id` are identifiers of the current and previous rentals of a given car. We will use them to follow car rental sequences.
- The column `checkin_type` indicates whether the rental was made using Getaround connect functionality or by mobile.
- The column `state` indicates whether the rental was canceled or not.
- The column `delay_at_checkout_in_minutes` gives the time difference between the actual and expected checkout times. A negative value indicates that the checkout occured earlier than expected, and a positive value indicates a late checkout. A late checkout which makes the next customer waiting is problematic and this is what we aim to mitigate by introducing a delay before availability.
- The column `time_delta_with_previous_rental_in_minutes` represents the expected amount of time between two consecutive rentals. This value is based on the *expected* checkout and checkin times, and does not include the checkout delay.  A `NULL` value corresponds to a time delta larger that 12h (720 min), in which case the rental is assumed to be non-consecutive (`previous_ended_rental_id` is also `NULL`).

## <a id="eda"></a> Exploratory data analysis

Before determining the impact of the introduction of a rental delay, we first gather some necessary insights about user behavior.


### General user behavior

In [6]:
## Number of rentals using each method
df['checkin_type'].value_counts()

checkin_type
mobile     17003
connect     4307
Name: count, dtype: int64

In [7]:
## Counts of rental states for each checkin type
df_ = df.groupby(['checkin_type', 'state']).count()['rental_id']
df_

checkin_type  state   
connect       canceled      798
              ended        3509
mobile        canceled     2467
              ended       14536
Name: rental_id, dtype: int64

In [8]:
## Probability of rental states for each checkin type
df_ / df_.T.groupby('checkin_type').sum()

checkin_type  state   
connect       canceled    0.185280
              ended       0.814720
mobile        canceled    0.145092
              ended       0.854908
Name: rental_id, dtype: float64

- Customers favor mobile checkin (80%) over Getaround connect (20%). Part of this difference is due to the fact that not all the cars (actually, only 46%) have the Getaround connect option.
- Rental cancellation rates are higher when customers use Getaround connect functionality (18.5%) than with mobile checkin (14.5%). The cancellation process is possibly made easier with Getaround connect.

In [None]:
## number of NULL values
df.groupby(['checkin_type', 'state']).agg(lambda x: x.isnull().sum())


Unnamed: 0_level_0,Unnamed: 1_level_0,rental_id,car_id,delay_at_checkout_in_minutes,previous_ended_rental_id,time_delta_with_previous_rental_in_minutes
checkin_type,state,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
connect,canceled,0,0,798,667,667
connect,ended,0,0,107,2827,2827
mobile,canceled,0,0,2466,2369,2369
mobile,ended,0,0,1593,13606,13606


- Almost all `delay_at_checkout_in_minutes` are `NULL` when the rental was canceled (save for 1 value which is probably an error). CHeckout never occurs when the renatl is canceled.
- Even when the rental ended, the delay at checkout is sometimes unknown. This happens about 10% of the time with mobile checkin, but less than 3% with getaround connect.

In [None]:
## Fraction of rentals which are consecutive (ie with less than 12h between checkout and next checkin)
df.groupby('checkin_type').agg(lambda x: (~x.isnull()).sum())['previous_ended_rental_id'] / df['checkin_type'].value_counts()

checkin_type
connect    0.188762
mobile     0.060460
dtype: float64

Consecutive rentals are much more frequent when customers use Getaround connect functionality (19%) than with mobile checkin (6%). Getaround connect is certainly beneficial to the company as it reduces the time spent in the un-rented state.

### Distribution of checkout delays

In this section, we study the distribution of checkout delays. The range of checkout delays is extremely broad, ranging from -22433 min (about 16 days!) to 71084 min (more than 49 days!). To visualize clearly the distribution of checkout delays, we compute the complementary cumulative distributions of delays. For a given time delay $\tau$ the positive and negative complementary cumulative distributions are respectively:
$$
    \mathrm{Prob}\left( T \geq \tau \right), \quad \mathrm{Prob}\left( T \leq -\tau \right),
$$
where $T$ is the checkout delay.

In [None]:
## Compute the complementary cumulative distribution of checkout delays
delay_vals = np.logspace([0], [5], 21)
checkout_distrib = {}
avg_checkout_delay, median_checkout_delay = {}, {}
for (checkin,), df_ in df.groupby(['checkin_type']):
    avg_checkout_delay[checkin] = df_['delay_at_checkout_in_minutes'].mean()
    median_checkout_delay[checkin] = df_['delay_at_checkout_in_minutes'].median()
    data = df_['delay_at_checkout_in_minutes'].to_numpy()
    data = data[~np.isnan(data)]
    checkout_distrib[checkin] = [
        np.sum(data >= delay_vals, axis=1) / len(data),
        np.sum(data <= -delay_vals, axis=1) / len(data),
        ]

In [None]:
## summary text to display on the figure
summary_text = {k: (f'avg delay = {avg_checkout_delay[k]:.0f} min\n'
                    f'median delay = {median_checkout_delay[k]:.0f} min\n'
                    r'$P\,(\mathrm{delay} \geq 0) '
                    f'= {checkout_distrib[k][0][0]:.3f}$')
                for k in checkout_distrib}

##
fig1, axs1 = plt.subplots(
    nrows=1, ncols=2, figsize=(9, 4), dpi=200,
    gridspec_kw={'left': 0.075, 'right': 0.97, 'top': 0.85, 'bottom': 0.13,
                 'wspace': 0.18})
fig1.suptitle('Figure 1: Complementary cumulative distribution of checkout delays',
              x=0.02, ha='left')

labels = [r'$\mathrm{Prob}\,(T \geq \tau)$', r'$\mathrm{Prob}\, (T \leq - \tau)$']

axs1[0].set_title("Mobile checkin")
line0, = axs1[0].plot(delay_vals, checkout_distrib['mobile'][0],
                      linestyle='', marker='.', markersize=10, color='tab:blue')
line1, = axs1[0].plot(delay_vals, checkout_distrib['mobile'][1],
             linestyle='', marker='.', markersize=10, color='tab:orange')
axs1[0].text(0.035, 0.06, summary_text['mobile'],
             transform=axs1[0].transAxes, fontsize=9,
             bbox={'boxstyle': 'round,pad=0.5', 'facecolor': '0.92'})
axs1[0].grid(visible=True, linewidth=0.3)
axs1[0].set_xscale('log')
axs1[0].set_xlim(1, 1e5)
axs1[0].set_yscale('log')
axs1[0].set_ylim(1e-4, 1)
axs1[0].set_xlabel(r"Checkout delay $\tau$ (min)")
axs1[0].set_ylabel('Inverse cumm. prob.')
axs1[0].legend(handles=[line0, line1], labels=labels)


axs1[1].set_title("Getaround connect checkin")
line0, = axs1[1].plot(delay_vals, checkout_distrib['connect'][0],
                      linestyle='', marker='.', markersize=10, color='tab:blue')
line1, = axs1[1].plot(delay_vals, checkout_distrib['connect'][1],
             linestyle='', marker='.', markersize=10, color='tab:orange')
axs1[1].text(0.035, 0.06, summary_text['connect'],
             transform=axs1[1].transAxes, fontsize=9,
             bbox={'boxstyle': 'round,pad=0.5', 'facecolor': '0.92'})
axs1[1].grid(visible=True, linewidth=0.3)
axs1[1].set_xscale('log')
axs1[1].set_xlim(1, 1e5)
axs1[1].set_yscale('log')
axs1[1].set_ylim(1e-4, 1)
axs1[1].set_xlabel(r"Checkout delay $\tau$ (min)")
axs1[1].legend(handles=[line0, line1], labels=labels)


plt.show()

Figure 1 presents the complementary cummulative distributions of checkout delays, for both mobile checkin (left panel) and Getaround connect checkin (right panel). Most delays are rather short, but delays larger than 12 hours are not infrequent, they occur about 10% of the time. However, there is a significant difference between the two checkin methods. Delays tend to be shorter and occur less frequently with Getaround connect. Moreover, delays larger than a day never occur with Getaround connect.

### Delay with previous rental

We turn to the analysis of the delays between consecutive rentals. We recall that these events are defined rental delays less than 720 min (12 hours), and that they account for only 19% of the cases with Getaround connect and 6% of the cases with mobile checkin.

In [None]:
## Rental delay values
rental_delay = {'mobile': [0, 0], 'connect': [0, 0]}
for (checkin, state), df_ in df.groupby(['checkin_type', 'state']):
    if state == 'ended':
        rental_delay[checkin][0] = df_['time_delta_with_previous_rental_in_minutes'].to_numpy()
    if state == 'canceled':
        rental_delay[checkin][1] = df_['time_delta_with_previous_rental_in_minutes'].to_numpy()

In [None]:
## histogram bins and center values
delay_bins = np.linspace(-30, 750, 14)
delay_vals = (delay_bins[1:] + delay_bins[:-1]) / 2

## 'mobile' delay histograms : ended, canceled, cancelation prob, prob stddev
delay_hist_me, _ = np.histogram(rental_delay['mobile'][0], bins=delay_bins)
delay_hist_mc, _ = np.histogram(rental_delay['mobile'][1], bins=delay_bins)
delay_mfrac = delay_hist_mc / (delay_hist_mc + delay_hist_me)
delay_mfrac_std = delay_mfrac * (1 - delay_mfrac) / np.sqrt(delay_hist_mc + delay_hist_me)

## 'connect' delay histograms : ended, canceled, cancelation prob, prob stddev
delay_hist_ce, _ = np.histogram(rental_delay['connect'][0], bins=delay_bins)
delay_hist_cc, _ = np.histogram(rental_delay['connect'][1], bins=delay_bins)
delay_cfrac = delay_hist_cc / (delay_hist_cc + delay_hist_ce)
delay_cfrac_std = delay_cfrac * (1 - delay_cfrac) / np.sqrt(delay_hist_cc + delay_hist_ce)

In [None]:
fig2, axs2 = plt.subplots(
    nrows=1, ncols=2, figsize=(9, 4), dpi=200,
    gridspec_kw={'left': 0.07, 'right': 0.93, 'top': 0.85, 'bottom': 0.13,
                 'wspace': 0.24})
axs2_twin = [ax.twinx() for ax in axs2]
fig2.suptitle('Figure 2: Distribution of delays with previous rental', x=0.02, ha='left')


handles = [Patch(facecolor='tab:blue', alpha=1, label='ended'),
           Patch(facecolor='tab:orange', alpha=1, label='canceled')]
labels = ['ended', 'canceled', 'cancelation prob.']

axs2[0].set_title("Mobile checkin")
axs2[0].hist(rental_delay['mobile'], bins=np.linspace(-30, 750, 14),
             stacked=False, density=False)

err = axs2_twin[0].errorbar(delay_vals, delay_mfrac, delay_mfrac_std,
                            color='tab:red', marker='o', markersize=5)
axs2_twin[0].set_ylim(0, 0.425)
axs2_twin[0].set_yticks([0, 0.1, 0.2, 0.3, 0.4])
axs2_twin[0].set_yticks([0.05, 0.15, 0.25, 0.35], minor=True)
# axs2_twin[0].set_ylabel('Cancellation prob.', rotation=270, labelpad=12)

axs2[0].grid(visible=True, linewidth=0.3)
axs2[0].set_xlim(-30, 750)
axs2[0].set_xticks(np.linspace(0, 720, 7))
axs2[0].set_xticks(np.linspace(60, 660, 6), minor=True)
axs2[0].set_xlabel("Delay with previous rental (min)")
axs2[0].set_ylim(0, 170)
axs2[0].set_ylabel("Counts")
axs2[0].legend(handles=handles + [err], labels=labels)


axs2[1].set_title("Getaround connect checkin")
axs2[1].hist(rental_delay['connect'], bins=np.linspace(-30, 750, 14), # np.linspace(-15, 735, 26)
             stacked=False, density=False)

err = axs2_twin[1].errorbar(delay_vals, delay_cfrac, delay_cfrac_std,
                            color='tab:red', marker='o', markersize=5)
axs2_twin[1].set_ylim(0, 0.425)
axs2_twin[1].set_yticks([0, 0.1, 0.2, 0.3, 0.4])
axs2_twin[1].set_yticks([0.05, 0.15, 0.25, 0.35], minor=True)
axs2_twin[1].set_ylabel('Cancellation prob.', rotation=270, labelpad=14)

axs2[1].grid(visible=True, linewidth=0.3)
axs2[1].set_xlim(-30, 750)
axs2[1].set_xticks(np.linspace(0, 720, 7))
axs2[1].set_xticks(np.linspace(60, 660, 6), minor=True)
axs2[1].set_xlabel("Delay with previous rental (min)")
axs2[1].set_ylim(0, 170)
# axs2[1].set_ylabel("Counts")
axs2[1].legend(handles=handles + [err], labels=labels)


plt.show()

We show in figure 2 histograms of the delay with previous rental, for both mobile checkin (left panel) and Getaround connect checkin (right panel), distinguishing ended and caceled rentals. We also show the associated cancellation probability. The histograms are binned with hourly intervals. There is a dip at around 6h rental delay. This is likely a consequence of user car checkout and checkin schedule. Users tend to checkout late in the day and do not checkin at night.

We note that the counts are similar for both checkin methods despite the fact that mobile checkins are 4 times more frequent. We recover the fact that Getaround connect functionality favors consecutive rentals. We also observe the higher probability of cancellation with getaround checkin mentioned above. Interestingly, this cancellation probability seems independent of the rental delay.

In [15]:
prev_rental_cols = ['rental_id', 'delay_at_checkout_in_minutes']
curr_rental_cols = ['previous_ended_rental_id', 'checkin_type', 'state',
                    'time_delta_with_previous_rental_in_minutes']
df_prev = df.loc[:, prev_rental_cols]
df_curr = df.loc[:, curr_rental_cols]
df2 = pd.merge(df_prev, df_curr, how='inner', left_on='rental_id',
               right_on='previous_ended_rental_id')
df2 = df2.assign(is_canceled=(df2['state'] == 'canceled'))

df2.describe(include='all')

Unnamed: 0,rental_id,delay_at_checkout_in_minutes,previous_ended_rental_id,checkin_type,state,time_delta_with_previous_rental_in_minutes,is_canceled
count,1841.0,1729.0,1841.0,1841,1841,1841.0,1841
unique,,,,2,2,,2
top,,,,mobile,ended,,False
freq,,,,1028,1612,,1612
mean,550127.411733,-24.761712,550127.411733,,,279.28843,
std,13184.023111,430.602411,13184.023111,,,254.594486,
min,505628.0,-4624.0,505628.0,,,0.0,
25%,540896.0,-54.0,540896.0,,,60.0,
50%,550567.0,1.0,550567.0,,,180.0,
75%,560823.0,44.0,560823.0,,,540.0,


In [16]:
df2

Unnamed: 0,rental_id,delay_at_checkout_in_minutes,previous_ended_rental_id,checkin_type,state,time_delta_with_previous_rental_in_minutes,is_canceled
0,531158,29.0,531158.0,mobile,ended,90.0,False
1,533303,-340.0,533303.0,mobile,ended,600.0,False
2,533380,-167.0,533380.0,connect,ended,690.0,False
3,534820,-576.0,534820.0,connect,ended,150.0,False
4,535313,23.0,535313.0,mobile,ended,720.0,False
...,...,...,...,...,...,...,...
1836,574571,-54.0,574571.0,connect,ended,540.0,False
1837,574596,10.0,574596.0,mobile,ended,30.0,False
1838,567694,-17.0,567694.0,mobile,ended,210.0,False
1839,568465,,568465.0,connect,canceled,60.0,True


In [17]:
df2['delay_at_checkout_in_minutes'].isna().sum()

np.int64(112)

In [20]:
(df2['is_canceled'] * df2['delay_at_checkout_in_minutes'].isna()).sum()


np.int64(23)