#Predicting Cab Booking Cancellations

###Notebook by [Abhishek Sharma](http://github.com/numb3r33/)
####Data hosted at [Kaggle](https://inclass.kaggle.com/c/predicting-cab-booking-cancellations/data)

##Table of contents

1. [Introduction](#Introduction)

3. [Required libraries](#Required-libraries)

5. [Step 1: Answering the question](#Step-1:-Answering-the-question)

6. [Step 2: Checking the data](#Step-2:-Checking-the-data)

7. [Step 3: Tidying the data](#Step-3:-Tidying-the-data)

    - [Bonus: Testing our data](#Bonus:-Testing-our-data)

8. [Step 4: Exploratory analysis](#Step-4:-Exploratory-analysis)

9. [Step 5: Classification](#Step-5:-Classification)

    - [Cross-validation](#Cross-validation)

    - [Parameter tuning](#Parameter-tuning)

10. [Step 6: Reproducibility](#Step-6:-Reproducibility)

11. [Conclusions](#Conclusions)

12. [Further reading](#Further-reading)

13. [Acknowledgements](#Acknowledgements)

##Introduction

[[ go back to the top ]](#Table-of-contents)

The business problem tackled here is trying to improve customer service for [YourCabs](#http://www.yourcabs.com/), a cab company in Bangalore. The problem of interest is booking cancellations by the company due to unavailability of a car. The challenge is that cancellations can occur very close to the trip start time, thereby causing passengers inconvenience.

###Competition Goal

The goal of the competition is to create a predictive model for classifying new bookings as to whether they will eventually get cancelled due to car unavailability. This is a classification task that includes misclassification costs. The winning entry will be the one with the lowest **average-cost-per-booking.**

Participants need to upload their classifications (0=no cancellation or 1=cancellation), and the system will compute the average-cost-per-booking based on these classifications.

##Required libraries

[[ go back to the top ]](#Table-of-contents)

If you don't have Python on your computer, you can use the [Anaconda Python distribution](http://continuum.io/downloads) to install most of the Python packages you need. Anaconda provides a simple double-click installer for your convenience.

This notebook uses several Python packages that come standard with the Anaconda Python distribution. The primary libraries that we'll be using are:

* **NumPy**: Provides a fast numerical array structure and helper functions.
* **pandas**: Provides a DataFrame structure to store data in memory and work with it easily and efficiently.
* **scikit-learn**: The essential Machine Learning package in Python.
* **matplotlib**: Basic plotting library in Python; most other Python plotting libraries are built on top of it.
* **Seaborn**: Advanced statistical plotting library.

To make sure you have all of the packages you need, install them with `conda`:

    conda install numpy pandas scikit-learn matplotlib seaborn

`conda` may ask you to update some of them if you don't have the most recent version. Allow it to do so.

##Step 1: Answering the question

[[ go back to the top ]](#Table-of-contents)

The first step to any data analysis project is to define the question or problem we're looking to solve, and to define a measure (or set of measures) for our success at solving that task.

>Did you specify the type of data analytic question (e.g. exploration, association causality) before touching the data?

The goal of the competition is to create a predictive model for classifying new bookings as to whether they will eventually get cancelled due to car unavailability, so this is a classification task.

>Did you define the metric for success before beginning?

Let's do that now. Since we're taking into account penalties for misclassification, we can use [Weighted Mean Absolute Error](https://www.kaggle.com/wiki/WeightedMeanAbsoluteError)

>Did you understand the context for the question and the scientific or business application?

The business problem tackled here is trying to improve customer service for YourCabs.com, a cab company in Bangalore. The problem of interest is booking cancellations by the company due to unavailability of a car. The challenge is that cancellations can occur very close to the trip start time, thereby causing passengers inconvenience.

>Did you record the experimental design?

The cab bookings data are made available through a collaboration between Prof. Galit Shmueli at the Indian School of Business and YourCabs co-founder Mr. Rajath Kedilaya and IDRC managing partner, Mr. Amit Batra.

>Did you consider whether the question could be answered with the available data?

The data set we currently have is only for three types of *Iris* flowers. The model built off of this data set will only work for those *Iris* flowers, so we will need more data to create a general flower classifier.

##Step 2: Checking the data

[[ go back to the top ]](#Table-of-contents)

The next step is to look at the data we're working with.

Generally, we're looking to answer the following questions:

* Is there anything wrong with the data?
* Are there any quirks with the data?
* Do I need to fix or remove any of the data?

Let's start by reading the data into a pandas DataFrame.

##Data fields

* id - booking *ID*
* user_id - the *ID* of the customer (based on mobile number)
* vehicle_model_id - vehicle model type.
* package_id - type of package (1=4hrs & 40kms, 2=8hrs & 80kms, 3=6hrs & 60kms, 4= 10hrs &   100kms, 5=5hrs & 50kms, 6=3hrs &     30kms, 7=12hrs & 120kms)
* travel_type_id - type of travel (1=long distance, 2= point to point, 3= hourly rental).
* from_area_id - unique identifier of area. Applicable only for point-to-point travel and packages
* to_area_id - unique identifier of area. Applicable only for point-to-point travel
* from_city_id - unique identifier of city
* to_city_id - unique identifier of city (only for intercity)
* from_date - time stamp of requested trip start
* to_date - time stamp of trip end
* online_booking - if booking was done on desktop website
* mobile_site_booking - if booking was done on mobile website
* booking_created - time stamp of booking
* from_lat - latitude of from area
* from_long -  longitude of from area
* to_lat - latitude of to area
* to_long - longitude of to area
* Car_Cancellation (available only in training data) - whether the booking was cancelled (1) or not (0) due to unavailability of   a car.
* Cost_of_error (available only in training data) - the cost incurred if the booking is misclassified. The cost of misclassifying an uncancelled booking as a cancelled booking (cost=1 unit). The cost associated with misclassifying a cancelled booking as uncancelled, This cost is a function of how close the cancellation occurs relative to the trip start time. The closer the trip, the higher the cost. Cancellations occurring less than 15 minutes prior to the trip start incur a fixed penalty of 100 units.

In [1]:
import pandas as pd

In [18]:
cars_cancel_train = pd.read_csv('./data/Kaggle_YourCabs_training.csv')
cars_cancel_train.head()

Unnamed: 0,id,user_id,vehicle_model_id,package_id,travel_type_id,from_area_id,to_area_id,from_city_id,to_city_id,from_date,to_date,online_booking,mobile_site_booking,booking_created,from_lat,from_long,to_lat,to_long,Car_Cancellation,Cost_of_error
0,132512,22177,28,,2,83,448,,,1/1/2013 2:00,,0,0,1/1/2013 1:39,12.92415,77.67229,12.92732,77.63575,0,1
1,132513,21413,12,,2,1010,540,,,1/1/2013 9:00,,0,0,1/1/2013 2:25,12.96691,77.74935,12.92768,77.62664,0,1
2,132514,22178,12,,2,1301,1034,,,1/1/2013 3:30,,0,0,1/1/2013 3:08,12.937222,77.626915,13.047926,77.597766,0,1
3,132515,13034,12,,2,768,398,,,1/1/2013 5:45,,0,0,1/1/2013 4:39,12.98999,77.55332,12.97143,77.63914,0,1
4,132517,22180,12,,2,1365,849,,,1/1/2013 9:00,,0,0,1/1/2013 7:53,12.845653,77.677925,12.95434,77.60072,0,1


**As you can see there are various columns with missing values and some columns which need to be converted to date-time.**

In [23]:
# convert from_date column to date-time
# Note: when we try to convert NaN to date-time, it gets converted to NaT( Not a timestamp )

cars_cancel_train['from_date'] = pd.to_datetime(cars_cancel_train['from_date'])

In [26]:
# convert to_date column to date-time
cars_cancel_train['to_date'] = pd.to_datetime(cars_cancel_train['to_date'])

Next thing, we could do is to print out the summary statistics about the dataset

In [27]:
cars_cancel_train.describe()

Unnamed: 0,id,vehicle_model_id,package_id,travel_type_id,from_area_id,to_area_id,from_city_id,to_city_id,online_booking,mobile_site_booking,from_lat,from_long,to_lat,to_long,Car_Cancellation,Cost_of_error
count,43431.0,43431.0,7550.0,43431.0,43343.0,34293.0,16345.0,1588.0,43431.0,43431.0,43338.0,43338.0,34293.0,34293.0,43431.0,43431.0
mean,159206.473556,25.71723,2.030066,2.137252,714.544494,669.490917,14.915081,68.537783,0.351592,0.043241,12.982461,77.636255,13.026648,77.640595,0.072114,8.000509
std,15442.386279,26.79825,1.461756,0.437712,419.883553,400.638225,1.165306,49.880732,0.477473,0.203402,0.085933,0.059391,0.113487,0.064045,0.25868,25.350698
min,132512.0,1.0,1.0,1.0,2.0,2.0,1.0,4.0,0.0,0.0,12.77663,77.38693,12.77663,77.38693,0.0,0.15
25%,145778.0,12.0,1.0,2.0,393.0,393.0,15.0,32.0,0.0,0.0,12.92645,77.593661,12.95185,77.58203,0.0,1.0
50%,159248.0,12.0,2.0,2.0,590.0,541.0,15.0,49.0,0.0,0.0,12.968887,77.63575,12.98275,77.64503,0.0,1.0
75%,172578.5,24.0,2.0,2.0,1089.0,1054.0,15.0,108.0,1.0,0.0,13.00775,77.6889,13.19956,77.70688,0.0,1.0
max,185941.0,91.0,7.0,3.0,1403.0,1403.0,31.0,203.0,1.0,1.0,13.366072,77.78642,13.366072,77.78642,1.0,100.0


**We can see that many of the columns are missing values like package_id, to_area_id etc.**

Let's also set id to be the index of the dataset

In [28]:
cars_cancel_train.set_index('id', inplace=True)

%matplotlib inlin

In [20]:
cars_cancel_train.set_index('user_id', inplace=True)

In [22]:
cars_cancel_train.groupby(level=0)['travel_type_id'].size()

user_id
16        3
21        1
24        1
35        4
36        1
37        1
39        1
41        2
42       10
47        3
53        1
55        2
59       10
66        1
70        5
74       15
75        1
80        1
81        3
82        1
88        1
92        7
93        1
95        4
97       23
99       13
104       3
109       1
112       1
113       1
         ..
48690     1
48691     1
48692     1
48693     1
48694     1
48696     1
48697     1
48699     1
48700     1
48701     1
48702     1
48703     1
48704     1
48705     1
48706     1
48707     1
48709     1
48710     1
48711     1
48714     1
48718     1
48719     1
48721     1
48723     1
48724     1
48725     1
48726     1
48727     1
48729     1
48730     1
dtype: int64