# Uber Rider Data Case Study

BitTiger DS203

December 2016

##  Project overview

Uber is interested in predicting rider retention. To help explore this question, they have provided a sample dataset of a cohort of users who signed up for an account in January 2014. The data was pulled several months later. 

## Dataset description

- city: city this user signed up in
- phone: primary device for this user
- signup_date: date of account registration; in the form ‘YYYY­MM­DD’
- last_trip_date: the last time this user completed a trip; in the form ‘YYYY­MM­DD’ 
- avg_dist: the average distance *(in miles) per trip taken in the first 30 days after signup 
- avg_rating_by_driver: the rider’s average rating over all of their trips 
- avg_rating_of_driver: the rider’s average rating of their drivers over all of their trips 
- surge_pct: the percent of trips taken with surge multiplier > 1
- avg_surge: The average surge multiplier over all of this user’s trips 
- trips_in_first_30_days: the number of trips this user took in the first 30 days after signing up
- luxury_car_user: True if the user took an luxury car in their first 30 days; False otherwise
- weekday_pct: the percent of the user’s trips occurring during a weekday

## Load data and browse data

In [None]:
# Import modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

% matplotlib inline

In [None]:
# Load data from file
df = pd.read_csv('data/churn.csv')

In [None]:
# Inspect dataset
df.info()

In [None]:
# Browse dataset
df.head(10)

In [None]:
# Show summary stats
df.describe()

In [None]:
# Count missing values by column
df.isnull().sum()

## Explore data

### Numeric variables

In [None]:
df['avg_dist'].plot.hist(bins=20)

In [None]:
df['avg_surge'].plot.hist(bins=20)

In [None]:
df['surge_pct'].plot.hist(bins=20)

In [None]:
df['weekday_pct'].plot.hist(bins=20)

In [None]:
df['avg_rating_by_driver'].plot.hist(bins=20)

In [None]:
df['avg_rating_of_driver'].plot.hist(bins=20)

In [None]:
df['trips_in_first_30_days'].plot.hist(bins=20)

In [None]:
# # Use scatter_matrix from Pandas
# from pandas.tools.plotting import scatter_matrix
# scatter_matrix(df[[u'avg_dist', u'avg_rating_by_driver', u'avg_rating_of_driver', u'avg_surge', u'surge_pct', u'trips_in_first_30_days', u'weekday_pct']],
#                alpha=0.2, figsize=(16, 16), diagonal='hist')
# plt.show()

In [None]:
# # Use scatter_matrix from Pandas
# from pandas.tools.plotting import scatter_matrix
# scatter_matrix(df[[u'avg_dist', u'trips_in_first_30_days', u'weekday_pct']], 
#                alpha=0.2, figsize=(8, 8), diagonal='kde')
# plt.show()

### Categorical variables

In [None]:
df['city'].value_counts()

In [None]:
df['city'].value_counts().plot.bar()

In [None]:
df['phone'].value_counts()

In [None]:
df['phone'].value_counts(dropna=False).plot.bar()

In [None]:
df['luxury_car_user'].value_counts().plot.bar()

## Clean data - dealing with missing values

In [None]:
# Count missing values by column
df.isnull().sum()

#### Option 1: drop all rows that have missing values

In [None]:
df_dropna = df.dropna(axis=0)

In [None]:
df_dropna.info()

In [None]:
df_dropna.describe()

#### Option 2: fill missing values

In [None]:
# Make a copy of df, because you don't want to mess up with orignal df when you experiment stuff
df_fillna = df.copy()

In [None]:
# Fill missing value for phone
df_fillna['phone'] = df['phone'].fillna('no_phone')

In [None]:
# Fill missing values with median
df_fillna['avg_rating_by_driver'] = df['avg_rating_by_driver'].fillna(df['avg_rating_by_driver'].median())
df_fillna['avg_rating_of_driver'] = df['avg_rating_of_driver'].fillna(df['avg_rating_of_driver'].median())

In [None]:
df_fillna.info()

In [None]:
df_fillna.describe()

#### Decision
We need to decide whether we should exclude data with missing value. We need statistical tools to help us decide. 


In [None]:
# For now we will move on (to be revisited)
df = df_fillna

## Transform data

### Time-series variables

In [None]:
# convert time-series information to datetime data type
df['last_trip_date'] = pd.to_datetime(df['last_trip_date'])
df['signup_date'] = pd.to_datetime(df['signup_date'])

In [None]:
# construct a new df to experiment on the time-series 
df_timestamp = df[['last_trip_date', 'signup_date']].copy()

In [None]:
df_timestamp['count'] = 1

In [None]:
df_timestamp = df_timestamp.set_index('signup_date')
df_timestamp['count'].resample("1D").sum().plot()

In [None]:
df_timestamp = df_timestamp.set_index('last_trip_date')
df_timestamp['count'].resample("1D").sum().plot()

In [None]:
# Experiment block
date_in_string = '2014-06-01'
date_in_datetime = pd.to_datetime(date_in_string)
print date_in_datetime
print date_in_datetime.dayofweek

In [None]:
# There might be some signal from day of week when a user signed up Uber, so let's create a column for that
df['signup_dow'] = df['signup_date'].apply(lambda x: x.dayofweek)

In [None]:
df.head()

### Converting categorical variables

In [None]:
df.info()

Categorical variables:
* city
* phone
* luxury_car_user
* signup_dow

#### Convert bool columns to int

In [None]:
df['luxury_car_user'] = df['luxury_car_user'].astype(int)

In [None]:
df.head()

#### Encode categorical columns to numeric values

In [None]:
df.head()

In [None]:
col_category = ['signup_dow', 'city', 'phone']

In [None]:
df_dummies = pd.get_dummies(df[col_category], columns=col_category)

In [None]:
df_dummies

In [None]:
df = df.join(df_dummies)

In [None]:
df.head()

In [None]:
df.columns

## Define a label/target/outcome

Add churn indicator. Considered to churn if have not taken a trip in the last 30 days. In practice, you will often have to figure out how to generate a reasonable label to train your dataset. Is the cutoff of 30 days reasonable?  You may want to test this... Sometimes, the correct label is even less obvious; your ability to make a sensible (and defensible) decision in these cases is important.

In [None]:
# Define churn: users did not take a trip during last 30 days, i.e. last trip date is earlier than 2014-06-01
df['churn'] = (df.last_trip_date < pd.to_datetime('2014-06-01')) * 1
df['active'] = (df.last_trip_date >= pd.to_datetime('2014-06-01')) * 1

df.head()

In [None]:
df['churn'].mean()

In [None]:
df['active'].mean()

## EDA with label

### colored scatter_matrix

In [None]:
colors = ['red' if ix else 'blue' for ix in df['active']]

In [None]:
# scatter_matrix(df[[u'avg_dist', u'avg_rating_by_driver', u'avg_rating_of_driver', 
#                   u'avg_surge', u'surge_pct', u'trips_in_first_30_days', u'weekday_pct']],
#                alpha=0.2, figsize=(16, 16), diagonal='hist', c=colors)
# plt.show()

### Explore churn rate split by features 

In [None]:
df[['city', 'churn']].groupby(['city']).mean().plot.bar()

In [None]:
df[['phone', 'churn']].groupby(['phone']).mean().plot.bar()

In [None]:
df[['luxury_car_user', 'active']].groupby(['luxury_car_user']).mean().plot.bar()

In [None]:
df[['trips_in_first_30_days', 'active']].groupby(['active']).mean().plot.bar()

In [None]:
is_active = df['active'] == 1

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2)
axes[0].hist(df[is_active]['avg_dist'].values)
axes[1].hist(df[~is_active]['avg_dist'].values)
fig.tight_layout()
plt.show()

#### Abstract out the plotting machine

In [None]:
def hist_active_vs_churn(df, col_name):
    is_active = df['active'] == 1
    fig, axes = plt.subplots(nrows=1, ncols=2)
    axes[0].hist(df[is_active][col_name].values)
    axes[0].set_title("active users")
    axes[0].set_xlabel(col_name)
    axes[0].set_ylabel("counts")
    axes[1].hist(df[~is_active][col_name].values)
    axes[1].set_title("churned users")
    axes[1].set_xlabel(col_name)
    axes[1].set_ylabel("counts")
    fig.tight_layout()
    plt.show()

In [None]:
df.columns

In [None]:
hist_active_vs_churn(df, col_name=u'avg_rating_by_driver')

In [None]:
cols = [u'avg_dist', u'avg_rating_by_driver', u'avg_rating_of_driver', u'avg_surge']

In [None]:
for col in cols:
    hist_active_vs_churn(df, col_name=col)

## Save cleaned data to csv file

### Select which columns to be saved

In [None]:
selected_columns = [u'avg_dist', u'avg_rating_by_driver', u'avg_rating_of_driver', u'avg_surge', 
                     u'surge_pct', u'trips_in_first_30_days', u'luxury_car_user', 
                     u'weekday_pct', u'city_Astapor', u'city_King\'s Landing',u'city_Winterfell', 
                     u'phone_Android', u'phone_iPhone', u'phone_no_phone', u'signup_dow_0', 
                     u'signup_dow_1', u'signup_dow_2', u'signup_dow_3', u'signup_dow_4', 
                     u'signup_dow_5', u'signup_dow_6', u'churn']

### Save to csv file


In [None]:
cleaned_data_csv = 'data/cleaned_data.csv'
df[selected_columns].to_csv(cleaned_data_csv, index=False)

## Build Logistic Regression Model

### Reload data from cleaned csv file

In [None]:
import pandas as pd
cleaned_data_csv = 'data/cleaned_data.csv'
df = pd.read_csv(cleaned_data_csv)

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

### Define Features and Target

In [None]:
selected_features = [u'avg_dist', u'avg_rating_by_driver', u'avg_rating_of_driver', u'avg_surge', 
                     u'surge_pct', u'trips_in_first_30_days', u'luxury_car_user', 
                     u'weekday_pct', u'city_Astapor', u'city_King\'s Landing',u'city_Winterfell', 
                     u'phone_Android', u'phone_iPhone', u'phone_no_phone', u'signup_dow_0', 
                     u'signup_dow_1', u'signup_dow_2', u'signup_dow_3', u'signup_dow_4', 
                     u'signup_dow_5', u'signup_dow_6']
target = u'churn'

In [None]:
X = df[selected_features].values
y = df['churn'].values

### Use our own implementation of Logistic Regression Model

In [None]:
# from my_LogisticRegression import *

In [None]:
from my_LogisticRegression import log_likelihood, log_likelihood_gradient, predict, predict_proba
from my_LogisticRegression import GradientAscent
from my_LogisticRegression import precision, accuracy, recall

In [None]:
ga = GradientAscent(cost=log_likelihood, 
                    gradient=log_likelihood_gradient, 
                    predict_func=predict,
                    fit_intercept=True)
ga.run(X, y, alpha=0.1/X.shape[0], num_iterations=5000)

In [None]:
y_pred = ga.predict(X)

print("The predicted class vector is \n{}".format(str(y_pred)))
print("The actual class vector is \n{}".format(str(y)))

In [None]:
print("Accuracy of the Logistic Regression is: {}".format(accuracy(y, y_pred)))
print("Precision of the Logistic Regression is: {}".format(precision(y, y_pred)))
print("Recall of the Logistic Regression is: {}".format(recall(y, y_pred)))

### Understanding the Estimated Coefficients

In [None]:
df_coeffs = pd.DataFrame(list(zip(selected_features, ga.coeffs))).sort_values(by=[1], ascending=False)
df_coeffs.columns = ['feature', 'coeff']
df_coeffs

In [None]:
import matplotlib.pyplot as plt
% matplotlib inline

In [None]:
ax = df_coeffs.plot.barh()
t = np.arange(X.shape[1])
ax.set_yticks(t)
ax.set_yticklabels(df_coeffs['feature'])
plt.show()

### Use standardized features

In [None]:
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)

### Use Logistic Regression from sklearn

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=100000, fit_intercept=False)
lr.fit(X, y)

In [None]:
y_pred = lr.predict(X)
print("Accuracy of the Logistic Regression is: {}".format(accuracy(y, y_pred)))
print("Precision of the Logistic Regression is: {}".format(precision(y, y_pred)))
print("Recall of the Logistic Regression is: {}".format(recall(y, y_pred)))

In [None]:
df_coeffs = pd.DataFrame(list(zip(selected_features, lr.coef_.flatten()))).sort_values(by=[1], ascending=False)
df_coeffs.columns = ['feature', 'coeff']
df_coeffs

In [None]:
ax = df_coeffs.plot.barh()
t = np.arange(X.shape[1])
ax.set_yticks(t)
ax.set_yticklabels(df_coeffs['feature'])
plt.show()

### How to interpret coefficient?

***Recall: Increasing the value of $x_i$ by 1 increases the odds ratio by a factor of $e^{\beta_i}$***

Say, for a given user, assume he has a probability to churn at 50%, or in another word, the odd ratio is 1:1 = 1

In [None]:
default_OR = 1 # 50% chance to churn

If a coefficient is 0.2, then, if we increase the corresponding variable by 1 unit, the new odd ratio will be:

In [None]:
beta = 0.2
increase = np.exp(beta)
OR = default_OR * increase
OR

Which is can be converted to chance to churn:

In [None]:
p = OR / (1 + OR)
p

If a coefficient is -0.4, then, if we increase the corresponding variable by 1 unit, the new odd ratio will be:

In [None]:
beta = -0.4
increase = np.exp(beta) * 1
OR = default_OR * increase
OR

Which is can be converted to chance to churn:

In [None]:
p = OR / (1 + OR)
p

### Check the result with our EDA

In [None]:
df[['luxury_car_user', 'churn']].groupby(['churn']).mean().plot.bar()

In [None]:
df[['luxury_car_user', 'churn']].groupby(['luxury_car_user']).mean().plot.bar()

In [None]:
df[['avg_dist', 'churn']].groupby(['churn']).mean().plot.bar()

In [None]:
df[['phone_iPhone', 'churn']].groupby(['churn']).mean().plot.bar()

In [None]:
df[['avg_rating_by_driver', 'churn']].groupby(['churn']).mean().plot.bar()
plt.legend(loc='lower center')

### Use polynomial features - high orders!

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
X_poly = PolynomialFeatures(degree=2, interaction_only=True).fit_transform(X)

In [None]:
X_poly.shape

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=1000000, fit_intercept=True)
lr.fit(X_poly, y)

In [None]:
y_pred = lr.predict(X_poly)
print("Accuracy of the Logistic Regression is: {}".format(accuracy(y, y_pred)))
print("Precision of the Logistic Regression is: {}".format(precision(y, y_pred)))
print("Recall of the Logistic Regression is: {}".format(recall(y, y_pred)))

### Use train and test set

In [None]:
from sklearn.cross_validation import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.95, random_state=42)

In [None]:
print X_train.shape, y_train.shape
print X_test.shape, y_test.shape

In [None]:
lr = LogisticRegression(C=0.1, fit_intercept=True)
lr.fit(X_train, y_train)

In [None]:
y_train_pred = lr.predict(X_train)
print("Training score:")
print("Accuracy of the Logistic Regression is: {}".format(accuracy(y_train, y_train_pred)))
print("Precision of the Logistic Regression is: {}".format(precision(y_train, y_train_pred)))
print("Recall of the Logistic Regression is: {}".format(recall(y_train, y_train_pred)))

In [None]:
y_test_pred = lr.predict(X_test)
print("Accuracy of the Logistic Regression is: {}".format(accuracy(y_test, y_test_pred)))
print("Precision of the Logistic Regression is: {}".format(precision(y_test, y_test_pred)))
print("Recall of the Logistic Regression is: {}".format(recall(y_test, y_test_pred)))