# Customer Churn Prediction Case Study 


In [5]:
import pandas as pd
import numpy as np
from datetime import timedelta
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier as GBC
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.linear_model import LogisticRegression as LR
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier as AD

from sklearn.decomposition import PCA
from importlib import reload
from sklearn.metrics import precision_recall_curve

***

# Background

A ride-sharing company (Company X) is interested in **predicting rider retention**. To help explore this question, we have provided a sample dataset of a cohort of users who signed up for an account in **January 2014**. The data was pulled on July 1, 2014; we consider a user retained if they were **“active”** (i.e. took a trip) in the preceding 30 days (from the day the data was pulled). In other words, a user is "active" if they have taken a trip since June 1, 2014. 


The data, churn.csv, is in the data folder. The data are split into train and test sets. You are encouraged to tune and estimate your model's performance on the train set, then see how it does on the unseen data in the test set at the end.



***

# Objective

We would like you to use this data set to help understand **what factors are the best predictors for retention**, and offer suggestions to operationalize those insights to help Company X. Therefore, your task is not only to build a model that minimizes error, but also a model that allows you to interpret the factors that contributed to your predictions.


***

Here is a detailed description of the data:

- city: city this user signed up in phone: primary device for this user
- signup_date: date of account registration; in the form YYYYMMDD
- last_trip_date: the last time this user completed a trip; in the form YYYYMMDD
- avg_dist: the average distance (in miles) per trip taken in the first 30 days after signup
- avg_rating_by_driver: the rider’s average rating over all of their trips
- avg_rating_of_driver: the rider’s average rating of their drivers over all of their trips
- surge_pct: the percent of trips taken with surge multiplier > 1
- avg_surge: The average surge multiplier over all of this user’s trips
- trips_in_first_30_days: the number of trips this user took in the first 30 days after signing up
- luxury_car_user: TRUE if the user took a luxury car in their first 30 days; FALSE otherwise
- weekday_pct: the percent of the user’s trips occurring during a weekday

## EDA

In [2]:
train = pd.read_csv("data/churn_train.csv")

In [3]:
train.head()

Unnamed: 0,avg_dist,avg_rating_by_driver,avg_rating_of_driver,avg_surge,city,last_trip_date,phone,signup_date,surge_pct,trips_in_first_30_days,luxury_car_user,weekday_pct
0,6.94,5.0,5.0,1.0,Astapor,2014-05-03,Android,2014-01-12,0.0,0,False,100.0
1,8.06,5.0,5.0,1.0,Astapor,2014-01-26,Android,2014-01-25,0.0,2,True,0.0
2,21.5,4.0,,1.0,Winterfell,2014-05-21,iPhone,2014-01-02,0.0,1,True,100.0
3,9.46,5.0,,2.75,Winterfell,2014-01-10,Android,2014-01-09,100.0,1,False,100.0
4,13.77,5.0,,1.0,Winterfell,2014-05-13,iPhone,2014-01-31,0.0,0,False,100.0


In [6]:
train.shape

(40000, 12)