# Lyft Data Science Assignment ---- kw

First, I will conduct some Exploratory Data Analysis to better understand the churn at Lyft and main factors that affect a driver’s churn rate.

Then, I will generate two different datasets to analyst by driver segment and some additional level.

After, I will build a model to test the attributes I selected.

In the end, I will give a summary and detail about experimentation design.

In [None]:
# Libraries for analysis and plotting
import numpy as np
import pandas as pd
import datetime as dt

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
plt.style.use('fivethirtyeight')

## Load the datasets

In [None]:
drivers = pd.read_csv('Resources/driver_ids.csv')
rides_id = pd.read_csv('Resources/ride_ids.csv')
rides_times = pd.read_csv('Resources/ride_timestamps.csv')

In [None]:
# Let's look at the first few rows
drivers.head(2)

In [None]:
drivers.info()

We can see that there are 937 unique driver ids with corresponding start dates for the drivers.
There are no null values but the timestamp is a string, this will need to be converted to a datetime item.

In [None]:
# Transform timestamp to datetime
drivers['driver_onboard_date'] = pd.to_datetime(drivers['driver_onboard_date'])

In [None]:
# Let's look at the first few rows
rides_id.head(2)

In [None]:
rides_id.info()

We can see that there is data on 193502 different lyft journeys and there are no null values.

In [None]:
# Let's look at the first few rows
rides_times.head(2)

In [None]:
rides_times.info()

We can see that the dataframe has the 194081 rows and there are no null values.

In [None]:
# Transform timestamp to datetime
rides_times['ride_picked_up_at'] = pd.to_datetime(rides_times['ride_picked_up_at'])

In [None]:
# Extract the features from drivers: extract the day, month and week of the year as well as month from the data frame
drivers['driver_onboard_week'] = drivers['driver_onboard_date'].dt.weekofyear
drivers['driver_onboard_month'] = drivers['driver_onboard_date'].dt.month
drivers['driver_onboard_day'] = drivers['driver_onboard_date'].dt.dayofyear

In [None]:
# Extract the features from rides_timestamps
rides_times['day_of_week'] = rides_times['ride_picked_up_at'].dt.weekday
rides_times['month'] = rides_times['ride_picked_up_at'].dt.month
rides_times['hour_of_day'] = rides_times['ride_picked_up_at'].dt.hour
rides_times['week_of_year'] = rides_times['ride_picked_up_at'].dt.weekofyear
rides_times['day_of_year'] = rides_times['ride_picked_up_at'].dt.dayofyear

## Merge the dataframes

In [None]:
# Merge rides_times with rides_id
rides = pd.merge(rides_id, rides_times, on='ride_id', how='inner')
# Next merge new dataframe with drivers
rides = pd.merge(rides, drivers, on='driver_id', how='inner')

In [None]:
# Let's look at the new dataset
rides.head(2)

In [None]:
rides.info()

Overview of data Fields
driver_id - Unique identifier for a driver
ride_id - Unique identifier for a ride that was completed by the driver
ride_distance - Ride distance in meters
ride_duration - Ride durations in seconds
ride_prime_time - PrimeTime applied on the ride
ride_picked_up_at - Timestamp for when driver picked up the passenger
day_of_week - Day the ride took place
month - Month the ride took place
hour_of_day - Hour of the day the ride took place
week_of_year - Week of the year ride took place
day_of_year - Day of the year ride took place
driver_onboard_date - Date on which driver was on-boarded
driver_onboard_week - Week of the year on which driver was on-boarded
driver_onboard_day - Day of the year on which driver was on-boarded
driver_onboard_month - Month on which driver was on-boarded

## Explotatory data analysis
To check the columns in the merged dataframe in order to check for outliers and get a better understanding of the data.

In [None]:
rides.describe()

### ride_distance

In [None]:
sns.boxplot(y=rides['ride_distance'])
sns.set_context('poster')
sns.set(style='white')
sns.despine(bottom=False)
plt.show()

Remove outliers: From the boxplot above we can see that the majority of the journeys are less than 40,000 meters in distance. There are a couple of outliers above this value as well as several shorter distances. I will remove rows where where ride_distance is less than 500 meters and greater than 40,000 metres.

In [None]:
rides = rides[(rides.ride_distance > 500) & (rides.ride_distance < 40000)]

In [None]:
ax = sns.distplot(rides['ride_distance'], hist=True, kde=False)
ax.set_title('Distribution of ride_distance\n', size = 20)
sns.set_context('poster')
sns.set(style='white')
sns.despine(bottom=False)
ax.set_ylabel('Count', size = 14)
ax.set_xlabel('Distance (m)', size = 14)
ax.axvline(rides['ride_distance'].mean(), color = 'green', linewidth = 1)
ax.axvline(rides['ride_distance'].median(), color = 'red', linewidth = 1)
ax.legend(['Mean', 'Median'])
plt.show()

From the above histogram we can see that the majority of journeys are less than 10,000 meters with the average distance being ~6,300 metres.

### ride_duration

In [None]:
rides.ride_duration.describe()

In [None]:
sns.boxplot(y=rides['ride_duration'])
sns.set_context('poster')
sns.set(style='white')
sns.despine(bottom=False)
plt.show()

Remove outliers: From the boxplot above we can see that the majority of the journeys are less than 3,000 seconds in duration. There are a couple of outliers above this value as well as several shorter distances. I will remove rows where where ride_duration is greater than 2,000 seconds.

In [None]:
rides = rides[rides.ride_duration < 2000]

In [None]:
ax = sns.distplot(rides['ride_duration'], hist=True, kde=False)
sns.set_context('poster')
sns.set(style='white')
sns.despine(bottom=False)
ax.set_title('Distribution of ride duration\n', size = 20)
ax.set_ylabel('Count', size = 14)
ax.set_xlabel('Duration (s)', size = 14)
ax.axvline(rides['ride_duration'].mean(), color = 'green', linewidth = 1)
ax.axvline(rides['ride_duration'].median(), color = 'red', linewidth = 1)
ax.legend(['Mean', 'Median'])
plt.show()

From the above histogram we can see that the majority of journeys take less than 2,000 seconds with the average distance being ~750 seconds.

### ride_prime_time

In [None]:
rides.ride_prime_time.value_counts()

In [None]:
ax = sns.distplot(rides['ride_prime_time'], hist=True, kde=False)
sns.set_context('poster')
sns.set(style='white')
sns.despine(bottom=False)
ax.set_title('Distribution of PrimeTime\n', size = 20)
ax.set_ylabel('Count', size = 14)
ax.set_xlabel('PrimeTime', size = 14)
plt.show()

We can see that for the majority of journeys there is no PrimeTime pricing.

### day_of_week

In [None]:
ax = sns.tsplot(rides.groupby('day_of_week')['day_of_week'].count())
sns.set_context('poster')
sns.set(style='white')
sns.despine(bottom=False)
ax.set_title('Distribution of journeys per day\n', size = 20)
ax.set_ylabel('Count', size = 14)
ax.set_xlabel('Day of the week', size = 14)
plt.show()

The numbers of the week refer to the days of the week with 0 = Monday and 6 = Sunday.
From the timeseries plot we can see that Monday has the least number of journeys and the number of journeys increase steadly during the week to peak on Fridays.

### hour_of_day

In [None]:
ax = sns.tsplot(rides.groupby('hour_of_day')['hour_of_day'].count())
sns.set_context('poster')
sns.set(style='white')
sns.despine(bottom=False)
ax.set_title('Distribution of journeys per hour of the day\n', size = 20)
ax.set_ylabel('Count', size = 14)
ax.set_xlabel('Hour of the day', size = 14)
plt.show()

### week_of_year

In [None]:
ax = sns.distplot(rides.week_of_year, bins=13, hist=True, kde=False)
sns.set_context('poster')
sns.set(style='white')
sns.despine(bottom=False)
ax.set_title('Distribution of journeys per week of the year\n', size = 20)
ax.set_ylabel('Count', size = 14)
ax.set_xlabel('Week of the year', size = 14)
ax.set_xticks(np.arange(13,26,1))
plt.show()

As expected we can see that that there is a gradual increase in the number journeys per week leading up to a peak in week 20, this is line with the previous distribution plot for months of the year where the peak was in May.

### day_of_year

In [None]:
rides.day_of_year.describe()

In [None]:
ax = sns.distplot(rides.day_of_year, bins=90, hist=True, kde=False)
sns.set_context('poster')
sns.set(style='white')
sns.despine(bottom=False)
ax.set_title('Distribution of journeys per day of the year\n', size = 20)
ax.set_ylabel('Count', size = 14)
ax.set_xlabel('Day of the year', size = 14)
ax.set_xticks(np.arange(88,180,7))
plt.show()

The distribution is in line with the weeks of the year with the gradual increase in the number journeys per day leading up to a peak in day 135, this is line with the previous distribution plot for months of the year where the peak was in May.

### driver_onboard_week

In [None]:
#sns.tsplot(rides.groupby('day_of_year')['day_of_year'].count())
ax = sns.distplot(rides.driver_onboard_week, bins=7, hist=True, kde=False)
sns.set_context('poster')
sns.set(style='white')
sns.despine(bottom=False)
ax.set_title('Distribution of driver onboarding per week of the year\n', size = 20)
ax.set_ylabel('Count', size = 14)
ax.set_xlabel('Week of the year', size = 14)
ax.set_xticks(np.arange(13,26,1))
plt.show()

From above it can be seen that a large number of drivers came onboard earlier in the four month period and then decreased over time.

## Generate output datasets ---- lifetime

### Average lifetime of a driver

### Defination of churn
To define churn of activated drivers (a driver becomes ‘activated’ once they complete their first ride).
I will assume that once a drivers doesn't drive for a week they have churned, so any driver that has not driven in the last week will be treated as churned.

I am looking for drivers that work that have consistant working hours that can meet demand at peak times.

In [None]:
# Take the start day and week that a driver starts and the last day they worked
lifetime0 = pd.DataFrame(rides.pivot_table(index='driver_id', values=('day_of_year', 'week_of_year', 
                                                        'driver_onboard_day', 'driver_onboard_week'), aggfunc=np.max))
lifetime0.fillna(0, inplace=True)
lifetime0.reset_index( inplace=True)

# Take the start day and week that a driver starts and the first day they completed the trip
lifetime1 = pd.DataFrame(rides.pivot_table(index='driver_id', values=('day_of_year', 'week_of_year', 
                                                        'driver_onboard_day', 'driver_onboard_week'), aggfunc=np.min))
lifetime1.fillna(0, inplace=True)
lifetime1.reset_index( inplace=True)

# Take the start day and week that a driver starts and the first day they completed the trip
lifetime2 = pd.DataFrame(rides.pivot_table(index='driver_id', values=('ride_prime_time'), aggfunc=np.mean))
lifetime2.fillna(0, inplace=True)
lifetime2.reset_index( inplace=True)

In [None]:
#merge
lifetime3 = pd.merge(lifetime0, lifetime1,'left', on = 'driver_id')

In [None]:
#merge
lifetime = pd.merge(lifetime3, lifetime2,'left', on = 'driver_id')

In [None]:
#drop useless columns
lifetime=lifetime.drop('driver_onboard_day_y', axis=1)
lifetime=lifetime.drop('driver_onboard_week_y', axis=1)

In [None]:
lifetime = lifetime.rename(columns={"driver_id":"driver_id", "day_of_year_x": "day_of_year", "driver_onboard_day_x": "driver_onboard_day",
                                                      "driver_onboard_week_x": "driver_onboard_week", "week_of_year_x": "week_of_year",
                                                      "day_of_year_y": "day_of_year_first","week_of_year_y": "week_of_year_first",
                                                      "ride_prime_time": "avg_ride_prime_time"})

In [None]:
lifetime.head(2)

In [None]:
# Determine the amount of time a driver has worked
lifetime['no_weeks'] = lifetime['week_of_year'] - lifetime['driver_onboard_week']
lifetime['no_days'] = lifetime['day_of_year'] - lifetime['driver_onboard_day']

In [None]:
ax = sns.distplot(lifetime['no_days'], hist=True, kde=False, bins=30)
sns.set_context('poster')
sns.set(style='white')
sns.despine(bottom=False)
ax.set_title('Distribution of num of days working\n', size = 20)
ax.set_ylabel('Count', size = 14)
ax.set_xlabel('Days', size = 14)
ax.axvline(lifetime['no_days'].mean(), color = 'green', linewidth = 1)
ax.axvline(lifetime['no_days'].median(), color = 'red', linewidth = 1)
ax.legend(['Mean', 'Median'])
plt.show()

In the four months that the data was collected we can see that an average driver has worked for a period of 55 days.
It is important to note that onboarding started earnestly in the first month and gradually increased as time progressed.

In [None]:
ax = sns.distplot(lifetime['no_weeks'], hist=True, kde=False, bins=10)
sns.set_context('poster')
sns.set(style='white')
sns.despine(bottom=False)
ax.set_title('Distribution of no. of weeks working\n', size = 20)
ax.set_ylabel('Count', size = 14)
ax.set_xlabel('Weeks', size = 14)
ax.axvline(lifetime['no_weeks'].mean(), color = 'green', linewidth = 1)
ax.axvline(lifetime['no_weeks'].median(), color = 'red', linewidth = 1)
ax.legend(['Mean', 'Median'])
plt.show()

Echoing the above we can see that an average driver has worked for a period of ~7.5 weeks.

In [None]:
# Determine the time spend a driver who completed first trip after onboard
lifetime['no_weeks_first'] = lifetime['week_of_year_first'] - lifetime['driver_onboard_week']
lifetime['no_days_first'] = lifetime['day_of_year_first'] - lifetime['driver_onboard_day']

In [None]:
ax = sns.distplot(lifetime['no_days_first'], hist=True, kde=False, bins=30)
sns.set_context('poster')
sns.set(style='white')
sns.despine(bottom=False)
ax.set_title('Distribution of num of days working\n', size = 20)
ax.set_ylabel('Count', size = 14)
ax.set_xlabel('Days', size = 14)
ax.axvline(lifetime['no_days_first'].mean(), color = 'green', linewidth = 1)
ax.axvline(lifetime['no_days_first'].median(), color = 'red', linewidth = 1)
ax.legend(['Mean', 'Median'])
plt.show()

In [None]:
lifetime.head(2)

### Flag churn driver

In [None]:
#Any driver that has not driven in the last week will be treated as churned.
churned = lifetime[lifetime['week_of_year'] < 25]
print ('There are', churned.shape[0], 'churned drivers.')

323 churned drivers accounts for ~39% of the total drivers.

In [None]:
# Create a new column and add a binary indication of churn, 1 for churn and 0 otherwise

lifetime['churn'] = 0

for row in lifetime.index:
    
    if lifetime.loc[row,('week_of_year')]  < 25:
    
        lifetime.loc[row,('churn')] = 1

In [None]:
lifetime.head(2)

### Driver_segmentation

In [None]:
# Create a new indication of driver_type
    
lifetime.loc[lifetime.no_days <= 30,'driver_type'] = 'new_driver'
lifetime.loc[lifetime.no_days >= 60,'driver_type'] = 'driver_m60d'
lifetime.loc[(lifetime.no_days < 60)&(lifetime.no_days > 30),'driver_type'] = 'driver_l60d'


In [None]:
new = lifetime[lifetime['no_days'] <= 30]
print ('There are', new.shape[0], 'new drivers.')

lifetime.groupby(['driver_type','churn']).size()

### Average hours worked per week

In [None]:
# Take the start day and week that a driver starts and the last day they worked 
time_worked = pd.DataFrame(rides.pivot_table(index='driver_id', values=('ride_duration'), aggfunc=np.mean))
time_worked.fillna(0, inplace=True)
time_worked.reset_index( inplace=True)

In [None]:
#merge
lifetime = pd.merge(lifetime, time_worked,'left', on = 'driver_id')

# Convert from seconds to hours
lifetime['time_worked'] = lifetime.ride_duration/3600

In [None]:
lifetime.head(2)

In [None]:
ax = sns.distplot(lifetime['time_worked'], hist=True, kde=False, bins=10)
sns.set_context('poster')
sns.set(style='white')
sns.despine(bottom=False)
ax.set_title('Distribution of hours worked\n', size = 20)
ax.set_ylabel('Count', size = 14)
ax.set_xlabel('Hours', size = 14)
ax.axvline(lifetime['time_worked'].mean(), color = 'green', linewidth = 1)
ax.axvline(lifetime['time_worked'].median(), color = 'red', linewidth = 1)
ax.legend(['Mean', 'Median'])
plt.show()

This information only takes into account the amount of time a drives spends on paid journeys. The actual time the driver spends working is not represented.
We can see a good portion of the drivers spent 10 hours or less driving passengers. The average number of hours spent driving was ~50 hours.

### Number of completed journeys

In [None]:
# Take the count of a drivers journeys over their lifetime
no_rides = pd.DataFrame(rides.pivot_table(index='driver_id', values=('ride_id'), aggfunc='count'))
no_rides.fillna(0, inplace=True)
no_rides.reset_index(inplace=True)

In [None]:
no_rides.head(2)

In [None]:
#merge
lifetime = pd.merge(lifetime, no_rides,'left', on = 'driver_id')

lifetime['no_rides'] = lifetime.ride_id

In [None]:
#drop useless columns
lifetime=lifetime.drop('ride_id', axis=1)
lifetime=lifetime.drop('ride_duration', axis=1)

In [None]:
lifetime.head(2)

In [None]:
ax = sns.distplot(lifetime['no_rides'], hist=True, kde=False, bins=10)
sns.set_context('poster')
sns.set(style='white')
sns.despine(bottom=False)
ax.set_title('Distribution of no. of journeys\n', size = 20)
ax.set_ylabel('Count', size = 14)
ax.set_xlabel('No. of journeys', size = 14)
ax.axvline(lifetime['no_rides'].mean(), color = 'green', linewidth = 1)
ax.axvline(lifetime['no_rides'].median(), color = 'red', linewidth = 1)
ax.legend(['Mean', 'Median'])
plt.show()

The average number of journeys made by a driver in their lifetime is 206.

## Generate output datasets ---- pivot

In [None]:
rides.head(2)

In [None]:
rides['ride_picked_up_at'] = pd.to_datetime(rides['ride_picked_up_at'])
rides['driver_onboard_date'] = pd.to_datetime(rides['driver_onboard_date'])
rides['days_active'] = rides['ride_picked_up_at'] - rides['driver_onboard_date'] 
rides['days_active_in_days'] = rides['days_active']/ pd.Timedelta(days=1)
rides['ride_within_1st_week'] = rides['days_active_in_days'] <= 7
rides.head(3)

In [None]:
pivot = rides.pivot_table(index='driver_id', \
                          values=('days_active_in_days', 'ride_id', 'ride_prime_time','ride_picked_up_at','ride_within_1st_week'),\
                          aggfunc={'days_active_in_days':np.max,'ride_id': 'count', 'ride_prime_time':np.mean,\
                                   'ride_picked_up_at': min,'ride_within_1st_week': sum})\
                                 .reset_index()

In [None]:
pivot['days_active'] = pivot['days_active_in_days']
pivot['number_rides'] = pivot['ride_id']
#drop useless columns
pivot=pivot.drop('ride_id', axis=1)

In [None]:
pivot.head(2)

In [None]:
pivot.describe()

### Churn rate

In [None]:
df_churn = pd.merge(drivers, pivot, how = 'left', on='driver_id')
df_churn = df_churn[['driver_id', 'driver_onboard_date', 'days_active', 'ride_picked_up_at']]

retention_rates_by_day = []

for i in range(43):
    retention = (df_churn['days_active'] >= i).mean()
    retention_rates_by_day.append(retention)
    
plt.plot(retention_rates_by_day, color='fuchsia')

plt.title('Drivers churn rate', fontsize = 16)

plt.xlabel('Days')
plt.ylabel('Retention')
sns.despine(offset=10)

### Avg number of rides 1st week

In [None]:
plt.figure(figsize=(4, 5))
x = [11.016018,36.058000]
plt.bar(x = ('<150 rides', '>150 rides'), height = x, align='center', color=('gray', 'fuchsia'))
plt.ylabel('Number of rides 1st week')
plt.title('Avg number of rides 1st week')
sns.despine(offset=10)

### Avg number of trips/day

In [None]:
plt.figure(figsize=(4, 5))
y = [1.742499,5.353718]
plt.bar(x = ('<150 rides', '>150 rides'), height = y, align='center',  color=('gray', 'fuchsia'))
plt.ylabel('Avg number trips/day')
plt.title('Avg number of trips/day')
sns.despine(offset=10)

### Driver tenure

In [None]:
plt.title('Driver tenure histogram, days', fontsize = 16)
pivot['days_active'].hist(bins = 50, facecolor='fuchsia')
plt.xlabel('Number days active')
plt.grid(False)
sns.despine(offset=10)

### Number of rides per driver

In [None]:
plt.figure(figsize=(10, 5))
plt.title('Number of rides per driver', fontsize = 16)
pivot['number_rides'].hist(bins = 50, facecolor='fuchsia')
plt.xlabel('Number of rides')
plt.ylabel('Number of drivers')
plt.grid(False)
sns.despine(offset=10, trim=True)
#plt.xlim((0, 300))

In [None]:
plt.title('Driver tenure histogram (by cohort), days', fontsize = 14)
x = pivot.loc[(pivot['days_active'].notnull()) & (pivot['number_rides'] < 150), 'days_active']
y = pivot.loc[(pivot['days_active'].notnull()) & (pivot['number_rides'] >= 150), 'days_active']
plt.hist([x, y], color=('gray', 'fuchsia'), label = ('< 150','>= 150'))
sns.despine(offset=10)
plt.xlabel('Number of days active')
plt.ylabel('Number of drivers')

### Number of rides per tenure

In [None]:
plt.scatter(pivot['number_rides'], pivot['days_active'], alpha=0.3, color = 'fuchsia') 
plt.xlabel('Number of rides')
plt.ylabel('Driver tenure, days')
plt.title('Number of rides per tenure', fontsize = 14)

### Number of rides in the first week after onboarding

In [None]:
pivot['ride_within_1st_week'].hist( color='fuchsia')
plt.title('Number of rides in the first week after onboarding', fontsize = 16)

plt.xlabel('Number of rides')

sns.despine(offset=10)
plt.grid(False)

### Hypotheses-1
Use pivot as dataset to analysis whether "Doubling the number of rides in an activated driver’s first week"-----"pivot.ride_within_1st_week" feature will decrease driver churn. 

### Hypotheses-2
Use lifetime as dataset to analysis whether "Increased the amount of time a drives spends on paid journeys"(Methods: promotion/subsidize)-----"lifetime.time_worked" feature will decrease driver churn.

### Conclusion base on above data analyst
I have more confidence about second hypothese. Not only from above results, also after I build a logic-regression model, I confirm my hypothese-2 have a stronger affect on drive churn. 

## Modelling
Use lifetime as dataset to predict whether the customer will churn or not. To better understand which factor will strongly affect the chrun.

In [None]:
lifetime.head(2)

In [None]:
# Libraries for modelling
from sklearn import datasets, metrics, linear_model, feature_selection, preprocessing
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, confusion_matrix
from sklearn.model_selection import cross_val_predict, cross_val_score, train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, auc
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, Ridge, Lasso
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.preprocessing import StandardScaler, Normalizer
import itertools

I will use Logistic Regression to predict churn determine.

The target will be the churn variable I created previously.

The feature set will include:

1)no_days: the amount of time a driver has worked
2)no_days_first: the time spend a driver who completed first trip after onboard
3)time_worked: the amount of time a drives spends on paid journeys
4)no_rides: the count of a drivers journeys over their lifetime
5)avg_ride_prime_time: the avg PrimeTime applied on the ride

In [None]:
# Define target and feature set for modelling
X = lifetime[['no_days', 'no_days_first','time_worked', 'no_rides', 'avg_ride_prime_time']]
y = lifetime.churn

In [None]:
# Design an experiment. Standardise the features before splitting into training and test sets

# Define the scaler
ss = StandardScaler()

# Scaling the feature set
Xs = ss.fit_transform(X)

Standardisation is necessary for regularized regression because the beta values for each predictor variable must be on the same scale. If betas are different sizes just because of the scale of predictor variables the regularisation term can't determine which betas are more/less important based on their size.

In [None]:
# Defining our model
# Using the standard paramaeters of the model
lr = LogisticRegression()

# Fitting the model with our target and features
model = lr.fit(Xs, y)
# Determins preditions from the model and print the accuracy of the model
predictions = model.predict(Xs)
print ('Accuracy:', accuracy_score(y, predictions))

# Setting the feature importance to a variable
feat_importance = model.coef_
# Sorting the feature importance
indices = np.argsort(np.absolute(feat_importance))

# Plotting the feature importance
plt.figure(figsize = (16, 6))
plt.title("Feature importance\n", fontsize = 20)
plt.bar(range(Xs.shape[1]), feat_importance[0][indices][0][::-1], align="center")
plt.ylabel('Feature coefficient', size = 16)
plt.xlabel('Feature variables', size = 16)
plt.xticks(range(X.shape[1]), X.columns[indices][0][::-1], rotation=45)
plt.xlim([-1, X.shape[1]])
plt.show()

In [None]:
# Create a dataframe to display the odds ratio
odds_ratio = pd.DataFrame(np.transpose(model.coef_), index = X.columns, columns = ["Coeffecient value"])
# Create the odds colums by calculating the exponential of the coefficients
odds_ratio["Odds"] = odds_ratio["Coeffecient value"].apply(np.exp)
# Display the odds ratio
odds_ratio.sort_values(by = 'Odds', ascending = False)

From the above chart and table we can see time_worked, avg_ride_prime_time and no_days_first come out on top for predicting churn with an Accuracy: 0.8399044205495818

Cross-validation
Using cross-validation I will evaluate different metrics -

Accuracy = (TP+TN)/total
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
I will also plot a confusion matrix to understand these metrics better as well as a receiver operating characteristic (ROC) curve to look at the area under the curve (AUC)

In [None]:
# List the metrics for evaluation
metrics = ['accuracy', 'precision', 'recall']

# Define the model 
# Using L2 regularisation and 5-fold cross-validation
lg = LogisticRegressionCV(penalty = 'l2', cv = 5, solver = 'liblinear') 

# Loop through the different metrics and print the mean of the metric after the 5-fold cross-validation
for metric in metrics:
    
    scores = cross_val_score(lg, Xs, y, scoring = metric, cv = 5)
    
    print (metric, ':' , scores.mean())

### Confucion Matrix

In [None]:
# Define a function for plotting confusion matrix
def plot_confusion_matrix(cm, classes, title='Confusion matrix for high/low salaries\n', cmap=plt.cm.Blues):
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    plt.ylabel('True salary')
    plt.xlabel('Predicted salary')
    plt.show()
    return

In [None]:
# Plot confusion matrix
cnf_matrix = confusion_matrix(y, model.predict(Xs))
plot_confusion_matrix(cnf_matrix, classes= ['Low', 'High'])

### ROC curve

In [None]:
# Plot ROC curve
FPR, TPR, THR = roc_curve(y, model.predict_proba(Xs)[:,1])
ROC_AUC = auc(FPR, TPR)

plt.figure(figsize=[6,6])
plt.plot(FPR, TPR, label='ROC curve (area = %0.2f)' % ROC_AUC, linewidth = 4)
plt.plot([0, 1], [0, 1], 'k--', linewidth=4)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate ', fontsize=16)
plt.title('ROC curve \n', fontsize=20)
plt.legend(loc="lower right")
plt.show()

The accuracy answers the question - Overall, how often is the classifier correct?

- For this model the accuracy score was 0.83 which is a good accuracy for the model and above the baseline. This means that model is good at predicting churn over random choice.

### Gridsearch
To try and improve the model I will perform a GridSearch using different parameters to try and report the best model.

In [None]:
# Define the GridSearch parameters
params = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
          'penalty': ['l1','l2']
         }

# Define the model
grid = GridSearchCV(LogisticRegression(), params, cv=5)
# fit the model
grid.fit(Xs, y)
# Print the beat parameters to use
print (grid.best_params_)

{'penalty': 'l2', 'C': 0.01}
The best parameters identified by the GridSearch were -

penalty of l2 (Ridge)
C value of 0.01
I will run another regression with these parameters.

In [None]:
# Using L2 regularisation, C = 1 and solver = liblinear as this is better for smaller datasets
lggs = LogisticRegression(penalty = 'l2', C = 0.01, solver = 'liblinear')

# Fit the model with our target and features
lggs.fit(Xs, y)

In [None]:
# List the metrics for evaluation again
metrics = ['accuracy', 'precision', 'recall']

# Loop through the different metrics and print the mean of the metric after the 5-fold cross-validation
for metric in metrics:
    
    scores = cross_val_score(lggs, Xs, y, scoring = metric, cv = 5)
    
    print (metric, ':' , scores.mean())

In [None]:
# Plot confusion matrix
cnf_matrix = confusion_matrix(y, lggs.predict(Xs))
plot_confusion_matrix(cnf_matrix, classes= ['Low', 'High'])

In [None]:
# Plot ROC curve
FPR, TPR, THR = roc_curve(y, lggs.predict_proba(Xs)[:,1])
ROC_AUC = auc(FPR, TPR)

plt.figure(figsize=[6,6])
plt.plot(FPR, TPR, label='ROC curve (area = %0.2f)' % ROC_AUC, linewidth = 4)
plt.plot([0, 1], [0, 1], 'k--', linewidth=4)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate ', fontsize=16)
plt.title('ROC curve \n', fontsize=20)
plt.legend(loc="lower right")
plt.show()

## Results Analyst:
There wasn't a noticable increase in the metric scores after using the optimised parameters identified from the GridSearch. This means that the original model was optimised to begin with.

By analysing the datasets we were able to derive a lot of useful information about driver behaviour.

Additional insights were obtained by performing some feature engineering on the datasets and these were used to identify factors that contribute to a driver's Lifetime Value.

1)An optimised Logistic Regression model was used on the following features:

no_days
no_days_first
time_worked
no_rides
avg_ride_prime_time

2)With an accuracy score od 0.84 the following deatures:

time_worked
avg_ride_prime_time
no_days_first
Emmerged as being the best predisctors for identifying churn.

## Summary

● The definition (with justification) for a driver to be considered churned.
Any driver that has not driven in the last week will be treated as churned.

● An assessment on the current business impact of churn to Lyft.
Find out the Reasons for churn we will better keep our users and decrease the churn rate.
Anslyst by driver segemnts we can know:

1)Drivers try the service and realize it's not something they like doing.
Improve: customer service/survey analyst
2)New customer mostly like to churn.
Improve: onboarding system/give driver additional motivation

● Insights on factors affecting churn.
From above analyst and modeling, we got below factors affecting churn.
time_worked
avg_ride_prime_time
no_days_first

● Insights on segments of drivers more likely to churn.
Explotatory data analysis. 
1)Pivot table:
    Driver segments:
    drivers with total rides<150:active on avg 40 days, make on average 53 rides
    drivers with total rides>150:active on avg 66 days, make on average 340 rides
    
2)Lifetime table:
    Driver segments:
    new_driver: total work days < 30 after onboard date, as new driver. High percentage to churn.
    driver_l60d: between 30~60 days, as driver less than 60 days.
    driver_m60d: more than 60 days, as driver more than 60 days. Low percentage to churn.
    

## Experimentation

For test of hypothesis: “eliminating the Prime Time feature will decrease driver churn”.

● What are the primary and secondary metrics you will track.

Set up Significance Levels (Alpha) = 0.05
primary metric: estimate_arrive_time in prime time

secondary metrics: 
1)ride_duration in prime time 
2)ride_cancellation_rate in prime time


### Power analyst

● How long you will run the experiment and how you will choose the winning variant.
Before running the A/B test,run the power analyst to determine the sample size. 
Then to know how long we need to run to reach the requirement.

### Sampling

● How you will divide observational units into control and treatment, and a description of the treatment and control conditions.

1)After the power analyst,we need to randomly assign users to either the treatment or control group, for instance by hashing their user IDs into buckets. Randomly select samples from each segments of drivers to avoid bias.
To distributing co-variates evenly eliminating statistical bias:
    i) Randamization bias
    ii) Selection bias, avoid risk appetite effect

2)To estimate the effect of the treatment on a metric of interest for a random-user experiment:

Estimate the control values by restricting to users in the control group
Estimate the treatment values by restricting to users in the treatment group
Then to compute the relative difference between the estimates from control & treatment.


### Normality assumptions
Chek data normality,and check R-square between the relationship. Dealing with skewed/not skewed data.

### Hypothesis testing
The hypothesis test will yield a p-value, which is the probability that our data could generate purely by chance.
1)Variance and standard deviation
2)P-value

### Business impacts
● What are some potential second-order effects on the experience of drivers and passengers during this experiment.

When there is no Prime Time, a passenger who opens the app and sees a driver available always requests a ride.
When there is Prime Time, the same passenger has a 50% chance of requesting a ride.
Neither drivers nor passengers ever cancel — every request leads to a completed ride.

If we eliminating the Prime Time,
1)for drivers: 
it will get hard to maintain driver availability. Driver revenue/day decrease. Driver churn rate increase.

2)for passengers: 
1)passengers waiting time may increased 
2)passengers cancellation increased 
3)passengers (users) churn rate increase