<a href="https://colab.research.google.com/github/jlzhang93/ultimate_challenge/blob/master/ultimate_challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 1

In [0]:
import pandas as pd

In [0]:
logins = pd.read_json('logins.json').set_index('login_time')
logins['count'] = 1
logins.head()

Unnamed: 0_level_0,count
login_time,Unnamed: 1_level_1
1970-01-01 20:13:18,1
1970-01-01 20:16:10,1
1970-01-01 20:16:37,1
1970-01-01 20:16:36,1
1970-01-01 20:26:21,1


In [0]:
agg = logins.resample('15min').sum()
agg['dayofweek'] = agg.index.dayofweek

A new feature is generated as the day of week, where 0 is Sunday and 6 is Saturday.

In [0]:
import plotly.express as px

In [0]:
px.scatter(agg, x=agg.index, y='count', color='dayofweek', 
           title='login counts (every 15 min) from 1970-01-01 to 1970-04-13 by day of')

The weekly pattern is straightforward: the login count every Saturday is the highest throughout the week, followed by every Friday. In general, the login count every Sunday and Monday is the least, but in the week of March 15th - 22nd, the login count is high very day and it reaches the peak on Monday. 
<br>
Once zoomed in, we can see a daily pattern for login counts that users tend to login in more around 11:30am and 10:00pm from Sunday to Thursday. On Friday and Saturday, users are prone to logging in around 4:30am. On Monday March 17th and Wednesday March 19th, there is a large number of logins around 1:30 am, which is not seen often.

# Part 2
1. The key measure of success is the number of rides served by drivers from one city in another city every weekday. If the policy successfully encourages drivers to be available in both cities, it is reasonable to expect on average that a Gotham driver offers more rides in Metropolis in the daytime and a Metropolis driver serves more in Gotham at night on weekdays than before the activation of policy.
<br>
<br>
2. A/B testing: Randomly select 50 driver partners from each city respectively and monitor their rides 2 weeks before and after the activation of policy (reimbursement of toll costs). Sample 1A contains 50 data points corresponding to the average ride that each Gotham driver offers in Metropolis during the day within 2 weeks before activation of policy. Sample 1B contains 50 data points that correpsonds to the average ride that each Metropolis driver offers in Gotham during the day within 2 weeks after activation of policy. One-tailed t test is needed to verify if the mean of Sample 1B is significantly higher than that of Sample 1A. Sample 2A contains 50 data points corresponding to the average ride that each Metropolis driver offers in Gotham at night within 2 weeks before activation of policy. Sample 2B contains 50 data points that correpsonds to the average ride that each Gotham driver offers in Metropolis at night within 2 weeks after activation of policy. Another one-tailed t test is needed to verify if the mean of Sample 2B is significantly higher than that of Sample 2A. If results of both t tests are statistically significant, then it is reasonble to recommend activating the policy for more driver partners. 

# Part 3

In [0]:
ultimate = pd.read_json('ultimate_data_challenge.json')
ultimate['signup_date'] = pd.to_datetime(ultimate.signup_date)
ultimate['last_trip_date'] = pd.to_datetime(ultimate.last_trip_date)
ultimate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   city                    50000 non-null  object        
 1   trips_in_first_30_days  50000 non-null  int64         
 2   signup_date             50000 non-null  datetime64[ns]
 3   avg_rating_of_driver    41878 non-null  float64       
 4   avg_surge               50000 non-null  float64       
 5   last_trip_date          50000 non-null  datetime64[ns]
 6   phone                   49604 non-null  object        
 7   surge_pct               50000 non-null  float64       
 8   ultimate_black_user     50000 non-null  bool          
 9   weekday_pct             50000 non-null  float64       
 10  avg_dist                50000 non-null  float64       
 11  avg_rating_by_driver    49799 non-null  float64       
dtypes: bool(1), datetime64[ns](2), float64(6), int

In [0]:
ultimate['duration'] = (ultimate['last_trip_date'] - 
                        ultimate['signup_date']).dt.days
ultimate['month'] = ultimate.duration // 30 + 1

In [0]:
s = ultimate.groupby('month').count().duration.sort_index(ascending=False)
series = pd.Series(s.values.cumsum() / 50000 * 100, index=s.index)

In [0]:
px.bar(series, x=series.index, y=series.values, 
       title='percentage of riders retained after month (s)')

1. As seen in the above plot, all riders are retained in the first month after they sign up for Ultimate, approximately 25% of riders are still active after 6 months of their sign-up date.

In [0]:
ultimate.phone.value_counts()

iPhone     34582
Android    15022
Name: phone, dtype: int64

In [0]:
ultimate['phone'] = ultimate.phone.fillna('iPhone')

The `NaN`s in `phone` feature are filled with mode.

In [0]:
s1 = ultimate.avg_rating_by_driver - ultimate.avg_rating_of_driver
px.histogram(s1, x=s1.values, 
title='value difference between avg_rating_by_driver and avg_rating_of_driver')

Most of the `avg_rating_by_driver` value is within +/- 1 of the `avg_rating_of_driver` value, so `NaN`s in one column are filled by values of another column.

In [0]:
ultimate['avg_rating_by_driver'] = ultimate.avg_rating_by_driver.fillna(
    ultimate.avg_rating_of_driver)
ultimate['avg_rating_of_driver'] = ultimate.avg_rating_of_driver.fillna(
    ultimate.avg_rating_by_driver)

In [0]:
ultimate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   city                    50000 non-null  object        
 1   trips_in_first_30_days  50000 non-null  int64         
 2   signup_date             50000 non-null  datetime64[ns]
 3   avg_rating_of_driver    49933 non-null  float64       
 4   avg_surge               50000 non-null  float64       
 5   last_trip_date          50000 non-null  datetime64[ns]
 6   phone                   50000 non-null  object        
 7   surge_pct               50000 non-null  float64       
 8   ultimate_black_user     50000 non-null  bool          
 9   weekday_pct             50000 non-null  float64       
 10  avg_dist                50000 non-null  float64       
 11  avg_rating_by_driver    49933 non-null  float64       
 12  duration                50000 non-null  int64 

There are still 67 rows that contain null values. They can be dropped.

In [0]:
ultimate_updated = ultimate.drop(ultimate[
  ultimate.avg_rating_by_driver.isnull()].index)

In [0]:
ultimate_updated['active_in_the_6th_month'] = (ultimate_updated['month'] >= 6
                                               ).astype(int)

In [0]:
ultimate_updated['signup_day'] = ultimate_updated.signup_date.dt.day

In [0]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
ultimate_updated = ultimate_updated.apply(le.fit_transform)

In [0]:
y = ultimate_updated['active_in_the_6th_month']
X = ultimate_updated.drop(['signup_date', 'last_trip_date', 
                       'duration', 'month', 'active_in_the_6th_month'], axis=1)

In [0]:
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, 
                                                test_size=0.2, random_state=7)

In [0]:
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier

In [0]:
tree = DecisionTreeClassifier()
tree.fit(Xtrain, ytrain)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [0]:
ypred = tree.predict(Xtest)

In [0]:
print(classification_report(ytest, ypred))

              precision    recall  f1-score   support

           0       0.83      0.82      0.82      7410
           1       0.49      0.51      0.50      2577

    accuracy                           0.74      9987
   macro avg       0.66      0.66      0.66      9987
weighted avg       0.74      0.74      0.74      9987



In [0]:
features = pd.DataFrame(tree.feature_importances_, index=Xtrain.columns, 
             columns=['feature_importance'])

In [0]:
px.bar(features, x=features.index, y='feature_importance')

In [0]:
from sklearn.metrics import accuracy_score
accuracy_score(ypred, ytest)

0.7381596074897366

2. A benchmark decision tree `tree` is built and the accuracy of model is approximately 73%. The important features are `avg_dist`, `signup_day` and `avg_rating_by_driver`. Decision tree classifier is used here because it is very simple and does not require much tuning. However, decision tree classifier does tend to overfit and does not perform very well when it comes to predicting.
<br>
<br>
3. Ultimate can try assigning longer trips to new driver partners. The drivers with higher average rating tend to stay for longer too. 

In [0]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier()
forest.fit(Xtrain, ytrain)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [0]:
ypred = forest.predict(Xtest)

In [0]:
print(classification_report(ypred, ytest))

              precision    recall  f1-score   support

           0       0.92      0.84      0.87      8114
           1       0.48      0.67      0.56      1873

    accuracy                           0.80      9987
   macro avg       0.70      0.75      0.72      9987
weighted avg       0.83      0.80      0.82      9987



In [0]:
accuracy_score(ypred, ytest)

0.8041453890057074

Random forest model predicts better without even being tuned.Random forest ensemble multiple randomized decision trees to avoid overfitting.

In [0]:
forest_features = pd.DataFrame(forest.feature_importances_, index=Xtrain.columns, 
             columns=['feature_importance'])

In [0]:
px.bar(forest_features, x=forest_features.index, y='feature_importance')

`avg_dist` is still the most important feature to affect if a driver stays longer or not.