## Relax Data Science Challenge

Defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven­day period, identify which factors predict future user adoption.

In [125]:
#import a necessary module

import pandas as pd
import matplotlib.pyplot as plt
import datetime
import numpy as np
import seaborn as sns

# read files

path_engagement ='/Users/kim.jiy/Documents/SpringBoard/Ch20/relax_challenge/takehome_user_engagement.csv'
df_engagement = pd.read_csv(path_engagement)

path_user ='/Users/kim.jiy/Documents/SpringBoard/Ch20/relax_challenge/takehome_users.csv'
df_user = pd.read_csv(path_user, encoding='latin-1')

In [126]:
df_user.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [127]:
df_engagement.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [128]:
# convert 'time_stamp' to to_datetime

df_engagement['time_stamp'] = pd.to_datetime(df_engagement.time_stamp)

In [129]:
#  store "adopted user" as a user who has logged into the product on three separate days in at least one seven day period

adopted_users = []

user_unique = list(df_engagement['user_id'].unique())

for user in user_unique:
    user_time_stamp = df_engagement[df_engagement['user_id'] == user].reset_index().sort_values('time_stamp')
    
    status = False
    if len(user_time_stamp) < 3:
        pass
    else: 
        for j in range(0, len(user_time_stamp)-2): 
            time_diff = user_time_stamp.time_stamp[j+2] - user_time_stamp.time_stamp[j]
            date1 = user_time_stamp.time_stamp[j].date()
            date2 = user_time_stamp.time_stamp[j+1].date()
            date3 = user_time_stamp.time_stamp[j+2].date()
            if (time_diff < datetime.timedelta(7)) & (date1 != date2) & (date2 != date3):
                status = True
            else:
                pass
    adopted_users.append(status)
    
     

In [130]:
# create a dataframe which consits of unique user_id and boolean of adopted users

pd_adopted = {'user_id': user_unique, 'adopted_users': adopted_users}

df_adopted = pd.DataFrame(data=pd_adopted)

df_adopted.head()

Unnamed: 0,user_id,adopted_users
0,1,False
1,2,True
2,3,False
3,4,False
4,5,False


The number of unique users is 8823 and the number of adopted users is 1602. 

In [30]:
# the number of user 

len(user_unique)

8823

In [131]:
# the number of adopted user

len( df_adopted[df_adopted["adopted_users"] == True] )

1602

### Missing Values

"last_session_creation_time" and "invited_by_user_id" have missing values. 


last_session_creation_time: unix timestamp of last login (missing rate = 3177/8823 = 36%)

invited_by_user_id: which user invited them to join (if applicable)(missing rate = 5583/8823 = 63%).



In [100]:
df_missing = df_user.isnull().sum(axis=0).reset_index()
df_missing.columns = ["col_name",'missing_values']

In [101]:
df_missing

Unnamed: 0,col_name,missing_values
0,object_id,0
1,creation_time,0
2,name,0
3,email,0
4,creation_source,0
5,last_session_creation_time,3177
6,opted_in_to_mailing_list,0
7,enabled_for_marketing_drip,0
8,org_id,0
9,invited_by_user_id,5583


Fill missing values at "last_session_creation_time" with median and fill "invited_by_user_id" with 0. 

In [132]:
df_user['last_session_creation_time'] = df_user['last_session_creation_time'].fillna(df_user['last_session_creation_time'].median())
df_user['invited_by_user_id'] = df_user['invited_by_user_id'].fillna(0)

df_user['creation_time'] = df_user['creation_time'].fillna(0)
df_user['last_session_creation_time'] = df_user['last_session_creation_time'].fillna(0)    
    

In [103]:
df_user.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null object
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    12000 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            12000 non-null float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB


### Prepocessing Data

In [51]:
df_user['creation_source'].unique()

array(['GUEST_INVITE', 'ORG_INVITE', 'SIGNUP', 'PERSONAL_PROJECTS',
       'SIGNUP_GOOGLE_AUTH'], dtype=object)

In [145]:
# get dummy variable for object column which is  "creation_source"
df_encoded= df_user.copy()
df_encoded = pd.get_dummies(df_user, columns = ['creation_source'])

In [146]:
df_encoded.head()

Unnamed: 0,object_id,creation_time,name,email,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,creation_source_GUEST_INVITE,creation_source_ORG_INVITE,creation_source_PERSONAL_PROJECTS,creation_source_SIGNUP,creation_source_SIGNUP_GOOGLE_AUTH
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,1398139000.0,1,0,11,10803.0,1,0,0,0,0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,1396238000.0,0,0,1,316.0,0,1,0,0,0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,1363735000.0,0,0,94,1525.0,0,1,0,0,0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,1369210000.0,0,0,1,5151.0,1,0,0,0,0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,1358850000.0,0,0,193,5240.0,1,0,0,0,0


In [147]:
df_encoded.creation_time = pd.to_datetime(df_encoded.creation_time)
df_encoded.last_session_creation_time = pd.to_datetime(df_encoded.last_session_creation_time, unit='s')


df_encoded['creation_year'] = df_encoded.creation_time.dt.year
df_encoded['creation_month'] = df_encoded.creation_time.dt.month
df_encoded['creation_day'] = df_encoded.creation_time.dt.day

df_encoded['last_session_year'] = df_encoded.last_session_creation_time.dt.year
df_encoded['last_session_month'] = df_encoded.last_session_creation_time.dt.month
df_encoded['last_session_day'] = df_encoded.last_session_creation_time.dt.day

#Drop unnecessary columns


In [148]:

df_encoded = df_encoded.merge(df_adopted, left_on='object_id', right_on='user_id', how='outer')
df_encoded['adopted_users'].fillna(False, inplace=True)


In [149]:


df_encoded.drop(['creation_time', 'last_session_creation_time',"email","user_id","name","object_id" ], axis=1, inplace=True)

In [150]:
df_encoded.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12000 entries, 0 to 11999
Data columns (total 16 columns):
opted_in_to_mailing_list              12000 non-null int64
enabled_for_marketing_drip            12000 non-null int64
org_id                                12000 non-null int64
invited_by_user_id                    12000 non-null float64
creation_source_GUEST_INVITE          12000 non-null uint8
creation_source_ORG_INVITE            12000 non-null uint8
creation_source_PERSONAL_PROJECTS     12000 non-null uint8
creation_source_SIGNUP                12000 non-null uint8
creation_source_SIGNUP_GOOGLE_AUTH    12000 non-null uint8
creation_year                         12000 non-null int64
creation_month                        12000 non-null int64
creation_day                          12000 non-null int64
last_session_year                     12000 non-null int64
last_session_month                    12000 non-null int64
last_session_day                      12000 non-null int64
adop

In [151]:
df_encoded.head()

Unnamed: 0,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,creation_source_GUEST_INVITE,creation_source_ORG_INVITE,creation_source_PERSONAL_PROJECTS,creation_source_SIGNUP,creation_source_SIGNUP_GOOGLE_AUTH,creation_year,creation_month,creation_day,last_session_year,last_session_month,last_session_day,adopted_users
0,1,0,11,10803.0,1,0,0,0,0,2014,4,22,2014,4,22,False
1,0,0,1,316.0,0,1,0,0,0,2013,11,15,2014,3,31,True
2,0,0,94,1525.0,0,1,0,0,0,2013,3,19,2013,3,19,False
3,0,0,1,5151.0,1,0,0,0,0,2013,5,21,2013,5,22,False
4,0,0,193,5240.0,1,0,0,0,0,2013,1,17,2013,1,22,False


In [152]:
Y = df_encoded['adopted_users'].values
X = df_encoded.drop("adopted_users",axis=1)

### Predictive Model and Feature Selection

I would use RandomForestClassifier for predictive model and feature selection. 



Divide dataset to 75 % of training data and 25% of test data

In [154]:
# Divide dataset to training data and test data

X_new = X

X_new['is_train'] = np.random.uniform(0,1, len(X_new)) <= 0.75

train_X = X_new[X_new['is_train']==1]
test_X = X_new[X_new['is_train']==0]

train_Y = Y[X_new['is_train']==1]
test_Y = Y[X_new['is_train']==0]

From grid search, the model with max_features of 10, oob_score of 20 and n_estimators of 50 were selected. 

In [155]:
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Use the grid search  to search for best parameters
rf = RandomForestClassifier()

tuned_parameters = {'n_estimators': [10, 30, 50], 'max_features': [None, 5,10], 'oob_score': [5,10,20]}
rf_RF = GridSearchCV(rf, tuned_parameters, cv=2, n_jobs=-1, verbose=1)


# Fit the random search model
rf_RF.fit(train_X, train_Y)

Fitting 2 folds for each of 27 candidates, totalling 54 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   10.9s
[Parallel(n_jobs=-1)]: Done  54 out of  54 | elapsed:   11.9s finished


GridSearchCV(cv=2, error_score='raise-deprecating',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'n_estimators': [10, 30, 50], 'max_features': [None, 5, 10], 'oob_score': [5, 10, 20]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [156]:
# The best parameters selected from grid search
print (rf_RF.best_params_)

{'max_features': 10, 'n_estimators': 50, 'oob_score': 10}


From the feature selection of RandomForestClassifier, 3 variable  which are year and month when they created, year of last login are important factors to predict future user adoption. 

We can find that creation time when account was created and last login time are related to adopted users. So I suggest 



In [159]:
from sklearn.metrics import accuracy_score, classification_report, mean_squared_error


rf_RF = RandomForestClassifier(max_features=10, oob_score=10, n_estimators=50)
rf_RF_fit =rf_RF.fit(train_X, train_Y)
Y_pred = rf_RF_fit.predict(test_X)

importances = rf_RF.feature_importances_
fi = pd.DataFrame(list(zip(X.columns, importances)), columns = ['features', 'Importance'])
fi.sort_values(by='Importance', ascending=False).head(5)

Unnamed: 0,features,Importance
9,creation_year,0.232714
10,creation_month,0.226997
12,last_session_year,0.176033
13,last_session_month,0.163291
14,last_session_day,0.059212
