# Relax Challenge

The data has two tables:

1. A user table ("takehome_users") with data on 12,000 users who signed up for the product in the last two years.
2. A usage summary table ("takehome_user_engagement") that has a row for each day that a user logged into the product.
Defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven-day period, identify which factors predict future user adoption.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestClassifier

In [2]:
users = pd.read_csv('takehome_users.csv', encoding="ISO-8859-1")
usage = pd.read_csv('takehome_user_engagement.csv', encoding="ISO-8859-1")

In [3]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null object
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            6417 non-null float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB


In [4]:
usage.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
time_stamp    207917 non-null object
user_id       207917 non-null int64
visited       207917 non-null int64
dtypes: int64(2), object(1)
memory usage: 4.8+ MB


In [5]:
usage.user_id.nunique()

8823

In [6]:
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [7]:
usage.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


Let's first create labels for adopted users.

In [8]:
usage.time_stamp = pd.to_datetime(usage.time_stamp)

In [9]:
def is_adopted(arr):
    count= len(arr)
    
    if count<3:
        return 0
    
    for i in range(count-2):
        if (arr.iloc[i+2]-arr.iloc[i])<pd.Timedelta(7,'D'):
            return 1
    
    return 0

In [10]:
adopted_user=usage.groupby('user_id')[['time_stamp']].agg(is_adopted)

In [11]:
adopted_user.reset_index(inplace=True)
adopted_user.columns=['user_id','adopted_user']
adopted_user.head()

Unnamed: 0,user_id,adopted_user
0,1,0
1,2,1
2,3,0
3,4,0
4,5,0


In [12]:
print(" # of total users: ", len(users))
print(" # of users who created a session: ", len(adopted_user))
print(" # of adopted users: ", adopted_user.adopted_user.sum())

 # of total users:  12000
 # of users who created a session:  8823
 # of adopted users:  1602


From above analysis, we can see we have 12,000 users who signed up for the product in the last two years. We have only 8823 users who created a session. Of those users, we have 1602 adopted users.

Next we clean up the merge the label with features.

In [13]:
users.rename(columns={'object_id':'user_id'},inplace=True)
users = pd.merge(users, adopted_user, on='user_id', how='outer')

In [14]:
users.head()

Unnamed: 0,user_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted_user
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,0.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,1.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0,0.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0,0.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0,0.0


In [15]:
users.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12000 entries, 0 to 11999
Data columns (total 11 columns):
user_id                       12000 non-null int64
creation_time                 12000 non-null object
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            6417 non-null float64
adopted_user                  8823 non-null float64
dtypes: float64(3), int64(4), object(4)
memory usage: 1.1+ MB


In [16]:
# add a categorical feature whether or not a session is created

users['session']= users.last_session_creation_time.notna()

# all users who haven't created a session are not adopted users. we can fill their label as 0

users['adopted_user'].fillna(0,inplace=True)

# creat a new feature as the delta b/t create session time and 

users.creation_time = pd.to_datetime(users.creation_time)
users.last_session_creation_time = pd.to_datetime(users.last_session_creation_time, unit='s')
users['time_delta_session_signup'] = users.last_session_creation_time-users.creation_time
users['time_delta_session_signup'].fillna(2000, inplace=True)  # maximum is ~700 days so we will in 2000

# Keep only info on if user was invited
users['invited'] = users['invited_by_user_id'].notna()
users.drop('invited_by_user_id', axis=1, inplace=True)

In [17]:
users.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12000 entries, 0 to 11999
Data columns (total 13 columns):
user_id                       12000 non-null int64
creation_time                 12000 non-null datetime64[ns]
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    8823 non-null datetime64[ns]
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
adopted_user                  12000 non-null float64
session                       12000 non-null bool
time_delta_session_signup     12000 non-null timedelta64[ns]
invited                       12000 non-null bool
dtypes: bool(2), datetime64[ns](2), float64(1), int64(4), object(3), timedelta64[ns](1)
memory usage: 1.1+ MB


In [18]:
def split(string):
    
    return string.split('@')[1].split('.')[0]

In [19]:
users['email_provider']=users.email.agg(split)

In [20]:
users.head()

Unnamed: 0,user_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,adopted_user,session,time_delta_session_signup,invited,email_provider
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,2014-04-22 03:53:30,1,0,11,0.0,True,0 days,True,yahoo
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,2014-03-31 03:45:04,0,0,1,1.0,True,136 days,True,gustr
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,2013-03-19 23:14:52,0,0,94,0.0,True,0 days,True,gustr
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,2013-05-22 08:09:28,0,0,1,0.0,True,1 days,True,yahoo
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,2013-01-22 10:14:20,0,0,193,0.0,True,5 days,True,yahoo


In [21]:
users['org_id'].nunique()

417

the organization (group of users) they belong to is a categorical feature and we need to encode it. However since the number of unique values are a lot, we need a special encoding scheme.

In [22]:
from sklearn.feature_extraction import FeatureHasher

users['org_id']=users['org_id'].astype('str')
fh_1 = FeatureHasher(n_features=10, input_type='string')
hashed_org = fh_1.fit_transform(users['org_id'])
hashed_org = hashed_org.toarray()

fh_2 = FeatureHasher(n_features=10, input_type='string')
hashed_email = fh_2.fit_transform(users['email_provider'])
hashed_email = hashed_email.toarray()

X=pd.concat([users, pd.DataFrame(hashed_org, columns=['org_id_hash']*10), pd.DataFrame(hashed_email,columns=['email_hash']*10)],axis=1)

In [23]:
X.head()

Unnamed: 0,user_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,adopted_user,...,email_hash,email_hash.1,email_hash.2,email_hash.3,email_hash.4,email_hash.5,email_hash.6,email_hash.7,email_hash.8,email_hash.9
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,2014-04-22 03:53:30,1,0,11,0.0,...,1.0,0.0,0.0,2.0,1.0,-1.0,0.0,0.0,0.0,0.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,2014-03-31 03:45:04,0,0,1,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,-2.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,2013-03-19 23:14:52,0,0,94,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,-2.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,2013-05-22 08:09:28,0,0,1,0.0,...,1.0,0.0,0.0,2.0,1.0,-1.0,0.0,0.0,0.0,0.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,2013-01-22 10:14:20,0,0,193,0.0,...,1.0,0.0,0.0,2.0,1.0,-1.0,0.0,0.0,0.0,0.0


In [24]:
X.drop(['user_id','creation_time','name','email','last_session_creation_time','org_id','email_provider'],axis=1, inplace=True)

In [25]:
X=pd.get_dummies(X, drop_first=True)

In [26]:
def convert(timedelta):
    return float(timedelta.days)

X['time_delta_session_signup']=X.time_delta_session_signup.agg(convert)


In [27]:
X.head()

Unnamed: 0,opted_in_to_mailing_list,enabled_for_marketing_drip,adopted_user,session,time_delta_session_signup,invited,org_id_hash,org_id_hash.1,org_id_hash.2,org_id_hash.3,...,email_hash,email_hash.1,email_hash.2,email_hash.3,email_hash.4,email_hash.5,creation_source_ORG_INVITE,creation_source_PERSONAL_PROJECTS,creation_source_SIGNUP,creation_source_SIGNUP_GOOGLE_AUTH
0,1,0,0.0,True,0.0,True,0.0,0.0,0.0,0.0,...,1.0,-1.0,0.0,0.0,0.0,0.0,0,0,0,0
1,0,0,1.0,True,136.0,True,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,-2.0,1,0,0,0
2,0,0,0.0,True,0.0,True,0.0,1.0,-1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,-2.0,1,0,0,0
3,0,0,0.0,True,1.0,True,0.0,0.0,0.0,0.0,...,1.0,-1.0,0.0,0.0,0.0,0.0,0,0,0,0
4,0,0,0.0,True,5.0,True,1.0,1.0,0.0,0.0,...,1.0,-1.0,0.0,0.0,0.0,0.0,0,0,0,0


In [28]:
from boruta import BorutaPy

# Boruta package only accept numpy, no dataframe

x= X.drop('adopted_user',axis=1).values
y= X['adopted_user'].values
df_columns=X.drop('adopted_user',axis=1).columns


#Define RF object, we need to use balanced b/c 1 is minority in the training set.
rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced')

#Define the Boruta feature selection method
feat_sel = BorutaPy(rf, n_estimators='auto', verbose=1)

#Fit the Boruta algo
feat_sel.fit(x, y)

Iteration: 1 / 100
Iteration: 2 / 100
Iteration: 3 / 100
Iteration: 4 / 100
Iteration: 5 / 100
Iteration: 6 / 100
Iteration: 7 / 100
Iteration: 8 / 100
Iteration: 9 / 100


BorutaPy finished running.

Iteration: 	10 / 100
Confirmed: 	2
Tentative: 	0
Rejected: 	27


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


BorutaPy(alpha=0.05,
     estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=20, n_jobs=-1, oob_score=False,
            random_state=<mtrand.RandomState object at 0x11304a948>,
            verbose=0, warm_start=False),
     max_iter=100, n_estimators='auto', perc=100,
     random_state=<mtrand.RandomState object at 0x11304a948>,
     two_step=True, verbose=1)

The trained model can now give us a ranking of the important features of the dataset. It does so in a numerical format where the features ranked '1' are the important ones while the rest are not very important.

In [29]:
features = pd.DataFrame(feat_sel.ranking_, index=df_columns)
features.columns = ['ranking']
features.sort_values('ranking', ascending=True)

Unnamed: 0,ranking
session,1
time_delta_session_signup,1
org_id_hash,2
org_id_hash,3
org_id_hash,4
creation_source_PERSONAL_PROJECTS,5
email_hash,6
org_id_hash,7
email_hash,8
org_id_hash,9


From above table we can see that whether or not a user creat a session after signing up and the time delta b/t last session created and signup are good predictor of user adoption.