# Objective: 

Defining  an  "adopted  user"   as  a  user  who   has  logged  into  the  product  on  three  separate
days  in  at  least  one  seven-day  period ,  identify  which  factors  predict  future  user
adoption .

---

We are not trying to build the most accurate classifier but one which will tell us the importance of each feature. First we will need to load the data. We will use pandas since it is a convenient library for data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
userdf = pd.read_csv('takehome_users.csv', encoding='latin_1')
user_eng_df = pd.read_csv('takehome_user_engagement.csv', encoding='latin_1')

In [3]:
userdf.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [4]:
user_eng_df.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [5]:
userdf.rename(columns={'object_id':'user_id'},inplace=True)

If a user does not have a referral, we can simply fill the column with 0.

In [6]:
user_eng_df.time_stamp = pd.to_datetime(user_eng_df.time_stamp)

We will need to get the engagement data to where there is one value associated for each user.

In [7]:
grouped_engagement = user_eng_df.groupby(['user_id','time_stamp']).sum()

In [8]:
level_values = grouped_engagement.index.get_level_values
grouped_engagement = grouped_engagement.groupby([level_values(0)] + [pd.Grouper(freq='W', level=-1)]).sum()

In [9]:
max_weekly_visits = grouped_engagement.groupby('user_id').max().reset_index()

In [10]:
grouped_engagement.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,visited
user_id,time_stamp,Unnamed: 2_level_1
1,2014-04-27,1
2,2013-11-17,1
2,2013-12-01,1
2,2013-12-15,1
2,2013-12-29,1
2,2014-01-05,1
2,2014-01-12,1
2,2014-02-09,3
2,2014-02-16,2
2,2014-03-09,1


We now have a dataframe where for each user we have the maximum number of logins in each week. We need to merge this with the user data.

In [11]:
merged = userdf.merge(max_weekly_visits, on='user_id', how='inner')

In [12]:
merged.head()

Unnamed: 0,user_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,visited
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,1
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,3
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0,1
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0,1
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0,1


In [13]:
merged['invited_by_user_id'] = merged.invited_by_user_id.fillna(0)

The actual email address is not important since each is unlikely to be correlated with the user's adoption, but we may want to look at the domain of the email.

In [14]:
merged['email_domain'] = merged['email'].apply(lambda x: x.split('@')[-1])

We are simply trying to categorize the users as adopted or not, so our target needs to be in this format.

In [15]:
merged['adopted'] = merged.visited.apply(lambda x: int(x >= 3))

In [16]:
merged.drop(columns=['name', 'email', 'visited'],inplace=True)

In [17]:
merged.head()

Unnamed: 0,user_id,creation_time,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,email_domain,adopted
0,1,2014-04-22 03:53:30,GUEST_INVITE,1398139000.0,1,0,11,10803.0,yahoo.com,0
1,2,2013-11-15 03:45:04,ORG_INVITE,1396238000.0,0,0,1,316.0,gustr.com,1
2,3,2013-03-19 23:14:52,ORG_INVITE,1363735000.0,0,0,94,1525.0,gustr.com,0
3,4,2013-05-21 08:09:28,GUEST_INVITE,1369210000.0,0,0,1,5151.0,yahoo.com,0
4,5,2013-01-17 10:14:20,GUEST_INVITE,1358850000.0,0,0,193,5240.0,yahoo.com,0


In [18]:
merged = pd.get_dummies(merged, columns=['creation_source', 'email_domain'])

In [19]:
merged.head()

Unnamed: 0,user_id,creation_time,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted,creation_source_GUEST_INVITE,creation_source_ORG_INVITE,...,email_domain_zjwjb.com,email_domain_zkbxm.com,email_domain_zkbzv.com,email_domain_zkcdj.com,email_domain_zkcep.com,email_domain_zkdih.com,email_domain_zpbkw.com,email_domain_zpcop.com,email_domain_zsrgb.com,email_domain_zssin.com
0,1,2014-04-22 03:53:30,1398139000.0,1,0,11,10803.0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2,2013-11-15 03:45:04,1396238000.0,0,0,1,316.0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
2,3,2013-03-19 23:14:52,1363735000.0,0,0,94,1525.0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,4,2013-05-21 08:09:28,1369210000.0,0,0,1,5151.0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,5,2013-01-17 10:14:20,1358850000.0,0,0,193,5240.0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


Machine learning algorithms do better with numbers than dates, so we will take the timestamps and convert them to the seconds since the most recent timestamp given.

In [20]:
merged.creation_time = pd.to_datetime(merged.creation_time)
merged.last_session_creation_time = pd.to_datetime(merged.last_session_creation_time,unit='s')

In [21]:
merged.creation_time.max()

Timestamp('2014-05-30 23:59:19')

In [22]:
last_ts = merged.last_session_creation_time.max()

In [25]:
merged['creation_time'] = merged['creation_time'].apply(lambda x: (last_ts - x).total_seconds())
merged['last_session_creation_time'] = merged['last_session_creation_time'].apply(lambda x: (last_ts - x).total_seconds())

There are far too many domain columns now. We want to get rid of the email columns with very few entries, which are probably either fake addresses or possibly personal domains.

In [23]:
merged = merged[merged.columns[[True, True] + list(merged.sum() >= 5)]]

In [26]:
merged.head()

Unnamed: 0,user_id,creation_time,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted,creation_source_GUEST_INVITE,creation_source_ORG_INVITE,creation_source_PERSONAL_PROJECTS,creation_source_SIGNUP,creation_source_SIGNUP_GOOGLE_AUTH,email_domain_cuvox.de,email_domain_gmail.com,email_domain_gustr.com,email_domain_hotmail.com,email_domain_jourrapide.com,email_domain_yahoo.com
0,1,3927920.0,3927920.0,1,0,11,10803.0,0,1,0,0,0,0,0,0,0,0,0,1
1,2,17579626.0,5829226.0,0,0,1,316.0,1,0,1,0,0,0,0,0,1,0,0,0
2,3,38331838.0,38331838.0,0,0,94,1525.0,0,0,1,0,0,0,0,0,1,0,0,0
3,4,32942962.0,32856562.0,0,0,1,5151.0,0,1,0,0,0,0,0,0,0,0,0,1
4,5,43649070.0,43217070.0,0,0,193,5240.0,0,1,0,0,0,0,0,0,0,0,0,1


In [52]:
merged.set_index('user_id',inplace=True)

We will use a Random Forest classifier, since if gives us a feature importance number for each feature, which is easily interpretable independent of scaling.

In [61]:
from sklearn.ensemble import RandomForestClassifier

Y = merged.adopted
X = merged[[x for x in merged.columns if x != 'adopted']]

In [62]:
clf = RandomForestClassifier(50)
clf.fit(X,Y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [63]:
pd.Series(abs(clf.feature_importances_), index=X.columns).sort_values()

creation_source_SIGNUP_GOOGLE_AUTH    0.001545
creation_source_PERSONAL_PROJECTS     0.002195
email_domain_gustr.com                0.002366
creation_source_SIGNUP                0.002520
email_domain_cuvox.de                 0.002830
creation_source_GUEST_INVITE          0.003005
email_domain_hotmail.com              0.003040
email_domain_jourrapide.com           0.003185
creation_source_ORG_INVITE            0.003322
email_domain_yahoo.com                0.003527
email_domain_gmail.com                0.004033
enabled_for_marketing_drip            0.005057
opted_in_to_mailing_list              0.006342
invited_by_user_id                    0.036129
org_id                                0.056386
creation_time                         0.226903
last_session_creation_time            0.637615
dtype: float64

The most important features to predict adoption are the account creation time and the time last logged in. The organization is also somewhat important, as well as referrals.