## Relax challenge

The goal is to figure out which features predict user adoption. In order to do this, I first need to use the data on on user logins to determine which users have adopted the product. First, I load the engagement (login) data into a pandas data frame:

In [1]:
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
users = pd.read_csv('takehome_users.csv', encoding='latin-1', parse_dates=['creation_time'])
users['last_session_creation_time'] = pd.to_datetime(users['last_session_creation_time'], unit='s')

In [3]:
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,2014-04-22 03:53:30,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,2014-03-31 03:45:04,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,2013-03-19 23:14:52,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,2013-05-22 08:09:28,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,2013-01-22 10:14:20,0,0,193,5240.0


In [4]:
engage = pd.read_csv('takehome_user_engagement.csv', encoding='latin-1', parse_dates=['time_stamp'])
engage.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


An "adopted" user has logged into the product on three separate days in at least one seven-day period, so first I select out users in the login data who have logged in on three seperate days:

In [5]:
engage["datetime"] = pd.to_datetime(engage.time_stamp)
engage_counts = engage["user_id"].value_counts()
engage_3_or_more = engage[engage["user_id"].isin(engage_counts[engage_counts > 2].index)]
engage_3_or_more.head()

Unnamed: 0,time_stamp,user_id,visited,datetime
1,2013-11-15 03:45:04,2,1,2013-11-15 03:45:04
2,2013-11-29 03:45:04,2,1,2013-11-29 03:45:04
3,2013-12-09 03:45:04,2,1,2013-12-09 03:45:04
4,2013-12-25 03:45:04,2,1,2013-12-25 03:45:04
5,2013-12-31 03:45:04,2,1,2013-12-31 03:45:04


Next, to get a list of the users who have adopted the product, I iterate through users, and for each user, I further iterate through the user's login days and determine if the user has a period of three logins within 7 days:

In [6]:
adopted_users = []
for this_user in engage_3_or_more["user_id"].unique():
    this_users_engagement = engage_3_or_more[engage_3_or_more["user_id"] == this_user]
    this_users_datetime = this_users_engagement["datetime"].reset_index()["datetime"]
    for i in range(len(this_users_datetime) - 2):
        time_interval = this_users_datetime[i + 2] - this_users_datetime[i]
        if time_interval < pd.Timedelta("7 days"):
            adopted_users.append(this_user)
            break

len(adopted_users)

1602

Finally, I create an indicator variable with the users dataframe that is "TRUE" when the user's ID is among those that were adopted. I start by looking at categorical variables from the user's dataframe to see how they associate with user adoption. For each categorical variable I make a table with the proportion of users in each category that adopt the product:

In [7]:
user_adoption_ind = users["object_id"].isin(adopted_users)
pd.crosstab(user_adoption_ind, users["creation_source"], normalize = "columns")

creation_source,GUEST_INVITE,ORG_INVITE,PERSONAL_PROJECTS,SIGNUP,SIGNUP_GOOGLE_AUTH
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
False,0.833564,0.870005,0.922312,0.859607,0.832491
True,0.166436,0.129995,0.077688,0.140393,0.167509


In [8]:
pd.crosstab(user_adoption_ind, users["opted_in_to_mailing_list"], normalize = "columns")

opted_in_to_mailing_list,0,1
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1
False,0.868088,0.861723
True,0.131912,0.138277


In [9]:
pd.crosstab(user_adoption_ind, users["enabled_for_marketing_drip"], normalize = "columns")

enabled_for_marketing_drip,0,1
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1
False,0.867163,0.862723
True,0.132837,0.137277


In [10]:
pd.crosstab(user_adoption_ind, users["org_id"], normalize = "columns")

org_id,0,1,2,3,4,5,6,7,8,9,...,407,408,409,410,411,412,413,414,415,416
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
False,0.965517,0.939914,0.925373,0.916667,0.899371,0.90625,0.92029,0.865546,0.896907,0.887097,...,0.833333,0.833333,0.882353,0.923077,0.884615,1.0,0.8125,0.9,0.625,1.0
True,0.034483,0.060086,0.074627,0.083333,0.100629,0.09375,0.07971,0.134454,0.103093,0.112903,...,0.166667,0.166667,0.117647,0.076923,0.115385,0.0,0.1875,0.1,0.375,0.0


It looks like there are some differences in adoption rate across org_id. Since many of the cells in the table above can have small numbers of observation in them, I run a fisher's exact test on the on one of the possible comparisions (adoption rate for org_id == 0 and org_id == 9) to get a sense of whether the differences in this table are interesting in light of the amount of variabiltity that comes from the moderate cell counts:

In [11]:
users["org_id"].value_counts().head(10)

0     319
1     233
2     201
3     168
4     159
6     138
5     128
9     124
7     119
10    104
Name: org_id, dtype: int64

In [12]:
from scipy.stats import fisher_exact
fisher_exact(pd.crosstab(user_adoption_ind, users["org_id"]).iloc[:, [0, 9]].values)

(3.5636363636363635, 0.0024749430750469316)

Although this isn't intended to be a comprehensive statistical analysis of the association of org_id with adoption, the results of the test indicate to me that the differences here are worth a second look, espeically since they are generally of similar or larger magnitude compared to some of the other variables. The org IDs in these files are just numerical codes. If the data here could be linked with information on the organizations associated with each org ID I think this would be a worthwhile next step in terms of understanding adoption and its relationship to these organizations.

Looking at the categorical variables on users overall, the features associated with the mailing list and marketing drip don't seem to be strongly related to adoption. Users invited to join another users personal project seem to have lower adoption compared to other creation options. There also seems to be substantial variation in adoption by organization, although this feature is quite sparse. I think it would be worthwhile to look into what the different organizations/groups are, so it might be useful to link this data up with any organization-level data.

To get a look at how time of account creation might impact adoption, I also looked at the adoption rate across year and month categories:

In [13]:
adopted_years = pd.to_datetime(users["creation_time"][user_adoption_ind]).apply(lambda x: x.year).value_counts()
nonadopted_years = pd.to_datetime(users["creation_time"][~user_adoption_ind]).apply(lambda x: x.year).value_counts()

In [14]:
adopted_years/(adopted_years + nonadopted_years)

2012    0.162674
2013    0.150282
2014    0.083357
Name: creation_time, dtype: float64

In [15]:
pd.to_datetime(users["creation_time"]).apply(lambda x: x.year).value_counts()

2013    5676
2014    3527
2012    2797
Name: creation_time, dtype: int64

In [16]:
adopted_months = pd.to_datetime(users["creation_time"][user_adoption_ind]).apply(lambda x: x.month).value_counts()
nonadopted_months = pd.to_datetime(users["creation_time"][~user_adoption_ind]).apply(lambda x: x.month).value_counts()

In [17]:
adopted_months/(adopted_months + nonadopted_months)

1     0.140914
2     0.143469
3     0.133390
4     0.090676
5     0.050172
6     0.178359
7     0.139671
8     0.165493
9     0.153163
10    0.173160
11    0.160256
12    0.138710
Name: creation_time, dtype: float64

There seems to be pretty large changes in adoption over years, and some degree of variation across months.

Overall, I would say that based on what I've looked at, variation in account creation time (especially over long spans like years), and inviting org are some of the most interesting predictors of adoption. I would recommend looking into possible reasons for potentially declining adoption over time.