**Defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven-day period, identify which factors predict future user adoption.**

Among the user base only 13.35% are adopted users anf the rest, 86.65%, are non-adopted users.

The most interesting features to analyze in trying to determine which factors predict future user adoption are 'creation_source', 'opted_in_to_mailing_list', 'enabled_for_marketing_drip' and if a user was invted or not to sign in.

Concerning the features 'opted_in_to_mailing_list' and 'enabled_for_marketing_drip', the adopted and non-adopted users separate distribution among the categories of those features doesn't show a big difference between both type of users. However, a user invited by an another user has a higher chance to become an adopted user. Similarly a user whi invites users to the plataform also has higer chances of being adopted users.

When analyzing the creation source, it was found that users who signed up by 'GUEST_INVITE' or 'SIGNUP_GOOGLE_AUTH' represent a higher proportion among the adopted users compared to the non-adopted users. On the other hand, users who signed for 'PERSONAL_PROJECTS' projects are more common among non-adopted-users than among adopted users.

In conclusion, to predict future user adoption the most important factors are to determine if a user has been invited or if he invites other users. In addition to that, signing by 'GUEST_INVITE' or 'SIGNUP_GOOGLE_AUTH' is also a factor increasing the likelihood of a user being adopted.

# Code

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json

In [2]:
# load data frames
engagement = pd.read_csv('./takehome_user_engagement.csv')
users = pd.read_csv('./takehome_users.csv', encoding='latin-1')

In [3]:
engagement.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
time_stamp    207917 non-null object
user_id       207917 non-null int64
visited       207917 non-null int64
dtypes: int64(2), object(1)
memory usage: 4.8+ MB


In [4]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null object
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            6417 non-null float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB


In [5]:
engagement.head(10)

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1
5,2013-12-31 03:45:04,2,1
6,2014-01-08 03:45:04,2,1
7,2014-02-03 03:45:04,2,1
8,2014-02-08 03:45:04,2,1
9,2014-02-09 03:45:04,2,1


In [6]:
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


Add column to engagement dataframe containing  7 day rolling sum.

In [7]:
from datetime import timedelta

# Timestamp to datetime format
engagement.time_stamp = engagement.time_stamp.astype('datetime64[ns]')
# Rolling window
delta = timedelta(7)
# Function returning rolling sum
def get_rolling_sum(df, period):
    return df.rolling(period, on='time_stamp')['visited'].sum()
# Apply get_rolling_sum to engagement dataframe
engagement['rolsum_7d'] = engagement.groupby('user_id', as_index=False, group_keys=False)\
                                    .apply(get_rolling_sum, delta)
# Check
engagement.head(10)

Unnamed: 0,time_stamp,user_id,visited,rolsum_7d
0,2014-04-22 03:53:30,1,1,1.0
1,2013-11-15 03:45:04,2,1,1.0
2,2013-11-29 03:45:04,2,1,1.0
3,2013-12-09 03:45:04,2,1,1.0
4,2013-12-25 03:45:04,2,1,1.0
5,2013-12-31 03:45:04,2,1,2.0
6,2014-01-08 03:45:04,2,1,1.0
7,2014-02-03 03:45:04,2,1,1.0
8,2014-02-08 03:45:04,2,1,2.0
9,2014-02-09 03:45:04,2,1,3.0


In [8]:
# Determine adopted users
adopted_users = engagement.user_id[engagement.rolsum_7d>=3].unique()
# Add column to users dataframe indicating if user is adopted
users['adopted'] = 0
users.adopted[users.object_id.isin(adopted_users)] = 1
# Fill nan in invited_user_id column with zeros
users.invited_by_user_id.fillna(0, inplace=True)
# Check
users.head(10)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,1
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0,0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0,0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0,0
5,6,2013-12-17 03:37:06,Cunha Eduardo,EduardoPereiraCunha@yahoo.com,GUEST_INVITE,1387424000.0,0,0,197,11241.0,0
6,7,2012-12-16 13:24:32,Sewell Tyler,TylerSewell@jourrapide.com,SIGNUP,1356010000.0,0,1,37,0.0,0
7,8,2013-07-31 05:34:02,Hamilton Danielle,DanielleHamilton@yahoo.com,PERSONAL_PROJECTS,,1,1,74,0.0,0
8,9,2013-11-05 04:04:24,Amsel Paul,PaulAmsel@hotmail.com,PERSONAL_PROJECTS,,0,0,302,0.0,0
9,10,2013-01-16 22:08:03,Santos Carla,CarlaFerreiraSantos@gustr.com,ORG_INVITE,1401833000.0,1,1,318,4143.0,1


In [9]:
prop_adopt = len(users[users.adopted==1]) / len(users) * 100
prop_non_adopt = len(users[users.adopted==0]) / len(users) * 100
print("Adopted users: {:.2f}%".format(prop_adopt))
print("Non-adopted users: {:.2f}%".format(prop_non_adopt))

Adopted users: 13.35%
Non-adopted users: 86.65%


In [10]:
prop = users[users.adopted==1].creation_source.value_counts() / len(users[users.adopted==1]) * 100
print("Creation source proportion among adopted users:")
for cat, val in prop.iteritems():
    print("\t%.2f%% " %(val), cat)

Creation source proportion among adopted users:
	34.52%  ORG_INVITE
	22.47%  GUEST_INVITE
	18.29%  SIGNUP
	14.48%  SIGNUP_GOOGLE_AUTH
	10.24%  PERSONAL_PROJECTS


In [11]:
prop = users[users.adopted==0].creation_source.value_counts() / len(users[users.adopted==0]) * 100
print("Creation source proportion among non-adopted users:")
for cat, val in prop.iteritems():
    print("\t %.2f%% " %(val), cat)

Creation source proportion among non-adopted users:
	 35.59%  ORG_INVITE
	 18.72%  PERSONAL_PROJECTS
	 17.34%  GUEST_INVITE
	 17.25%  SIGNUP
	 11.09%  SIGNUP_GOOGLE_AUTH


In [12]:
prop = users[users.adopted==1].opted_in_to_mailing_list.value_counts() / len(users[users.adopted==1]) * 100
print("Mailing list proportion among adopted users:")
for cat, val in prop.iteritems():
    print("\t %.2f%% " %(val), cat)

Mailing list proportion among adopted users:
	 74.16%  0
	 25.84%  1


In [13]:
prop = users[users.adopted==0].opted_in_to_mailing_list.value_counts() / len(users[users.adopted==0]) * 100
print("Mailing list proportion among non-adopted users:")
for cat, val in prop.iteritems():
    print("\t %.2f%% " %(val), cat)

Mailing list proportion among non-adopted users:
	 75.19%  0
	 24.81%  1


In [14]:
prop = users[users.adopted==1].enabled_for_marketing_drip.value_counts() / len(users[users.adopted==1]) * 100
print("Enabled marketing proportion among adopted users:")
for cat, val in prop.iteritems():
    print("\t %.2f%% " %(val), cat)

Enabled marketing proportion among adopted users:
	 84.64%  0
	 15.36%  1


In [15]:
prop = users[users.adopted==0].enabled_for_marketing_drip.value_counts() / len(users[users.adopted==0]) * 100
print("Enabled marketing proportion among non-adopted users:")
for cat, val in prop.iteritems():
    print("\t %.2f%% " %(val), cat)

Enabled marketing proportion among non-adopted users:
	 85.13%  0
	 14.87%  1


In [16]:
perc = len(users[(users.invited_by_user_id!=0) & (users.adopted==1)]) / len(users[users.adopted==1]) * 100
print("Proportion of invited users among adopted users: %.2f%%" %(perc))

Proportion of invited users among adopted users: 56.99%


In [17]:
perc = len(users[(users.invited_by_user_id!=0) & (users.adopted==0)]) / len(users[users.adopted==0]) * 100
print("Proportion of invited users among non-adopted users: %.2f%%" %(perc))

Proportion of invited users among non-adopted users: 52.93%


In [18]:
perc = np.sum(users.object_id[users.adopted==1].isin(users.invited_by_user_id.unique())) / \
                                                len(users[users.adopted==1]) * 100
print("Proportion of adopted users that are also inviting users: %.2f%%" %(perc))

Proportion of adopted users that are also inviting users: 26.72%


In [19]:
perc = np.sum(users.object_id[users.adopted==0].isin(users.invited_by_user_id.unique())) / \
                                                len(users[users.adopted==0]) * 100
print("Proportion of non-adopted users that are also inviting users: %.2f%%" %(perc))

Proportion of non-adopted users that are also inviting users: 20.54%
