# Relax Inc. User Engagement Data Analysis

### Data

The data is available as two attached CSV files:
* takehome_user_engagement.csv
* takehome_users.csv

The data has the following two tables:
1. A user table ( "takehome_users" ) with data on 12,000 users who signed up for the product in the last two years. This table includes:
 * name: the user's name
 * object_id: the user's id
 * email: email address
 * creation_source: how their account was created. This takes on one of 5 values:
    * PERSONAL_PROJECTS: invited to join another user's personal workspace
    * GUEST_INVITE: invited to an organization as a guest (limited permissions)
    * ORG_INVITE: invited to an organization (as a full member)
    * SIGNUP: signed up via the website
    * SIGNUP_GOOGLE_AUTH: signed up using Google Authentication (using a Google email account for their login id)
 * creation_time: when they created their account
 * last_session_creation_time: unix timestamp of last login
 * opted_in_to_mailing_list: whether they have opted into receiving marketing emails
 * enabled_for_marketing_drip: whether they are on the regular marketing email drip
 * org_id: the organization (group of users) they belong to
 * invited_by_user_id: which user invited them to join (if applicable).
2. A usage summary table ( "takehome_user_engagement" ) that has a row for each day that a user logged into the product.

### Goal
Defining an "adopted user" as a user who has logged into the product on three separate days in at least one sevenday period, identify which factors predict future user adoption.

## Import Packages

In [1]:
from glob import glob
import pandas as pd
import numpy as np
from sklearn import preprocessing

try:
    import cPickle as pickle
except ImportError:  # python 3.x
    import pickle

import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
sns.set_style("whitegrid")
palette = sns.diverging_palette(220, 20, sep = 20, n = 150)
sns.set_palette(palette)

## Load Data

In [2]:
# Check csv files
data_dir = 'Data/*.csv'
! ls {data_dir}

Data/takehome_user_engagement.csv Data/takehome_users.csv


In [3]:
df_users = pd.read_csv('Data/takehome_users.csv', encoding = "ISO-8859-1")
df_users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [4]:
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null object
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            6417 non-null float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB


In [5]:
df_engage = pd.read_csv('Data/takehome_user_engagement.csv', encoding = "ISO-8859-1", parse_dates = ['time_stamp'])
df_engage.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [6]:
df_engage.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
time_stamp    207917 non-null datetime64[ns]
user_id       207917 non-null int64
visited       207917 non-null int64
dtypes: datetime64[ns](1), int64(2)
memory usage: 4.8 MB


## Define Adopted Users

Defining an "adopted user" as a user who has logged into the product on three separate days in at least one sevenday period, identify which factors predict future user adoption.

In [7]:
users = set(df_engage['user_id'])
adopted_user = []

for i, user in enumerate(users):
    df_user = df_engage[df_engage.user_id == user]
    df_user = df_user[['time_stamp', 'user_id']]
    df_user['time_stamp'] = pd.to_datetime(df_user['time_stamp'])
    df_user = df_user.drop_duplicates()
    df_user = df_user.set_index('time_stamp')
    df_user['at_least_3_days_over_7_days_period'] = df_user.rolling(window = '7D').count()
    
    if max(df_user['at_least_3_days_over_7_days_period']) >= 3:
        adopted_user.append(user)
        
print(adopted_user)

[2, 10, 20, 33, 42, 43, 50, 53, 63, 69, 74, 80, 81, 82, 87, 133, 135, 141, 146, 153, 160, 165, 168, 172, 174, 185, 188, 197, 200, 202, 203, 209, 214, 230, 245, 247, 263, 265, 275, 280, 283, 297, 298, 305, 310, 311, 321, 322, 341, 347, 351, 363, 370, 383, 397, 401, 418, 430, 445, 450, 460, 462, 469, 471, 472, 479, 483, 492, 494, 497, 502, 506, 509, 510, 512, 518, 522, 529, 535, 540, 547, 553, 564, 572, 589, 591, 601, 603, 605, 618, 627, 628, 632, 634, 639, 669, 679, 680, 724, 725, 728, 754, 772, 783, 786, 804, 828, 845, 851, 869, 874, 882, 885, 901, 906, 907, 912, 928, 932, 934, 937, 943, 953, 980, 985, 1007, 1009, 1013, 1017, 1018, 1026, 1027, 1035, 1055, 1061, 1072, 1089, 1093, 1094, 1099, 1106, 1107, 1119, 1123, 1124, 1128, 1129, 1136, 1145, 1150, 1151, 1155, 1156, 1163, 1173, 1186, 1196, 1202, 1214, 1222, 1233, 1235, 1238, 1242, 1245, 1250, 1274, 1280, 1290, 1303, 1318, 1319, 1320, 1327, 1339, 1343, 1345, 1350, 1357, 1361, 1368, 1379, 1396, 1407, 1410, 1411, 1421, 1434, 1464, 1472, 

In [8]:
print(len(adopted_user) / len(users) * 100)

18.157089425365523


About 18% of users are considered adopted users.

## Identify Factors that Predict Future User Adoption

In [9]:
# Map the adopted user information back to the user data frame.
df_users['is_adopted_user'] = df_users['object_id'].apply(lambda x: x in adopted_user)
df_users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,is_adopted_user
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,False
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,True
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0,False
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0,False
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0,False


In [10]:
df_users = df_users[['object_id', 'creation_time', 'creation_source', 'opted_in_to_mailing_list', 'enabled_for_marketing_drip', 'is_adopted_user', 'org_id']]
df_users.head()

Unnamed: 0,object_id,creation_time,creation_source,opted_in_to_mailing_list,enabled_for_marketing_drip,is_adopted_user,org_id
0,1,2014-04-22 03:53:30,GUEST_INVITE,1,0,False,11
1,2,2013-11-15 03:45:04,ORG_INVITE,0,0,True,1
2,3,2013-03-19 23:14:52,ORG_INVITE,0,0,False,94
3,4,2013-05-21 08:09:28,GUEST_INVITE,0,0,False,1
4,5,2013-01-17 10:14:20,GUEST_INVITE,0,0,False,193


In [11]:
# fix data types
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 7 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null object
creation_source               12000 non-null object
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
is_adopted_user               12000 non-null bool
org_id                        12000 non-null int64
dtypes: bool(1), int64(4), object(2)
memory usage: 574.3+ KB


In [12]:
df_users['creation_time'] = pd.to_datetime(df_users['creation_time'])
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 7 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null datetime64[ns]
creation_source               12000 non-null object
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
is_adopted_user               12000 non-null bool
org_id                        12000 non-null int64
dtypes: bool(1), datetime64[ns](1), int64(4), object(1)
memory usage: 574.3+ KB


In [13]:
df_users = pd.get_dummies(df_users, columns = ['creation_source'])
df_users.head()

Unnamed: 0,object_id,creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,is_adopted_user,org_id,creation_source_GUEST_INVITE,creation_source_ORG_INVITE,creation_source_PERSONAL_PROJECTS,creation_source_SIGNUP,creation_source_SIGNUP_GOOGLE_AUTH
0,1,2014-04-22 03:53:30,1,0,False,11,1,0,0,0,0
1,2,2013-11-15 03:45:04,0,0,True,1,0,1,0,0,0
2,3,2013-03-19 23:14:52,0,0,False,94,0,1,0,0,0
3,4,2013-05-21 08:09:28,0,0,False,1,1,0,0,0,0
4,5,2013-01-17 10:14:20,0,0,False,193,1,0,0,0,0


In [14]:
# We should also fix missing values. As there are not any missing values here, let's skip this step.
# In addition, we should also fix outliers, if applicable.

In [15]:
# Applied logistic regression with penalized L1 regularization to select features.
from sklearn.linear_model import LogisticRegression

possible_features = ['opted_in_to_mailing_list',
                     'enabled_for_marketing_drip',
                     'creation_source_GUEST_INVITE',
                     'creation_source_ORG_INVITE',
                     'creation_source_PERSONAL_PROJECTS',
                     'creation_source_SIGNUP',
                     'creation_source_SIGNUP_GOOGLE_AUTH']
X = df_users[possible_features]
y = df_users['is_adopted_user']

for C in [1.0, 0.5, 0.1, 0.01]:
    clf = LogisticRegression(C = C, penalty = 'l1', tol = 0.01, solver = 'saga')
    clf.fit(X, y)
    coef = clf.coef_.ravel()
    print(coef)

[ 0.05296418  0.00425654  0.19625971 -0.08617218 -0.65485385 -0.00220979
  0.20775786]
[ 4.55240910e-02  1.26044253e-04  1.93521272e-01 -8.44295592e-02
 -6.48076570e-01  0.00000000e+00  1.99844046e-01]
[ 0.02068429  0.00333954  0.16374602 -0.06768551 -0.59788032  0.00130011
  0.15935844]
[ 0.          0.          0.          0.         -0.09037984  0.
  0.        ]


## Conclusion

Based on the above logistic regression results, we can conclud that 'opted_in_to_mailing_list', 'creation_source_GUEST_INVITE' & 'creation_source_SIGNUP_GOOGLE_AUTH' seem to be best features, which means that users who opted into the mailing liast and users who are invited by other users or using Google Authentication to sign up are the best features.