The data is available as two attached CSV files:
- takehome_user_engagement.csv
- takehome_users.csv

The data has the following two tables:
1. A user table ( "takehome_users" ) with data on 12,000 users who signed up for the product in the last two years. This table includes:
    - name: the user's name
    - object_id: the user's id
    - email: email address
    - creation_source: how their account was created. This takes on one of 5 values:

        - PERSONAL_PROJECTS: invited to join another user's personal workspace
        - GUEST_INVITE: invited to an organization as a guest (limited permissions)
        - ORG_INVITE: invited to an organization (as a full member)
        - SIGNUP: signed up via the website
        - SIGNUP_GOOGLE_AUTH: signed up using Google Authentication (using a Google email account for their login id)
    - creation_time: when they created their account
    - last_session_creation_time: unix timestamp of last login
    - opted_in_to_mailing_list: whether they have opted into receiving
    marketing emails
    - enabled_for_marketing_drip: whether they are on the regular
    marketing email drip
    - org_id: the organization (group of users) they belong to
    - invited_by_user_id: which user invited them to join (if applicable).
2. A usage summary table ( "takehome_user_engagement" ) that has a row for each day that a user logged into the product.

Defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven day period , identify which factors predict future user adoption .

In [1]:
import pandas as pd
import numpy as np

In [6]:
users = pd.read_csv("takehome_users.csv",encoding="ISO-8859-1")
user_engagement = pd.read_csv("takehome_user_engagement.csv",encoding="ISO-8859-1")

In [7]:
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [8]:
user_engagement.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


# Outline
1. First identify who are "adopted user"
2. Aggregate to users table
3. Perform a decision tree / random forest 
4. Get feature importance

## 1. Identiry who are "adopted user"

In [45]:
user_engagement['time_stamp'] = pd.to_datetime(user_engagement['time_stamp'])
user_engagement = user_engagement.set_index(keys='time_stamp')
user_engagement.head()

Unnamed: 0_level_0,user_id,visited
time_stamp,Unnamed: 1_level_1,Unnamed: 2_level_1
2014-04-22 03:53:30,1,1
2013-11-15 03:45:04,2,1
2013-11-29 03:45:04,2,1
2013-12-09 03:45:04,2,1
2013-12-25 03:45:04,2,1


In [46]:
check = user_engagement.groupby(by='user_id')['visited'].rolling('7d').sum()
check.head()

user_id  time_stamp         
1        2014-04-22 03:53:30    1.0
2        2013-11-15 03:45:04    1.0
         2013-11-29 03:45:04    1.0
         2013-12-09 03:45:04    1.0
         2013-12-25 03:45:04    1.0
Name: visited, dtype: float64

In [50]:
# Get "Adopted Users"
adopted_user = check[check==3].index.get_level_values(0).unique()
adopted_user

Int64Index([    2,    10,    20,    33,    42,    43,    50,    53,    63,
               69,
            ...
            11957, 11958, 11959, 11961, 11964, 11965, 11967, 11969, 11975,
            11988],
           dtype='int64', name='user_id', length=1602)

## 2. Aggregate to users table

In [56]:
# Aggregate to users table
users['adopted'] = 0
users.loc[users['object_id'].isin(adopted_user),'adopted'] = 1


In [116]:
users.head(10)

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,1
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0,0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0,0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0,0
5,6,2013-12-17 03:37:06,Cunha Eduardo,EduardoPereiraCunha@yahoo.com,GUEST_INVITE,1387424000.0,0,0,197,11241.0,0
6,7,2012-12-16 13:24:32,Sewell Tyler,TylerSewell@jourrapide.com,SIGNUP,1356010000.0,0,1,37,,0
7,8,2013-07-31 05:34:02,Hamilton Danielle,DanielleHamilton@yahoo.com,PERSONAL_PROJECTS,,1,1,74,,0
8,9,2013-11-05 04:04:24,Amsel Paul,PaulAmsel@hotmail.com,PERSONAL_PROJECTS,,0,0,302,,0
9,10,2013-01-16 22:08:03,Santos Carla,CarlaFerreiraSantos@gustr.com,ORG_INVITE,1401833000.0,1,1,318,4143.0,1


In [132]:
# Feature engineering
# invited_adopted: check whether the user who invited current user is a adopted user or not
def checkInvitedUser(row):
#     return row
    if ~np.isnan(row):
#         print(row)
        adopted_value = users.loc[users['object_id'] == row,'adopted'].values[0]
        return (adopted_value + 1)
    else:
        return 0

users['invited_adopted'] = users['invited_by_user_id'].apply(checkInvitedUser)

In [133]:
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted,invited_adopted
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,0,1
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,1,1
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0,0,2
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0,0,2
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0,0,2


In [139]:
# Filter out column about time and userID
users_notime = users.loc[:,['creation_source','opted_in_to_mailing_list','enabled_for_marketing_drip','org_id','invited_adopted','adopted']]
users_notime.head()

Unnamed: 0,creation_source,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_adopted,adopted
0,GUEST_INVITE,1,0,11,1,0
1,ORG_INVITE,0,0,1,1,1
2,ORG_INVITE,0,0,94,2,0
3,GUEST_INVITE,0,0,1,2,0
4,GUEST_INVITE,0,0,193,2,0


## 3. Perform a decision tree / random forest

In [58]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

In [140]:
# PERSONAL_PROJECTS: invited to join another user's personal workspace
# GUEST_INVITE: invited to an organization as a guest (limited permissions)
# ORG_INVITE: invited to an organization (as a full member)
# SIGNUP: signed up via the website
# SIGNUP_GOOGLE_AUTH: signed up using Google Authentication (using a Google email account for their login id)

creation_source = users['creation_source'].unique()
creation_source
creation_source_dict = {name:index for index,name in enumerate(creation_source)}

In [136]:
creation_source_dict

{'GUEST_INVITE': 0,
 'ORG_INVITE': 1,
 'SIGNUP': 2,
 'PERSONAL_PROJECTS': 3,
 'SIGNUP_GOOGLE_AUTH': 4}

In [141]:
#Pre-processing
x = users_notime.iloc[:,:-1]
x['creation_source'] = x['creation_source'].apply(lambda x: creation_source_dict.get(x))
y = users_notime.iloc[:,-1]

In [142]:
x.head()

Unnamed: 0,creation_source,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_adopted
0,0,1,0,11,1
1,1,0,0,1,1
2,1,0,0,94,2
3,0,0,0,1,2
4,0,0,0,193,2


In [143]:
# Fit a decision tree
clf = DecisionTreeClassifier(random_state=0)
clf.fit(x,y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

In [146]:
# Get accuracy
clf.score(x,y)

0.9045

In [183]:
y_pred = clf.predict(x)
pd.Series(y_pred).value_counts()

0    11438
1      562
dtype: int64

In [147]:
list(zip(x.columns,clf.feature_importances_))

[('creation_source', 0.039694848787657736),
 ('opted_in_to_mailing_list', 0.09580302581575088),
 ('enabled_for_marketing_drip', 0.09404878337214993),
 ('org_id', 0.7455448029891558),
 ('invited_adopted', 0.02490853903528568)]

## Answer
As we can the the feature "org_id" plays an important part to predict whether user are "adopted" or not.