### Relax Inc. mock takehome challenge
Task: Defining  an  "adopted  user"   as  a  user  who   has  logged  into  the  product  on  three  separate
days  in  at  least  one  seven­day  period ,  identify  which  factors  predict  future  user
adoption .

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
from datetime import datetime

#### Read files

In [2]:
engagement_file = "takehome_user_engagement.csv"
users_file = "takehome_users2.csv"

In [3]:
engagement = pd.DataFrame(pd.read_csv(engagement_file, parse_dates=True))
users = pd.DataFrame(pd.read_csv(users_file, parse_dates=True, encoding = 'latin1'))

In [4]:
print('the total number of users is',len(users))
users[:3]

the total number of users is 12000


Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,4/22/2014 3:53,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,11/15/2013 3:45,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,3/19/2013 23:14,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0


In [5]:
engagement[:3]

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1


In [6]:
engagement['time_stamp'] = engagement.time_stamp.apply(lambda d: datetime.strptime(d, '%Y-%m-%d %H:%M:%S'))

In [7]:
engagement.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
time_stamp    207917 non-null datetime64[ns]
user_id       207917 non-null int64
visited       207917 non-null int64
dtypes: datetime64[ns](1), int64(2)
memory usage: 4.8 MB


In [8]:
engagement['date']= engagement['time_stamp'].apply(lambda s: s.date())
engagement.drop(['visited','time_stamp'], axis=1, inplace=True)

In [9]:
engagement.head()

Unnamed: 0,user_id,date
0,1,2014-04-22
1,2,2013-11-15
2,2,2013-11-29
3,2,2013-12-09
4,2,2013-12-25


### Adopted users
Identify which users have logged into the product on 3 days in one 7-day period.
<br>
use engagement table -- ```user_id``` == ```object_id```

Strategy: 
- count visits per day by user
- check there aren't multiple visits/day
- use timedelta to calculate difference between visits, and check if <= 7

In [10]:
from datetime import datetime, timedelta

def adopted(a):
    if len(a) >= 3:  
        a = [i for i in a]  # convert property object to list
        a.sort()
        a = [a[i+1] - a[i] for i in range(len(a)-2)]  # cumulative difference of current and next day
        a = [1 for i in range(len(a)-2) if a[i] + a[i+1] + a[i+2] <= timedelta(days=7)]
        if 1 in a:
            return 1

This fucntion converts an object to list, then returns 1 if a user logged into the product on three separate days in at least one seven day period 

In [11]:
df = engagement.groupby('user_id').agg(adopted)  # group by user_id and aggregate using custom function
df.fillna(0, inplace=True)  # fill null values with 0
df.columns = ['adopted_user']
df.head()

Unnamed: 0_level_0,adopted_user
user_id,Unnamed: 1_level_1
1,0.0
2,0.0
3,0.0
4,0.0
5,0.0


In [12]:
print('The number of adopted users is', len(df[df['adopted_user']==1]),'of 12000 total.')

The number of adopted users is 1322 of 12000 total.


In [13]:
print('Adopted users are', round(1322/12000*100),'%.')

Adopted users are 11 %.


### Join to user info df & clean

In [14]:
users['user_id']=users.object_id

In [15]:
data = users.join(df, how='left', on='user_id')
data.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,user_id,adopted_user
0,1,4/22/2014 3:53,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,1,0.0
1,2,11/15/2013 3:45,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,2,0.0
2,3,3/19/2013 23:14,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0,3,0.0
3,4,5/21/2013 8:09,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0,4,0.0
4,5,1/17/2013 10:14,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0,5,0.0


Drop email, name, last session creation time

In [16]:
data.drop(['email','name', 'last_session_creation_time'], axis=1, inplace=True)
data.head()

Unnamed: 0,object_id,creation_time,creation_source,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,user_id,adopted_user
0,1,4/22/2014 3:53,GUEST_INVITE,1,0,11,10803.0,1,0.0
1,2,11/15/2013 3:45,ORG_INVITE,0,0,1,316.0,2,0.0
2,3,3/19/2013 23:14,ORG_INVITE,0,0,94,1525.0,3,0.0
3,4,5/21/2013 8:09,GUEST_INVITE,0,0,1,5151.0,4,0.0
4,5,1/17/2013 10:14,GUEST_INVITE,0,0,193,5240.0,5,0.0


In [17]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 9 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null object
creation_source               12000 non-null object
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            6417 non-null float64
user_id                       12000 non-null int64
adopted_user                  8823 non-null float64
dtypes: float64(2), int64(5), object(2)
memory usage: 843.8+ KB


Using ```.info``` to check for NaN.  ```invited_by_user_id```, and ```adopted_user``` all have some NaNs.  In all cases it is fair to assume these are 0.
```last_session_creation_time``` also has NaNs, I will leave these for now.

In [18]:
data['invited_by_user_id'].fillna(0, inplace=True)
data['adopted_user'].fillna(0, inplace=True)

Drop remaining NaNs becuase the creation time stamp is missing.  This may be due to bad data collection, but could also mean this is not a user at all (never initated a session).

In [19]:
data.dropna(axis=0, inplace=True)

In [20]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12000 entries, 0 to 11999
Data columns (total 9 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null object
creation_source               12000 non-null object
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            12000 non-null float64
user_id                       12000 non-null int64
adopted_user                  12000 non-null float64
dtypes: float64(2), int64(5), object(2)
memory usage: 937.5+ KB


In [21]:
data['creation_time'] = data['creation_time'].apply(lambda d: datetime.strptime(d, '%m/%d/%Y %H:%M'))

In [22]:
data.head()

Unnamed: 0,object_id,creation_time,creation_source,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,user_id,adopted_user
0,1,2014-04-22 03:53:00,GUEST_INVITE,1,0,11,10803.0,1,0.0
1,2,2013-11-15 03:45:00,ORG_INVITE,0,0,1,316.0,2,0.0
2,3,2013-03-19 23:14:00,ORG_INVITE,0,0,94,1525.0,3,0.0
3,4,2013-05-21 08:09:00,GUEST_INVITE,0,0,1,5151.0,4,0.0
4,5,2013-01-17 10:14:00,GUEST_INVITE,0,0,193,5240.0,5,0.0


### Feature design
- create dummy variables for creation_source
- also use mailing list and marketing drip as predictors


In [23]:
creation_dummies = pd.get_dummies(data['creation_source'], drop_first=True, prefix='creation_', dummy_na=True)
complete = pd.merge(data, creation_dummies, how='outer', left_index=True, right_index=True)
complete.head()

Unnamed: 0,object_id,creation_time,creation_source,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,user_id,adopted_user,creation__ORG_INVITE,creation__PERSONAL_PROJECTS,creation__SIGNUP,creation__SIGNUP_GOOGLE_AUTH,creation__nan
0,1,2014-04-22 03:53:00,GUEST_INVITE,1,0,11,10803.0,1,0.0,0,0,0,0,0
1,2,2013-11-15 03:45:00,ORG_INVITE,0,0,1,316.0,2,0.0,1,0,0,0,0
2,3,2013-03-19 23:14:00,ORG_INVITE,0,0,94,1525.0,3,0.0,1,0,0,0,0
3,4,2013-05-21 08:09:00,GUEST_INVITE,0,0,1,5151.0,4,0.0,0,0,0,0,0
4,5,2013-01-17 10:14:00,GUEST_INVITE,0,0,193,5240.0,5,0.0,0,0,0,0,0


### Training data set up

In [24]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score, confusion_matrix
from sklearn import decomposition
from sklearn.preprocessing import MinMaxScaler

In [25]:
# target
y = complete['adopted_user']

In [26]:
# select likely features to be relevent in predicting user adoption:
x = complete[['opted_in_to_mailing_list','enabled_for_marketing_drip', 'org_id',
              'creation__ORG_INVITE', 'creation__PERSONAL_PROJECTS', 'creation__SIGNUP', 
              'creation__SIGNUP_GOOGLE_AUTH', 'creation__nan']]

In [27]:
# use MinMaxScaler to scale values to [0,1]
scaler = MinMaxScaler()
scaler.fit(x)
scaler.fit_transform(x)

array([[ 1.        ,  0.        ,  0.02644231, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.00240385, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.22596154, ...,  0.        ,
         0.        ,  0.        ],
       ..., 
       [ 1.        ,  1.        ,  0.19951923, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.01442308, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  1.        ,  0.        , ...,  1.        ,
         0.        ,  0.        ]])

In [28]:
# 60% train and 40% test data
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=42, stratify=y)

### Train model

In [29]:
tree = DecisionTreeClassifier()

tree.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

### Evaluate model (test & scores)

In [30]:
y_pred = tree.predict(X_test)

# scores
print('Feature importance')
for idx, val in enumerate(tree.feature_importances_):
    print("{:30}{:3f}".format(x.columns[idx], val))

Feature importance
opted_in_to_mailing_list      0.125592
enabled_for_marketing_drip    0.062909
org_id                        0.723539
creation__ORG_INVITE          0.010712
creation__PERSONAL_PROJECTS   0.010844
creation__SIGNUP              0.026211
creation__SIGNUP_GOOGLE_AUTH  0.040193
creation__nan                 0.000000


In [31]:
print("{:30}{:3f}".format('F1 score', f1_score(y_test, y_pred)))
print("{:30}{:3f}".format('Test accuracy', accuracy_score(y_test, y_pred)))

F1 score                      0.098237
Test accuracy                 0.850833


In [32]:
print('Confusion Matrix')
print(confusion_matrix(y_test, y_pred))

Confusion Matrix
[[4045  226]
 [ 490   39]]


### Remove less important features

In [33]:
x2 = x.drop(['creation__ORG_INVITE', 'creation__PERSONAL_PROJECTS', 'creation__SIGNUP', 'creation__nan'],axis=1)

In [34]:
x2.head()

Unnamed: 0,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,creation__SIGNUP_GOOGLE_AUTH
0,1,0,11,0
1,0,0,1,0
2,0,0,94,0
3,0,0,1,0
4,0,0,193,0


In [35]:
# use MinMaxScaler to scale values to [0,1]
scaler = MinMaxScaler()
scaler.fit(x2)
scaler.fit_transform(x2)

array([[ 1.        ,  0.        ,  0.02644231,  0.        ],
       [ 0.        ,  0.        ,  0.00240385,  0.        ],
       [ 0.        ,  0.        ,  0.22596154,  0.        ],
       ..., 
       [ 1.        ,  1.        ,  0.19951923,  0.        ],
       [ 0.        ,  0.        ,  0.01442308,  0.        ],
       [ 0.        ,  1.        ,  0.        ,  0.        ]])

In [36]:
# 60% train and 40% test data
X2_train, X2_test, y_train, y_test = train_test_split(x2, y, test_size=0.4, random_state=42, stratify=y)

In [37]:
tree2 = DecisionTreeClassifier()

tree2.fit(X2_train, y_train)

y_pred = tree2.predict(X2_test)

# scores
print('Feature importance')
for idx, val in enumerate(tree2.feature_importances_):
    print("{:30}{:3f}".format(x2.columns[idx], val))

Feature importance
opted_in_to_mailing_list      0.159804
enabled_for_marketing_drip    0.079872
org_id                        0.751896
creation__SIGNUP_GOOGLE_AUTH  0.008428


In [38]:
print("{:30}{:3f}".format('F1 score', f1_score(y_test, y_pred)))
print("{:30}{:3f}".format('Test accuracy', accuracy_score(y_test, y_pred)))

F1 score                      0.040752
Test accuracy                 0.872500


In [39]:
print('Confusion Matrix')
print(confusion_matrix(y_test, y_pred))

Confusion Matrix
[[4175   96]
 [ 516   13]]


## Conclusions

Organization ID is strong predictor of adoption, but opt-in on the mailing list is the next most important feature.

<br/>

BUT, Organization ID may not be the most useful feature becuase personal accounts may not be associated with any particular organization.  Subsetting the data to organization accounts might lead to better detail on how organization membership influences user adoption.