# Problem Statement:

Defining an "adopted user" as a user who has logged  into  the  product  on  three  separate
days in at least one seven-day period, **identify  which  factors  predict  future  user
adoption**.

We  suggest  spending  1-2  hours  on  this,  but  you're  welcome  to  spend  more  or  less.
Please  send  us  a  brief  writeup  of  your  findings  (the  more  concise,  the  better  --  no  more
than  one  page),  along  with  any  summary  tables,  graphs,  code,  or  queries  that  can  help
us  understand  your  approach.  Please  note  any  factors  you  considered  or  investigation
you  did,  even  if  they  did  not  pan  out.  Feel  free  to  identify  any  further  research  or  data
you  think  would  be  valuable.

The  data  is  available  as  two  attached  CSV  files:
1. takehome_user_engagement.csv
2. takehome_users.csv

The  data  has  the  following  two  tables:

1. A  user  table  ( "takehome_users" )  with  data  on  12,000  users  who  signed  up  for  the product  in  the  last  two  years.   This  table  includes:
    1. name:  the  user's  name
    2. object_id:   the  user's  id
    3. email:  email  address
    4. creation_source:   how  their  account  was  created.  This  takes  on  one of  5  values:
        1. PERSONAL_PROJECTS:  invited  to  join  another  user's personal  workspace
        2. GUEST_INVITE:  invited  to  an  organization  as  a  guest (limited  permissions)
        3. ORG_INVITE:  invited  to  an  organization  (as  a  full  member)
        4. SIGNUP:  signed  up  via  the  website
        5. SIGNUP_GOOGLE_AUTH:  signed  up  using  Google Authentication  (using  a  Google  email  account  for  their  login id)
    2. creation_time:  when  they  created  their  account
    3. last_session_creation_time:   unix  timestamp  of  last  login
    4. opted_in_to_mailing_list:  whether  they  have  opted  into  receiving marketing  emails
    5. enabled_for_marketing_drip:  whether  they  are  on  the  regular marketing  email  drip
    6. org_id:   the  organization  (group  of  users)  they  belong  to
    7. invited_by_user_id:   which  user  invited  them  to  join  (if  applicable).

2. A  usage  summary  table  ( "takehome_user_engagement" )  that  has  a  row  for  each  day that  a  user  logged  into  the  product.

# Strategy:

1. Load data
2. Determine which users are "adopted users"
3. Find which features correlate the most with the adopted users
4. Use classification feature selection to determine feature importances

### Imports

In [1]:
import numpy as np
import pandas as pd
import datetime as dt
import time

### Load Data

In [2]:
user_eng_raw = pd.read_csv('takehome_user_engagement.csv')
users_raw = pd.read_csv('takehome_users.csv')

In [3]:
user_eng = user_eng_raw.copy()
user_eng.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [4]:
users = users_raw.copy()
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


### Determine the "Adopted Users"

In [5]:
# order the entries
user_eng.set_index(['user_id'], inplace=True)
user_eng.sort_index(inplace=True)
user_eng.head()

Unnamed: 0_level_0,time_stamp,visited
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,2014-04-22 03:53:30,1
2,2013-11-15 03:45:04,1
2,2013-11-29 03:45:04,1
2,2013-12-09 03:45:04,1
2,2013-12-25 03:45:04,1


In [6]:
# convert time_stamps into datetimes
user_eng['time_stamp'] = user_eng['time_stamp'].apply(lambda time: dt.datetime.strptime(time, '%Y-%m-%d %H:%M:%S'))

In [7]:
# initialize users as not adopted
users['adopted'] = False
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,False
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,False
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0,False
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0,False
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0,False


In [8]:
# convert creation_times into UNIX timestamps
users['creation_time'] = users['creation_time'].apply(lambda time: dt.datetime.strptime(time, '%Y-%m-%d %H:%M:%S'))
users['creation_time'] = users['creation_time'].apply(lambda date_time: time.mktime(date_time.timetuple()))
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted
0,1,1398157000.0,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,False
1,2,1384509000.0,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,False
2,3,1363753000.0,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0,False
3,4,1369142000.0,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0,False
4,5,1358439000.0,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0,False


In [9]:
# drop columns with private information
users.drop(['name', 'email', 'invited_by_user_id'], axis=1, inplace=True)
users.head()

Unnamed: 0,object_id,creation_time,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,adopted
0,1,1398157000.0,GUEST_INVITE,1398139000.0,1,0,11,False
1,2,1384509000.0,ORG_INVITE,1396238000.0,0,0,1,False
2,3,1363753000.0,ORG_INVITE,1363735000.0,0,0,94,False
3,4,1369142000.0,GUEST_INVITE,1369210000.0,0,0,1,False
4,5,1358439000.0,GUEST_INVITE,1358850000.0,0,0,193,False


In [10]:
# unique user_id
user_ids = user_eng.index.unique()

# for each user
for i in user_ids:
    
    # initialize/reset adopted status
    adopted = False
    
    # copy user visits
    user_visits = user_eng[user_eng.index==i]
    
    # if more than 1 visit, else adopted=False
    if len(user_visits) > 1: 
        
        # increment left index 1 at a time
        for j in range(len(user_visits)):
            
            # increment right index 1 at a time
            for k in range(len(user_visits) - i + 1):

                # if window's time delta is greater than 7 days, break loop to slide window start
                timedelta = (user_visits.iloc[k]['time_stamp'] - user_visits.iloc[j]['time_stamp'])
                if timedelta > dt.timedelta(days=7):
                    break

                # sum unique days
                date_stamps = user_visits.iloc[j:k]['time_stamp'].apply(lambda date: date.strftime('%Y-%m-%d'))
                if len(date_stamps.unique()) >= 3:
                    adopted = True
                    break
        
            # user considered "adopted"
            if adopted:
                # update user adopted status
                users.loc[users['object_id']==i, 'adopted'] = True
                # move on to the next user
                break

### Find which features correlate the most with the adopted users

In [11]:
# check types
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 8 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   object_id                   12000 non-null  int64  
 1   creation_time               12000 non-null  float64
 2   creation_source             12000 non-null  object 
 3   last_session_creation_time  8823 non-null   float64
 4   opted_in_to_mailing_list    12000 non-null  int64  
 5   enabled_for_marketing_drip  12000 non-null  int64  
 6   org_id                      12000 non-null  int64  
 7   adopted                     12000 non-null  bool   
dtypes: bool(1), float64(2), int64(4), object(1)
memory usage: 668.1+ KB


#### Do I want to pandas.factorize() the categorical data?

In [15]:
users['creation_source'], users_creation_source_uniques = users['creation_source'].factorize()

In [19]:
users.head()

Unnamed: 0,object_id,creation_time,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,adopted
0,1,1398157000.0,0,1398139000.0,1,0,11,False
1,2,1384509000.0,1,1396238000.0,0,0,1,False
2,3,1363753000.0,1,1363735000.0,0,0,94,False
3,4,1369142000.0,0,1369210000.0,0,0,1,False
4,5,1358439000.0,0,1358850000.0,0,0,193,False


In [23]:
users_no_nan = users.dropna(axis=0)

In [24]:
X = users_no_nan.drop('adopted', axis=1)
y = users_no_nan[['adopted']]

### Use classification feature selection to determine feature importances

In [35]:
# ANOVA feature selection for numeric input and categorical output
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif, chi2
# define feature selection
fs = SelectKBest(score_func=chi2, k=4)
# apply feature selection
X_selected = fs.fit_transform(X, np.ravel(y))
print(X_selected.shape)

(8823, 4)


In [36]:
fs.get_feature_names_out()

array(['object_id', 'creation_time', 'last_session_creation_time',
       'org_id'], dtype=object)