Defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven­day period, identify which factors predict future user adoption.

We suggest spending 1­2 hours on this, but you're welcome to spend more or less. Please send us a brief writeup of your findings (the more concise, the better ­ no more than one page), along with any summary tables, graphs, code, or queries that can help us understand your approach. Please note any factors you considered or investigation you did, even if they did not pan out. Feel free to identify any further research or data you think would be valuable.

In [138]:
# import libraries 
import pandas as pd
import numpy as np

from datetime import datetime, timedelta

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

In [2]:
# import data
df_user = pd.read_csv('takehome_users.csv', encoding='ISO-8859-1')
df_usage = pd.read_csv('takehome_user_engagement.csv')

Getting familiar with what is in both dataframes.

In [5]:
# check the first few rows
df_user.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [10]:
# check the last few rows
df_user.tail()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
11995,11996,2013-09-06 06:14:15,Meier Sophia,SophiaMeier@gustr.com,ORG_INVITE,1378448000.0,0,0,89,8263.0
11996,11997,2013-01-10 18:28:37,Fisher Amelie,AmelieFisher@gmail.com,SIGNUP_GOOGLE_AUTH,1358275000.0,0,0,200,
11997,11998,2014-04-27 12:45:16,Haynes Jake,JakeHaynes@cuvox.de,GUEST_INVITE,1398603000.0,1,1,83,8074.0
11998,11999,2012-05-31 11:55:59,Faber Annett,mhaerzxp@iuxiw.com,PERSONAL_PROJECTS,1338638000.0,0,0,6,
11999,12000,2014-01-26 08:57:12,Lima Thaís,ThaisMeloLima@hotmail.com,SIGNUP,1390727000.0,0,1,0,


In [11]:
# check the first few rows
df_usage.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [12]:
# check the last few rows
df_usage.tail()

Unnamed: 0,time_stamp,user_id,visited
207912,2013-09-06 06:14:15,11996,1
207913,2013-01-15 18:28:37,11997,1
207914,2014-04-27 12:45:16,11998,1
207915,2012-06-02 11:55:59,11999,1
207916,2014-01-26 08:57:12,12000,1


In [13]:
# check additional information
df_user.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null object
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            6417 non-null float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB


There are 12000 rows in df_user. Two columns have some empty rows.

In [14]:
# check additional information
df_usage.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
time_stamp    207917 non-null object
user_id       207917 non-null int64
visited       207917 non-null int64
dtypes: int64(2), object(1)
memory usage: 4.8+ MB


There are 207917 rows in df_usage and no empty rows.

In [15]:
# check summary statistic
df_user.describe()

Unnamed: 0,object_id,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
count,12000.0,8823.0,12000.0,12000.0,12000.0,6417.0
mean,6000.5,1379279000.0,0.2495,0.149333,141.884583,5962.957145
std,3464.24595,19531160.0,0.432742,0.356432,124.056723,3383.761968
min,1.0,1338452000.0,0.0,0.0,0.0,3.0
25%,3000.75,1363195000.0,0.0,0.0,29.0,3058.0
50%,6000.5,1382888000.0,0.0,0.0,108.0,5954.0
75%,9000.25,1398443000.0,0.0,0.0,238.25,8817.0
max,12000.0,1402067000.0,1.0,1.0,416.0,11999.0


In [16]:
# check summary statistic
df_usage.describe()

Unnamed: 0,user_id,visited
count,207917.0,207917.0
mean,5913.314197,1.0
std,3394.941674,0.0
min,1.0,1.0
25%,3087.0,1.0
50%,5682.0,1.0
75%,8944.0,1.0
max,12000.0,1.0


We can sort the usage summary table ("takehome_user_engagement") in descending order to see when the last log was made. And extract date and time from the time_stamp column.

In [3]:
# sort values in the dataframe
df_usage = df_usage.sort_values(by = 'time_stamp', ascending=False)

In [4]:
# convert from string to datetime and extract date
df_usage['date'] = pd.to_datetime(df_usage['time_stamp']).dt.date

In [5]:
# create a column with time only
df_usage['time'] = pd.to_datetime(df_usage['time_stamp']).dt.time 

In [27]:
# check if changes were made
df_usage.head()

Unnamed: 0,time_stamp,user_id,visited,date,time
70763,2014-06-06 14:58:50,4051,1,2014-06-06,14:58:50
6053,2014-06-04 23:56:26,341,1,2014-06-04,23:56:26
168409,2014-06-04 23:46:31,9558,1,2014-06-04,23:46:31
162633,2014-06-04 23:34:04,9325,1,2014-06-04,23:34:04
84316,2014-06-04 23:32:13,4625,1,2014-06-04,23:32:13


First, let's check if the creation_time column is more than a week ago, by looking at the last entry, and compare it with the last entry in time_stamp column in df_usage.

In [6]:
# sort values by creation_time in descending order in df_user
df_user.sort_values(by= 'creation_time', ascending=False)[:10]

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
6052,6053,2014-05-30 23:59:19,Armbruster Kandace,KandacePArmbruster@hotmail.com,GUEST_INVITE,1401494000.0,1,0,7,3421.0
3489,3490,2014-05-30 23:45:01,Oliveira Estevan,EstevanRochaOliveira@gmail.com,PERSONAL_PROJECTS,1401580000.0,0,0,31,
10163,10164,2014-05-30 23:27:30,Walsh Sam,SamWalsh@jourrapide.com,GUEST_INVITE,1401492000.0,0,0,302,5383.0
9687,9688,2014-05-30 23:12:01,Coveny Taj,TajCoveny@yahoo.com,GUEST_INVITE,1401664000.0,0,0,93,7296.0
6944,6945,2014-05-30 23:10:35,Sandoval Matthew,MatthewKSandoval@cuvox.de,SIGNUP,1401578000.0,0,0,5,
9872,9873,2014-05-30 22:41:18,Nissen Amalie,AmalieMNissen@gmail.com,SIGNUP,1401490000.0,1,0,230,
11394,11395,2014-05-30 22:34:31,Iversen Clara,ClaraGIversen@jourrapide.com,PERSONAL_PROJECTS,,0,0,3,
5583,5584,2014-05-30 21:34:28,Moore Sam,SamMoore@gustr.com,GUEST_INVITE,1401572000.0,0,0,86,8152.0
1888,1889,2014-05-30 21:24:54,Dickinson Savannah,SavannahDickinson@gmail.com,SIGNUP_GOOGLE_AUTH,1401485000.0,0,0,144,
9072,9073,2014-05-30 21:20:15,Griffiths Finley,FinleyGriffiths@gmail.com,GUEST_INVITE,1401658000.0,0,0,2,4202.0


In [7]:
# sort values by time_stamp in descending order in df_usage
df_usage.sort_values(by='time_stamp', ascending=False)[:10]

Unnamed: 0,time_stamp,user_id,visited,date,time
70763,2014-06-06 14:58:50,4051,1,2014-06-06,14:58:50
6053,2014-06-04 23:56:26,341,1,2014-06-04,23:56:26
168409,2014-06-04 23:46:31,9558,1,2014-06-04,23:46:31
162633,2014-06-04 23:34:04,9325,1,2014-06-04,23:34:04
84316,2014-06-04 23:32:13,4625,1,2014-06-04,23:32:13
137644,2014-06-04 23:30:50,7859,1,2014-06-04,23:30:50
98415,2014-06-04 23:28:26,5378,1,2014-06-04,23:28:26
33339,2014-06-04 23:21:13,2033,1,2014-06-04,23:21:13
40769,2014-06-04 23:14:30,2474,1,2014-06-04,23:14:30
201595,2014-06-04 23:13:01,11519,1,2014-06-04,23:13:01


We can see that the last account was created on 2014-05-30 and the last log-in was on 2014-06-06, which is exactly 7 days, so we don't need to eliminated any users based on that (who don't have an account for more than 7 days).

Now let's check how many times each user has logged-in in 2 years.

In [8]:
df_usage.groupby('user_id')['visited'].count().sort_values(ascending=False)

user_id
3623    606
906     600
1811    593
7590    590
8068    585
       ... 
7314      1
7315      1
7316      1
7318      1
1         1
Name: visited, Length: 8823, dtype: int64

We can remove users who have logged in fewer than 3 times in total.

In [9]:
import copy

# create a copy of the dataframe
df_usage_copy = df_usage.copy()

In [10]:
# group by user_id, count number of visits and sort in descending order
df_usage_copy.groupby('user_id')['visited'].count().sort_values(ascending=False)[2400:]

user_id
2603     2
10682    2
2900     2
2602     2
2600     2
        ..
7314     1
7315     1
7316     1
7318     1
1        1
Name: visited, Length: 6423, dtype: int64

In [11]:
# create a dataframe with users who have at least 3 logged visits
df_user_three_times_all = df_usage_copy[df_usage_copy.groupby('user_id').visited.transform(len)>=3]

In [13]:
df_user_three_times_all.head()

Unnamed: 0,time_stamp,user_id,visited,date,time
6053,2014-06-04 23:56:26,341,1,2014-06-04,23:56:26
168409,2014-06-04 23:46:31,9558,1,2014-06-04,23:46:31
162633,2014-06-04 23:34:04,9325,1,2014-06-04,23:34:04
84316,2014-06-04 23:32:13,4625,1,2014-06-04,23:32:13
137644,2014-06-04 23:30:50,7859,1,2014-06-04,23:30:50


Now we will flip the order of the time_stamp column, to make next steps easier.

In [14]:
# sort dataframe in ascending order by time_stamp
df_user_three_times_all = df_user_three_times_all.sort_values(by='time_stamp', ascending=True)

In [15]:
df_user_three_times_all.head(10)

Unnamed: 0,time_stamp,user_id,visited,date,time
59486,2012-05-31 15:47:36,3428,1,2012-05-31,15:47:36
26821,2012-05-31 21:58:33,1693,1,2012-05-31,21:58:33
140780,2012-06-01 20:02:35,8068,1,2012-06-01,20:02:35
60374,2012-06-02 00:28:47,3514,1,2012-06-02,00:28:47
126542,2012-06-02 06:23:51,7170,1,2012-06-02,06:23:51
139177,2012-06-03 10:28:01,7991,1,2012-06-03,10:28:01
43193,2012-06-03 16:44:54,2568,1,2012-06-03,16:44:54
108508,2012-06-03 20:33:31,6047,1,2012-06-03,20:33:31
60375,2012-06-04 00:28:47,3514,1,2012-06-04,00:28:47
126543,2012-06-04 06:23:51,7170,1,2012-06-04,06:23:51


Let's see how many users have at least 3 visits in 2 years.

In [16]:
# check number of users
df_user_three_times_all.user_id.nunique()

2248

We have 2248 different users that logged into the product at least 3 visits in 2 years.

In [17]:
# create a list of users
users_list  = list(df_user_three_times_all.user_id.unique())

In [122]:
len(users_list)

2248

Create a dataframe with user_id and a new column 'test_three' that will represent users who had 3 logins in a 7-day period, but for now we will insert 0 and later change where this condition where true.

In [18]:
# initialize data of lists
data_1 = {'user_id': users_list, 'test_three': 0} 
  
# create dataframe 
df_1 = pd.DataFrame(data_1) 
  
# print the output 
df_1 

Unnamed: 0,user_id,test_three
0,3428,0
1,1693,0
2,8068,0
3,3514,0
4,7170,0
5,7991,0
6,2568,0
7,6047,0
8,2973,0
9,9345,0


Now we are ready to find users who have 3 logins in a 7-day period. First we will find all unique login dates for each user. Next step will be to find if there are any 3 login dates within 7 days, if they are then we will change value in column 'test_three' from 0 to 1 in df_1.

In [19]:
# find all unique dates
def select_dates(df, user):
    all_dates = []
    for row in range(len(df)):
        if df.user_id.iloc[row] == user:
            all_dates.append(df.date.iloc[row])
        else:
            pass
    return(sorted(list(set(all_dates)))) 

# find 3 login dates in a 7-day period
def three_days(df, user):
    all_dates = select_dates(df, user)
    x = [1 for d in range(len(all_dates)-2) if pd.to_datetime(all_dates[d+2])- pd.to_datetime(all_dates[d]) <= timedelta(days=7)]
    return x

# if there are 3 days in a 7-day period change value in column 'test_three' to 1
def users_with_three_days(df, df2, users_list):
    for user in users_list:
        x = three_days(df, user)
        if len(x) > 0:
            for row in range(len(df2)):
                if df2.user_id.iloc[row] == user:
                    df2.test_three.iloc[row] = 1
    return(df2)

In [22]:
print(users_with_three_days(df_user_three_times_all, df_1, users_list))

      user_id  test_three
0        3428           1
1        1693           1
2        8068           1
3        3514           0
4        7170           0
...       ...         ...
2243    10277           1
2244     5882           0
2245     2940           0
2246    10751           0
2247      479           1

[2248 rows x 2 columns]


In [23]:
# print number of users with and without 3 login days in a 7-day period
df_1.test_three.value_counts()

1    1656
0     592
Name: test_three, dtype: int64

We have 1656 users with 3 login days in a 7-day period.

Before joining df_user and df_1, let's rename columns in df_1.

In [47]:
# renaming columns
df_user_three = df_1.rename(columns={'user_id': 'object_id', 'test_three': 'adopted_user'})

In [48]:
df_user_three.head()

Unnamed: 0,object_id,adopted_user
0,3428,1
1,1693,1
2,8068,1
3,3514,0
4,7170,0


In [53]:
# number of unique users in df_user
df_user.object_id.nunique()

12000

In [50]:
# join df_user and df_user_three
df_joined2 = df_user.merge(df_user_three, how='left', on='object_id')

In [51]:
# check number of rows and column in joined table
df_joined2.shape

(12000, 11)

In [56]:
# count number of values for each group of users
df_joined2.adopted_user.value_counts()

1.0    1656
0.0     592
Name: adopted_user, dtype: int64

In [57]:
# check unique groups of users
df_joined2.adopted_user.unique()

array([nan,  1.,  0.])

We have some missing values, which we can fill with 0, since those are also users who don’t have 3 login days during a 7-day period.

In [58]:
# fill empty rows
df_joined2['adopted_user'].fillna(0, inplace=True)

In [59]:
# count users who have and don't have 3 login days in a 7-day period
df_joined2.adopted_user.value_counts()

0.0    10344
1.0     1656
Name: adopted_user, dtype: int64

Overall, we have 1656 users who have at least one 3 login days in a 7-day period and 10344 users who don't.

Before building a model we have to check if columns need any changes.

In [60]:
# get information about the dataframe
df_joined2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12000 entries, 0 to 11999
Data columns (total 11 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null object
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            6417 non-null float64
adopted_user                  12000 non-null float64
dtypes: float64(3), int64(4), object(4)
memory usage: 1.1+ MB


We have 12000 rows and two columns, last_session_creation_time and invited_by_user_id, have some empty rows. We also need to change the type of data in some columns.

In [61]:
# check first few rows
df_joined2.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted_user
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,0.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,1.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0,0.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0,0.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0,0.0


In [62]:
# change column to datetime type
df_joined2['creation_time'] = pd.to_datetime(df_joined2['creation_time'])

Let's also fill invited_by_user_id null values with 0, because we can consider 0 to be invited by none.

In [63]:
# fill empty rows
df_joined2['invited_by_user_id'].fillna(0, inplace=True)

We can drop name and email columns since these should be unique to every user and even if they aren't, it doesn't make sense to say that your email or name should be a factor in whether or not you are an adopted user.

In [119]:
# drop columns
df_joined3 = df_joined2.drop(['name', 'email'], axis=1)

In [120]:
df_joined3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12000 entries, 0 to 11999
Data columns (total 9 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null datetime64[ns]
creation_source               12000 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            12000 non-null float64
adopted_user                  12000 non-null float64
dtypes: datetime64[ns](1), float64(3), int64(4), object(1)
memory usage: 937.5+ KB


Since creation_time column has a datetime type, which cannot be used in a model, we can replace these values with the number of days between when a user created their account and the first created account in the dataframe.

In [122]:
# subtract time when first account was created from all other times when each user account was created
creation_time_new = (pd.to_datetime(df_joined3.creation_time) - pd.to_datetime(min(df_joined3.creation_time))).dt.days

In [98]:
creation_time_new

0        691
1        533
2        292
3        355
4        231
        ... 
11995    463
11996    224
11997    696
11998      0
11999    605
Name: creation_time, Length: 12000, dtype: int64

In [123]:
# change old column with new
df_joined3.creation_time = creation_time_new

In [124]:
# check the creation_source column
df_joined3.creation_source.value_counts() 

ORG_INVITE            4254
GUEST_INVITE          2163
PERSONAL_PROJECTS     2111
SIGNUP                2087
SIGNUP_GOOGLE_AUTH    1385
Name: creation_source, dtype: int64

Since creation_source is a categorical feature, we need to one-hot-encode it.

In [125]:
# one column for each category in creation_source column
df_joined3_dummy = pd.get_dummies(df_joined3, columns=['creation_source'])

In [111]:
# check data type for each column
df_joined3_dummy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12000 entries, 1 to 12000
Data columns (total 12 columns):
creation_time                         12000 non-null int64
last_session_creation_time            12000 non-null float64
opted_in_to_mailing_list              12000 non-null int64
enabled_for_marketing_drip            12000 non-null int64
org_id                                12000 non-null int64
invited_by_user_id                    12000 non-null float64
adopted_user                          12000 non-null float64
creation_source_GUEST_INVITE          12000 non-null uint8
creation_source_ORG_INVITE            12000 non-null uint8
creation_source_PERSONAL_PROJECTS     12000 non-null uint8
creation_source_SIGNUP                12000 non-null uint8
creation_source_SIGNUP_GOOGLE_AUTH    12000 non-null uint8
dtypes: float64(3), int64(4), uint8(5)
memory usage: 808.6 KB


The last step before building a model is to change the object_id column to index, since it is unique for each user.

In [126]:
# change object_id column into index
df_joined3_dummy.set_index('object_id', inplace=True)
df_joined3_dummy.head()

Unnamed: 0_level_0,creation_time,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted_user,creation_source_GUEST_INVITE,creation_source_ORG_INVITE,creation_source_PERSONAL_PROJECTS,creation_source_SIGNUP,creation_source_SIGNUP_GOOGLE_AUTH
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,691,1398139000.0,1,0,11,10803.0,0.0,1,0,0,0,0
2,533,1396238000.0,0,0,1,316.0,1.0,0,1,0,0,0
3,292,1363735000.0,0,0,94,1525.0,0.0,0,1,0,0,0
4,355,1369210000.0,0,0,1,5151.0,0.0,1,0,0,0,0
5,231,1358850000.0,0,0,193,5240.0,0.0,1,0,0,0,0


Now, we are ready to build a model and find which feature affects adopted_user the most.


We'll use a Random Forest Classifier because it generally performs well in classification tasks, in most cases is more accurate than decision trees, controls over-fitting and provides information on what features are important.

First we will build a model without a last_session_creation_time column, because we are assuming that time of the last login is in correlation with adopted user. 

In [127]:
# labels are the values we want to predict
labels_1 = np.array(df_joined3_dummy['adopted_user']) 

# remove the labels from the features
# axis 1 refers to the columns
features_1 = df_joined3_dummy.drop(['adopted_user', 'last_session_creation_time'], axis=1)

# saving feature names for later use
feature_list_1 = list(features_1.columns) 

# convert to numpy array
features_1 = np.array(features_1)

# split dataset into training set and test set, 70% training and 30% test
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(features_1, labels_1, test_size=0.3, random_state=42)

We will use the Random grid to search for best hyperparameters.

In [128]:
# number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 20, stop = 500, num = 10)]

# number of features to consider at every split
max_features = ['auto', 'sqrt']

# maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

# minimum number of samples required to split a node
min_samples_split = [2, 5, 10]

# minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# method of selecting samples for training each tree
bootstrap = [True, False]

# create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [129]:
# first create the base model to tune
rf_1 = RandomForestClassifier()

# random search of parameters, using 3 fold cross validation, 
# search across 5 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf_1, param_distributions = random_grid, n_iter = 5, cv = 3, verbose=2, random_state=42, n_jobs = -1) 

# fit the random search model
rf_random.fit(X_train_1, y_train_1)

Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:    7.8s finished


RandomizedSearchCV(cv=3, error_score='raise-deprecating',
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
                                                    n_estimators='warn',
                                                    n_jobs=None

In [133]:
rf_random.best_params_

{'n_estimators': 20,
 'min_samples_split': 10,
 'min_samples_leaf': 2,
 'max_features': 'sqrt',
 'max_depth': 50,
 'bootstrap': True}

In [134]:
# instantiate model 
rf_random_search = RandomForestClassifier(n_estimators = 20, min_samples_split = 10, min_samples_leaf = 2, max_features= 'sqrt', max_depth  = 50, bootstrap = True)

# train the model on training data
rf_random_search.fit(X_train_1, y_train_1)

# use the forest's predict method on the test data
rd_pred_1 = rf_random_search.predict(X_test_1)

In [135]:
print('ROC_AUC_score:', metrics.roc_auc_score(y_test_1, rf_random_search.predict(X_test_1)))
print('F1_score:', metrics.f1_score(y_test_1, rf_random_search.predict(X_test_1)))

ROC_AUC_score: 0.499
F1_score: 0.014953271028037384


In [136]:
# create pandas Searies from feature importance
feature_imp_rs = pd.Series(rf_random_search.feature_importances_, index=feature_list_1).sort_values(ascending=False)
feature_imp_rs

creation_time                         0.394369
org_id                                0.350754
invited_by_user_id                    0.192219
opted_in_to_mailing_list              0.018444
enabled_for_marketing_drip            0.013261
creation_source_PERSONAL_PROJECTS     0.010889
creation_source_ORG_INVITE            0.005550
creation_source_SIGNUP                0.005057
creation_source_GUEST_INVITE          0.004835
creation_source_SIGNUP_GOOGLE_AUTH    0.004623
dtype: float64

Now, let's build a model with GridSearch to see if we can get better performance than with Random Search.

In [140]:
# create the parameter grid based on the results of random search 
param_grid_t = {
    'bootstrap': [True],
    'max_depth': [40, 50, 60, 80],
    'n_estimators': [30, 50, 60, 100]
}


# create a based model
RF2 = RandomForestClassifier()

# instantiate the grid search model
grid_search = GridSearchCV(estimator = RF2, param_grid = param_grid_t, cv = 3, n_jobs = -1, verbose = 2)

# fit the grid search to the data
grid_search.fit(X_train_1, y_train_1)

Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:    5.6s
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:    7.9s finished


GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid

In [141]:
# print scores
print('ROC_AUC_score:', metrics.roc_auc_score(y_test_1, grid_search.predict(X_test_1)))
print('F1_score:', metrics.f1_score(y_test_1, grid_search.predict(X_test_1)))

ROC_AUC_score: 0.5005483870967742
F1_score: 0.0411663807890223


In [143]:
best_grid = grid_search.best_estimator_

In [144]:
# create pandas Searies from feature importance
feature_imp_grid = pd.Series(best_grid.feature_importances_, index=feature_list_1).sort_values(ascending=False)
feature_imp_grid

creation_time                         0.415810
org_id                                0.347931
invited_by_user_id                    0.182196
opted_in_to_mailing_list              0.015838
enabled_for_marketing_drip            0.012917
creation_source_PERSONAL_PROJECTS     0.006958
creation_source_ORG_INVITE            0.005302
creation_source_GUEST_INVITE          0.004787
creation_source_SIGNUP_GOOGLE_AUTH    0.004243
creation_source_SIGNUP                0.004017
dtype: float64

From this model we can tell that when a user created their account and to which organization they belong to affects if this user will become an adopted user or not.

Now, let's build a model that includes last_session_creation_time.

For last_session_creation_time, we are assuming the meaning for the nulls is that these users never created a session or never logged-in. For these users, we will replace the null values with 0 to indicate that the user logged-in a very long time ago.

In [145]:
df_joined3_dummy.loc[df_joined3_dummy.last_session_creation_time.isnull(), 'last_session_creation_time'] = 0

In [146]:
# labels are the values we want to predict
labels_2 = np.array(df_joined3_dummy['adopted_user']) 

# remove the labels from the features
# axis 1 refers to the columns
features_2 = df_joined3_dummy.drop('adopted_user', axis=1)

# saving feature names for later use
feature_list_2 = list(features_2.columns) 

# convert to numpy array
features_2 = np.array(features_2)

# split dataset into training set and test set, 70% training and 30% test
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(features_2, labels_2, test_size=0.3, random_state=42)

In [148]:
# number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 20, stop = 500, num = 10)]

# number of features to consider at every split
max_features = ['auto', 'sqrt']

# maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

# minimum number of samples required to split a node
min_samples_split = [2, 5, 10]

# minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# method of selecting samples for training each tree
bootstrap = [True, False]

# create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}


# use the random grid to search for best hyperparameters
# first create the base model to tune
rf_2 = RandomForestClassifier()

# random search of parameters, using 3 fold cross validation, 
# search across 5 different combinations, and use all available cores
rf_random_2 = RandomizedSearchCV(estimator = rf_2, param_distributions = random_grid, n_iter = 5, cv = 3, verbose=2, random_state=42, n_jobs = -1) 

# fit the random search model
rf_random_2.fit(X_train_2, y_train_2)

Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:    6.4s finished


RandomizedSearchCV(cv=3, error_score='raise-deprecating',
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
                                                    n_estimators='warn',
                                                    n_jobs=None

In [149]:
# print scores
print('ROC_AUC_score:', metrics.roc_auc_score(y_test_2, rf_random_2.predict(X_test_2)))
print('F1_score:', metrics.f1_score(y_test_2, rf_random_2.predict(X_test_2)))

ROC_AUC_score: 0.8909354838709678
F1_score: 0.8630887185104053


In [150]:
# create pandas Searies from feature importance
feature_imp_rs2 = pd.Series(rf_random_2.best_estimator_.feature_importances_, index=feature_list_2).sort_values(ascending=False)
feature_imp_rs2

last_session_creation_time            0.685871
creation_time                         0.208156
org_id                                0.053928
invited_by_user_id                    0.032585
opted_in_to_mailing_list              0.003948
creation_source_PERSONAL_PROJECTS     0.003737
enabled_for_marketing_drip            0.003621
creation_source_GUEST_INVITE          0.002358
creation_source_ORG_INVITE            0.002145
creation_source_SIGNUP_GOOGLE_AUTH    0.001986
creation_source_SIGNUP                0.001666
dtype: float64

As expected last_session_creation_time has the greatest effect on the adopted user. And the model performs better, has a ROC_AUC score of 0.891, compared to the previous one with a score of 0.499.

Let's check if we get the same results with use of the GridSearch method.

In [155]:
# create the parameter grid based on the results of random search 
param_grid_t2 = {
    'bootstrap': [True],
    'max_depth': [40, 50, 60, 80],
    'n_estimators': [30, 50, 60, 100]
}


# create a base model
RF3 = RandomForestClassifier()

# instantiate the grid search model
grid_search2 = GridSearchCV(estimator = RF3, param_grid = param_grid_t, cv = 3, n_jobs = -1, verbose = 2)

# fit the grid search to the data
grid_search2.fit(X_train_2, y_train_2)

Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:    5.2s
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:    7.1s finished


GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid

In [156]:
# print scores
print('ROC_AUC_score:', metrics.roc_auc_score(y_test_2, grid_search2.predict(X_test_2)))
print('F1_score:', metrics.f1_score(y_test_2, grid_search2.predict(X_test_2)))

ROC_AUC_score: 0.8889354838709678
F1_score: 0.8605927552140505


In [159]:
# create pandas Searies from feature importance
feature_imp_grid2 = pd.Series(grid_search2.best_estimator_.feature_importances_, index=feature_list_2).sort_values(ascending=False)
feature_imp_grid2

last_session_creation_time            0.667469
creation_time                         0.208711
org_id                                0.062037
invited_by_user_id                    0.038413
opted_in_to_mailing_list              0.005151
enabled_for_marketing_drip            0.004270
creation_source_PERSONAL_PROJECTS     0.003908
creation_source_GUEST_INVITE          0.002636
creation_source_ORG_INVITE            0.002599
creation_source_SIGNUP                0.002435
creation_source_SIGNUP_GOOGLE_AUTH    0.002369
dtype: float64

Notice that it has a similar feature importance and ROC_AUC score than the model with only Random Search.

We can conclude that user's last login has the greatest effect on a user becoming adopted or not. The following two features are when a user created their account and to which organisation they belong to.

Let's check out the correlation with our adopted_user column.

In [160]:
df_joined3_dummy.corr()['adopted_user'].sort_values(ascending=False)

adopted_user                          1.000000
last_session_creation_time            0.250484
org_id                                0.066995
creation_source_GUEST_INVITE          0.044317
creation_source_SIGNUP_GOOGLE_AUTH    0.036198
invited_by_user_id                    0.021965
creation_source_SIGNUP                0.008920
opted_in_to_mailing_list              0.008838
enabled_for_marketing_drip            0.006578
creation_source_ORG_INVITE           -0.006592
creation_source_PERSONAL_PROJECTS    -0.075717
creation_time                        -0.086207
Name: adopted_user, dtype: float64

A negative value means there is a negative correlation between two variables and a positive value means a positive correlation exists.
A negative correlation between two variables means that one variable increases whenever the other decreases, and vice versa.