# Relax Challenge

This is the solution to the Relax Challenge. The following is an adaptation of the prompt and task to complete.

--------------------------

The data has two tables:
- A user table ("takehome_users") with data on 12,000 users who signed up for the product in the last two years.
- A usage summary table ("takehome_user_engagement") that has a row for each day that a user logged into the product.

Defining an "*adopted user*" as a user who *has logged into the product on three separate days in at least one seven-day period*, **identify which factors predict future user adoption**.

---------------------------

At first, we may think of this as a predictive task. However, this really is a feature selection problem where we want to know which features have the most predictive power. To that end, we will use the [Boruta algorithm](https://github.com/scikit-learn-contrib/boruta_py) which uses feature selection using Random Forests. Now let's start the analysis.

--------------------------

In [1]:
#import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy

We can read in the users file. 

In [2]:
users = pd.read_csv('takehome_users.csv')
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null object
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            6417 non-null float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB


In [3]:
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


There are several cleaning steps to take here, which we will do now.

In [4]:
#Convert the time columns to datetime
users.creation_time = pd.to_datetime(users.creation_time)
users.last_session_creation_time = pd.to_datetime(users.last_session_creation_time, unit='s')

#Convert the columns to boolean
users.opted_in_to_mailing_list = users.opted_in_to_mailing_list.astype('bool')
users.enabled_for_marketing_drip = users.enabled_for_marketing_drip.astype('bool')

#Convert to categorical
users.creation_source = users.creation_source.astype('category')

#Re-name user id appropriately
users['user_id'] = users['object_id']
users.drop('object_id', axis=1, inplace=True)

#Keep only info on if user was invited
users['invited'] = np.where(users['invited_by_user_id'].isnull(), True, False)
users.drop('invited_by_user_id', axis=1, inplace=True)

#We don't need personal information
users.drop(['name', 'email'], axis=1, inplace=True)

Next, we load in the engagement data. 

In [5]:
engagement = pd.read_csv('takehome_user_engagement.csv')
engagement.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
time_stamp    207917 non-null object
user_id       207917 non-null int64
visited       207917 non-null int64
dtypes: int64(2), object(1)
memory usage: 4.8+ MB


In [6]:
engagement.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


As before, we also clean this dataset up.

In [7]:
#Convert time column to datetime
engagement.time_stamp = pd.to_datetime(engagement.time_stamp)

#Drop visited column, its always 1
engagement.drop('visited', axis=1, inplace=True)

The heart of this problem is to correctly determine which users can be classified as *adopted users*. We can do this by grouping the engagement data by user. Then we can loop over the sorted time stamp data for a single user. Here we can simply construct a 7 day window from a given time stamp and demand the next two time stamps be within that window. Users who do not fill this criteria are not *adopted users*.

In [8]:
adopted_dict = {x:False for x in range(1, len(users)+1)}

for group in engagement.groupby('user_id'):
    
    #Define useful vars
    user_id = group[0]
    user_times = group[1]['time_stamp'].sort_values().reset_index(drop=True)
    num_engage = len(user_times)
    
    #If there are less than 3 engagements, they do not qualify
    if num_engage < 3:
        continue
     
    #Iterate over the engagement timestampe
    for i, stamp in enumerate(user_times):
        
        #Ensure we don't go off the end of the array of timestamps
        if i == num_engage-2:
            break
            
        #Define useful timestamp vars    
        start = stamp
        end = start + pd.Timedelta('7D')
        next1 = user_times[i+1]
        next2 = user_times[i+2]
        
        #Are the next two timestamps within a week?
        if (next1 < end) & (next2 < end):
            adopted_dict[user_id] = True
            break    
            
print('There are %i adopted users.' % np.sum(adopted_dict.values()))            

There are 1602 adopted users.


Note that of the 12,000 users, only 1,602 are adopted. This means the target variable is highly imbalanced. However, again we are not as concerned with a great predictive model as we are about the underlying variables and their relationships with the target variable. 

Let's combine this info with the users data from before.

In [9]:
#Convert dict of adopted users to dataframe for easy merge
adopted_df = pd.DataFrame(adopted_dict.items(), columns=['user_id', 'adopted'])

#Merge the adopted user info to users dataframe
users = pd.merge(users, adopted_df, on='user_id', how='outer')

Now that we have a fully cleaned and merged dataset, we turn our attention to processing it for our machine learning model. Because we will be (essentially) using a Random Forest, we need to re-code the datetime information. The datetime is actually given to the second, but we can take a cue from the prompt. The problem is concerned with time on the scale of days (recall: engagement on three separate days). So we will extract the year, month and day information for both the account creation and last session features. This way we add the time information to the model in a simple way. 

In [10]:
#Re-code datetime columns
users['creation_year'] = users.creation_time.dt.year
users['creation_month'] = users.creation_time.dt.month
users['creation_day'] = users.creation_time.dt.day
users['last_session_year'] = users.last_session_creation_time.dt.year
users['last_session_month'] = users.last_session_creation_time.dt.month
users['last_session_day'] = users.last_session_creation_time.dt.day

#Drop unnecessary columns
users.drop(['creation_time', 'last_session_creation_time', 'user_id'], axis=1, inplace=True)

Next we ensure null values are filled and one-hot encoding the one string/categorical feature. 

In [11]:
#Fill null values with something obvious for the model
users.last_session_day.fillna(0, inplace=True)
users.last_session_month.fillna(0, inplace=True)
users.last_session_year.fillna(0, inplace=True)

#Convert creation_source to indicator variables
users = pd.get_dummies(users, drop_first=True) #Drop first dummy column, since it is correlated to the others

Now we can separate out our x and y data.

In [12]:
#Define the target variable
target = 'adopted'

#Define the x and y data
x = users.drop(target, axis=1).values
y = users[target].values

#Get the column names for x
df_columns = users.drop(target, axis=1).columns.values

We can now use our Boruta algorithm with an instantiated RandomForestClassifier object. Note the class weight is defined to be balanced, as we attempt to take the class imbalance into account. We also allow the BorutaPy object to determine the correct number of trees to be grown. 

In [13]:
#Define RF object
rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced')

#Define the Boruta feature selection method
feat_sel = BorutaPy(rf, n_estimators='auto', verbose=1)

#Fit the Boruta algo
feat_sel.fit(x, y)

Iteration: 1 / 100
Iteration: 2 / 100
Iteration: 3 / 100
Iteration: 4 / 100
Iteration: 5 / 100
Iteration: 6 / 100
Iteration: 7 / 100
Iteration: 8 / 100


BorutaPy finished running.

Iteration: 	9 / 100
Confirmed: 	7
Tentative: 	0
Rejected: 	7


BorutaPy(alpha=0.05,
     estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_split=1e-07,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=52, n_jobs=-1,
            oob_score=False,
            random_state=<mtrand.RandomState object at 0x110fe3230>,
            verbose=0, warm_start=False),
     max_iter=100, n_estimators='auto', perc=100,
     random_state=<mtrand.RandomState object at 0x110fe3230>,
     two_step=True, verbose=1)

The trained model can now give us a ranking of the important features of the dataset. It does so in a numerical format where the features ranked '1' are the important ones while the rest are not very important. 

In [14]:
features = pd.DataFrame(feat_sel.ranking_, index=df_columns)
features.columns = ['ranking']
features.sort_values('ranking', ascending=True)

Unnamed: 0,ranking
org_id,1
creation_year,1
creation_month,1
creation_day,1
last_session_year,1
last_session_month,1
last_session_day,1
creation_source_PERSONAL_PROJECTS,2
opted_in_to_mailing_list,3
creation_source_ORG_INVITE,3


The feature rankings in the table above show 7 important features that can predict the adopted users best. In essence, it really is only three features, as the two datetime features were converted into three different features. So the answer to the prompt is: **the creation_time and last_session_creation_time features alongside the org_id are the most important factors which can predict user adoption**, according to the given defintion. 

---------------------

Et fin.