# 27.2 The Take-Home Challenge (Relax Inc.)
<br>

For this challenge we were tasked with problem of investigating how we can classify, predict and explain the relationship between user "activity" and user platform adoption for "Relax Inc." We applied a very basic approach to this challenge but we were still apply to find a solution and build a successful model for predicting user adoption. First we explored the two datasets we were provided ("takehome_user_engagement.csv" and "takehome_users.csv") by getting our info like descriptive stats and various counts from pandas/numpy. We made sure all the data was clean and in the proper format for analysis. Since the main goal of our analysis was to identify and predict what factors produce future user adoption, we needed to actively extrapolate the users that have been adopted. To do that, we formed a function that grouped and sorted the amount of logins for 7 days at a time and calculated the time-delta in between. Doing so, we were able to confirm Relax Inc. had 1645 adopted users out of roughly 8000. We then used those users adoption status as our y variable and a few columns from the original dataset that needed to be "one-hot encoded." From there we performed a train/test split of 80/20 and built a Random Forest Classifier model to view what features contributed to the respective adopted users. 

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# load data
df_engage = pd.read_csv('./takehome_user_engagement.csv')
df_users = pd.read_csv('./takehome_users.csv', encoding='iso-8859-1')

print(df_engage.info(), df_users.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   time_stamp  207917 non-null  object
 1   user_id     207917 non-null  int64 
 2   visited     207917 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 4.8+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   object_id                   12000 non-null  int64  
 1   creation_time               12000 non-null  object 
 2   name                        12000 non-null  object 
 3   email                       12000 non-null  object 
 4   creation_source             12000 non-null  object 
 5   last_session_creation_time  8823 non-null   float64
 6   opted_in_to_mailing_list    12000 non-null  int64  
 7   enabled_

In [5]:
# Convert data columns to pandas datetime objects
df_users['creation_time'] = pd.to_datetime(df_users['creation_time'])
df_users['last_session_creation_time'] = pd.to_datetime(
    df_users['last_session_creation_time'])
df_engage['time_stamp'] = pd.to_datetime(df_engage['time_stamp'])

# View user engagement
df_engage.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [40]:
sns.countplot(x=df_engage.user_id)

<AxesSubplot:xlabel='user_id', ylabel='count'>

Error in callback <function flush_figures at 0x000001B2C7DECA60> (for post_execute):


KeyboardInterrupt: 

In [6]:
# View users data
df_users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1970-01-01 00:00:01.398138810,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1970-01-01 00:00:01.396237504,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1970-01-01 00:00:01.363734892,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1970-01-01 00:00:01.369210168,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1970-01-01 00:00:01.358849660,0,0,193,5240.0


In [7]:
# Group by user IDs and view descriptive statistics
df_engage.groupby('user_id').sum().describe()

Unnamed: 0,visited
count,8823.0
mean,23.565341
std,73.988152
min,1.0
25%,1.0
50%,1.0
75%,3.0
max,606.0


In [8]:
# Drop dataset duplicates and sort it by its time stamp
df_engage = df_engage.drop_duplicates(
    subset='time_stamp').sort_values('time_stamp')

In [9]:
# Function to get adopted users, outputs in boolean format (True = Adopted, False = Not adopted)

def user_logins_by_days(df,
                        n_days=7,
                        n_logins=3
                        ):  # 7 days, 3 logins min as requested by docs

    passed_days = df['time_stamp'].diff(periods=n_logins -
                                        1)  # Days passed over login
    # Returns change in login activity over 7 days
    return any(passed_days <= timedelta(days=n_days))


# Apply function to grouped dataset by "user_id"
adopted = df_engage.groupby('user_id').apply(user_logins_by_days)
adopted.name = 'adopted_user'

# View output
adopted.head()

user_id
1    False
2     True
3    False
4    False
5    False
Name: adopted_user, dtype: bool

In [10]:
print("# of adopted users:", sum(adopted))

# of adopted users: 1654


In [11]:
# Drop unecessary columns for our problem, names or emails won't really benefit us
df_users = df_users.drop(['name', 'email'], axis=1)

In [12]:
# Get decriptive statistics of user data
df_users.describe()

Unnamed: 0,object_id,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
count,12000.0,12000.0,12000.0,12000.0,6417.0
mean,6000.5,0.2495,0.149333,141.884583,5962.957145
std,3464.24595,0.432742,0.356432,124.056723,3383.761968
min,1.0,0.0,0.0,0.0,3.0
25%,3000.75,0.0,0.0,29.0,3058.0
50%,6000.5,0.0,0.0,108.0,5954.0
75%,9000.25,0.0,0.0,238.25,8817.0
max,12000.0,1.0,1.0,416.0,11999.0


In [13]:
# Calculate person correlation and variance for potential features 
# No strong correlation between variables

print(df_users.corr())
print(df_users.var())

                            object_id  opted_in_to_mailing_list  \
object_id                    1.000000                 -0.032370   
opted_in_to_mailing_list    -0.032370                  1.000000   
enabled_for_marketing_drip  -0.022040                  0.483529   
org_id                       0.004110                  0.003432   
invited_by_user_id           0.018699                  0.004699   

                            enabled_for_marketing_drip    org_id  \
object_id                                    -0.022040  0.004110   
opted_in_to_mailing_list                      0.483529  0.003432   
enabled_for_marketing_drip                    1.000000  0.009275   
org_id                                        0.009275  1.000000   
invited_by_user_id                            0.003687 -0.057780   

                            invited_by_user_id  
object_id                             0.018699  
opted_in_to_mailing_list              0.004699  
enabled_for_marketing_drip            0.0

In [14]:
# Reformat new dataset to feed to our model
df_users = df_users.set_index('object_id')
df_users.index.name = 'user_id'

# Merge user data with our adopted users
model_data = pd.concat([df_users, adopted], axis=1, join='inner')
model_data.rename(columns={0: 'adopted_user'}, inplace=True)
model_data['adopted_user'] = model_data['adopted_user'].astype(int)

# Fill NaN values with 0
model_data = model_data.fillna(value=0)

# View the dataset
model_data.head()

Unnamed: 0_level_0,creation_time,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted_user
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,2014-04-22 03:53:30,GUEST_INVITE,1970-01-01 00:00:01.398138810,1,0,11,10803.0,0
2,2013-11-15 03:45:04,ORG_INVITE,1970-01-01 00:00:01.396237504,0,0,1,316.0,1
3,2013-03-19 23:14:52,ORG_INVITE,1970-01-01 00:00:01.363734892,0,0,94,1525.0,0
4,2013-05-21 08:09:28,GUEST_INVITE,1970-01-01 00:00:01.369210168,0,0,1,5151.0,0
5,2013-01-17 10:14:20,GUEST_INVITE,1970-01-01 00:00:01.358849660,0,0,193,5240.0,0


In [22]:
# Perform train/test split (80/20 split) - X features, y adopted users
X = model_data.drop(
    ['adopted_user', 'creation_time', 'last_session_creation_time', 'org_id'],
    axis=1)
y = model_data['adopted_user']

# Get dummy variables for one-hot-encoded of creation source
X2 = pd.get_dummies(X, columns=['creation_source'], drop_first=True)

In [23]:
# Check shapes
print(X2.shape, y.shape)

(8809, 7) (8809,)


In [19]:
# Use sklearns KBest algorithim to select best features
X2_new = SelectKBest(chi2, k=20).fit_transform(X2, y)

# Check shape
print(X2_new.shape, y.shape)

(8809, 20) (8809,)


In [24]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X2,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=33)

In [25]:
# Check shape
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(7047, 7) (1762, 7) (7047,) (1762,)


In [27]:
# GridSearchCV for our Random Forest Classifier
# To save time we didn't train for long, using minimal hyperparameters

rfr_params = {
    #'criterion': ["gini", "entropy"],
    #'max_depth': [2, 3],
    "n_estimators": [100, 300, 500]
}

# Fit X_train and y_train with best estimator
rfr_grid = GridSearchCV(RandomForestClassifier(),
                        param_grid=rfr_params,
                        verbose=3).fit(X_train, y_train).best_estimator_

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] n_estimators=100 ................................................
[CV] .................... n_estimators=100, score=0.765, total=   0.6s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s


[CV] n_estimators=100 ................................................
[CV] .................... n_estimators=100, score=0.746, total=   0.8s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.3s remaining:    0.0s


[CV] n_estimators=100 ................................................
[CV] .................... n_estimators=100, score=0.743, total=   0.8s
[CV] n_estimators=100 ................................................
[CV] .................... n_estimators=100, score=0.743, total=   0.7s
[CV] n_estimators=100 ................................................
[CV] .................... n_estimators=100, score=0.737, total=   0.8s
[CV] n_estimators=300 ................................................
[CV] .................... n_estimators=300, score=0.766, total=   2.4s
[CV] n_estimators=300 ................................................
[CV] .................... n_estimators=300, score=0.748, total=   2.2s
[CV] n_estimators=300 ................................................
[CV] .................... n_estimators=300, score=0.744, total=   2.1s
[CV] n_estimators=300 ................................................
[CV] .................... n_estimators=300, score=0.744, total=   2.1s
[CV] n

[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:   32.4s finished


In [29]:
# Test the train RFC with our X_test
rfr_pred = rfr_grid.predict(X_test)

In [31]:
# View parameters used 
rfr_grid.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 500,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [30]:
# Get model performance scores
print(classification_report(rfr_pred, y_test))

              precision    recall  f1-score   support

           0       0.91      0.82      0.86      1592
           1       0.10      0.19      0.14       170

    accuracy                           0.76      1762
   macro avg       0.50      0.51      0.50      1762
weighted avg       0.83      0.76      0.79      1762



In [35]:
# Feature importances
feat_importance = rfr_grid.feature_importances_
feat_importance_df = pd.DataFrame(
    {
        'creation_source': feat_importance[0],
        'opted_in_to_mailing_list': feat_importance[1],
        'enabled_for_marketing_drip': feat_importance[2],
        'invited_by_user_id': feat_importance[3],
        'dummy_1': feat_importance[4],
        'dummy_2': feat_importance[5],
        'dummy_3': feat_importance[6]
    },
    index=[0])

feat_importance_df

Unnamed: 0,creation_source,opted_in_to_mailing_list,enabled_for_marketing_drip,invited_by_user_id,dummy_1,dummy_2,dummy_3
0,0.005379,0.005129,0.977268,0.005072,0.002339,0.003224,0.00159


## Findings
<br>

Through our analysis, we were able to successfully collect the adopted users from the data provided. We obtained decent results with a very bare Random Forest Classifier with weighted avg precision of 0.83, recall of 0.76, and a F1 of 0.79. There is definitely room for improvement and more investigation into contributing factors that influenced our model. Although, by viewing its feature importances we can see that user adoption is more than likely a result of an user having the marketing drip enabled. 


**Random Forest Classifier parameters:**

- 'bootstrap': True,
- 'ccp_alpha': 0.0,
- 'class_weight': None,
- 'criterion': 'gini',
- 'max_depth': None,
- 'max_features': 'auto',
- 'max_leaf_nodes': None,
- 'max_samples': None,
- 'min_impurity_decrease': 0.0,
- 'min_impurity_split': None,
- 'min_samples_leaf': 1,
- 'min_samples_split': 2,
- 'min_weight_fraction_leaf': 0.0,
- 'n_estimators': 500,
- 'n_jobs': None,
- 'oob_score': False,
- 'random_state': None,
- 'verbose': 0,
- 'warm_start': False}

