# Relax Inc. Take-Home Challenge

### Problem Statement

Defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven-day period, **identify which factors predict future user adoption.**

In [1]:
import warnings
import pandas as pd
import numpy as np

warnings.simplefilter(action="ignore", category=FutureWarning)

In [2]:
df_user = pd.read_csv("takehome_users.csv", encoding="ISO-8859-1")
df_user.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [3]:
df_user['last_session_creation_time'] = pd.to_datetime(df_user['last_session_creation_time'],unit='s')

In [4]:
df_user = df_user.rename({"object_id":"user_id"}, axis=1)
df_user.head()

Unnamed: 0,user_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,2014-04-22 03:53:30,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,2014-03-31 03:45:04,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,2013-03-19 23:14:52,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,2013-05-22 08:09:28,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,2013-01-22 10:14:20,0,0,193,5240.0


In [5]:
df_user.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   user_id                     12000 non-null  int64         
 1   creation_time               12000 non-null  object        
 2   name                        12000 non-null  object        
 3   email                       12000 non-null  object        
 4   creation_source             12000 non-null  object        
 5   last_session_creation_time  8823 non-null   datetime64[ns]
 6   opted_in_to_mailing_list    12000 non-null  int64         
 7   enabled_for_marketing_drip  12000 non-null  int64         
 8   org_id                      12000 non-null  int64         
 9   invited_by_user_id          6417 non-null   float64       
dtypes: datetime64[ns](1), float64(1), int64(4), object(4)
memory usage: 937.6+ KB


In [6]:
df_user.shape

(12000, 10)

In [7]:
df_user.describe()

Unnamed: 0,user_id,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
count,12000.0,12000.0,12000.0,12000.0,6417.0
mean,6000.5,0.2495,0.149333,141.884583,5962.957145
std,3464.24595,0.432742,0.356432,124.056723,3383.761968
min,1.0,0.0,0.0,0.0,3.0
25%,3000.75,0.0,0.0,29.0,3058.0
50%,6000.5,0.0,0.0,108.0,5954.0
75%,9000.25,0.0,0.0,238.25,8817.0
max,12000.0,1.0,1.0,416.0,11999.0


In [8]:
df_user[df_user['last_session_creation_time'].isna()].head()

Unnamed: 0,user_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
7,8,2013-07-31 05:34:02,Hamilton Danielle,DanielleHamilton@yahoo.com,PERSONAL_PROJECTS,NaT,1,1,74,
8,9,2013-11-05 04:04:24,Amsel Paul,PaulAmsel@hotmail.com,PERSONAL_PROJECTS,NaT,0,0,302,
11,12,2014-04-17 23:48:38,Mathiesen Lærke,LaerkeLMathiesen@cuvox.de,ORG_INVITE,NaT,0,0,130,9270.0
14,15,2013-07-16 21:33:54,Theiss Ralf,RalfTheiss@hotmail.com,PERSONAL_PROJECTS,NaT,0,0,175,
15,16,2013-02-11 10:09:50,Engel René,ReneEngel@hotmail.com,PERSONAL_PROJECTS,NaT,0,0,211,


In [9]:
df_user['last_session_creation_time'].fillna(df_user['creation_time'],inplace=True)

In [10]:
df_user.nunique()

user_id                       12000
creation_time                 11996
name                          11355
email                         11980
creation_source                   5
last_session_creation_time    11998
opted_in_to_mailing_list          2
enabled_for_marketing_drip        2
org_id                          417
invited_by_user_id             2564
dtype: int64

since user_id is all unique and name, email, org_id and invited_by_user_id are not required for our analysis, lets drop those columns.

In [11]:
df_user = df_user.drop(['name', 'email', 'org_id' , 'invited_by_user_id'], axis=1)
df_user.head()

Unnamed: 0,user_id,creation_time,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip
0,1,2014-04-22 03:53:30,GUEST_INVITE,2014-04-22 03:53:30,1,0
1,2,2013-11-15 03:45:04,ORG_INVITE,2014-03-31 03:45:04,0,0
2,3,2013-03-19 23:14:52,ORG_INVITE,2013-03-19 23:14:52,0,0
3,4,2013-05-21 08:09:28,GUEST_INVITE,2013-05-22 08:09:28,0,0
4,5,2013-01-17 10:14:20,GUEST_INVITE,2013-01-22 10:14:20,0,0


In [12]:
df_user_eng = pd.read_csv('takehome_user_engagement.csv', parse_dates=["time_stamp"])

In [13]:
df_user_eng.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [14]:
# defining an 'adopted user' 
df_agg = df_user_eng.set_index("time_stamp")

users = df_agg["user_id"].unique()
adoption = []

for i in users:
    id_filter = df_agg["user_id"] == i
    df_filter = df_agg[id_filter].resample("1D").count()
    df_filter = df_filter.rolling(window=7).sum()
    df_filter = df_filter.dropna()
    adoption.append(any(df_filter["visited"].values >= 7))

# creating a new df using df_user and df_adopt
user_adoption = list(zip(users, adoption))

df_adopt = pd.DataFrame(user_adoption)
df_adopt.columns = ["user_id", "adopted_user"]

df = df_user.merge(df_adopt, on="user_id", how="left")

# mapping 'adopted_user' 
df.loc[:, "adopted_user"] = df["adopted_user"].map({False:0, True:1, np.nan:0})
df.dropna(subset=["adopted_user"], inplace=True)
df["adopted_user"] = df["adopted_user"].astype(int)
df.head()

Unnamed: 0,user_id,creation_time,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,adopted_user
0,1,2014-04-22 03:53:30,GUEST_INVITE,2014-04-22 03:53:30,1,0,0
1,2,2013-11-15 03:45:04,ORG_INVITE,2014-03-31 03:45:04,0,0,0
2,3,2013-03-19 23:14:52,ORG_INVITE,2013-03-19 23:14:52,0,0,0
3,4,2013-05-21 08:09:28,GUEST_INVITE,2013-05-22 08:09:28,0,0,0
4,5,2013-01-17 10:14:20,GUEST_INVITE,2013-01-22 10:14:20,0,0,0


In [15]:
# Selecting adopted_user, creation_source, opted_in_to_mailing_list and enabled_for_marketing_drip for our analysis
df = df[["adopted_user", "creation_source", "opted_in_to_mailing_list", "enabled_for_marketing_drip"]]
df.head()

Unnamed: 0,adopted_user,creation_source,opted_in_to_mailing_list,enabled_for_marketing_drip
0,0,GUEST_INVITE,1,0
1,0,ORG_INVITE,0,0
2,0,ORG_INVITE,0,0
3,0,GUEST_INVITE,0,0
4,0,GUEST_INVITE,0,0


In [16]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [17]:
X = df[df.columns[1:]]
y = df[df.columns[0]]
X.head()

Unnamed: 0,creation_source,opted_in_to_mailing_list,enabled_for_marketing_drip
0,GUEST_INVITE,1,0
1,ORG_INVITE,0,0
2,ORG_INVITE,0,0
3,GUEST_INVITE,0,0
4,GUEST_INVITE,0,0


In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)

In [19]:
pipeline = Pipeline(steps=[("encoder", OneHotEncoder()), \
                           ("rf", RandomForestClassifier(random_state = 42))])

params = {"rf__n_estimators" : [50, 75, 100],
          "rf__max_depth" : [5, 10, 15]}

cv = GridSearchCV(pipeline, param_grid=params, cv=3)
cv.fit(X_train, y_train)

print(f"Best parameters: {cv.best_params_}")
print(f"Training accuracy score from tuned model: \
       {cv.best_score_*100:.1f}%")

Best parameters: {'rf__max_depth': 5, 'rf__n_estimators': 50}
Training accuracy score from tuned model:        94.8%


In [20]:
# test set score #
y_pred = cv.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {test_accuracy*100:.2f}%")

Model accuracy: 94.83%


In [21]:
cv.best_estimator_.named_steps["rf"].feature_importances_

array([0.10811758, 0.08434293, 0.23440629, 0.05680139, 0.18565606,
       0.09569639, 0.08321853, 0.07249834, 0.07926248])

In [22]:
X_ohe = pd.get_dummies(X_test)
pipeline.fit(X_ohe, y_test)

fe = pipeline.named_steps["rf"].feature_importances_

feature_importance = zip(X_ohe.columns, fe)
feature_importance = sorted(feature_importance, key=lambda x:x[1], reverse=True)

for i, j in feature_importance:
    print(f"Weight: {j:.3f} | Feature: {i}")

Weight: 0.130 | Feature: creation_source_ORG_INVITE
Weight: 0.125 | Feature: creation_source_GUEST_INVITE
Weight: 0.109 | Feature: enabled_for_marketing_drip
Weight: 0.100 | Feature: opted_in_to_mailing_list
Weight: 0.049 | Feature: creation_source_SIGNUP
Weight: 0.033 | Feature: creation_source_PERSONAL_PROJECTS
Weight: 0.031 | Feature: creation_source_SIGNUP_GOOGLE_AUTH


## Conclusion

1. creation_source is the most critical feature with more weightage to users created using invite. The company should focus its marketing more towards collaborative user-groups
2. marketing_drip works, so company should keep efforts on it to keep the user base.
3. opted_in_to_mailing_list is also an important feature. Company should invest more into mail based content and marketing.