In [1]:
import pandas as pd

In [8]:
engagement = pd.read_csv('takehome_user_engagement.csv')

In [10]:
users = pd.read_csv('takehome_users.csv', encoding='latin-1')

In [12]:
engagement

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1
...,...,...,...
207912,2013-09-06 06:14:15,11996,1
207913,2013-01-15 18:28:37,11997,1
207914,2014-04-27 12:45:16,11998,1
207915,2012-06-02 11:55:59,11999,1


In [11]:
users

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1.398139e+09,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1.396238e+09,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1.363735e+09,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1.369210e+09,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1.358850e+09,0,0,193,5240.0
...,...,...,...,...,...,...,...,...,...,...
11995,11996,2013-09-06 06:14:15,Meier Sophia,SophiaMeier@gustr.com,ORG_INVITE,1.378448e+09,0,0,89,8263.0
11996,11997,2013-01-10 18:28:37,Fisher Amelie,AmelieFisher@gmail.com,SIGNUP_GOOGLE_AUTH,1.358275e+09,0,0,200,
11997,11998,2014-04-27 12:45:16,Haynes Jake,JakeHaynes@cuvox.de,GUEST_INVITE,1.398603e+09,1,1,83,8074.0
11998,11999,2012-05-31 11:55:59,Faber Annett,mhaerzxp@iuxiw.com,PERSONAL_PROJECTS,1.338638e+09,0,0,6,


In [18]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import seaborn as sns

In [14]:
users['creation_time'] = pd.to_datetime(users['creation_time'])
users['last_session_creation_time'] = pd.to_datetime(users['last_session_creation_time'], unit='s')
engagement['time_stamp'] = pd.to_datetime(engagement['time_stamp'])

In [15]:
def is_adopted_user(user_id):
    user_data = engagement[engagement['user_id'] == user_id]
    user_data = user_data.sort_values('time_stamp')
    user_data['date'] = user_data['time_stamp'].dt.date
    unique_dates = user_data['date'].unique()

    for i in range(len(unique_dates) - 6):
        week_dates = unique_dates[i:i+7]
        if len(week_dates) >= 3:
            return 1
    return 0

users['adopted'] = users['object_id'].apply(is_adopted_user)

In [16]:
users['days_since_creation'] = (pd.Timestamp.now() - users['creation_time']).dt.days

users['days_since_last_session'] = (pd.Timestamp.now() - users['last_session_creation_time']).dt.days

users = pd.get_dummies(users, columns=['creation_source'])

users = users.drop(['name', 'email', 'object_id', 'creation_time', 'last_session_creation_time'], axis=1)

In [20]:
X = users.drop('adopted', axis=1)
y = users['adopted']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('rf_model', RandomForestClassifier(n_estimators=100, random_state=42))
])

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

print(classification_report(y_test, y_pred))

rf_model = pipeline.named_steps['rf_model']
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 10 Most Important Features:")
print(feature_importance.head(10))

              precision    recall  f1-score   support

           0       0.97      0.99      0.98      2072
           1       0.95      0.80      0.87       328

    accuracy                           0.97      2400
   macro avg       0.96      0.89      0.92      2400
weighted avg       0.97      0.97      0.97      2400


Top 10 Most Important Features:
                               feature  importance
5              days_since_last_session    0.678077
4                  days_since_creation    0.203851
2                               org_id    0.058556
3                   invited_by_user_id    0.037420
0             opted_in_to_mailing_list    0.004518
1           enabled_for_marketing_drip    0.004289
8    creation_source_PERSONAL_PROJECTS    0.003759
7           creation_source_ORG_INVITE    0.002552
6         creation_source_GUEST_INVITE    0.002466
10  creation_source_SIGNUP_GOOGLE_AUTH    0.002450


Classification Report Analysis:

a) Accuracy: The overall accuracy is 0.97, which means the model correctly predicts 97% of all cases. This is a high accuracy, but we should be cautious of potential class imbalance.

b) Class 0 (Non-adopted users):
Precision: 0.97
Recall: 0.99
F1-score: 0.98
Support: 2072 samples

c) Class 1 (Adopted users):
Precision: 0.95
Recall: 0.80
F1-score: 0.87
Support: 328 samples
Observations:

There's a class imbalance: many more non-adopted users (2072) than adopted users (328).

The model performs very well in identifying non-adopted users (high precision and recall).

For adopted users, the model has high precision (0.95) but lower recall (0.80). This means it's very accurate when it predicts a user will be adopted, but it misses some adopted users.

Feature Importance Analysis:

a) Top 3 most important features:
days_since_last_session (67.8%)
days_since_creation (20.4%)
org_id (5.9%)

b) Other notable features:
invited_by_user_id (3.7%)
opted_in_to_mailing_list and enabled_for_marketing_drip (both around 0.4%)
Various creation sources (each less than 0.4%)

Observations:
User activity recency (days_since_last_session) is by far the most important predictor of adoption.

Account age (days_since_creation) is the second most important feature.
The organization a user belongs to (org_id) has some influence on adoption.
Who invited the user (invited_by_user_id) has a minor impact.

Marketing-related features and creation sources have very little impact on predicting adoption.

For Future Work:

Focus on user engagement: Since the recency of last session is the strongest predictor, implement strategies to encourage frequent logins, such as email reminders, push notifications, or engaging content.

Improve onboarding for new users: The importance of account age suggests that getting users engaged early is crucial. Enhance the onboarding process to demonstrate value quickly.

Investigate organizational factors: Look into what makes certain organizations (org_id) more conducive to user adoption. This could inform strategies for targeting and supporting specific types of organizations.

Refine the adoption model: Consider adjusting the threshold for predicting adoption to improve recall for adopted users, if identifying potential adopters is more important than precision.

Re-evaluate marketing strategies: Given the low importance of marketing-related features, review and possibly revamp marketing approaches to make them more effective in driving adoption.

Collect more data: Consider gathering more detailed user activity data, as the current features don't capture the full picture of what drives adoption.
Address class imbalance: Use techniques like oversampling, undersampling, or SMOTE to balance the classes and potentially improve the model's performance on the minority class (adopted users).