# Relax Data Science Challenge

## Identifying Adopted Users
*Adopted user* is a user who has logged into the product on three separate days in at least one seven day period.

Note: I considered any 7 day period. For example Jan 1, to jan 7th is one 7 day period, and Jan 2 to Jan 8th is another 7 day period. Suppose a user signed in 3 times during Jan 1 - Jan 7, after that never logs in. He is still considered as an adopted user.

In [1]:
import pandas as pd
engagement_df = pd.read_csv('takehome_user_engagement.csv')
engagement_df.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [2]:
engagement_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
time_stamp    207917 non-null object
user_id       207917 non-null int64
visited       207917 non-null int64
dtypes: int64(2), object(1)
memory usage: 4.8+ MB


In [3]:
# Convert time_stamp to datetime datatype
engagement_df.time_stamp = pd.to_datetime(engagement_df.time_stamp)

In [4]:
# Since timestamp is not required, remove it
engagement_df.time_stamp = engagement_df.time_stamp.dt.date

In [5]:
# Explore visited column
engagement_df.visited.value_counts()

1    207917
Name: visited, dtype: int64

In [6]:
# Check if any user visited on the same day twice (this should not be case according to project description)
print(sum(engagement_df.duplicated()))
sum(engagement_df.duplicated(subset=['time_stamp', 'user_id']))

0


0

In [7]:
# Find out number of unique user_ids
len(pd.Series(engagement_df.user_id.unique()))

8823

#### Steps for identifying adopted users.

- Get the list of unique users
- Get the list of dates each user logged in ascending order
- Loop through the dates list. If the date in current element (index i) is less than 7 days from the date in 2 elements further down (index i + 2), assign the the user as adopted user and break the loop. If loop completes without breaking, then the user is not adopted. 

In [8]:
# Sort the data frame by each user and dates visited
engagement_df_sorted = engagement_df.sort_values(['user_id' ,'time_stamp'])

In [9]:
# Get the list of unique users
user_list = list(engagement_df_sorted.user_id.unique())

In [10]:
# Identify if an user is adopted or not
import datetime
is_adopted = [0] * len(user_list)
for i, v in enumerate(user_list):
    # print (i)
    sub_df = engagement_df_sorted[engagement_df_sorted.user_id == v]
    dates = list(sub_df.time_stamp)
    # print (dates[0])
    for j in range(len(dates) - 2):
        if (dates[j + 2] - dates[j]).days < 7:
            is_adopted[i] = 1
            break       

In [11]:
is_adopted_df = pd.DataFrame({'user_id' : user_list,
                'is_adopted' : is_adopted})
is_adopted_df.head()

Unnamed: 0,user_id,is_adopted
0,1,0
1,2,1
2,3,0
3,4,0
4,5,0


In [12]:
# Percentage of adopted versus non-adopted users
import matplotlib.pyplot as plt
num_ado_users = is_adopted_df.is_adopted.value_counts(normalize=True)
print(num_ado_users)
num_ado_users.plot(kind='bar')
plt.xticks([0, 1], ['Not adopted', 'Adopted'])
plt.ylabel('Number of users')
plt.title("Fraction of Adopted and Not Adopted Users");

0    0.818429
1    0.181571
Name: is_adopted, dtype: float64


## Model to Predict Adopted Users

In [13]:
# Read user data
user_data_df = pd.read_csv('takehome_users.csv', encoding='latin-1')
user_data_df.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [14]:
user_data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null object
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            6417 non-null float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB


In [15]:
user_data_df.describe()

Unnamed: 0,object_id,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
count,12000.0,8823.0,12000.0,12000.0,12000.0,6417.0
mean,6000.5,1379279000.0,0.2495,0.149333,141.884583,5962.957145
std,3464.24595,19531160.0,0.432742,0.356432,124.056723,3383.761968
min,1.0,1338452000.0,0.0,0.0,0.0,3.0
25%,3000.75,1363195000.0,0.0,0.0,29.0,3058.0
50%,6000.5,1382888000.0,0.0,0.0,108.0,5954.0
75%,9000.25,1398443000.0,0.0,0.0,238.25,8817.0
max,12000.0,1402067000.0,1.0,1.0,416.0,11999.0


In [16]:
user_data_df.creation_source.value_counts(normalize=True)

ORG_INVITE            0.354500
GUEST_INVITE          0.180250
PERSONAL_PROJECTS     0.175917
SIGNUP                0.173917
SIGNUP_GOOGLE_AUTH    0.115417
Name: creation_source, dtype: float64

In [17]:
# Percentage of users opted for mailing list
user_data_df.opted_in_to_mailing_list.value_counts(normalize=True)

0    0.7505
1    0.2495
Name: opted_in_to_mailing_list, dtype: float64

In [18]:
# Percentage of users who are on marketing drip
user_data_df.enabled_for_marketing_drip.value_counts(normalize=True)

0    0.850667
1    0.149333
Name: enabled_for_marketing_drip, dtype: float64

In [19]:
user_data_df.org_id.nunique()

417

In [20]:
user_data_df.invited_by_user_id.nunique()

2564

### Characterstics of the dataset
There are 207,917 logins related to 8,823 users. This table does not contain any missig values. Only 18% of users are adopted and 82% users are not adopted. So this dataset is quite imbalanced.

In users file there are 12,000 records with 10 columns. There are some missing values in the fields `last_session_creation_time` and `invited_by_user_id`. Since I am not going to use these columns in my analysis, I am not going to deal with these missing values.

25% users opted in for mailing list. 15% users are on marketing drip. Users belong to 417 different organizations. 2,564 users invited others to join.

### Which features to use in model training
Features like timestamp and IDs are not useful in prediction as they are too unique to be useful. While it is possible to extract weekday, and month from the date and use them as predictors, I did not do it here as it is less likely that weekday and month of user sign_up or last login has an effect on adoption. So I used only following features in model building; creation_source, opted_in_to_mailing_list, and enabled_for_marketing_drip.

#### Merging dataframes
As per *takehome_users* file there are 12,000 users. But there are login details for only 8,823. I can consider rest of the users as not adopted users. The assumption is that these users just signed up, but never logged in. But I do not know this for sure as the login data for these users may be just missing. Moreover, already the dataset is imbalanced in favor of not adopted users. So I decided to use only 8,823 users for analysis for whom login detail is available.


In [21]:
# Merge dataframes
user_adopted_df = user_data_df.merge(is_adopted_df, right_on='user_id', left_on='object_id', how='inner')
user_adopted_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8823 entries, 0 to 8822
Data columns (total 12 columns):
object_id                     8823 non-null int64
creation_time                 8823 non-null object
name                          8823 non-null object
email                         8823 non-null object
creation_source               8823 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      8823 non-null int64
enabled_for_marketing_drip    8823 non-null int64
org_id                        8823 non-null int64
invited_by_user_id            4776 non-null float64
user_id                       8823 non-null int64
is_adopted                    8823 non-null int64
dtypes: float64(2), int64(6), object(4)
memory usage: 896.1+ KB


## Modelling
Since we are interested to know which factors predict future user adoption, and the features are categorical, I have decided to use decision tree. In sklearn, with decision tree we can get feature importance. Also, there is a parameter to take care of class imbalance of target labels.

In [22]:
# Prepare features and target variables
features = user_adopted_df[['creation_source', 'opted_in_to_mailing_list', 'enabled_for_marketing_drip']]
target = user_adopted_df['is_adopted']

In [23]:
# One hot encode categorical variables
features_final = pd.get_dummies(features)
features_final.head()

Unnamed: 0,opted_in_to_mailing_list,enabled_for_marketing_drip,creation_source_GUEST_INVITE,creation_source_ORG_INVITE,creation_source_PERSONAL_PROJECTS,creation_source_SIGNUP,creation_source_SIGNUP_GOOGLE_AUTH
0,1,0,1,0,0,0,0
1,0,0,0,1,0,0,0
2,0,0,0,1,0,0,0
3,0,0,1,0,0,0,0
4,0,0,1,0,0,0,0


In [24]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Split the 'features' and 'income' data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features_final, 
                                                    target, 
                                                    test_size = 0.25, 
                                                    random_state = 0)

# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))

Training set has 6617 samples.
Testing set has 2206 samples.


### Decision Tree

In [25]:
from sklearn.metrics import f1_score, make_scorer, accuracy_score, classification_report
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(class_weight='balanced') # Set class_weight='balanced' to take care of class imbalance.

# Fit the new model.
clf.fit(X_train, y_train)
test_predictions = clf.predict(X_test)

test_accuracy = round(accuracy_score(y_test, test_predictions), 2)
test_f1 = round(f1_score(y_test, test_predictions), 2)

print (f'Test accuracy: {test_accuracy}')
print (f'Test f1: {test_f1}')

Test accuracy: 0.51
Test f1: 0.29


In [26]:
# Print various performance metric
print(classification_report(y_test, test_predictions, target_names=['Not Adopted', 'Adopted']))

             precision    recall  f1-score   support

Not Adopted       0.82      0.50      0.62      1787
    Adopted       0.20      0.53      0.29       419

avg / total       0.70      0.51      0.56      2206



In [27]:
pd.DataFrame({'Features': features_final.columns,
              'Feature Score': clf.feature_importances_})

Unnamed: 0,Features,Feature Score
0,opted_in_to_mailing_list,0.079595
1,enabled_for_marketing_drip,0.032639
2,creation_source_GUEST_INVITE,0.589778
3,creation_source_ORG_INVITE,0.133066
4,creation_source_PERSONAL_PROJECTS,0.147948
5,creation_source_SIGNUP,0.016728
6,creation_source_SIGNUP_GOOGLE_AUTH,0.000245


## Conclusion
In this project, first I determined if an user is adopted based on whether a user logged in at least three times in any 7 days period. Then I build a model to predict user adoption. I used decision tree, which has an accuracy of 51% and f1 score of 29%. We can use more sophisticated model like XGBoost, Random Forest, etc or even deep learning models like multi layer perceptron to improve this performance.

Among the different predictors, creation source is the strongest predictor of user adoption as per decision tree model. Particularly, whether a user is GUEST_INVITE or NOT has the highest feature importance score.