# #Predicting User Engagement in Corporate Collaboration Network

by Mike Yea
DAT7

##1. Background

In 2012, an opt-in, web-based (and mobile-enabled) collaboration network was launched at my organization, a 90,000-employee federal agency.  To date, this tool is the only one of its kind that spans across the entire agency.  While initial roll-out and user adoption were impressive (50 percent of users joined in the first 6 months), the growth rate of the network has slowed.  Among chief complaints from users is a high number of unresponded "messages" or posts.  Without active collaboration, the network--which is designed to foster the breaking down of organizational silos and link a geographically distributed workforce--will become a shell of its former self. To prevent such an outcome, my colleagues and I are interested in launching user engagement campaign to "lift" user engagement.  Rather than innundating users and potential users with mass marketing, we would like to engage users in an informed way.  

##2. Problem Statement

**Can I predict a “lift” in user engagement (i.e., replies to public messages) from message attributes (e.g., length of the message, message posed in the form of a question, presence of attachments and hyperlink, key words, message tone/sentiment, message text, information about message poster)?**    

My initial hypotheses are:
1. Message content and metadata have intrinsic value in predicting user engagement
2. Message poster's role within the organization and activity level within the network are predictors of user engagement

##3. Data

###3.1 Data Import

The collaboration network has approximately 3 years of data, and the data can be exported via a web interface.  I imported both message (messages.csv) and user profile (users.csv) data for the last one year.  I used Pandas to extract data, but encountered some encoding errors. By importing and using the "sys" library, I was able to parse the data:

In [1]:
import pandas as pd
import sys
reload(sys)
sys.setdefaultencoding('utf8')
messages = pd.read_csv('Messages.csv', na_filter=False)
messages.shape
messages.describe()
messages.dtypes

###3.2 Data Pre-Processing

Because the purpose of this study is learning about the public interaction of users, 1) I removed private messages and messages posted in private groups; and 2) stored them in a data frame: 

In [None]:
messages.in_private_group.sum()
messages.in_private_conversation.sum()
public_msgs = messages[(messages.in_private_group == False) & (messages.in_private_conversation == False)]
public_msgs = public_msgs.drop(['in_private_group', 'in_private_conversation'], axis=1)
public_msgs = public_msgs.replace('', 'NA')

The data fields are as follows (filtered to include fields used in the project):

**Message**
* id: unique identifier for each message
* replied_to_id: id of the message to which the subject message is replying; blank if top-level message
* thread_id: id of the top-level message 
* group_id: id of the network group
* group_name: name of the group
* participants: user id(s) of participants in the thread
* in_private_group: whether or not the message is posted to a private group (boolean)
* in_private_conversation: whether or not the message is private (boolean) 
* sender_id: id of the message's author
* body: message text
* attachments: internal identifier
* created_at: DTG when the message was posted

**User**
* id: unique identifier for each user
* job_title: user entered position description
* joined_at: when the user joined the network
* state: whether the user is active or not (boolean)

###3.3 Response Variable

Additionally, messages in messages.csv are not normalized; both top-level (or headline) messages and replies in the thread are stored in the same table.  Since I primarily am interested in how the top-level messages induce user engagement (i.e., reply to the initial message), I created a separate data frame that only contains top-level messages and added to that data frame a series that contained the number of replies to each top-level message: 

In [None]:
'''
1. Create response variable column in a dataframe
'''
#1.1 Initialize a dataframe
y_data = pd.DataFrame()
y_data = y_data.fillna(0)
#1.2 Add number of replies to each original message
#1.2.1 Populate dataframe with top-level message ID
top_msg_id = []
for index, row in public_msgs.iterrows():
    if row.id == row.thread_id:
        top_msg_id.append(row.id)
y_data['id'] = top_msg_id
#1.2.2 Create a function that returns the number of replies to a given top-level message ID
def get_reply_counts(id):    
    public_msgs['num_reply'] = public_msgs.replied_to_id.str.contains(str(id)).astype(int)
    return public_msgs['num_reply'].sum()    
    
#1.2.3 Determine the number of replies to each top-level message and store it to the dataframe
cnt_replies = []
for row in y_data.id:
    cnt_replies.append(get_reply_counts(row))
y_data['num_replies'] = cnt_replies 

As suggested by several active users, a vast majority, 80.02 percent, of messages in the sample data go unanswered (amazingly enough, this number has hovered around 80% throughout the **history of the collaboration network**).  A histogram of the number of replies is depicted as follows:  

<img src="hist_num.png"> 

Because of this obvious class imbalance, the most frequently appearing class (i.e., no reply) was removed randomly (without replacement) from the dataframe to achieve even distribution of classes:

In [None]:
#1.2.4 Remove random rows of y_data that has num_replies == 0 to achieve class balance
import numpy as np
rows = np.random.choice(y_data[y_data.num_replies == 0].index.values, 5774, replace=False)
y_data = y_data.drop(y_data.index[rows], in_place=True)

##4. Feature Analysis and Selection

###4.1 Feature Engineering

Given the volume of unstructured data, transforming the body of the message into a document term matrix (DTM) appears to be a good idea (user-entered job title also was transformed into a DTM).  Additionally, there are 9 hand-engineered features that I hypothesized could correlate with the response variable:

1. message posted in a group (a proxy for collaborating in self-selected group) (binary)
2. attachments (binary)
3. length of message (continuous)
4. hyperlinks included (binary)
5. message tone/sentiment (index between -1 and 1)
6. message posed as a question (binary)
7. number of key words observed over time ("experience", "opportunity", and "interest") that appear to draw user engagement (continuous)
8. message poster's tenure in the collaboration network when a message was posted (number of days; continuous)
9. @mentions one or more users (binary)
    
A number of different approaches was used to engineer the above features.  Shown below as an example is the key word feature (#7) using Regular Expressions:

In [None]:
'''
2. Add features to the dataframe
'''

#2.7 Key words ["experience", "opportunity", "interest"] use apply(key_word_search)
import re
def search_key_words(text):
    return len(re.findall(r"(experience|opportunity|interest)", text))
public_msgs['has_key_word'] = public_msgs.body.apply(search_key_words)
df = public_msgs[['id','has_key_word']]
y_data = pd.merge(y_data,df)  

###4.2 Feature Selection

By plotting a scatter plot (response vs. each feature) chart and adding a regression line, it would be possible to determine what features appear to be correlated strongly with the response:

<img src="pair_wise_1.png"> 
<img src="pair_wise_2.png"> 

By inspecting the linear regression model, it seems all the features, individually and collectively, have limited explanatory power, and the linear model does not appear to be a good fit.  The negative slope of the "message sentiment/tone" chart was unexpected; I thought this feature and the response would be correlated positively.  

##5. Modeling

###5.1 Conversion to a Classification Model

I realized early in the process an OLM regression model would not be a good candidate (R2 value of .05 and RMSE 1.5).  Hence, the continuous response variable was encoded to class, making this a classification problem.  The reponse variable is encoded as follows:
    
    **1**: one or more replies (*approximately 50% of all responses)
    **0**: no reply (*approxmiately 50% of all reponses)

In [None]:
'''
4. Logistic Regression using DTM of body of the message as a feature and other features
'''
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
import scipy as sp
#4.1 Convert the response variable to classes
#4.1 Binary: 0 reply to post: 0; 1 or more replies to post: 1
y_data['num_replies_class'] = np.where(y_data.num_replies>0,1,0)

###5.2 Logistic Regression and Naive Bayes (NB)
There were 8 models considered for evaluation at the outset: 1) logistic regresson/NB model using message body DTM; 2) logistic regression/NB model using job title of the message author DTM; 3) logistic regression/NB model using both DTMs and a sparse matrix representing the 9 hand-engineered features; and 4) logistic/NB model using on the 9 hand-engineered features:

In [None]:
#4.2 Add the body of the message to the dataframe
df = public_msgs[['id','body']]
y_data = pd.merge(y_data,df)
#4.3 Split the new DataFrame into training and testing sets
feature_cols = ['body', 'job_title', 'in_group', 'has_attach','msg_len', 'has_qm', 'has_key_word', 'author_age', 'has_at_mention']
X = y_data[feature_cols]
y = y_data['num_replies_class']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
#4.4 Use CountVectorizer with body of the message only
vect = CountVectorizer(ngram_range=(1, 2), min_df=2)
train_dtm = vect.fit_transform(X_train[:, 0])
test_dtm = vect.transform(X_test[:, 0])
#4.4b Use CountVectorizer with job_title of the author only
train_dtm_jt = vect.fit_transform(X_train[:, 1])
test_dtm_jt = vect.transform(X_test[:, 1])
#4.5 Cast other feature columns to float and convert to a sparse matrix
extra = sp.sparse.csr_matrix(X_train[:, 2:].astype(float))
#4.6 Combine sparse matrices
train_dtm_extra = sp.sparse.hstack((train_dtm, train_dtm_jt, extra))
train_dtm_jt_extra = sp.sparse.hstack((train_dtm_jt, extra))
#4.7 Repeat for testing set
extra = sp.sparse.csr_matrix(X_test[:, 2:].astype(float))
test_dtm_extra = sp.sparse.hstack((test_dtm, test_dtm_jt, extra))
test_dtm_jt_extra = sp.sparse.hstack((test_dtm_jt, extra))

In [None]:
#4.8 Use logistic regression with hand engineered features
logreg = LogisticRegression(C=.1)
logreg.fit(X_train[:, 2:], y_train)
y_pred_class = logreg.predict(X_test[:, 2:])
y_pred_prob = logreg.predict_proba(extra)
metrics.accuracy_score(y_test, y_pred_class) #.610
metrics.roc_auc_score(y_test, y_pred_prob[:,1]) #.647
#4.8a Use NB with hand engineered features
nb = MultinomialNB()
nb.fit(X_train[:, 2:], y_train)
y_pred_class = nb.predict(X_test[:, 2:])
y_pred_prob = nb.predict_proba(extra)
metrics.accuracy_score(y_test, y_pred_class) #.454
metrics.roc_auc_score(y_test, y_pred_prob[:,1]) #.472
#4.9 Use logistic regression with the body of the message
logreg = LogisticRegression(C=.01)
logreg.fit(train_dtm, y_train)
y_pred_class = logreg.predict(test_dtm)
y_pred_prob = logreg.predict_proba(test_dtm)
metrics.accuracy_score(y_test, y_pred_class) #.673
metrics.roc_auc_score(y_test, y_pred_prob[:,1]) #.709
#4.9a Use logistic regression with the job title of the author
logreg = LogisticRegression(C=1)
logreg.fit(train_dtm_jt, y_train)
y_pred_class = logreg.predict(test_dtm_jt)
y_pred_prob = logreg.predict_proba(test_dtm_jt)
metrics.accuracy_score(y_test, y_pred_class) #.570
metrics.roc_auc_score(y_test, y_pred_prob[:,1]) #.622
#4.9b Use NB with the body of the message
nb = MultinomialNB()
nb.fit(train_dtm, y_train)
y_pred_class = nb.predict(test_dtm)
y_pred_prob = nb.predict_proba(test_dtm)
metrics.accuracy_score(y_test, y_pred_class) #.670
metrics.roc_auc_score(y_test, y_pred_prob[:,1]) #.692
#4.9c Use NB with the job title of the author
nb = MultinomialNB()
nb.fit(train_dtm_jt, y_train)
y_pred_class = nb.predict(test_dtm_jt)
y_pred_prob = nb.predict_proba(test_dtm_jt)
metrics.accuracy_score(y_test, y_pred_class) #.574
metrics.roc_auc_score(y_test, y_pred_prob[:,1]) #.623
#4.10 Use logistic regression with all features
logreg = LogisticRegression(C=.01)
logreg.fit(train_dtm_extra, y_train)
y_pred_class = logreg.predict(test_dtm_extra)
y_pred_prob = logreg.predict_proba(test_dtm_extra)
metrics.accuracy_score(y_test, y_pred_class) #.673
metrics.roc_auc_score(y_test, y_pred_prob[:,1]) #.711
metrics.confusion_matrix(y_test, y_pred_class)
#4.10a Use logistic regression with job title and hand engineered features
logreg = LogisticRegression(C=.1)
logreg.fit(train_dtm_jt_extra, y_train)
y_pred_class = logreg.predict(test_dtm_jt_extra)
y_pred_prob = logreg.predict_proba(test_dtm_jt_extra)
metrics.accuracy_score(y_test, y_pred_class) #.650
metrics.roc_auc_score(y_test, y_pred_prob[:,1]) #.682
metrics.confusion_matrix(y_test, y_pred_class)
#4.10b Use NB with all features
nb = MultinomialNB()
nb.fit(train_dtm_extra, y_train)
y_pred_class = nb.predict(test_dtm_extra)
y_pred_prob = nb.predict_proba(test_dtm_extra)
metrics.accuracy_score(y_test, y_pred_class) #.674
metrics.roc_auc_score(y_test, y_pred_prob[:,1]) #.696

# 5.3 Model Evaluation

The null accuracy is .511.  Two point performance metrics, class prediction accuracy and area-under-the-curve, are the primary evaluation metric: 

<img src="model_performance_1.png"> 
<img src="model_performance_2.png"> 

The first group of models, logistic regresson and NB model using all features, performed the best, outperforming the null model by 16 percentage points (ref: class prediction accuracy).  I, however, chose the logistic regression model that used only the hand-engineered features, because it is relatively more interpretable and supports the objective of finding "levers" of user engagement "lift."     

## 6 Findings and Conclusions

###6.1 Findings/Conclusions

* Do not reject the hypothesis that the body of the message and metadata of the message are predictors of user engagement
* Do not reject the hypothesis that the message author's position/title is a predictor of user engagement
* The model with the hand-engineered features is chosen for further exploration due to its relatively high interpretability
  * Once selecting a model for further analysis, the model was evaluated by feeding all 256 combinations of features
  * The model with the following features--'has_attach', 'has_qm', 'has_key_word', 'author_age', and 'has_at_mention'--achieved .628 class prediction accuracy and .654 AUC, rather insignificant improvement over the model using all features (.611 and .647, respectively)
* Training models on class-balanced data did more to improve performance than did any tuning or feature engineering on data with class imbalance

    
###6.2 Future Work

* Understanding the role of replies to original posts that garner greater response could improve problem definition and solution
* Obtaining message poster reputation, which was not available, can potentially improve the predictive power of model alternatives
* Only about 4% of 18,000 users are "engaged" (i.e., post content, comment on other users' content, click on the like button) at any given reporting period.  Approximately, 30% of all users are considered "lurkers."  Lurker data is not available but is obtainable, provided I get approval from the CIO.  Using both active user and lurker data is critical to measuring the true engagement level of users  