# #Predicting User Engagement in Corporate Collaboration Network

by Mike Yea
DAT7

##1. Background

In 2012, an opt-in, web-based (and mobile-enabled) collaboration network was launched at my organization, a 90,000-employee federal agency.  To date, this tool is the only one of its kind that spans across the entire agency.  While initial roll-out and user adoption were impressive (50 percent of users joined in the first 6 months), the growth rate of the network has slowed.  Among chief complaints from users is a high number of unresponded "messages" or posts.  Without active collaboration, the network--which is designed to foster the breaking down of organizational silos and link a geographically distributed workforce-will become a shell of its former self. To prevent such an outcome, my colleagues and I are interested in launching an user engagement campaign to induce a "lift" in user engagement.  Rather than innundating users and potential users with mass marketing material, we would like to engage users in an informed way.  

##2. Problem Statement

**Can I predict a “lift” in user engagement (i.e., replies to public messages) from message attributes (e.g., length of the message, message posed in the form of a question, presence of attachments and hyperlink, key words, message tone/sentiment, message text, information about message poster)?**    

My hypotheses are:
1. Message content or references cited in the message are correlated positively with the number of replies.
2. Message poster's tenure or role within the organization are correlated with the number of replies.

##3. Data

###3.1 Data Import

The collaboration network has approximately 3 years of data, and the data can be exported via a web interface.  I imported both message (messages.csv) and user profile (users.csv) data for the last one year.  I used Pandas to extract data, but encountered some encoding errors. By importing and using the "sys" library, I was able to parse the data:

In [None]:
import pandas as pd
import sys
reload(sys)
sys.setdefaultencoding('utf8')
messages = pd.read_csv('Messages.csv', na_filter=False)
messages.shape
messages.describe()
messages.dtypes

###3.2 Data Pre-Processing

Because the purpose of this study is learning about the public interaction of users, 1) I removed private messages and messages posted in private groups; and 2) stored them in a data frame: 

In [None]:
messages.in_private_group.sum()
messages.in_private_conversation.sum()
public_msgs = messages[(messages.in_private_group == False) & (messages.in_private_conversation == False)]
public_msgs = public_msgs.drop(['in_private_group', 'in_private_conversation'], axis=1)
public_msgs = public_msgs.replace('', 'NA')

The data fields are as follows (filtered to include fields used in the project):

**messages-**
    id: unique identifier for each message
    replied_to_id: id of the message to which the subject message is replying; blank if top-
    level message
    thread_id: id of the top-level message 
    group_id: id of the network group
    group_name: name of the group
    participants: user id(s) of participants in the thread
    in_private_group: whether or not the message is posted to a private group (boolean)
    in_private_conversation: whether or not the message is private (boolean) 
    sender_id: id of the message's author
    body: message text
    attachments: internal identifier
    created_at: DTG when the message was posted

**users-**
    id: unique identifier for each user
    job_title: user entered position description
    joined_at: when the user joined the network
    state: whether the user is active or not (boolean)

###3.3 Response Variable

Additionally, messages in messages.csv are not normalized; both top-level (or headline) messages and replies in the thread are stored in the same table.  Since I primarily am interested how the top-level messages induce user engagement (i.e., reply to the initial message), I created a separate data frame that only contains top-level messages and added to that data frame a series that contained the number of replies to each top-level message: 

In [None]:
'''
1. Create response variable column in a dataframe
'''
#1.1 Initialize a dataframe
y_data = pd.DataFrame()
y_data = y_data.fillna(0)
#1.2 Add number of replies to each original message
#1.2.1 Populate dataframe with top-level message ID
top_msg_id = []
for index, row in public_msgs.iterrows():
    if row.id == row.thread_id:
        top_msg_id.append(row.id)
y_data['id'] = top_msg_id
#1.2.2 Create a function that returns the number of replies to a given top-level message ID
def get_reply_counts(id):    
    public_msgs['num_reply'] = public_msgs.replied_to_id.str.contains(str(id)).astype(int)
    return public_msgs['num_reply'].sum()    
    
#1.2.3 Determine the number of replies to each top-level message and store it to the dataframe
cnt_replies = []
for row in y_data.id:
    cnt_replies.append(get_reply_counts(row))
y_data['num_replies'] = cnt_replies 

As suggested by several active users, a vast majority, 80.02 percent, of messages in the sample data go unanswered (amazingly enough, this number has hovered around 80% throughout the **history of the collaboration network**).  A histogram of the number of replies is depicted as follows:  

<img src="hist_num.png"> 

##4. Feature Analysis and Selection

###4.1 Feature Engineering

Given the volume of unstructured data, transforming the body of the message into a document term matrix (DTM) appear to be a good idea (user-entered job title also was transformed into a DTM).  Additionally, there are 8 features that I hypothesized would correlate with the response variable:

    1. message posted in a group (binary)
    2. attachments (binary)
    3. length of message (continuous)
    4. hyperlinks (binary)
    5. message tone/sentiment (index between -1 and 1)
    6. message posed as a question (binary)
    7. number of key words observed over time ("experience", "opportunity", and "interest") that appear to draw user engagement (continuous)
    8. message poster's tenure in the collaboration network (number of days; continuous)
    
A number of different approaches was used to engineer the above features.  Shown below as an example is the key word feature (#7) using Regular Expressions:

In [None]:
'''
2. Add features to the dataframe
'''

#2.7 Key words ["experience", "opportunity", "interest"] use apply(key_word_search)
import re
def search_key_words(text):
    return len(re.findall(r"(experience|opportunity|interest)", text))
public_msgs['has_key_word'] = public_msgs.body.apply(search_key_words)
df = public_msgs[['id','has_key_word']]
y_data = pd.merge(y_data,df)  

###4.2 Feature Selection

By plotting a scatter plot (response vs. each feature)chart and adding a regression line, it would be possible to determine what features appear to be correlated strongly with the response:

<img src="pair_wise_1.png"> 
<img src="pair_wise_2.png"> 

By inspecting the linear regression model, it seems all the features, individually and collectively, have limited explanatory power, and the linear model does not appear to be a good fit.  The negative slope of the "message sentiment/tone" chart was unexpected; I thought this feature and the response would be correlated positively.  

##5. Modeling

###5.1 Conversion to a Classifier Model

I realized early in the process an OLM regression model would not be a good candidate (R2 value of .05 and RMSE 1.5)  Hence, the continuous response variable was converted to class, making this a classification problem.  This conversion also is useful for the purpose of predicting a particular number of replies, but in a degree of user engagement represented by the classes.  The reponse variable is converted into a 3-class response as follows:
    
    **2**: more than one reply (approximately 10% of all responses)
    **1**: one reply (approximately 10% of all responses)
    **0**: no reply (approxmiately 80% of all reponses)

In [None]:
'''
4. Logistic Regression using DTM of body of the message as a feature and other features
'''
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
import scipy as sp
#4.1 Convert the response variable to classes
def convert_to_class(num_replies):
    num_replies_class = int    
    if num_replies > 1:
        num_replies_class = 2
    elif num_replies == 1:
        num_replies_class = 1
    else:
        num_replies_class = 0
    return num_replies_class
y_data['num_replies_class'] = y_data.num_replies.apply(convert_to_class)

###5.2 Logistic Regression
There were three models considered for evaluation at the outset: 1) logistic regresson model using message body DTM; 2) logistic regression model using job title of the message author DTM; and 3) logistic regression model using both DTMs and a sparse matrix representing the 8 features developed earlier:

In [None]:
#4.2 Add the body of the message to the dataframe
df = public_msgs[['id','body']]
y_data = pd.merge(y_data,df)
#4.3 Split the new DataFrame into training and testing sets
feature_cols = ['body', 'job_title', 'in_group', 'has_attach','msg_len', 'has_qm', 'has_key_word', 'author_age']
X = y_data[feature_cols]
y = y_data['num_replies_class']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
#4.4 Use CountVectorizer with body of the message only
vect = CountVectorizer(ngram_range=(1, 2), min_df=2)
train_dtm = vect.fit_transform(X_train[:, 0])
test_dtm = vect.transform(X_test[:, 0])
#4.4b Use CountVectorizer with job_title of the author only
train_dtm_jt = vect.fit_transform(X_train[:, 1])
test_dtm_jt = vect.transform(X_test[:, 1])
#4.5 Cast other feature columns to float and convert to a sparse matrix
extra = sp.sparse.csr_matrix(X_train[:, 2:].astype(float))
#4.6 Combine sparse matrices
train_dtm_extra = sp.sparse.hstack((train_dtm, train_dtm_jt, extra))
#4.7 Repeat for testing set
extra = sp.sparse.csr_matrix(X_test[:, 2:].astype(float))
test_dtm_extra = sp.sparse.hstack((test_dtm, test_dtm_jt, extra))

###5.3 Model Evaluation

I predicted the null accuracy is around .7 (weighted average of (2,1)=.2 * 1.5 = .3; 1-.3 = .7), and the calculation (.714) supports my prediction.  Since the ROC curve and UAC for a multi-class problem is not supported, class prediction accuracy is the primary evaluation metric. 

In [None]:
#4.8 Use logistic regression with the body of the message (**Model 1**)
logreg = LogisticRegression(C=1e9)
logreg.fit(train_dtm, y_train)
y_pred_class = logreg.predict(test_dtm)
metrics.accuracy_score(y_test, y_pred_class) #.716
#4.8b Use logistic regression with thre job title of the author (**Model 2**)
logreg = LogisticRegression(C=1e9)
logreg.fit(train_dtm_jt, y_train)
y_pred_class = logreg.predict(test_dtm_jt)
metrics.accuracy_score(y_test, y_pred_class) #.785
#4.9 Use logistic regression with all features (**Model 3**)
logreg = LogisticRegression(C=1e9)
logreg.fit(train_dtm_extra, y_train)
y_pred_class = logreg.predict(test_dtm_extra)
y_pred_prob = logreg.predict_proba(test_dtm_extra)
metrics.accuracy_score(y_test, y_pred_class) #.770
metrics.confusion_matrix(y_test, y_pred_class)

The second model listed above (logistic regression model using the job title of the author as DTM) has the best performance among the three.  The first model is only marginally better at making class predictions than the null model.    

##6 Findings and Conclusions

###6.1 Findings/Conclusions (Interim until final report)

    a. The eight features chosen or engineered are poor predictors of class
    b. However, the imbalance between the response of "0" and the remaining classes resulted in a high standard (i.e., null accuracy over .7).  I would like to purse some techniques to reduce the impact of class imbalance (OH 8/1)
    c. Although not discussed above, I converted the response variable to a binary class problem (1: all messages with the number of replies greater than 0; 0: all messages with no reply).  However, none of the alternatives exceeded the performance of the null model
    d. The model using the DTM of the job title (self-populated by the user and is not a mandatory field during user registration) has the best class prediction accuracy (need to look at other features such as @mention in the body of the message)
    e. Reject the hypothesis that the body of the message and metadata of the message are predictors of user response to the message
    d. Do not reject the hypothesis that the message author's position/title is a strong predictor (office hour: code that summarizes top tokens of job_title DTM)
    
###6.3 Future Work

    a. Only about 4% of 18,000 users are "engaged" (i.e., post content, comment on other users' content, click on the like button) at any given reporting period.  Approximately, 30% of all users are considered "lurkers."  Lurker data is not available but is obtainable, provided I get approval from the CIO.  Using both active user and lurker data is critical to measuring the true engagement level of users  
    b. Understanding what features are positively correlated with the response is critical to the overall aim of the project.  I hope to find general features that can serve as "levers" of our marketing arm