##Predicting User Engagement in Corporate Collaboration Network

by Mike Yea
DAT7

##1. Background

* In 2012, an opt-in, web-based (and mobile-enabled) collaboration network was launched at my organization, a 90,000-employee federal agency
* While initial roll-out and user adoption were impressive, the growth rate of the network has slowed
* About 80% of public messages/posts go unanswered
* Interested in informing future user engagement campaign

##2. Problem Statement

**Can I predict a “lift” in user engagement from message attributes (e.g., message text, attachment, link, @mention)?**    

My initial hypotheses are:
1. Message content and metadata have intrinsic value in predicting user engagement
2. Message poster's role within the organization and activity level within the network are predictors of user engagement

##3. Data

###3.1 Data Pre-Processing
* Worked with de-normalized data (data was **not** normalized across message hierarchy) 
* Included top-level messages (ignored subsequent messages in the same thread)
* Removed private messages or messages posted in private boards

###3.2 Response Variable
* Number of replies to top-level messagesm (see *Histogram of ...*)
* Encoded:
  * no reply (80% of data): 0 
  * 1 or more replies (20% of data): 1
* Randomly removed rows from the set of no-reply data

<img src="hist_num.png"> 

##4. Feature Analysis and Selection

###4.1 Feature Engineering

* The body of message represents by far the most voluminous component of data
* Hand engineered 9 more features:
  1. message posted in a group (a proxy for collaborating in self-selected group) (binary)
  2. ~~attachments (binary)~~
  3. length of message (continuous)
  4. hyperlinks included (binary)
  5. ~~message tone/sentiment (index between -1 and 1)~~
  6. message posed as a question (binary)
  7. number of **key words** observed over time ("experience", "opportunity", and "interest") that appear to draw user engagement (continuous)
  8. message poster's tenure in the collaboration network when a message was posted (number of days; continuous)
  9. **@mentions** one or more users (binary)

## 5. Model Evaluation

The **null accuracy is .511**.  

Two single number performance metrics, class prediction accuracy and area-under-the-curve, are primary evaluation metrics: 

<img src="model_performance_1.png"> 
<img src="model_performance_2.png"> 

## 6. Conclusions 

* Do not reject the hypotheses (message and user attributes) 
* The model with the **hand-engineered features** is chosen for further exploration due to its relatively high interpretability
  * Once selecting a model for further analysis, the model was evaluated by feeding all 256 combinations of features
  * The model with the following features--'has_attach', 'has_qm', 'has_key_word', 'author_age', and 'has_at_mention'--achieved .628 class prediction accuracy and .654 AUC, rather **insignificant** improvement over the model using all features (.611 and .647, respectively)
* Training models on **class-balanced data** did **more** to improve performance than did any other method or combination of methods (e.g., more features, tuning)
* Future work:
  * Interplay between subsequent messages and replies
  * Adding "lurker" activity to response
  * Message author's reputation (e.g., "likes", followers)