Title: Predicting Learner Churn in MOOCs
Slug: predicting_learner_churn   
Summary: If you knew your learners were about to drop out of the awesome course you spent months working on, what would you do about it?    
Date: 2018-02-27 12:00     
Category: Projects    
Tags: Deep Learning, Churn, MOOC, Predictive Maintenance    
Authors: Kabir Khan, Software Engineer at Microsoft Worldwide Learning

## **NOTE: This post is in progress...**

> **This model was trained on data from past runs of Microsoft edX courses that are part of the [Microsoft Professional Program](https://academy.microsoft.com)**

> **If you're interested in the code for this project it's all available (for a limited time) on [Github](https://github.com/kabirkhan/edx_learner_attrition/tree/dev)**

## **Context**

On the Microsoft Worldwide Learning team we are constantly striving to provide a better experience and better content to our learners. That's why it's so important to us that learners are engaged and regularly working on and interacting with our course material. 

A few months ago, we started working on better understanding our learners' behavior, particularly around the tendency for learners to drop out and not complete courses. We decided we want to know when learners are struggling or trailing off and are likely to drop out of a course before they do.  

We settled on knowing a week in advance. We figured with this much notice, we could contact learners and attempt to re-engage them in the course, provide struggling learners with extra resources and even recommending harder courses for learners that were completing content too quickly.

---

Over the last few months, I've built a model to predict with up to 78% accuracy when learners in a course are likely to drop out a week before they do. This model was based on a few key features of learner activity, aggregated per week of each course.

### **Problem:**
Can we predict a week in advance if a learner will drop out of a course

### **Proposed solution:**
Deep learning model trained on aggregated learner activity information

## **Packages**

In [12]:
import pandas as pd

## **The data**

All courses in the Microsoft Profressional Program are currently hosted on edX, a platform that does a great job collecting telemetry data on the browser and server, tied to specific learners.

Example of an event collected by edX. Note we particularly care about the "event_type"

<img src="predicting_learner_churn/1_edx-video-example.png"/>

edX cuts logs for all Microsoft courses and ships them to us where our awesome BI team injests them into our data warehouse. From there, I aggregated the log events per-course, per-week for each learner.

### **Aggregated Data Example**

In [14]:
features = pd.DataFrame([
    {'user_id': 9999, 'course_week': 4, 'user_started_week': 3, 
     'num_video_plays': 12, 'num_subsections_viewed': 23, 
     'num_problems_attempted': 8, 'num_problems_correct': 5, 
     'num_forum_posts': 1., 'num_forum_up_votes': 3, 
     'avg_forum_sentiment': .72},
    {'user_id': 100001, 'course_week': 4, 'user_started_week': 0, 
     'num_video_plays': 0, 'num_subsections_viewed': 0, 
     'num_problems_attempted': 0, 'num_problems_correct': 0, 
     'num_forum_posts': 0, 'num_forum_up_votes': 0, 
     'avg_forum_sentiment': 0},
])

features[[
    'user_id', 'course_week', 'user_started_week', 'num_video_plays',
    'num_subsections_viewed', 'num_problems_attempted', 'num_problems_correct',
    'num_forum_posts', 'num_forum_up_votes', 'avg_forum_sentiment'
]]

Unnamed: 0,user_id,course_week,user_started_week,num_video_plays,num_subsections_viewed,num_problems_attempted,num_problems_correct,num_forum_posts,num_forum_up_votes,avg_forum_sentiment
0,9999,4,3,12,23,8,5,1.0,3,0.72
1,100001,4,0,0,0,0,0,0.0,0,0.0


## **Data Pipeline**

All data transformations are done using the python [pandas](https://pandas.pydata.org) package. I used Spotify's [luigi](https://github.com/spotify/luigi) package to orchestrate pandas data manipulation on a kubernetes cluster hosted on Azure through [acs-engine](https://github.com/Azure/acs-engine). 
Final model data (like the example above) is stored in [Azure Data Lake Store](https://azure.microsoft.com/en-us/services/data-lake-store/) for each course.

>luigi provides structure to tasks allowing complex dependency chains in the form of directed acyclic graphs (DAG). It comes with an excellent scheduler and visualizer of queued and running tasks as well as dependency graphs for tasks.

>The luigi Dashboard for our project, running the same pipeline for all past Microsoft course runs on edX


<img src="predicting_learner_churn/luigi-dashboard.png"/>

> This is a snapshot of the full pipeline for a few courses represented as a dependency graph. Each one of the subtrees starting with a `Pipeline` node is a job run on the kubernetes cluster. This pipeline runs once a week for every currently running course, aggregating data from the past week for use in active learning and future training.


<img src="predicting_learner_churn/luigi-full-pipeline.png"/>

Now that we have an idea of the dataset and how it is processed, let's make some predictions. We started with some standard classification methods (K-Nearest-Neighbors, SVM, Logistic Regression) but weren't seeing very promising results for the metrics we cared about (optimizing for maximum **recall** and **accuracy** scores)

## **The Model (v0)**

The initial model architecture used a 3 layer neural network, utilizing `dropout` and `ReLU` activations at each layer and the `Adam` optimizer for reasonably high accuracy **(71%)**, surpassing the standard machine learning methods we tried earlier.
<img src="predicting_learner_churn/Architecture-Dropout.png"/>

## **The Model (v1) - a deeper network**

As I expanded the training dataset to more and more courses during development, I switched over to the [Azure Batch AI](https://azure.microsoft.com/en-us/services/batch-ai/) service for training. This allowed me stop worrying about the costs of running deep learning jobs on GPUs in the Kubernetes cluster and let the Batch AI take care of autoscaling down to 0 machines.

This also allowed for easy hyperparameter sweeps of network architecture. I tested many architectures and chose the following, increasing accuracy significantly **(77%)**.

This new model is a 5 layer network, still utilizing `Dropout`, `ReLU` and `Adam` as well as introducing l2 regularization and batch normalization at each layer. 

## **The Model (v2) - looking back**

The v1 model is significantly better than v0 or other machine learning methods and provides a solid base for our use case. However, it is processing learner data as discrete data points, not as sequences for each user. The v1 model is losing out on a lot of latent information from a learner's activity in multiple previous weeks of the course.

After reading [this work on predictive maintenance with LSTMs](https://azure.microsoft.com/en-us/blog/deep-learning-for-predictive-maintenance/) I realized the use case is very similar to predicting when learners are likely to drop out of a course. Following a similar pattern, I developed an LSTM based network architecture...

# ...