Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


Assorted analytics scripts for edX tracking logs. These are my personal scripts, and may not be useful to others.

In order to use these scripts, create a YAML file in your home directory called ~/.xanalytics. This file should define several directories:

  • public-data-dir -- Obsolete: Non-edX datafiles. Geocoding. CIA World Factbook. Etc. This is not in the package itself.
  • edx-data-dir -- edX tracking logs and other read-only source data with PII
  • scratch-dir -- Location for intermediate data.

These directories have subdirectories. Each subdirectory will contain either subdirectories, or gzip-compressed files.

These scripts are based on processing log files with generators. This is a nice design pattern for several reasons:

  • Fast. Most things are never written to disk.
  • Easy-to-read.
  • If you have a bug, scripts fail early.
  • Things run lazily. This makes it easy to skip to whichever step
    where there is unprocessed data.


I use pull (rather than push) generators, mostly for readability. It's possible to go between the two with queues.

The scripts can process data for two courses in about 3 minutes on a quad core i7 machine.

Useful things

This is likely very out-of-date by the time you read this.

xanalytics/ -- Allows easy access to directories (as listed above) and similar. We're moving to pyfilesystem, but many functions still take directories. pyfilesystem is a higher-level abstraction, and is easier to work with (less os.join and similar). In the future, it can also transparently go to/from S3.

xanalytics/ -- CIA world factbook access. Neat statistics about student countries.

xanalytics/ -- Allows you to split computation among multiple cores. There's a bit of overhead for serializing data, so it's not always a win, but it is almost always a win. split and join are the functions to look at. The cool thing is this is completely transparent. Most of the code doesn't have to be aware of these.

xanalytics/ -- Most of the meat of the code. Various processing operations over tracking logs. read_data is usually the starting point, followed by something like text_to_json if using source files. In most cases, I suggest using BSON files. It's a 4x performance gain. encode_to_bson is helpful here. We do need to move read_bson_file to correctly work with the filesystem (rather than directory) approach.

xanalytics/ -- Make it easier to work with PII without unintentionally violating student privacy.


edX: Assorted analytics scripts for edX







No releases published


No packages published