Skip to content

pmitros/x-analytics-scripts

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
cmd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

x-analytics-scripts

Assorted analytics scripts for edX tracking logs. These are my personal scripts, and may not be useful to others.

In order to use these scripts, create a YAML file in your home directory called ~/.xanalytics. This file should define several directories:

  • public-data-dir -- Obsolete: Non-edX datafiles. Geocoding. CIA World Factbook. Etc. This is not in the package itself.
  • edx-data-dir -- edX tracking logs and other read-only source data with PII
  • scratch-dir -- Location for intermediate data.

These directories have subdirectories. Each subdirectory will contain either subdirectories, or gzip-compressed files.

These scripts are based on processing log files with generators. This is a nice design pattern for several reasons:

  • Fast. Most things are never written to disk.
  • Easy-to-read.
  • If you have a bug, scripts fail early.
  • Things run lazily. This makes it easy to skip to whichever step
    where there is unprocessed data.

See: http://www.slideshare.net/dabeaz/python-generator-hacking

I use pull (rather than push) generators, mostly for readability. It's possible to go between the two with queues.

The scripts can process data for two courses in about 3 minutes on a quad core i7 machine.

Useful things

This is likely very out-of-date by the time you read this.

xanalytics/settings.py -- Allows easy access to directories (as listed above) and similar. We're moving to pyfilesystem, but many functions still take directories. pyfilesystem is a higher-level abstraction, and is easier to work with (less os.join and similar). In the future, it can also transparently go to/from S3.

xanalytics/cia.py -- CIA world factbook access. Neat statistics about student countries.

xanalytics/multiprocess.py -- Allows you to split computation among multiple cores. There's a bit of overhead for serializing data, so it's not always a win, but it is almost always a win. split and join are the functions to look at. The cool thing is this is completely transparent. Most of the code doesn't have to be aware of these.

xanalytics/streaming.py -- Most of the meat of the code. Various processing operations over tracking logs. read_data is usually the starting point, followed by something like text_to_json if using source files. In most cases, I suggest using BSON files. It's a 4x performance gain. encode_to_bson is helpful here. We do need to move read_bson_file to correctly work with the filesystem (rather than directory) approach.

xanalytics/desensitize.py -- Make it easier to work with PII without unintentionally violating student privacy.

About

edX: Assorted analytics scripts for edX

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published