edX-datascrub is used for scrubbing edX data into a format that is easy to analyze. This repository is forked from HarvardX-Tools.
Output: The final processed data of each class is stored in csv file with the following fields:
- seconds to next action
- actor: user
- verb: action
- object_name: in the format of chapter/sequential/vertical/item_name
- result: correct or incorrect if verb is problem_check, empty otherwise
The output rows are partially sorted by times. More specifically, if you only consider rows associated to one user, those rows are sorted by times.
- Clone this repository and edX-courseaxis repository.
- Add the following directories to your environment path
- Make sure all shellscripts and python scripts inside these 4 directories are executable.
- Decrypt all files from edX and store them in the same directories.
Then, you're ready to go!
Obtaining Course Axis
Every class comes with class information and content packaged in classXXX.xml.tar.gz. Run
generate_courseaxis script in the directory that contains classXXX.xml.tar.gz:
The script will generate a directory named csv_files containing many files including:
- info.csv collecting course names (e.g. BerkeleyX-CS191x-Spring_2013), start dates, and end dates of all classes
- one course_name_axis.csv for each class
- axis.error logging all the errors occurred during generating course axes. Check the error messages in this file to investigate why the course axis of any particular class is not being generated.
info.csv and course axes will be useful in the next step. Note that if there is an error generating course axis or there is no start or end date for a specific class in its xml.tar.gz, that class is excluded from info.csv.
Processing Activity Logs
You can choose to process activity logs of all classes at once or just one log of a specific class at a time.
Processing Selected Courses
In the directory that contains prod-edx* directories in which contain the raw activity logs, run:
processLogData.py course_name1,course_name2 start_date end_date
You can get
end_date from info.csv.
The first argument to the script is a list of course names, separated
,. The list can be of any abitrary size. Most courses do not have the exactly same start and end dates. However, you can group the ones that have similar start and end dates together (e.g. the ones offered in the same semester), and specify start and end dates that cover all of the classes in the list. This will make the overall log processing run faster.
- generate a separate log file for each class inside each prod-edx* directory. The log file is named after the class name.
- combine the separated log files of the same class located in different prod-edx* directories into one log file and store the combined log in the directory in which the script is run.
ClassList.csvto keep track of between which dates the course have already been processed.
The combined log file
course_name.log for each course and
ClassList.csv will be generated in the directory in which the script is running.
If you have already processed
date2, and you want to process more logs between
date3. You have to use
- as the start date as follows:
processLogData.py courseA - date3
In this case, the script will append new logs to
After you obtain the combined log from, you then run
transformOneLog.sh course_name.log path_to_course_axis.csv
transformOneLog.sh takes a generated
processLogData.py and the corresponding course axis generated from the Obtaining Course Axis section as its inputs. It then transforms the combined log file into a nicely formatted csv file for each class.
Note that only
processLogData.py is incremental (appending new logs to the ones that have already been processed).
transfromOneLog.sh is not incremental. It will transform the entire given log file.
Processing All Logs
Caution: You can move around or rename the directory that contains course axes and info.csv generated from the previous step, but make sure that all course axes and info.csv are still in the same directory.
Then, in the directory that contains prod-edx* directories in which contain the raw activity logs, simply run:
The script will call
transformOneLog.sh for every course appeared in info.csv. Note that running this script is not as efficient as running
transformOneLog.sh manually. This is because it will not separate logs of different courses at the same time, unlike
processLogData.py when a list of courses is given.