This is a basic devstack setup for hacking on edX analytics.
It will setup:
- edx-analytics-pipeline
- edx-analytics-data-api
- edx-analytics-data-api-client
- edx-analytics-dashboard
Make sure you have Vagrant and Ansible available on your system.
Then clone this repository, cd into it, and run vagrant up
. Magic!
(Note it may take quite a while to provision the devstack at first - about 23 minutes on a MacBook Pro at the time of writing.)
If you want to update the system, try fixing bugs, or resume a failed
vagrant up
, run the command vagrant provision
to do so. You can re-provision
at any time. The provisioning process does not interfere with the cloned/shared
edx-analytics-x
git repositories in any way, so no changes will be lost.
If you want to use this analytics devstack in concert with a local LMS devstack, you'll need to make a few changes on the LMS devstack:
-
As the
vagrant
user, sudo edit/etc/mysql/my.cnf
and changebind-adress
from127.0.0.1
to0.0.0.0
. -
Then run these commands:
service mysql restart mysql -u root -e "GRANT SELECT ON *.* TO 'analytics'@'192.168.33.11' IDENTIFIED BY 'edx';"
-
Next, as the
edxapp
user, editlms/envs/private.py
and add:OAUTH_OIDC_ISSUER = 'http://192.168.33.10:8000/oauth2' ANALYTICS_DASHBOARD_URL = 'http://192.168.33.11:9999'
-
Finally, while the LMS is running, go to /admin/oauth2/client/ and add a new client with URL
http://192.168.33.11:9999/
and redirect URIhttp://192.168.33.11:9999/complete/edx-oidc/
. Client type confidential. Save and leave the browser tab open as you'll need the client ID and secret when setting up the Insights dashboard (see "Usage" below). -
If you want an easy way to transfer log files from the LMS into the analytics devstack's HDFS store, copy the included
upload-tracking-logs.sh
script to the LMS devstack and run it as thevagrant
user.
To do stuff, run vagrant ssh
and then sudo su analytics
to run commands.
You'll find all the apps in the /home/analytics/apps
folder. Each app has
its own virtualenv in /home/analytics/venvs/xxxxxx
, which will be
automagically activated for you when you cd
, e.g. cd ~/apps/pipeline
.
Ports are not forwarded, so you cannot access the apps via localhost:9999 (too prone to conflicts, with e.g. edX platform devstack). Instead, the analytics devstack virtual box has the IP '192.168.33.11' so just connect to that. For example, on your host computer, go to http://192.168.33.11:9999/
Run this command as the analytics
user:
cd ~/apps/data-api/ && ./manage.py runserver 0.0.0.0:9001
Access it at http://192.168.33.11:9001/
- First, as a one time setup step, edit
~/apps/dashboard/analytics_dashboard/settings/local.py
and set the values ofSOCIAL_AUTH_EDX_OIDC_KEY
andSOCIAL_AUTH_EDX_OIDC_SECRET
to the values shown by the LMS in step 4 of "LMS Setup". Make sure the API server is running.
Then, as the analytics
user:
cd ~/apps/dashboard/
make develop migrate
./manage.py runserver 0.0.0.0:9999
Access it at http://192.168.33.11:9999/
Notes:
- The data-api and the LMS must both be running for the dashboard to be fully functional.
- If you get timing errors during the OAuth login, run
sudo ntpdate -s time.nist.gov
on both devstacks to fix their clocks.
To run the test suites of each of the four analytics apps:
vagrant up && vagrant ssh
sudo su analytics
cd ~/apps/pipeline/; make test
cd ~/apps/data-api/; make validate
cd ~/apps/data-api-client/; make test
cd ~/apps/dashboard/; make requirements.js; ./node_modules/.bin/r.js -o build.js; make validate
Let's try using the analytics pipeline to process data.
As the analytics
user, we need to load a log file to test with. Run this
command which will load tracking.log-20150101-123456789
:
hdfs dfs -put ~/log_files/dummy/ /test_input
Next, run these two commands to process this log file and store the results in MySQL:
cd ~/apps/pipeline
launch-task AnswerDistributionToMySQLTaskWorkflow --local-scheduler --remote-log-level DEBUG --include *tracking.log* --src hdfs://localhost:9000/test_input --dest hdfs://localhost:9000/test_answer_dist --name test_task --n-reduce-tasks 3
Now, to check if the task worked, go to http://192.168.33.11:50070/explorer.html#/test_answer_dist . You should see two folders listed.
Now run:
mysql -u root analytics --execute="SELECT COUNT(*) FROM answer_distribution;"
If the pipeline task ran successfully, this should show a count of 2
.
Then, run the API server (see "Usage" above), and open your browser and go to
http://192.168.33.11:9001/docs/#!/api/Problem_Response_Answer_Distribution .
Enter i4x://edX/DemoX-S/problem/a58470ee54cc49ecb2bb7c1b1c0ab43a
as the
problem_id
(this is based on the dummy log file in
/home/analytics/log_files/dummy
). Click "Try it out!" and ensure a result is
displayed.
Once that's working, we can try running the pipeline to process data from the LMS.
-
Make sure the LMS devstack is running in on the same host and was configured as described earlier in "LMS Setup" (so the pipeline can connect to its MySQL DB).
-
On the LMS devstack, run
upload-tracking-logs.sh
as thevagrant
user (see "LMS Setup"). -
Go to http://192.168.33.11:50070/explorer.html#/edx-analytics-pipeline/input and verify that the log files are present in HDFS.
-
On the analytics devstack, as the
analytics
user, run these commands:cd ~/apps/pipeline launch-task ImportAllDatabaseTablesTask --local-scheduler
If that completed successfully, you should be able to see the data stored in Hive, using these commands:
$ hive hive> show tables; OK auth_user auth_userprofile student_courseenrollment Time taken: 0.952 seconds, Fetched: 3 row(s) hive> SELECT * FROM auth_user;
You should now see a list of all the users that existed on the LMS system at the time the
ImportAllDatabaseTablesTask
ran:1 honor 2015-06-29 19:50:00 2014-11-19 04:06:46 true false false honor@example.com 2015-07-05 2 audit 2014-11-19 04:06:49 2014-11-19 04:06:49 true false false audit@example.com 2015-07-05 3 verified 2015-06-25 19:06:35 2014-11-19 04:06:52 true false false verified@example.com 2015-07-05 4 staff 2015-07-03 19:17:16 2014-11-19 04:06:54 true true true staff@example.com 2015-07-05
-
Now continue to run more parts of the pipeline, processing LMS data and storing it in the analytics database. Use these commands:
launch-task CourseActivityWeeklyTask --local-scheduler --n-reduce-tasks 3 --weeks 8 launch-task ImportEnrollmentsIntoMysql --local-scheduler --n-reduce-tasks 3
If those tasks completed without error, fire up the
data-api
server and the Insights Dashboard server (see "Usage" above), then go to the dashboard at http://192.168.33.11:9999 and explore the reports that are now available.