Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


A study of postsecondary graduate employability using topic modeling.

Project Overview

The research goal of this project is to determine the nature of the overlap of those core, academic skills being taught at the postsecondary level and those being expected at the entry level in the workforce. In a nutshell, the research goal is to study how well universities are preparing students for the workforce: to what degree are they promoting graduate employability?

This repository defines an open source software tool to perform that analysis. The core concept leveraged is topic modeling, a method from machine learning and natural language processing. Topic modeling is used to infer concepts from large datasets of job postings and course descriptions.


Import data
docker-compose up -d elasticsearch
./elasticsearch/bin/import-data small  # {small|medium|large}
Start the server
docker-compose up -d web

You should be able to acccess the vis server at localhost:9000.

Data Ingestion

If you would like to run the whole ingestion and analysis process, there are a few more steps. API token

Register for an account at Data World and export your API token.

Start up background services
docker-compose up -d elasticsearch kibana postgres
Run data pipelines
sbt ingest/run
sbt preprocess/run

Executing LDA

With default parameters
sbt analysis/run

Configuring LDA

You can modify the behavior of LDA through environment variables. Some pre-defined configurations are made available for you.

# Source one of these before running analysis/run.
source ./analysis/config/small
source ./analysis/config/medium
source ./analysis/config/large

Exporting data

./elasticsearch/bin/export-data NEW_SNAPSHOT_ID

This will do several things:

  • create a local snapshot repository in your Elasticsearch cluster
    • this lives on your local filesystem: ./data/elasticsearch-snapshots/local/
  • create a new snapshot NEW_SNAPSHOT_ID in the local repository

Project Modules


Core components, models, and glue code.

"net.rouly" % "employability-core" % "x.x.x"


Elasticsearch read/write services. Interaction is defined using Reactive Streams.

"net.rouly" % "employability-elasticsearch" % "x.x.x"


Postgres read/write services. Interaction is defined using Reactive Streams.

"net.rouly" % "employability-postgres" % "x.x.x"


Entry point application to ingest raw data into Elasticsearch.

Raw data is accepted from the following data providers:

  • add data set definitions under resources/datasets/
"net.rouly" % "employability-ingest" % "x.x.x"


Entry point application to pre-process and clean ingested data. Cleaned and prepared data is exported to Postgres.

"net.rouly" % "employability-ingest" % "x.x.x"


Entry point application to read processed data from Postgres and execute the primary topic modeling steps. Topics are output to Elasticsearch.

"net.rouly" % "employability-analysis" % "x.x.x"


User facing web application to explore the generated topics and render various statistics about them.

"net.rouly" % "employability-web" % "x.x.x"