Skip to content
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

VizML: Training Data, Feature Extraction, and Model Training

This repository provides access to the Plotly dataset-visualization pairs, feature extraction scripts, and model training ssripts used in the VizML paper.

Data Description

We provide subsets of the Plotly corpus with 10K and 100K pairs, the full corpus with 1,066,443 pairs(205G), and features extracted from an aggressively deduplicated set of 119,815 pairs (19G). More information about the corpus schema, the extracted features, and the design choices are provided in the paper.


This repository uses python 3.7.3 and depends on the packages listed in requirements.txt. Create the virtual environment with virtualenv -p python3 venv, enter the virtual environment using source venv/bin/activate, and install dependencies with pip install -r requirements.txt.

How do I use this repository?

Accessing Data

To download and unzip the Plotly dataset-visualization pairs or features, run ./ Comment lines to specify which subsets or features you want to use. Then create a symlink for to access the data ln -s data/[ plotly_plots_with_full_data_with_all_fields_and_header_{ 1k, 100k, full }.tsv data/plot_data.tsv.

Preparing Data

Within the data_cleaning directory:

  • To remove charts without all data: python
  • To remove duplicate charts: python

Extracting and Characterizing Features

Within the feature_extraction directory, run python Then use notebooks/Plotly Performance.ipynb to characterize features (e.g. distribution of number of columns per dataset)

Baseline Model Training

Use notebooks/Descriptive Statistics.ipynb to train the random forest, K-nearest neighbors, naive Bayes, and Logistic regression baseline models. Use notebooks/Model Feature Importances.ipynb to extract feature importances from the random forest baseline model.

Neural Network Training

Within the neural_network directory, run python [LOAD|TRAIN|EVAL] to load features, train models, then evaluate a particular model.


Use notebooks/Benchmarking.ipynb to evaluate serialized models against the crowdsourced consensus ground truth.

What's in this repository Shell script to download and unzip dataset - visualization pairs and features from Amazon S3 storage
requirements.txt: Python dependencies
data/: Placeholder directory for raw data
features/: Placeholder directory for extracted features
results/: Placeholder directory for intermediate results and figures
models/: Placeholder directory for trained models
        └─── Functions to aggregate single - column features
        └─── Helper functions used in
        └─── Functions to detect and mark dates
        └─── Helper functions used in all feature extraction scripts
        └─── Functions to extract single - column features
        └─── Functions to transform single - column features
        └─── Functions used to detect data types
        └─── Functions to extract design choices of visualizations
        └─── Functions to extract design choices of encodings
    └─── Top -level entry point to extract features and outcomes
    └─── Helpers used in top -level extraction function
    └─── Helpers functions when training baseline models
    └─── Helper functions when processing data
    └─── Misc helper functions
    └─── Top-level entry point to load features and train neural network
    └─── Functions to evaluate trained neural network
    └─── Class definitions for neural network
    └─── Script to evaluate best network against benchmarking ground truth
    └─── Script to evaluate best network for Plotly test set
    └─── Script to prepare training, validation, and testing splits
    └─── Helper functions for model training
    └─── Script to train network
    └─── Helper functions
    └───Descriptive Statistics.ipynb: Notebook to generate visualizations of number of charts per user, number of rows per dataset, and number of columns per dataset
    └───Plotly Performance.ipynb: Notebook to train baseline models and assess performance on a hold-out setfrom the Plotly corpus
    └───Model Feature Importances.ipynb: Notebook to extract feature importances from trained models
    └───Benchmarking.ipynb: Notebook to generate predictions of trained models on benchmarking datasets, bootstrap crowdsourced consensus, and compare predictions
preprocessing/: Scripts to preprocess features before ML modeling
    └─── Helper functions to deduplicate charts
    └─── Helper function to impute missing values
    └─── Helper functions to prepare features for learning
docs/: Landing page and miscellaneous material for documentation


Plotly dataset-visualization pairs, feature extraction scripts, and model training code for VizML (CHI 2019)



No releases published


No packages published
You can’t perform that action at this time.