Instructions and license for Detexify's sample data
Latest commit c2db958 Jan 5, 2016 @kirel Update README.md
Permalink
Failed to load latest commit information.
README.md Update README.md Jan 5, 2016
example.json Create example.json Jul 10, 2014
odbl-10.txt Rename odbl-10.md to odbl-10.txt Jul 10, 2014

README.md

Detexify data

This repository contains instructions on how to obtain and use the training data for http://detexify.kirelabs.org. It does not contain the data.

Obtaining the data

Detexify's training data is stored in a CouchDB database hosted on Cloudant.

The database lives at: https://kirelabs.cloudant.com/detexify

NOTE: Please send me an email (danishkirel@gmail.com) to get auth credentials. Having this open caused too many expensive calls against the API.

Replication

This does not work right now because of #1

The best way to obtain the data is to set up your own CouchDB and replicate the database to yours. This can be done easily via CouchDB's Admin Interface. Assuming you have a local CouchDB running visit

http://127.0.0.1:5984/_utils

If you are familiar with CochDB's HTTP interface you can instead use the following request to start replication:

POST /_replicate HTTP/1.1
Content-Type: application/json
{"source":"https://kirelabs.cloudant.com/detexify","target":"detexify"}

Querying data & data format

You can query data via the view by_id. Have a look at https://kirelabs.cloudant.com/detexify/_design/tools/_view/by_id?keys=%5B%22amssymb-OT1-_bigstar%22%5D&descending=false&include_docs=true&reduce=false&limit=5 for an example.

Please have a look at the CouchDB documentation on views for more information.

A prettyprinted example document can be viewed in example.json. It was obtained via the request https://kirelabs.cloudant.com/detexify/_design/tools/_view/by_id?skip=4242&limit=1&reduce=false&include_docs=true.

The json objects contain two relevant keys. One is key and it identifies the LaTeX command this sample is for (see https://github.com/kirel/detexify/blob/master/lib/latex/symbol.rb#L22 for details). The other one is data which contains an array of ink strokes. Ink strokes are represented as arrays of objects { x: x-coordinate, y: y-coordinate, t: timestamp }. This data is not preprocessed in any way.

License

The database is licensed under the ODbL. This license is also used by OpenStreetMap. A human-readable form of the license can be found at http://opendatacommons.org/licenses/odbl/summary/. I am no expert in database licensing but this looks reasonable to me. Feel free to contact me if you have questions or concerns.

Note

Please be kind. I pay for database traffic. Replicate the data once and then use your own database.