overhaul the synonym data structure to take up less space #16

coolbutuseless · 2018-11-25T11:27:18Z

This is a monster overhaul of the main "words" data structure.

Instead of storing raw words, we split it up and store:

A sorted list of all unique words
Convert each character vector of synonyms into an integer vector (indexing into the list of all words)

By storing integer vectors rather than character strings there is about a 50% reduction in memory usage, and the compressed data is now <5MB.

The downside is that creating the integer vectors from the word lists isn't very fast, and you wouldn't want to do this dynamically.

The upsides:

Package is now <5MB
The package code consists of just 2 functions i.e. syn() and syns()
Removed all the .onLoad() stuff to dynamically load data
No longer have to download data at runtime
Removed the code to "get" and "parse" the words package (this is now all done off-line in data-raw/download-and-compress-moby.R
As a side-effect of all this, Issue syn() and syns() are ill-behaved when a word doesn't exist #15 is now fixed.

coolbutuseless and others added 2 commits November 25, 2018 21:17

overhaul the synonym data structure to take up less space

c374401

Merge branch 'master' into compress-words

4cb1381

njtierney merged commit a2f750b into njtierney:master Nov 27, 2018

This was referenced Nov 27, 2018

Remove dependencies #17

Closed

use rappdirs package to locate safe place to put files #14

Closed

move words.txt into zenodo for safe storage #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

overhaul the synonym data structure to take up less space #16

overhaul the synonym data structure to take up less space #16

coolbutuseless commented Nov 25, 2018

overhaul the synonym data structure to take up less space #16

overhaul the synonym data structure to take up less space #16

Conversation

coolbutuseless commented Nov 25, 2018