New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure data for performance #24

Closed
jmsv opened this Issue Jun 11, 2018 · 5 comments

Comments

Projects
None yet
2 participants
@jmsv
Copy link
Owner

jmsv commented Jun 11, 2018

At the moment, the json dataset is structured as follows:

[
  {
    "a_lang": "eng",
    "a_word": "potato",
    "b_lang": "tnq",
    "b_word": "batata"
  },
  { ...

This is loaded as a Python dict and filtered using:

row = list(filter(
    lambda entry: entry['a_word'] == self.word and entry[
        'a_lang'] == self.language.iso, etymwn_data))

If the data was restructured so words acted as dict keys, referencing words would be much faster since dicts are an implementation of hash tables.

Data could instead be structured by language then by word, as follows:

{
    "lang":{
        "word":[
            {
                "origin-word":"origin-lang"
            }
        ]
    }
}

for example,

{
    "eng":{
        "airport":[
            {"air":"eng"},
            {"port":"eng"}
        ],
        "banana":[
            {"banaana":"wol"}
        ]
    },
    "lat":{
        "fructus":[
            {"fruor":"lat"}
        ]
    }
}

Origin words are individual dicts to prevent key collisions.

@jmsv

This comment has been minimized.

Copy link
Owner Author

jmsv commented Jun 11, 2018

Open to suggestions for better ways of structuring it

@alxwrd

This comment has been minimized.

Copy link
Collaborator

alxwrd commented Jun 11, 2018

Potentially could expand out the origin dicts:

"eng":{
    "airport":[
        {"word": "air", "lang": "eng"},
        {"word": "port", "lang": "eng"}
    ],
    "banana":[
        {"word": "banaana", "lang": "wol"}
    ]
}

I think it could make the loading of origins slightly clearer:

source_origins = data[self.language.iso][self.word]

origins = [
    ety.Word(origin["word"], origin["lang"]) for origin in source_origins
]

vs

source_origins = data[self.language.iso][self.word]

origins = [
    ety.Word(*info) for origin in source_origins for info in origin.items()
]

The downside to expanding out the dicts is it'll result in a larger file.

@jmsv

This comment has been minimized.

Copy link
Owner Author

jmsv commented Jun 11, 2018

@alxwrd I think it might be better to keep the smaller file and just comment code or something to explain what's happening.

Rather than using *info, word and lang could be unpacked by hand which is probably more readable:

origins = [
    ety.Word(word, lang) for origin in source_origins for word, lang in origin.items()
]
@alxwrd

This comment has been minimized.

Copy link
Collaborator

alxwrd commented Jun 11, 2018

Yea that's nice actually 😋

For creating the new data file, is it going to be rebuilt from the original source? Or transforming the current file?

I think it'd be good to start from the source .tsv, and create a build_ety_data script that could either live in the repo root, or ety/wn/. The script would fetch the archived data, unpack it, perform the transform. Then if there are any updates to the source, the data can easily be updated.

@jmsv

This comment has been minimized.

Copy link
Owner Author

jmsv commented Jun 12, 2018

Yeah I was thinking start from original source too. I was in touch with the guy that maintains the dataset a couple of weeks ago and apparently a new version will be released hopefully by August.

A script that stays in the repo is definitely a good idea - this would probably be best kept in ety/wn.

It's probably a good idea to only download the dataset if it's not available locally, but with the option to redownload; the original source is quite big so downloading is time consuming

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment