Skip to content
This repository was archived by the owner on Jan 31, 2018. It is now read-only.

[bug 1026019] Rework bigram generation#308

Merged
willkg merged 1 commit intomozilla:masterfrom
willkg:1026019-bigrams
Jun 17, 2014
Merged

[bug 1026019] Rework bigram generation#308
willkg merged 1 commit intomozilla:masterfrom
willkg:1026019-bigrams

Conversation

@willkg
Copy link
Copy Markdown
Member

@willkg willkg commented Jun 16, 2014

This reworks the tokenizing part of bigram generation so that it's no
longer (ab)using Elasticsearch to do the work. Now it does its own
tokenizing.

I also added some really rough tests to make sure the new tokenizing
wasn't really horrible. The tests run roughly the same before and after
the code changes. The new code handles the "add-on" case better--the old
code broke that into "add" and "on" and then ditched "on".

This radically improves reindexing. Cuts reindexing times in half or better.

r?

@willkg
Copy link
Copy Markdown
Member Author

willkg commented Jun 16, 2014

To test:

  1. run the tests
  2. run ./manage.py esreindex --percent 10

It's pretty benign, so it won't crash. No one is using the bigrams-using stuff as far as I know. I wrote up bug 1026019 to cover making this better. So for now, this mostly just affects how fast we can reindex.

This reworks the tokenizing part of bigram generation so that it's no
longer (ab)using Elasticsearch to do the work. Now it does its own
tokenizing.

I also added some really rough tests to make sure the new tokenizing
wasn't really horrible. The tests run roughly the same before and after
the code changes. The new code handles the "add-on" case better--the old
code broke that into "add" and "on" and then ditched "on".
@rlr
Copy link
Copy Markdown
Contributor

rlr commented Jun 16, 2014

looks good. tests pass and all that. r+!

@willkg willkg merged commit dccfb76 into mozilla:master Jun 17, 2014
@willkg
Copy link
Copy Markdown
Member Author

willkg commented Jun 17, 2014

Landed in dccfb76 [bug 1026019] Rework bigram generation

@willkg willkg deleted the 1026019-bigrams branch June 17, 2014 02:56
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants