Skip to content
This repository has been archived by the owner. It is now read-only.

feat(experiments): add text taggers #4294

Merged
merged 8 commits into from Sep 5, 2018
Merged

feat(experiments): add text taggers #4294

merged 8 commits into from Sep 5, 2018

Conversation

@jonathankoren
Copy link
Contributor

@jonathankoren jonathankoren commented Aug 7, 2018

Update 2

Update

  • Switched to for..of loops
  • Removed double-quote on keys
  • Did not switch from Objects to Maps. We use Object in a lot of places in the follow on PR (which you haven't seen yet). This would be a big change. It would mean converting the fundamental data structure. Honestly, I'm not comfortable with making that change at this time. I'm open to doing it later, especially if we need a speed up, but not today.

Description

This PR adds the ability to classify text. We define two different classifiers, a Naïve Bayes (NB) classifier, and a multiclass nonnegative matrix factorization (NMF) classifier. Both use a bag of words, TF-IDF vectors as features. The purpose of this code is to allow Firefox to classify pages into topics, by examining the text found on the page.

This code is part of the Pocket Personalization v2 experiment which uses content analysis to locally build interest profiles.

This code is dark.

Testing

Unit tests
This code has no current consumers.

Related

We reviewed this internally on PR Pocket#1

See Also

https://docs.google.com/document/d/12OtUZywivIvBnO3hmMNjptmIQ8cQLFOIOzJO4bRLqdg/edit
https://en.wikipedia.org/wiki/Naive_Bayes_classifier
https://en.wikipedia.org/wiki/Non-negative_matrix_factorization
https://en.wikipedia.org/wiki/Tf%E2%80%93idf

@k88hudson
Copy link
Member

@k88hudson k88hudson commented Aug 7, 2018

@jonathankoren hey, when is the expected ship date for this experiment / i.e. how soon do you need this merged / reviewed?

@jonathankoren
Copy link
Contributor Author

@jonathankoren jonathankoren commented Aug 7, 2018

@k88hudson We're aiming to land the feature before August 23 code freeze. There's more code to be written, but because it's a fairly large amount new code, I thought it was prudent to break it up into smaller PRs rather than hold on to one giant PR. The idea was that it makes it easier to understand, and lowers the overall risk. The feature is going to gated behind an A/B test and will launch dark, and then be turned on for some fraction of users through Shield.

If it's useful, I'm on slack, and have been working with Scott Downe, who I think sits next to you.

@k88hudson
Copy link
Member

@k88hudson k88hudson commented Aug 7, 2018

Ok thanks! Just wanted to know if this was planned for an uplift or not.
We are a little under pressure this week with another feature that needs to get uplifted, but we'll try to get someone to look at this soon. Your approach to break it up into a bunch of PRs sounds great 👍

@jonathankoren
Copy link
Contributor Author

@jonathankoren jonathankoren commented Aug 7, 2018

The plan is to try really really hard to not have to get uplift after the Sept 4 beta release, but if we miss the date, we'll need uplift to beta. Otherwise, we're stuck for a long time. Fortunately, we're going to be behind a flag, so hopefully it should possible.

@ncloudioj
Copy link
Member

@ncloudioj ncloudioj commented Aug 15, 2018

Let's also track this on Bugzilla https://bugzilla.mozilla.org/show_bug.cgi?id=1483667

let bestClassId = -1;
let bestClassLabel = null;
let logSumExp = 0.0; // will be P(x). Used to create a proper probability
for (let classId = 0; classId < this.model.classes.length; classId++) {

This comment has been minimized.

@ncloudioj

ncloudioj Aug 21, 2018
Member

Nit: for..of statement is more succinct for array iteration. Consider replacing all the raw for loops with for..of in this patch?

This comment has been minimized.

@ncloudioj

ncloudioj Aug 29, 2018
Member

Did you miss this one?

This comment has been minimized.

@jonathankoren

jonathankoren Aug 30, 2018
Author Contributor

No. I need the id so I can check it against the positive class id.

let confident = ((bestClassId === this.model.positive_class_id) &&
(bestLogProb > this.model.positive_class_threshold_log_prob));
return {
"label": bestClassLabel,

This comment has been minimized.

@ncloudioj

ncloudioj Aug 21, 2018
Member

Nit: redundant double quotes for keys.

This comment has been minimized.

@jonathankoren

jonathankoren Aug 29, 2018
Author Contributor

eslint doesn't like it without it. It thinks I'm trying to do shorthand.

logSumExp += Math.exp(classLogProb);
}

// now normalizae the probability by dividing by P(x)

This comment has been minimized.

@ncloudioj

ncloudioj Aug 21, 2018
Member

s/normalizae/normalize/

This comment has been minimized.

}

// now project toksInLatentSpace back into class space
let predictions = {};

This comment has been minimized.

@ncloudioj

ncloudioj Aug 21, 2018
Member

Nit: prefer to use Map


// Javascript doesn't support unicode page mnemonics out of the box, and so
// we have to define them ourselves.
const UNICODE_SPACE = "\x20\xA0\u1680\u2000-\u200A\u2028\u2029\u202F\u205F\u3000";

This comment has been minimized.

@ncloudioj

ncloudioj Aug 21, 2018
Member

All these constants here want some comments to clarify:

  • What they are in general
  • Some extra reference links, which pointing to more details about those unicodes, would be even better

This comment has been minimized.

@jonathankoren

jonathankoren Aug 24, 2018
Author Contributor

Unicode specifies certain mnemonics for code pages and character classes. They call them "character properties" https://en.wikipedia.org/wiki/Unicode_character_property . These mnemonics are have been adopted by many regular expression libraries, however the standard regexp system doesn't support unicode character properties. If you want to use them, you have to pull in a library like XRegExp.

Each of these sections contains the characters values / ranges for specific character property: Whitespace, Symbol (S), Punctuation (P), Number (N), Mark (M), and Letter (L). The tables are defined by Unicode. I think I reformatted the data from XRegExp to get it to work here.

We have to support unicode the tokenizer, as we expect to see (and have seen in our training data) nonenglish and multilingual documents. This allows us to use the same code for any language that uses spaces to separate words.

Interestingly enough, it looks like xregexp is pulled in via two or three levels of dependencies, but I have no idea how / should you'd import it directly.

This comment has been minimized.

@ncloudioj

ncloudioj Aug 24, 2018
Member

Could you add this to the code?

const UNICODE_LETTER = "A-Za-z\xAA\xB5\xBA\xC0-\xD6\xD8-\xF6\xF8-\u02C1\u02C6-\u02D1\u02E0-\u02E4\u02EC\u02EE\u0370-\u0374\u0376\u0377\u037A-\u037D\u037F\u0386\u0388-\u038A\u038C\u038E-\u03A1\u03A3-\u03F5\u03F7-\u0481\u048A-\u052F\u0531-\u0556\u0559\u0561-\u0587\u05D0-\u05EA\u05F0-\u05F2\u0620-\u064A\u066E\u066F\u0671-\u06D3\u06D5\u06E5\u06E6\u06EE\u06EF\u06FA-\u06FC\u06FF\u0710\u0712-\u072F\u074D-\u07A5\u07B1\u07CA-\u07EA\u07F4\u07F5\u07FA\u0800-\u0815\u081A\u0824\u0828\u0840-\u0858\u08A0-\u08B4\u08B6-\u08BD\u0904-\u0939\u093D\u0950\u0958-\u0961\u0971-\u0980\u0985-\u098C\u098F\u0990\u0993-\u09A8\u09AA-\u09B0\u09B2\u09B6-\u09B9\u09BD\u09CE\u09DC\u09DD\u09DF-\u09E1\u09F0\u09F1\u0A05-\u0A0A\u0A0F\u0A10\u0A13-\u0A28\u0A2A-\u0A30\u0A32\u0A33\u0A35\u0A36\u0A38\u0A39\u0A59-\u0A5C\u0A5E\u0A72-\u0A74\u0A85-\u0A8D\u0A8F-\u0A91\u0A93-\u0AA8\u0AAA-\u0AB0\u0AB2\u0AB3\u0AB5-\u0AB9\u0ABD\u0AD0\u0AE0\u0AE1\u0AF9\u0B05-\u0B0C\u0B0F\u0B10\u0B13-\u0B28\u0B2A-\u0B30\u0B32\u0B33\u0B35-\u0B39\u0B3D\u0B5C\u0B5D\u0B5F-\u0B61\u0B71\u0B83\u0B85-\u0B8A\u0B8E-\u0B90\u0B92-\u0B95\u0B99\u0B9A\u0B9C\u0B9E\u0B9F\u0BA3\u0BA4\u0BA8-\u0BAA\u0BAE-\u0BB9\u0BD0\u0C05-\u0C0C\u0C0E-\u0C10\u0C12-\u0C28\u0C2A-\u0C39\u0C3D\u0C58-\u0C5A\u0C60\u0C61\u0C80\u0C85-\u0C8C\u0C8E-\u0C90\u0C92-\u0CA8\u0CAA-\u0CB3\u0CB5-\u0CB9\u0CBD\u0CDE\u0CE0\u0CE1\u0CF1\u0CF2\u0D05-\u0D0C\u0D0E-\u0D10\u0D12-\u0D3A\u0D3D\u0D4E\u0D54-\u0D56\u0D5F-\u0D61\u0D7A-\u0D7F\u0D85-\u0D96\u0D9A-\u0DB1\u0DB3-\u0DBB\u0DBD\u0DC0-\u0DC6\u0E01-\u0E30\u0E32\u0E33\u0E40-\u0E46\u0E81\u0E82\u0E84\u0E87\u0E88\u0E8A\u0E8D\u0E94-\u0E97\u0E99-\u0E9F\u0EA1-\u0EA3\u0EA5\u0EA7\u0EAA\u0EAB\u0EAD-\u0EB0\u0EB2\u0EB3\u0EBD\u0EC0-\u0EC4\u0EC6\u0EDC-\u0EDF\u0F00\u0F40-\u0F47\u0F49-\u0F6C\u0F88-\u0F8C\u1000-\u102A\u103F\u1050-\u1055\u105A-\u105D\u1061\u1065\u1066\u106E-\u1070\u1075-\u1081\u108E\u10A0-\u10C5\u10C7\u10CD\u10D0-\u10FA\u10FC-\u1248\u124A-\u124D\u1250-\u1256\u1258\u125A-\u125D\u1260-\u1288\u128A-\u128D\u1290-\u12B0\u12B2-\u12B5\u12B8-\u12BE\u12C0\u12C2-\u12C5\u12C8-\u12D6\u12D8-\u1310\u1312-\u1315\u1318-\u135A\u1380-\u138F\u13A0-\u13F5\u13F8-\u13FD\u1401-\u166C\u166F-\u167F\u1681-\u169A\u16A0-\u16EA\u16F1-\u16F8\u1700-\u170C\u170E-\u1711\u1720-\u1731\u1740-\u1751\u1760-\u176C\u176E-\u1770\u1780-\u17B3\u17D7\u17DC\u1820-\u1877\u1880-\u1884\u1887-\u18A8\u18AA\u18B0-\u18F5\u1900-\u191E\u1950-\u196D\u1970-\u1974\u1980-\u19AB\u19B0-\u19C9\u1A00-\u1A16\u1A20-\u1A54\u1AA7\u1B05-\u1B33\u1B45-\u1B4B\u1B83-\u1BA0\u1BAE\u1BAF\u1BBA-\u1BE5\u1C00-\u1C23\u1C4D-\u1C4F\u1C5A-\u1C7D\u1C80-\u1C88\u1CE9-\u1CEC\u1CEE-\u1CF1\u1CF5\u1CF6\u1D00-\u1DBF\u1E00-\u1F15\u1F18-\u1F1D\u1F20-\u1F45\u1F48-\u1F4D\u1F50-\u1F57\u1F59\u1F5B\u1F5D\u1F5F-\u1F7D\u1F80-\u1FB4\u1FB6-\u1FBC\u1FBE\u1FC2-\u1FC4\u1FC6-\u1FCC\u1FD0-\u1FD3\u1FD6-\u1FDB\u1FE0-\u1FEC\u1FF2-\u1FF4\u1FF6-\u1FFC\u2071\u207F\u2090-\u209C\u2102\u2107\u210A-\u2113\u2115\u2119-\u211D\u2124\u2126\u2128\u212A-\u212D\u212F-\u2139\u213C-\u213F\u2145-\u2149\u214E\u2183\u2184\u2C00-\u2C2E\u2C30-\u2C5E\u2C60-\u2CE4\u2CEB-\u2CEE\u2CF2\u2CF3\u2D00-\u2D25\u2D27\u2D2D\u2D30-\u2D67\u2D6F\u2D80-\u2D96\u2DA0-\u2DA6\u2DA8-\u2DAE\u2DB0-\u2DB6\u2DB8-\u2DBE\u2DC0-\u2DC6\u2DC8-\u2DCE\u2DD0-\u2DD6\u2DD8-\u2DDE\u2E2F\u3005\u3006\u3031-\u3035\u303B\u303C\u3041-\u3096\u309D-\u309F\u30A1-\u30FA\u30FC-\u30FF\u3105-\u312D\u3131-\u318E\u31A0-\u31BA\u31F0-\u31FF\u3400-\u4DB5\u4E00-\u9FD5\uA000-\uA48C\uA4D0-\uA4FD\uA500-\uA60C\uA610-\uA61F\uA62A\uA62B\uA640-\uA66E\uA67F-\uA69D\uA6A0-\uA6E5\uA717-\uA71F\uA722-\uA788\uA78B-\uA7AE\uA7B0-\uA7B7\uA7F7-\uA801\uA803-\uA805\uA807-\uA80A\uA80C-\uA822\uA840-\uA873\uA882-\uA8B3\uA8F2-\uA8F7\uA8FB\uA8FD\uA90A-\uA925\uA930-\uA946\uA960-\uA97C\uA984-\uA9B2\uA9CF\uA9E0-\uA9E4\uA9E6-\uA9EF\uA9FA-\uA9FE\uAA00-\uAA28\uAA40-\uAA42\uAA44-\uAA4B\uAA60-\uAA76\uAA7A\uAA7E-\uAAAF\uAAB1\uAAB5\uAAB6\uAAB9-\uAABD\uAAC0\uAAC2\uAADB-\uAADD\uAAE0-\uAAEA\uAAF2-\uAAF4\uAB01-\uAB06\uAB09-\uAB0E\uAB11-\uAB16\uAB20-\uAB26\uAB28-\uAB2E\uAB30-\uAB5A\uAB5C-\uAB65\uAB70-\uABE2\uAC00-\uD7A3\uD7B0-\uD7C6\uD7CB-\uD7FB\uF900-\uFA6D\uFA70-\uFAD9\uFB00-\uFB06\uFB13-\uFB17\uFB1D\uFB1F-\uFB28\uFB2A-\uFB36\uFB38-\uFB3C\uFB3E\uFB40\uFB41\uFB43\uFB44\uFB46-\uFBB1\uFBD3-\uFD3D\uFD50-\uFD8F\uFD92-\uFDC7\uFDF0-\uFDFB\uFE70-\uFE74\uFE76-\uFEFC\uFF21-\uFF3A\uFF41-\uFF5A\uFF66-\uFFBE\uFFC2-\uFFC7\uFFCA-\uFFCF\uFFD2-\uFFD7\uFFDA-\uFFDC";

const REGEXP_SPLITS = new RegExp(`[${UNICODE_SPACE}${UNICODE_SYMBOL}${UNICODE_PUNCT}]+`);
const REGEXP_ALPHANUMS = new RegExp(`^[${UNICODE_NUMBER}${UNICODE_MARK}${UNICODE_LETTER}]+$`);

This comment has been minimized.

@ncloudioj

ncloudioj Aug 21, 2018
Member

Do we also want to filter other stopwords? Which could help reduce the size of result.

This comment has been minimized.

@jonathankoren

jonathankoren Aug 24, 2018
Author Contributor

That wouldn't happen here. This is simply cutting up a stream of bytes into tokens to check against the dictionary. You have to do this even before you do stopword removal, because you don't even know what the words are yet.

We end up doing something much more aggressive than stop word removal because the tagging models use closed vocabulary. If the word isn't in the vocabulary, we simply drop it.

/**
* Downcases the text, and splits it into consecutive alphanumeric characters.
* This is locale aware, and so will not strip accents. This uses "word
* breaks", and os is not appropriate for languages without them

This comment has been minimized.

@ncloudioj

ncloudioj Aug 21, 2018
Member

what do you mean by "os"?

This comment has been minimized.

@jonathankoren

jonathankoren Aug 24, 2018
Author Contributor

That's just a typo. It looks like I accidentally doubled the word "is", but ended up typing it wrong the first time.

* (e.g. Chinese).
*/
tokenize(text) {
return text.toLocaleLowerCase().split(REGEXP_SPLITS).filter(tok => tok.match(REGEXP_ALPHANUMS));

This comment has been minimized.

@ncloudioj

ncloudioj Aug 21, 2018
Member

These two pieces of regex processing could be CPU intensive, we might want to keep an eye on it by profiling.

* be dropped.
*/
toksTotfIdfVector(tokens, vocab_idfs) {
let tfidfs = {};

This comment has been minimized.

@ncloudioj

ncloudioj Aug 21, 2018
Member

Again, a Map would be better than object to store a large number of key/value pairs.

This comment has been minimized.

@jonathankoren

jonathankoren Aug 24, 2018
Author Contributor

What's the difference?

});
l2Norm = Math.sqrt(l2Norm);
Object.keys(tfidfs).forEach(tok => {
tfidfs[tok][1] /= l2Norm;

This comment has been minimized.

@ncloudioj

ncloudioj Aug 21, 2018
Member

Any chance to see l2Norm === 0? If so, better add a guard to prevent the divided-by-zero error.

Copy link
Member

@ncloudioj ncloudioj left a comment

It looks good in general. Having a few nits about the code style and other questions in the review comments. Please address them, then I'd like to take another look at this patch. Thanks!

describe("Naive Bayes Tagger", () => {
describe("#tag", () => {
let model = {
"model_type": "nb",

This comment has been minimized.

@ncloudioj

ncloudioj Aug 29, 2018
Member

Let's also remove the double quotes in this files.

};

// RELEASE THE TESTS!
for (let i = 0; i < testCases.length; i++) {

This comment has been minimized.

@ncloudioj

ncloudioj Aug 29, 2018
Member

Let's do for..of for this too.


// RELEASE THE TESTS!
for (let i = 0; i < testCases.length; i++) {
checkTag(testCases[i]);

This comment has been minimized.

};

// run the tests
for (let i = 0; i < testCases.length; i++) {

This comment has been minimized.

let predictions = {};
Object.keys(this.model.document_topic).forEach(topic => {
let score = 0;
for (let i = 0; i < toksInLatentSpace.length; i++) {

This comment has been minimized.

@ncloudioj

ncloudioj Aug 29, 2018
Member

Let's for..of this.

This comment has been minimized.

@jonathankoren

jonathankoren Aug 30, 2018
Author Contributor

It doesn't make a lot of sense to do that here, because i is used as the index across two different arrays. If you used a for...of, you'd still need to keep the i around and increment it outside of the for statement so you can do the lookup on toksInLatentSpace in line 52. I'd rather keep the index explicit here to reinforce the 1:1 correspondence.

let tfidfs = {};

// calcualte the term frequencies
for (let i = 0; i < tokens.length; i++) {

This comment has been minimized.

Copy link
Member

@ncloudioj ncloudioj left a comment

Please resolve all the issues in my comments.

For the Object->Map switch, if you'd like to do that later, please file a follow-up bug for that. It's not only for speed, but also to be consistent in this codebase. If you're looking for a collection structure to store key/value pairs, Map is always preferred.

Copy link
Member

@ncloudioj ncloudioj left a comment

r+! After fixing that final nit.


// Javascript doesn't support unicode page mnemonics out of the box, and so
// we have to define them ourselves.
const UNICODE_SPACE = "\x20\xA0\u1680\u2000-\u200A\u2028\u2029\u202F\u205F\u3000";

This comment has been minimized.

@ncloudioj

ncloudioj Aug 30, 2018
Member

Please add the comments that you added in 6eae545?

@ncloudioj
Copy link
Member

@ncloudioj ncloudioj commented Sep 4, 2018

@jonathankoren There is an eslint error reported by Taskcluster, could you do a rebase against master to see if that fixes it?

@ncloudioj ncloudioj merged commit 9c44263 into mozilla:master Sep 5, 2018
2 checks passed
2 checks passed
Taskcluster (pull_request) TaskGroup: success
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants