feat(experiments): add text taggers #4294
Conversation
|
@jonathankoren hey, when is the expected ship date for this experiment / i.e. how soon do you need this merged / reviewed? |
|
@k88hudson We're aiming to land the feature before August 23 code freeze. There's more code to be written, but because it's a fairly large amount new code, I thought it was prudent to break it up into smaller PRs rather than hold on to one giant PR. The idea was that it makes it easier to understand, and lowers the overall risk. The feature is going to gated behind an A/B test and will launch dark, and then be turned on for some fraction of users through Shield. If it's useful, I'm on slack, and have been working with Scott Downe, who I think sits next to you. |
|
Ok thanks! Just wanted to know if this was planned for an uplift or not. |
|
The plan is to try really really hard to not have to get uplift after the Sept 4 beta release, but if we miss the date, we'll need uplift to beta. Otherwise, we're stuck for a long time. Fortunately, we're going to be behind a flag, so hopefully it should possible. |
|
Let's also track this on Bugzilla https://bugzilla.mozilla.org/show_bug.cgi?id=1483667 |
| let bestClassId = -1; | ||
| let bestClassLabel = null; | ||
| let logSumExp = 0.0; // will be P(x). Used to create a proper probability | ||
| for (let classId = 0; classId < this.model.classes.length; classId++) { |
ncloudioj
Aug 21, 2018
•
Member
Nit: for..of statement is more succinct for array iteration. Consider replacing all the raw for loops with for..of in this patch?
Nit: for..of statement is more succinct for array iteration. Consider replacing all the raw for loops with for..of in this patch?
ncloudioj
Aug 29, 2018
•
Member
Did you miss this one?
Did you miss this one?
jonathankoren
Aug 30, 2018
•
Author
Contributor
No. I need the id so I can check it against the positive class id.
No. I need the id so I can check it against the positive class id.
| let confident = ((bestClassId === this.model.positive_class_id) && | ||
| (bestLogProb > this.model.positive_class_threshold_log_prob)); | ||
| return { | ||
| "label": bestClassLabel, |
ncloudioj
Aug 21, 2018
Member
Nit: redundant double quotes for keys.
Nit: redundant double quotes for keys.
jonathankoren
Aug 29, 2018
Author
Contributor
eslint doesn't like it without it. It thinks I'm trying to do shorthand.
eslint doesn't like it without it. It thinks I'm trying to do shorthand.
| logSumExp += Math.exp(classLogProb); | ||
| } | ||
|
|
||
| // now normalizae the probability by dividing by P(x) |
ncloudioj
Aug 21, 2018
Member
s/normalizae/normalize/
s/normalizae/normalize/
ncloudioj
Aug 29, 2018
Member
ditto
ditto
| } | ||
|
|
||
| // now project toksInLatentSpace back into class space | ||
| let predictions = {}; |
|
|
||
| // Javascript doesn't support unicode page mnemonics out of the box, and so | ||
| // we have to define them ourselves. | ||
| const UNICODE_SPACE = "\x20\xA0\u1680\u2000-\u200A\u2028\u2029\u202F\u205F\u3000"; |
ncloudioj
Aug 21, 2018
Member
All these constants here want some comments to clarify:
- What they are in general
- Some extra reference links, which pointing to more details about those unicodes, would be even better
All these constants here want some comments to clarify:
- What they are in general
- Some extra reference links, which pointing to more details about those unicodes, would be even better
jonathankoren
Aug 24, 2018
Author
Contributor
Unicode specifies certain mnemonics for code pages and character classes. They call them "character properties" https://en.wikipedia.org/wiki/Unicode_character_property . These mnemonics are have been adopted by many regular expression libraries, however the standard regexp system doesn't support unicode character properties. If you want to use them, you have to pull in a library like XRegExp.
Each of these sections contains the characters values / ranges for specific character property: Whitespace, Symbol (S), Punctuation (P), Number (N), Mark (M), and Letter (L). The tables are defined by Unicode. I think I reformatted the data from XRegExp to get it to work here.
We have to support unicode the tokenizer, as we expect to see (and have seen in our training data) nonenglish and multilingual documents. This allows us to use the same code for any language that uses spaces to separate words.
Interestingly enough, it looks like xregexp is pulled in via two or three levels of dependencies, but I have no idea how / should you'd import it directly.
Unicode specifies certain mnemonics for code pages and character classes. They call them "character properties" https://en.wikipedia.org/wiki/Unicode_character_property . These mnemonics are have been adopted by many regular expression libraries, however the standard regexp system doesn't support unicode character properties. If you want to use them, you have to pull in a library like XRegExp.
Each of these sections contains the characters values / ranges for specific character property: Whitespace, Symbol (S), Punctuation (P), Number (N), Mark (M), and Letter (L). The tables are defined by Unicode. I think I reformatted the data from XRegExp to get it to work here.
We have to support unicode the tokenizer, as we expect to see (and have seen in our training data) nonenglish and multilingual documents. This allows us to use the same code for any language that uses spaces to separate words.
Interestingly enough, it looks like xregexp is pulled in via two or three levels of dependencies, but I have no idea how / should you'd import it directly.
ncloudioj
Aug 24, 2018
Member
Could you add this to the code?
Could you add this to the code?
| const UNICODE_LETTER = "A-Za-z\xAA\xB5\xBA\xC0-\xD6\xD8-\xF6\xF8-\u02C1\u02C6-\u02D1\u02E0-\u02E4\u02EC\u02EE\u0370-\u0374\u0376\u0377\u037A-\u037D\u037F\u0386\u0388-\u038A\u038C\u038E-\u03A1\u03A3-\u03F5\u03F7-\u0481\u048A-\u052F\u0531-\u0556\u0559\u0561-\u0587\u05D0-\u05EA\u05F0-\u05F2\u0620-\u064A\u066E\u066F\u0671-\u06D3\u06D5\u06E5\u06E6\u06EE\u06EF\u06FA-\u06FC\u06FF\u0710\u0712-\u072F\u074D-\u07A5\u07B1\u07CA-\u07EA\u07F4\u07F5\u07FA\u0800-\u0815\u081A\u0824\u0828\u0840-\u0858\u08A0-\u08B4\u08B6-\u08BD\u0904-\u0939\u093D\u0950\u0958-\u0961\u0971-\u0980\u0985-\u098C\u098F\u0990\u0993-\u09A8\u09AA-\u09B0\u09B2\u09B6-\u09B9\u09BD\u09CE\u09DC\u09DD\u09DF-\u09E1\u09F0\u09F1\u0A05-\u0A0A\u0A0F\u0A10\u0A13-\u0A28\u0A2A-\u0A30\u0A32\u0A33\u0A35\u0A36\u0A38\u0A39\u0A59-\u0A5C\u0A5E\u0A72-\u0A74\u0A85-\u0A8D\u0A8F-\u0A91\u0A93-\u0AA8\u0AAA-\u0AB0\u0AB2\u0AB3\u0AB5-\u0AB9\u0ABD\u0AD0\u0AE0\u0AE1\u0AF9\u0B05-\u0B0C\u0B0F\u0B10\u0B13-\u0B28\u0B2A-\u0B30\u0B32\u0B33\u0B35-\u0B39\u0B3D\u0B5C\u0B5D\u0B5F-\u0B61\u0B71\u0B83\u0B85-\u0B8A\u0B8E-\u0B90\u0B92-\u0B95\u0B99\u0B9A\u0B9C\u0B9E\u0B9F\u0BA3\u0BA4\u0BA8-\u0BAA\u0BAE-\u0BB9\u0BD0\u0C05-\u0C0C\u0C0E-\u0C10\u0C12-\u0C28\u0C2A-\u0C39\u0C3D\u0C58-\u0C5A\u0C60\u0C61\u0C80\u0C85-\u0C8C\u0C8E-\u0C90\u0C92-\u0CA8\u0CAA-\u0CB3\u0CB5-\u0CB9\u0CBD\u0CDE\u0CE0\u0CE1\u0CF1\u0CF2\u0D05-\u0D0C\u0D0E-\u0D10\u0D12-\u0D3A\u0D3D\u0D4E\u0D54-\u0D56\u0D5F-\u0D61\u0D7A-\u0D7F\u0D85-\u0D96\u0D9A-\u0DB1\u0DB3-\u0DBB\u0DBD\u0DC0-\u0DC6\u0E01-\u0E30\u0E32\u0E33\u0E40-\u0E46\u0E81\u0E82\u0E84\u0E87\u0E88\u0E8A\u0E8D\u0E94-\u0E97\u0E99-\u0E9F\u0EA1-\u0EA3\u0EA5\u0EA7\u0EAA\u0EAB\u0EAD-\u0EB0\u0EB2\u0EB3\u0EBD\u0EC0-\u0EC4\u0EC6\u0EDC-\u0EDF\u0F00\u0F40-\u0F47\u0F49-\u0F6C\u0F88-\u0F8C\u1000-\u102A\u103F\u1050-\u1055\u105A-\u105D\u1061\u1065\u1066\u106E-\u1070\u1075-\u1081\u108E\u10A0-\u10C5\u10C7\u10CD\u10D0-\u10FA\u10FC-\u1248\u124A-\u124D\u1250-\u1256\u1258\u125A-\u125D\u1260-\u1288\u128A-\u128D\u1290-\u12B0\u12B2-\u12B5\u12B8-\u12BE\u12C0\u12C2-\u12C5\u12C8-\u12D6\u12D8-\u1310\u1312-\u1315\u1318-\u135A\u1380-\u138F\u13A0-\u13F5\u13F8-\u13FD\u1401-\u166C\u166F-\u167F\u1681-\u169A\u16A0-\u16EA\u16F1-\u16F8\u1700-\u170C\u170E-\u1711\u1720-\u1731\u1740-\u1751\u1760-\u176C\u176E-\u1770\u1780-\u17B3\u17D7\u17DC\u1820-\u1877\u1880-\u1884\u1887-\u18A8\u18AA\u18B0-\u18F5\u1900-\u191E\u1950-\u196D\u1970-\u1974\u1980-\u19AB\u19B0-\u19C9\u1A00-\u1A16\u1A20-\u1A54\u1AA7\u1B05-\u1B33\u1B45-\u1B4B\u1B83-\u1BA0\u1BAE\u1BAF\u1BBA-\u1BE5\u1C00-\u1C23\u1C4D-\u1C4F\u1C5A-\u1C7D\u1C80-\u1C88\u1CE9-\u1CEC\u1CEE-\u1CF1\u1CF5\u1CF6\u1D00-\u1DBF\u1E00-\u1F15\u1F18-\u1F1D\u1F20-\u1F45\u1F48-\u1F4D\u1F50-\u1F57\u1F59\u1F5B\u1F5D\u1F5F-\u1F7D\u1F80-\u1FB4\u1FB6-\u1FBC\u1FBE\u1FC2-\u1FC4\u1FC6-\u1FCC\u1FD0-\u1FD3\u1FD6-\u1FDB\u1FE0-\u1FEC\u1FF2-\u1FF4\u1FF6-\u1FFC\u2071\u207F\u2090-\u209C\u2102\u2107\u210A-\u2113\u2115\u2119-\u211D\u2124\u2126\u2128\u212A-\u212D\u212F-\u2139\u213C-\u213F\u2145-\u2149\u214E\u2183\u2184\u2C00-\u2C2E\u2C30-\u2C5E\u2C60-\u2CE4\u2CEB-\u2CEE\u2CF2\u2CF3\u2D00-\u2D25\u2D27\u2D2D\u2D30-\u2D67\u2D6F\u2D80-\u2D96\u2DA0-\u2DA6\u2DA8-\u2DAE\u2DB0-\u2DB6\u2DB8-\u2DBE\u2DC0-\u2DC6\u2DC8-\u2DCE\u2DD0-\u2DD6\u2DD8-\u2DDE\u2E2F\u3005\u3006\u3031-\u3035\u303B\u303C\u3041-\u3096\u309D-\u309F\u30A1-\u30FA\u30FC-\u30FF\u3105-\u312D\u3131-\u318E\u31A0-\u31BA\u31F0-\u31FF\u3400-\u4DB5\u4E00-\u9FD5\uA000-\uA48C\uA4D0-\uA4FD\uA500-\uA60C\uA610-\uA61F\uA62A\uA62B\uA640-\uA66E\uA67F-\uA69D\uA6A0-\uA6E5\uA717-\uA71F\uA722-\uA788\uA78B-\uA7AE\uA7B0-\uA7B7\uA7F7-\uA801\uA803-\uA805\uA807-\uA80A\uA80C-\uA822\uA840-\uA873\uA882-\uA8B3\uA8F2-\uA8F7\uA8FB\uA8FD\uA90A-\uA925\uA930-\uA946\uA960-\uA97C\uA984-\uA9B2\uA9CF\uA9E0-\uA9E4\uA9E6-\uA9EF\uA9FA-\uA9FE\uAA00-\uAA28\uAA40-\uAA42\uAA44-\uAA4B\uAA60-\uAA76\uAA7A\uAA7E-\uAAAF\uAAB1\uAAB5\uAAB6\uAAB9-\uAABD\uAAC0\uAAC2\uAADB-\uAADD\uAAE0-\uAAEA\uAAF2-\uAAF4\uAB01-\uAB06\uAB09-\uAB0E\uAB11-\uAB16\uAB20-\uAB26\uAB28-\uAB2E\uAB30-\uAB5A\uAB5C-\uAB65\uAB70-\uABE2\uAC00-\uD7A3\uD7B0-\uD7C6\uD7CB-\uD7FB\uF900-\uFA6D\uFA70-\uFAD9\uFB00-\uFB06\uFB13-\uFB17\uFB1D\uFB1F-\uFB28\uFB2A-\uFB36\uFB38-\uFB3C\uFB3E\uFB40\uFB41\uFB43\uFB44\uFB46-\uFBB1\uFBD3-\uFD3D\uFD50-\uFD8F\uFD92-\uFDC7\uFDF0-\uFDFB\uFE70-\uFE74\uFE76-\uFEFC\uFF21-\uFF3A\uFF41-\uFF5A\uFF66-\uFFBE\uFFC2-\uFFC7\uFFCA-\uFFCF\uFFD2-\uFFD7\uFFDA-\uFFDC"; | ||
|
|
||
| const REGEXP_SPLITS = new RegExp(`[${UNICODE_SPACE}${UNICODE_SYMBOL}${UNICODE_PUNCT}]+`); | ||
| const REGEXP_ALPHANUMS = new RegExp(`^[${UNICODE_NUMBER}${UNICODE_MARK}${UNICODE_LETTER}]+$`); |
ncloudioj
Aug 21, 2018
Member
Do we also want to filter other stopwords? Which could help reduce the size of result.
Do we also want to filter other stopwords? Which could help reduce the size of result.
jonathankoren
Aug 24, 2018
Author
Contributor
That wouldn't happen here. This is simply cutting up a stream of bytes into tokens to check against the dictionary. You have to do this even before you do stopword removal, because you don't even know what the words are yet.
We end up doing something much more aggressive than stop word removal because the tagging models use closed vocabulary. If the word isn't in the vocabulary, we simply drop it.
That wouldn't happen here. This is simply cutting up a stream of bytes into tokens to check against the dictionary. You have to do this even before you do stopword removal, because you don't even know what the words are yet.
We end up doing something much more aggressive than stop word removal because the tagging models use closed vocabulary. If the word isn't in the vocabulary, we simply drop it.
| /** | ||
| * Downcases the text, and splits it into consecutive alphanumeric characters. | ||
| * This is locale aware, and so will not strip accents. This uses "word | ||
| * breaks", and os is not appropriate for languages without them |
ncloudioj
Aug 21, 2018
Member
what do you mean by "os"?
what do you mean by "os"?
jonathankoren
Aug 24, 2018
Author
Contributor
That's just a typo. It looks like I accidentally doubled the word "is", but ended up typing it wrong the first time.
That's just a typo. It looks like I accidentally doubled the word "is", but ended up typing it wrong the first time.
| * (e.g. Chinese). | ||
| */ | ||
| tokenize(text) { | ||
| return text.toLocaleLowerCase().split(REGEXP_SPLITS).filter(tok => tok.match(REGEXP_ALPHANUMS)); |
ncloudioj
Aug 21, 2018
•
Member
These two pieces of regex processing could be CPU intensive, we might want to keep an eye on it by profiling.
These two pieces of regex processing could be CPU intensive, we might want to keep an eye on it by profiling.
| * be dropped. | ||
| */ | ||
| toksTotfIdfVector(tokens, vocab_idfs) { | ||
| let tfidfs = {}; |
ncloudioj
Aug 21, 2018
•
Member
Again, a Map would be better than object to store a large number of key/value pairs.
Again, a Map would be better than object to store a large number of key/value pairs.
jonathankoren
Aug 24, 2018
Author
Contributor
What's the difference?
What's the difference?
| }); | ||
| l2Norm = Math.sqrt(l2Norm); | ||
| Object.keys(tfidfs).forEach(tok => { | ||
| tfidfs[tok][1] /= l2Norm; |
ncloudioj
Aug 21, 2018
Member
Any chance to see l2Norm === 0? If so, better add a guard to prevent the divided-by-zero error.
Any chance to see l2Norm === 0? If so, better add a guard to prevent the divided-by-zero error.
|
It looks good in general. Having a few nits about the code style and other questions in the review comments. Please address them, then I'd like to take another look at this patch. Thanks! |
| describe("Naive Bayes Tagger", () => { | ||
| describe("#tag", () => { | ||
| let model = { | ||
| "model_type": "nb", |
ncloudioj
Aug 29, 2018
Member
Let's also remove the double quotes in this files.
Let's also remove the double quotes in this files.
| }; | ||
|
|
||
| // RELEASE THE TESTS! | ||
| for (let i = 0; i < testCases.length; i++) { |
ncloudioj
Aug 29, 2018
Member
Let's do for..of for this too.
Let's do for..of for this too.
|
|
||
| // RELEASE THE TESTS! | ||
| for (let i = 0; i < testCases.length; i++) { | ||
| checkTag(testCases[i]); |
ncloudioj
Aug 29, 2018
Member
ditto
ditto
| }; | ||
|
|
||
| // run the tests | ||
| for (let i = 0; i < testCases.length; i++) { |
ncloudioj
Aug 29, 2018
Member
ditto
ditto
| let predictions = {}; | ||
| Object.keys(this.model.document_topic).forEach(topic => { | ||
| let score = 0; | ||
| for (let i = 0; i < toksInLatentSpace.length; i++) { |
ncloudioj
Aug 29, 2018
Member
Let's for..of this.
Let's for..of this.
jonathankoren
Aug 30, 2018
•
Author
Contributor
It doesn't make a lot of sense to do that here, because i is used as the index across two different arrays. If you used a for...of, you'd still need to keep the i around and increment it outside of the for statement so you can do the lookup on toksInLatentSpace in line 52. I'd rather keep the index explicit here to reinforce the 1:1 correspondence.
It doesn't make a lot of sense to do that here, because i is used as the index across two different arrays. If you used a for...of, you'd still need to keep the i around and increment it outside of the for statement so you can do the lookup on toksInLatentSpace in line 52. I'd rather keep the index explicit here to reinforce the 1:1 correspondence.
| let tfidfs = {}; | ||
|
|
||
| // calcualte the term frequencies | ||
| for (let i = 0; i < tokens.length; i++) { |
ncloudioj
Aug 29, 2018
Member
ditto
ditto
|
Please resolve all the issues in my comments. For the Object->Map switch, if you'd like to do that later, please file a follow-up bug for that. It's not only for speed, but also to be consistent in this codebase. If you're looking for a collection structure to store key/value pairs, Map is always preferred. |
|
r+! After fixing that final nit. |
|
|
||
| // Javascript doesn't support unicode page mnemonics out of the box, and so | ||
| // we have to define them ourselves. | ||
| const UNICODE_SPACE = "\x20\xA0\u1680\u2000-\u200A\u2028\u2029\u202F\u205F\u3000"; |
|
@jonathankoren There is an eslint error reported by Taskcluster, could you do a rebase against master to see if that fixes it? |
Update 2
Update
Description
This PR adds the ability to classify text. We define two different classifiers, a Naïve Bayes (NB) classifier, and a multiclass nonnegative matrix factorization (NMF) classifier. Both use a bag of words, TF-IDF vectors as features. The purpose of this code is to allow Firefox to classify pages into topics, by examining the text found on the page.
This code is part of the Pocket Personalization v2 experiment which uses content analysis to locally build interest profiles.
This code is dark.
Testing
Unit tests
This code has no current consumers.
Related
We reviewed this internally on PR Pocket#1
See Also
https://docs.google.com/document/d/12OtUZywivIvBnO3hmMNjptmIQ8cQLFOIOzJO4bRLqdg/edit
https://en.wikipedia.org/wiki/Naive_Bayes_classifier
https://en.wikipedia.org/wiki/Non-negative_matrix_factorization
https://en.wikipedia.org/wiki/Tf%E2%80%93idf