New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(experiments): add text taggers #4294

Merged
merged 8 commits into from Sep 5, 2018

Conversation

Projects
None yet
3 participants
@jonathankoren
Contributor

jonathankoren commented Aug 7, 2018

Update 2

Update

  • Switched to for..of loops
  • Removed double-quote on keys
  • Did not switch from Objects to Maps. We use Object in a lot of places in the follow on PR (which you haven't seen yet). This would be a big change. It would mean converting the fundamental data structure. Honestly, I'm not comfortable with making that change at this time. I'm open to doing it later, especially if we need a speed up, but not today.

Description

This PR adds the ability to classify text. We define two different classifiers, a Naïve Bayes (NB) classifier, and a multiclass nonnegative matrix factorization (NMF) classifier. Both use a bag of words, TF-IDF vectors as features. The purpose of this code is to allow Firefox to classify pages into topics, by examining the text found on the page.

This code is part of the Pocket Personalization v2 experiment which uses content analysis to locally build interest profiles.

This code is dark.

Testing

Unit tests
This code has no current consumers.

Related

We reviewed this internally on PR Pocket#1

See Also

https://docs.google.com/document/d/12OtUZywivIvBnO3hmMNjptmIQ8cQLFOIOzJO4bRLqdg/edit
https://en.wikipedia.org/wiki/Naive_Bayes_classifier
https://en.wikipedia.org/wiki/Non-negative_matrix_factorization
https://en.wikipedia.org/wiki/Tf%E2%80%93idf

@k88hudson

This comment has been minimized.

Show comment
Hide comment
@k88hudson

k88hudson Aug 7, 2018

Member

@jonathankoren hey, when is the expected ship date for this experiment / i.e. how soon do you need this merged / reviewed?

Member

k88hudson commented Aug 7, 2018

@jonathankoren hey, when is the expected ship date for this experiment / i.e. how soon do you need this merged / reviewed?

@jonathankoren

This comment has been minimized.

Show comment
Hide comment
@jonathankoren

jonathankoren Aug 7, 2018

Contributor

@k88hudson We're aiming to land the feature before August 23 code freeze. There's more code to be written, but because it's a fairly large amount new code, I thought it was prudent to break it up into smaller PRs rather than hold on to one giant PR. The idea was that it makes it easier to understand, and lowers the overall risk. The feature is going to gated behind an A/B test and will launch dark, and then be turned on for some fraction of users through Shield.

If it's useful, I'm on slack, and have been working with Scott Downe, who I think sits next to you.

Contributor

jonathankoren commented Aug 7, 2018

@k88hudson We're aiming to land the feature before August 23 code freeze. There's more code to be written, but because it's a fairly large amount new code, I thought it was prudent to break it up into smaller PRs rather than hold on to one giant PR. The idea was that it makes it easier to understand, and lowers the overall risk. The feature is going to gated behind an A/B test and will launch dark, and then be turned on for some fraction of users through Shield.

If it's useful, I'm on slack, and have been working with Scott Downe, who I think sits next to you.

@k88hudson

This comment has been minimized.

Show comment
Hide comment
@k88hudson

k88hudson Aug 7, 2018

Member

Ok thanks! Just wanted to know if this was planned for an uplift or not.
We are a little under pressure this week with another feature that needs to get uplifted, but we'll try to get someone to look at this soon. Your approach to break it up into a bunch of PRs sounds great 👍

Member

k88hudson commented Aug 7, 2018

Ok thanks! Just wanted to know if this was planned for an uplift or not.
We are a little under pressure this week with another feature that needs to get uplifted, but we'll try to get someone to look at this soon. Your approach to break it up into a bunch of PRs sounds great 👍

@jonathankoren

This comment has been minimized.

Show comment
Hide comment
@jonathankoren

jonathankoren Aug 7, 2018

Contributor

The plan is to try really really hard to not have to get uplift after the Sept 4 beta release, but if we miss the date, we'll need uplift to beta. Otherwise, we're stuck for a long time. Fortunately, we're going to be behind a flag, so hopefully it should possible.

Contributor

jonathankoren commented Aug 7, 2018

The plan is to try really really hard to not have to get uplift after the Sept 4 beta release, but if we miss the date, we'll need uplift to beta. Otherwise, we're stuck for a long time. Fortunately, we're going to be behind a flag, so hopefully it should possible.

@ncloudioj

This comment has been minimized.

Show comment
Hide comment
@ncloudioj

ncloudioj Aug 15, 2018

Contributor

Let's also track this on Bugzilla https://bugzilla.mozilla.org/show_bug.cgi?id=1483667

Contributor

ncloudioj commented Aug 15, 2018

Let's also track this on Bugzilla https://bugzilla.mozilla.org/show_bug.cgi?id=1483667

let bestClassId = -1;
let bestClassLabel = null;
let logSumExp = 0.0; // will be P(x). Used to create a proper probability
for (let classId = 0; classId < this.model.classes.length; classId++) {

This comment has been minimized.

@ncloudioj

ncloudioj Aug 21, 2018

Contributor

Nit: for..of statement is more succinct for array iteration. Consider replacing all the raw for loops with for..of in this patch?

@ncloudioj

ncloudioj Aug 21, 2018

Contributor

Nit: for..of statement is more succinct for array iteration. Consider replacing all the raw for loops with for..of in this patch?

This comment has been minimized.

@ncloudioj

ncloudioj Aug 29, 2018

Contributor

Did you miss this one?

@ncloudioj

ncloudioj Aug 29, 2018

Contributor

Did you miss this one?

This comment has been minimized.

@jonathankoren

jonathankoren Aug 30, 2018

Contributor

No. I need the id so I can check it against the positive class id.

@jonathankoren

jonathankoren Aug 30, 2018

Contributor

No. I need the id so I can check it against the positive class id.

let confident = ((bestClassId === this.model.positive_class_id) &&
(bestLogProb > this.model.positive_class_threshold_log_prob));
return {
"label": bestClassLabel,

This comment has been minimized.

@ncloudioj

ncloudioj Aug 21, 2018

Contributor

Nit: redundant double quotes for keys.

@ncloudioj

ncloudioj Aug 21, 2018

Contributor

Nit: redundant double quotes for keys.

This comment has been minimized.

@jonathankoren

jonathankoren Aug 29, 2018

Contributor

eslint doesn't like it without it. It thinks I'm trying to do shorthand.

@jonathankoren

jonathankoren Aug 29, 2018

Contributor

eslint doesn't like it without it. It thinks I'm trying to do shorthand.

Show outdated Hide outdated lib/NaiveBayesTextTagger.jsm Outdated
}
// now project toksInLatentSpace back into class space
let predictions = {};

This comment has been minimized.

@ncloudioj

ncloudioj Aug 21, 2018

Contributor

Nit: prefer to use Map

@ncloudioj

ncloudioj Aug 21, 2018

Contributor

Nit: prefer to use Map

Show outdated Hide outdated lib/TfIdfVectorizer.jsm Outdated
const UNICODE_LETTER = "A-Za-z\xAA\xB5\xBA\xC0-\xD6\xD8-\xF6\xF8-\u02C1\u02C6-\u02D1\u02E0-\u02E4\u02EC\u02EE\u0370-\u0374\u0376\u0377\u037A-\u037D\u037F\u0386\u0388-\u038A\u038C\u038E-\u03A1\u03A3-\u03F5\u03F7-\u0481\u048A-\u052F\u0531-\u0556\u0559\u0561-\u0587\u05D0-\u05EA\u05F0-\u05F2\u0620-\u064A\u066E\u066F\u0671-\u06D3\u06D5\u06E5\u06E6\u06EE\u06EF\u06FA-\u06FC\u06FF\u0710\u0712-\u072F\u074D-\u07A5\u07B1\u07CA-\u07EA\u07F4\u07F5\u07FA\u0800-\u0815\u081A\u0824\u0828\u0840-\u0858\u08A0-\u08B4\u08B6-\u08BD\u0904-\u0939\u093D\u0950\u0958-\u0961\u0971-\u0980\u0985-\u098C\u098F\u0990\u0993-\u09A8\u09AA-\u09B0\u09B2\u09B6-\u09B9\u09BD\u09CE\u09DC\u09DD\u09DF-\u09E1\u09F0\u09F1\u0A05-\u0A0A\u0A0F\u0A10\u0A13-\u0A28\u0A2A-\u0A30\u0A32\u0A33\u0A35\u0A36\u0A38\u0A39\u0A59-\u0A5C\u0A5E\u0A72-\u0A74\u0A85-\u0A8D\u0A8F-\u0A91\u0A93-\u0AA8\u0AAA-\u0AB0\u0AB2\u0AB3\u0AB5-\u0AB9\u0ABD\u0AD0\u0AE0\u0AE1\u0AF9\u0B05-\u0B0C\u0B0F\u0B10\u0B13-\u0B28\u0B2A-\u0B30\u0B32\u0B33\u0B35-\u0B39\u0B3D\u0B5C\u0B5D\u0B5F-\u0B61\u0B71\u0B83\u0B85-\u0B8A\u0B8E-\u0B90\u0B92-\u0B95\u0B99\u0B9A\u0B9C\u0B9E\u0B9F\u0BA3\u0BA4\u0BA8-\u0BAA\u0BAE-\u0BB9\u0BD0\u0C05-\u0C0C\u0C0E-\u0C10\u0C12-\u0C28\u0C2A-\u0C39\u0C3D\u0C58-\u0C5A\u0C60\u0C61\u0C80\u0C85-\u0C8C\u0C8E-\u0C90\u0C92-\u0CA8\u0CAA-\u0CB3\u0CB5-\u0CB9\u0CBD\u0CDE\u0CE0\u0CE1\u0CF1\u0CF2\u0D05-\u0D0C\u0D0E-\u0D10\u0D12-\u0D3A\u0D3D\u0D4E\u0D54-\u0D56\u0D5F-\u0D61\u0D7A-\u0D7F\u0D85-\u0D96\u0D9A-\u0DB1\u0DB3-\u0DBB\u0DBD\u0DC0-\u0DC6\u0E01-\u0E30\u0E32\u0E33\u0E40-\u0E46\u0E81\u0E82\u0E84\u0E87\u0E88\u0E8A\u0E8D\u0E94-\u0E97\u0E99-\u0E9F\u0EA1-\u0EA3\u0EA5\u0EA7\u0EAA\u0EAB\u0EAD-\u0EB0\u0EB2\u0EB3\u0EBD\u0EC0-\u0EC4\u0EC6\u0EDC-\u0EDF\u0F00\u0F40-\u0F47\u0F49-\u0F6C\u0F88-\u0F8C\u1000-\u102A\u103F\u1050-\u1055\u105A-\u105D\u1061\u1065\u1066\u106E-\u1070\u1075-\u1081\u108E\u10A0-\u10C5\u10C7\u10CD\u10D0-\u10FA\u10FC-\u1248\u124A-\u124D\u1250-\u1256\u1258\u125A-\u125D\u1260-\u1288\u128A-\u128D\u1290-\u12B0\u12B2-\u12B5\u12B8-\u12BE\u12C0\u12C2-\u12C5\u12C8-\u12D6\u12D8-\u1310\u1312-\u1315\u1318-\u135A\u1380-\u138F\u13A0-\u13F5\u13F8-\u13FD\u1401-\u166C\u166F-\u167F\u1681-\u169A\u16A0-\u16EA\u16F1-\u16F8\u1700-\u170C\u170E-\u1711\u1720-\u1731\u1740-\u1751\u1760-\u176C\u176E-\u1770\u1780-\u17B3\u17D7\u17DC\u1820-\u1877\u1880-\u1884\u1887-\u18A8\u18AA\u18B0-\u18F5\u1900-\u191E\u1950-\u196D\u1970-\u1974\u1980-\u19AB\u19B0-\u19C9\u1A00-\u1A16\u1A20-\u1A54\u1AA7\u1B05-\u1B33\u1B45-\u1B4B\u1B83-\u1BA0\u1BAE\u1BAF\u1BBA-\u1BE5\u1C00-\u1C23\u1C4D-\u1C4F\u1C5A-\u1C7D\u1C80-\u1C88\u1CE9-\u1CEC\u1CEE-\u1CF1\u1CF5\u1CF6\u1D00-\u1DBF\u1E00-\u1F15\u1F18-\u1F1D\u1F20-\u1F45\u1F48-\u1F4D\u1F50-\u1F57\u1F59\u1F5B\u1F5D\u1F5F-\u1F7D\u1F80-\u1FB4\u1FB6-\u1FBC\u1FBE\u1FC2-\u1FC4\u1FC6-\u1FCC\u1FD0-\u1FD3\u1FD6-\u1FDB\u1FE0-\u1FEC\u1FF2-\u1FF4\u1FF6-\u1FFC\u2071\u207F\u2090-\u209C\u2102\u2107\u210A-\u2113\u2115\u2119-\u211D\u2124\u2126\u2128\u212A-\u212D\u212F-\u2139\u213C-\u213F\u2145-\u2149\u214E\u2183\u2184\u2C00-\u2C2E\u2C30-\u2C5E\u2C60-\u2CE4\u2CEB-\u2CEE\u2CF2\u2CF3\u2D00-\u2D25\u2D27\u2D2D\u2D30-\u2D67\u2D6F\u2D80-\u2D96\u2DA0-\u2DA6\u2DA8-\u2DAE\u2DB0-\u2DB6\u2DB8-\u2DBE\u2DC0-\u2DC6\u2DC8-\u2DCE\u2DD0-\u2DD6\u2DD8-\u2DDE\u2E2F\u3005\u3006\u3031-\u3035\u303B\u303C\u3041-\u3096\u309D-\u309F\u30A1-\u30FA\u30FC-\u30FF\u3105-\u312D\u3131-\u318E\u31A0-\u31BA\u31F0-\u31FF\u3400-\u4DB5\u4E00-\u9FD5\uA000-\uA48C\uA4D0-\uA4FD\uA500-\uA60C\uA610-\uA61F\uA62A\uA62B\uA640-\uA66E\uA67F-\uA69D\uA6A0-\uA6E5\uA717-\uA71F\uA722-\uA788\uA78B-\uA7AE\uA7B0-\uA7B7\uA7F7-\uA801\uA803-\uA805\uA807-\uA80A\uA80C-\uA822\uA840-\uA873\uA882-\uA8B3\uA8F2-\uA8F7\uA8FB\uA8FD\uA90A-\uA925\uA930-\uA946\uA960-\uA97C\uA984-\uA9B2\uA9CF\uA9E0-\uA9E4\uA9E6-\uA9EF\uA9FA-\uA9FE\uAA00-\uAA28\uAA40-\uAA42\uAA44-\uAA4B\uAA60-\uAA76\uAA7A\uAA7E-\uAAAF\uAAB1\uAAB5\uAAB6\uAAB9-\uAABD\uAAC0\uAAC2\uAADB-\uAADD\uAAE0-\uAAEA\uAAF2-\uAAF4\uAB01-\uAB06\uAB09-\uAB0E\uAB11-\uAB16\uAB20-\uAB26\uAB28-\uAB2E\uAB30-\uAB5A\uAB5C-\uAB65\uAB70-\uABE2\uAC00-\uD7A3\uD7B0-\uD7C6\uD7CB-\uD7FB\uF900-\uFA6D\uFA70-\uFAD9\uFB00-\uFB06\uFB13-\uFB17\uFB1D\uFB1F-\uFB28\uFB2A-\uFB36\uFB38-\uFB3C\uFB3E\uFB40\uFB41\uFB43\uFB44\uFB46-\uFBB1\uFBD3-\uFD3D\uFD50-\uFD8F\uFD92-\uFDC7\uFDF0-\uFDFB\uFE70-\uFE74\uFE76-\uFEFC\uFF21-\uFF3A\uFF41-\uFF5A\uFF66-\uFFBE\uFFC2-\uFFC7\uFFCA-\uFFCF\uFFD2-\uFFD7\uFFDA-\uFFDC";
const REGEXP_SPLITS = new RegExp(`[${UNICODE_SPACE}${UNICODE_SYMBOL}${UNICODE_PUNCT}]+`);
const REGEXP_ALPHANUMS = new RegExp(`^[${UNICODE_NUMBER}${UNICODE_MARK}${UNICODE_LETTER}]+$`);

This comment has been minimized.

@ncloudioj

ncloudioj Aug 21, 2018

Contributor

Do we also want to filter other stopwords? Which could help reduce the size of result.

@ncloudioj

ncloudioj Aug 21, 2018

Contributor

Do we also want to filter other stopwords? Which could help reduce the size of result.

This comment has been minimized.

@jonathankoren

jonathankoren Aug 24, 2018

Contributor

That wouldn't happen here. This is simply cutting up a stream of bytes into tokens to check against the dictionary. You have to do this even before you do stopword removal, because you don't even know what the words are yet.

We end up doing something much more aggressive than stop word removal because the tagging models use closed vocabulary. If the word isn't in the vocabulary, we simply drop it.

@jonathankoren

jonathankoren Aug 24, 2018

Contributor

That wouldn't happen here. This is simply cutting up a stream of bytes into tokens to check against the dictionary. You have to do this even before you do stopword removal, because you don't even know what the words are yet.

We end up doing something much more aggressive than stop word removal because the tagging models use closed vocabulary. If the word isn't in the vocabulary, we simply drop it.

/**
* Downcases the text, and splits it into consecutive alphanumeric characters.
* This is locale aware, and so will not strip accents. This uses "word
* breaks", and os is not appropriate for languages without them

This comment has been minimized.

@ncloudioj

ncloudioj Aug 21, 2018

Contributor

what do you mean by "os"?

@ncloudioj

ncloudioj Aug 21, 2018

Contributor

what do you mean by "os"?

This comment has been minimized.

@jonathankoren

jonathankoren Aug 24, 2018

Contributor

That's just a typo. It looks like I accidentally doubled the word "is", but ended up typing it wrong the first time.

@jonathankoren

jonathankoren Aug 24, 2018

Contributor

That's just a typo. It looks like I accidentally doubled the word "is", but ended up typing it wrong the first time.

* (e.g. Chinese).
*/
tokenize(text) {
return text.toLocaleLowerCase().split(REGEXP_SPLITS).filter(tok => tok.match(REGEXP_ALPHANUMS));

This comment has been minimized.

@ncloudioj

ncloudioj Aug 21, 2018

Contributor

These two pieces of regex processing could be CPU intensive, we might want to keep an eye on it by profiling.

@ncloudioj

ncloudioj Aug 21, 2018

Contributor

These two pieces of regex processing could be CPU intensive, we might want to keep an eye on it by profiling.

* be dropped.
*/
toksTotfIdfVector(tokens, vocab_idfs) {
let tfidfs = {};

This comment has been minimized.

@ncloudioj

ncloudioj Aug 21, 2018

Contributor

Again, a Map would be better than object to store a large number of key/value pairs.

@ncloudioj

ncloudioj Aug 21, 2018

Contributor

Again, a Map would be better than object to store a large number of key/value pairs.

This comment has been minimized.

@jonathankoren

jonathankoren Aug 24, 2018

Contributor

What's the difference?

@jonathankoren

jonathankoren Aug 24, 2018

Contributor

What's the difference?

This comment has been minimized.

@ncloudioj

ncloudioj Aug 24, 2018

Contributor
});
l2Norm = Math.sqrt(l2Norm);
Object.keys(tfidfs).forEach(tok => {
tfidfs[tok][1] /= l2Norm;

This comment has been minimized.

@ncloudioj

ncloudioj Aug 21, 2018

Contributor

Any chance to see l2Norm === 0? If so, better add a guard to prevent the divided-by-zero error.

@ncloudioj

ncloudioj Aug 21, 2018

Contributor

Any chance to see l2Norm === 0? If so, better add a guard to prevent the divided-by-zero error.

@ncloudioj

It looks good in general. Having a few nits about the code style and other questions in the review comments. Please address them, then I'd like to take another look at this patch. Thanks!

};
// run the tests
for (let i = 0; i < testCases.length; i++) {

This comment has been minimized.

@ncloudioj

ncloudioj Aug 29, 2018

Contributor

ditto

@ncloudioj

ncloudioj Aug 29, 2018

Contributor

ditto

let predictions = {};
Object.keys(this.model.document_topic).forEach(topic => {
let score = 0;
for (let i = 0; i < toksInLatentSpace.length; i++) {

This comment has been minimized.

@ncloudioj

ncloudioj Aug 29, 2018

Contributor

Let's for..of this.

@ncloudioj

ncloudioj Aug 29, 2018

Contributor

Let's for..of this.

This comment has been minimized.

@jonathankoren

jonathankoren Aug 30, 2018

Contributor

It doesn't make a lot of sense to do that here, because i is used as the index across two different arrays. If you used a for...of, you'd still need to keep the i around and increment it outside of the for statement so you can do the lookup on toksInLatentSpace in line 52. I'd rather keep the index explicit here to reinforce the 1:1 correspondence.

@jonathankoren

jonathankoren Aug 30, 2018

Contributor

It doesn't make a lot of sense to do that here, because i is used as the index across two different arrays. If you used a for...of, you'd still need to keep the i around and increment it outside of the for statement so you can do the lookup on toksInLatentSpace in line 52. I'd rather keep the index explicit here to reinforce the 1:1 correspondence.

Show outdated Hide outdated lib/TfIdfVectorizer.jsm Outdated
@ncloudioj

Please resolve all the issues in my comments.

For the Object->Map switch, if you'd like to do that later, please file a follow-up bug for that. It's not only for speed, but also to be consistent in this codebase. If you're looking for a collection structure to store key/value pairs, Map is always preferred.

jonathankoren added some commits Aug 30, 2018

@ncloudioj

r+! After fixing that final nit.

Show outdated Hide outdated lib/TfIdfVectorizer.jsm Outdated
@ncloudioj

This comment has been minimized.

Show comment
Hide comment
@ncloudioj

ncloudioj Sep 4, 2018

Contributor

@jonathankoren There is an eslint error reported by Taskcluster, could you do a rebase against master to see if that fixes it?

Contributor

ncloudioj commented Sep 4, 2018

@jonathankoren There is an eslint error reported by Taskcluster, could you do a rebase against master to see if that fixes it?

@ncloudioj ncloudioj merged commit 9c44263 into mozilla:master Sep 5, 2018

2 checks passed

Taskcluster (pull_request) TaskGroup: success
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment