Skip to content
This repository has been archived by the owner on Dec 3, 2020. It is now read-only.

92% price accuracy #275

Merged
merged 42 commits into from
Nov 20, 2018
Merged

92% price accuracy #275

merged 42 commits into from
Nov 20, 2018

Conversation

erikrose
Copy link
Contributor

@erikrose erikrose commented Nov 16, 2018

As per this morning's email, this bumps the price accuracy up to 92% and the title to 100%, for a total of 8 percentage points improvement.

New testing results (samples 26-75):
Entity        Accuracy       95% Confidence Interval
----------------------------------------------------
title         100%           100.0 - 100%
image         100%           100.0 - 100%
price          92%            84.5 - 99.5%

It's also tested on a corpus twice as big:

Old testing results:
Entity        Accuracy       95% Confidence Interval
----------------------------------------------------
image          96%            88.3 - 100%
title         100%           100.0 - 100%
price          88%            75.3 - 100%

Though my goal was a price accuracy bump, my strategy was to attack image recognition first, since everything else hinges on that. And it worked out pretty well! I added rules to…

  • Consider images found as the background-image of divs
  • Punish images with extreme aspect ratios
  • Punish images with "background" in nearby IDs

I also re-expressed (and sometimes rewrote) the rules to compute [0, 1] fuzzy-logic confidences, from which we might be able to draw overall confidences that a given page is a product one, toward making the menu enable and disable more intelligently.

Please excuse the weird metadata on the commits. These are painstakingly extracted from the fathom-trainees fork where I developed them and retconned into this repo to preserve as much history as I could. One consequence of all this cherry-picking is that I change some stuff in the first few commits to glue it into fathom-trainees the way I wanted it, and then I change it back in the last "Put the glue code back" commit. So maybe look at the last commit before picking nits with the first few. :-)

Note that reality has suddenly shifted under my feet. I'd love for someone to softlink the trainees.js, coefficients.json, and ruleset_factory.js into this from a recent fork of fathom-trainees, as everything has started returning really low numbers on my machine in the past hour or so. I can't even repro my earlier training runs. I suspect (and hope) I broke something local to my machine. I had mistranscribed a number. See ba89ed4. That's why this is now 92% accurate instead of 90%.

Basically, copy ruleset_factory.js over from webext-commerce, copy the coeffs out of fathom_default_coefficients.json, and write a line or 2 of glue.
…ner menu.

...rather than having to edit the code between training or testing runs.
I'm hoping this sort of thing will help penalties like the cart one have a more consistent effect, since we'll no longer be blowing up bonuses without bound.
I'm not sure what scaling by the viewport size was getting us before. I could just be dense.

Put the coefficients vector back; I had used a shrunken one for faster training when working on image rules.
Title and image are unaffected, since they don't really have many rules to balance atm.
…onfidences. Also the $ rule.

Also break up both "price" rules into separate ones for parent and the actual element so the trainer can come up with optimal coeffs, rather than what I assume is a human guess of .75x difference.

Start those new parent coeffs out at as close to .75x as I could.
Add a trapezoid function, mostly to clamp it to 0..1 before we raise it to the coeff's power.

Re-spell largerImage() for consistency. ("Is" is what we've been using for fuzzy-truth values elsewhere, so we adopt it here.)
Express it in terms of fuzzy truth. Also greatly simplify. I'd like to test whether this does as well or better than the overlap-testing method.

In doing so, change getHighestScoringImage() to return its fnode, not its element.
Also don't require the price to come at the end, in service of the "US $5.00 plus free shipping" use case which I misread "'US $5.00' on eBay" as. Let's see how that goes.
One of those showed up in 4.html.

Rename imageIsBig() because we're calling it on things other than images now, and it was always general anyway.
…ground images. This gets us to 100% on the training corpus of 75.
…the JSON file.

Order the coeffs alphabetically in the ruleset factory so getCoeffsInOrder() works once more.

Upgrade to Fathom 2.8 so FathomFox's good/back-clicking features work when you symlink from a fathom-trainees fork to here. (Yarn looks in pricewise, not in fathom-trainees, for a copy of fathom.)
@erikrose erikrose changed the title 90% price 90% price accuracy Nov 16, 2018
@Osmose
Copy link
Contributor

Osmose commented Nov 19, 2018

So I'm trying to understand how exponentiation works for coefficients:

Base: Zeroish (0.08)
Positive exponent: Reduces score (1.0 > score > 0)
Negative exponent: Increases score greatly (0.08 ^ -1 == 12.5)

Base: Oneish (0.9)
Positive exponent: Reduces score weakly
Negative exponent: Increases score weakly

So a rule with a positive exponent will always reduce the score, and rules with a negative exponent will always increase it, and rules are simply expected to return ONEISH for weak matches and ZEROISH for strong matches?

(Which is weird, because the code seems to work the other way around?)

If I'm getting this correctly, that means exponentiation does not allow for a rule that increases the score for matches, and decreases it for elements that don't match, right?

Copy link
Contributor

@Osmose Osmose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a fan of your work. I am a fan of resolving dozens of issues at once, if we're lucky with this PR. I'm gonna test this against some of the pages we've had reports for during my next review pass to see if it fixes any of them.

Thanks!

* make a falling trapezoid, where the result is ONEISH to the left and ZEROISH
* to the right, use a zeroAt greater than oneAt.
*/
function trapezoid(number, zeroAt, oneAt) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trapezoid is an unclear name for what this is doing, or at least it requires some explanation; the first time I ran into this particular function I had to read it like 3 times to understand what it was doing. A signature like linearScale(number, {zeroAt: ZEROISH, oneAt: ONEISH}) would go a long way towards making this understandable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could go with the more semantic meaning of this, which is something like:

return STRONG_MATCH; // rename ONEISH to STRONG_MATCH
return WEAK_MATCH; // rename ZEROISH to WEAK_MATCH
return scaledMatch(number, {weak: 15, strong: 7});

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your linearScale spelling. Changing. I'm not sure what you have in mind with defaulting the zeroAt param to ZEROISH and oneAt to ONEISH. ZEROISH and ONEISH are the output values of linearScale, not typically the input ones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean to default them, rather was just putting them in as rando arguments.

return 0.75 * this.hasPriceInIDCoeff;
}
return DEFAULT_SCORE;
contains(haystack, needle, coeff) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tricky naming problem, huh. iincludes is a bad name, but I'm wary of shifting away from JavaScript's nomenclature too.

Maybe rename doesContain to icontains (or maybe just contains) and contains to containsScore? I mostly just want to make it clear that one is a standard contains and the other returns a score.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good call. I named those super-fast and never made a second pass to clean them up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the i for in iincludes and icontains?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case-insensitive. But I ended up spelling it out.

return haystack.toLowerCase().includes(needle);
}

/** Return a weighted confidence of whether a substring is within a given
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do a newline after the double stars for multiline doc comments.

}

hasPriceInParentID(fnode) {
return this.contains(fnode.element.parentElement.id, 'price', this.hasPriceInParentIDCoeff);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parentElement cold be null, right? Does the code fail in that case? I guess the old code had this problem too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually comes out as "", so it's fine. :-D

return this.contains(fnode.element.className, 'price', this.hasPriceInClassNameCoeff);
}

/** Scores fnode with 'price' in its class name */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In its parent's class name

rule(type('imageish'), score(this.isBig.bind(this))),
// punishment for extreme aspect ratios, to filter out banners or nav elements
rule(type('imageish'), score(fnode => trapezoid(this.aspectRatio(fnode.element), 10, 5)
** this.extremeAspectCoeff)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this into a standalone function to be consistent? Same with the other inline functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I was experimenting to see if it was clearer inline, and I actually thought it was, but I'll revert for now. I think the New Math will be a good time to re-examine.

}
return DEFAULT_SCORE;
isNearImage(fnode) {
const image = this.getHighestScoringImage(fnode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should be explicit about types?

const imageFnode = this.getHighestScoringImageFnode(fnode);

return this.hasPriceishPatternCoeff;
}
return DEFAULT_SCORE;
const regExp = /\$?\d+\.\d{2}(?![0-9])/;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the negative lookahead be optional, for price strings at the end of the element text?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because the negative lookahead succeeds when the string ends. (After all, there is indeed no digit there.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this addition, (?![0-9]), is allowing any non-digits to follow the two digits after the decimal point, instead of requiring that the end of the string be those two digits after the decimal point?

// Trying it out in a REPL...
const regExpBefore = /\${0,1}\d+\.\d{2}$/;
const regExpAfter = /\$?\d+\.\d{2}(?![0-9])/;
const sample = 'US $4.99e';
regExpBefore.test(sample); // false
regExpAfter.test(sample); // true

@@ -264,6 +235,20 @@ export default class RulesetFactory {
return true;
}

hasBackgroundImage(fnode) {
const bgImage = getComputedStyle(fnode.element)['background-image'];
return !!bgImage && bgImage !== 'none';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The !! should be unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

@@ -275,40 +260,54 @@ export default class RulesetFactory {
*/
// consider all visible img elements
rule(dom('img').when(this.isVisible.bind(this)), type('imageish')),
// and divs, which sometimes have CSS background-images
// TODO: Consider a bonus for <img> tags.
rule(dom('div').when(fnode => this.isVisible(fnode) && this.hasBackgroundImage(fnode)), type('imageish')),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually successfully pull the background image from these elements? I think the wrapper code around this factory assumes image matches are img tags and pull their src.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I had tunnel vision and didn't look beyond upping the recognizer's accuracy. Let me see if I can figure it out.

@erikrose
Copy link
Contributor Author

First, let me say I am not in love with the current math and, as I said in IRC yesterday, I have some new, simpler, hopefully much more intuitive and optimization-friendly math coming up. Addition and thresholds-via-sigmoids; no exponentiation. That said, I will lay out how things stand right now. Of course, what's most important is that the math produces accurate recognition. :-)

So a rule with a positive exponent will always reduce the score, and rules with a negative exponent will always increase it,

Right. I was hoping to avoid negatives in practice, since they violate the 0..1 range post-weighting, but it didn't turn out that way.

and rules are simply expected to return ONEISH for weak matches and ZEROISH for strong matches?

The other way around. ONEISH indicates a high confidence that the rule's signal was detected, while ZEROISH indicates a high confidence that it wasn't.

If I'm getting this correctly, that means exponentiation does not allow for a rule that increases the score for matches, and decreases it for elements that don't match, right?

Correct. And this is actually going in the right direction, I believe. Originally, using Fathom's built-in multiplication-based score combination with no auxiliary math, you could scale either way, and I considered that a feature. However, if you think about it in terms of individual rules, rules are simple. They generally can't say anything about the overall confidence that an element is, say, the product's price. Their purview is one tiny dimension: whether a font size is big or a color is bright or that there are decimal digits present. If a color is bright, great. Bonus for that element. But if it isn't, there shouldn't necessarily be a penalty assessed. What do you do in that case? Can't return 0-ish; that'd be a penalty. Can't return 1; that's as high as you can go in Fuzzy Logic Land and should thus represent the highest bonus. And we don't want to go back outside 0..1, because then we're dealing with unbounded ranges and no longer have stable thresholds to pin our confidences on.

So the right answer, I believe, is to have dedicated rules that emit bonuses and other rules that emit penalties and tell the framework which is which. The New Math is going to be that way, and the exponentiation-based scheme we're temporarily using here models it decently as well. Think of the exponentiation-based fuzzy-logic scheme as a multiplication of doubts. If a rule returns 0.9, it doubts that this is the price a little; 0.1, a lot. (Notice that we're operating under the false idea that a rule has purview over the whole type's confidence, which I now believe is a stupid idea.) Those get weighted and then multiplied together, and the result is probably a very small number, but what really matters is the comparison of them, so it all shakes out in the end.

To get intuitively scaled confidences out of this, I had planned to take the nth root of the final result, n being the sum of all the coefficients. This would yield a weighted geometric mean, which is close enough to an arithmetic mean for the purpose of intuitive interpretation. (The whole current math is just a big weighted geometric mean.) However, I'm not sure that works with negative coefficients. At any rate, I want to pave over the whole mess with simpler, more intuitive New Math, which I look forward to running by both you and the optimizer.

Fortunately, the optimizer uses a general-purpose algorithm, so it doesn't really matter what we do, for the purposes of accuracy. The transition in this PR to 0..1 ranges will be required in the New Math anyway; they'll just have different coeffs computed to go along with them.

If that doesn't make it clearer, at least it should give you a good Wikipedia hole to go down. :-)

@erikrose
Copy link
Contributor Author

erikrose commented Nov 20, 2018

Aw, phooey. As I work on getting background-image extraction going, I see you guys are using scoring thresholds as well (SCORE_THRESHOLD). I'll have to derive a new one of those, since I (1) added new rules and (2) threw all the cards up in the air with the 0..1 ranging.

We don't handle non-url() specifications of images. Extraction will just fail nicely in the (hopefully uncommon) cases of image-sets and such.
@erikrose
Copy link
Contributor Author

Re-ran testing run. Still 92%. Didn't screw anything up on that end. :-)

Copy link
Contributor

@Osmose Osmose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/extraction/fathom/ruleset_factory.js Show resolved Hide resolved
src/extraction/fathom/ruleset_factory.js Show resolved Hide resolved
src/extraction/fathom/ruleset_factory.js Show resolved Hide resolved
}
// The other thing the ruleset can return is an arbitrary element with
// a CSS background image.
return urlFromCssDeclaration(getComputedStyle(element)['background-image']);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of a nit, but I've always found that function declarations at the top break the flow of the function and make it harder to read. The function name helps the readability, but a variable can do just as well without interrupting the order:

const backgroundImage = getComputedStyle(element)['background-image'];
return backgroundImage.substring(5, background.length - 2);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted this to be ScreechinglyObviousCode. With magic numbers like 5 and -2, it's not otherwise screechingly obvious that what we're trying to do is pull the param out of url("…").

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An extra variable name, then?

const backgroundImage = getComputedStyle(element)['background-image'];
const backgroundImageUrl = backgroundImage.substring(5, background.length - 2); // "url('<image_url>')"
return backgroundImageUrl;

(Or not. This is very much yak shaving at this point.)

src/extraction/fathom/index.js Outdated Show resolved Hide resolved
Copy link
Collaborator

@biancadanforth biancadanforth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this has already merged, but I went through it anyway and can double confirm that this is an improvement from our baseline extraction based on the original 50-page training corpus, with additional confidence provided by the 25-page test set at https://github.com/mozilla/webext-commerce-corpus/. Thank you for these improvements!

In doing the training runs, I noticed you added several changes to Fathom that improved the process from my original notes with Fathom 2.3.0, so that was great to see as well.

I asked a couple clarifying questions here on the ruleset, and I will follow up with you separately on a couple points related to Fathom/FathomFox/Fathom Trainees, etc.

/** Return whether the computed font size of an element is big. */
fontIsBig(fnode) {
const size = parseInt(getComputedStyle(fnode.element).fontSize, 10);
return trapezoid(size, 14, 50) ** this.bigFontCoeff;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It would be easier here if these threshold values (14, 50) were stored in const variables at the top for future tweaking. That way they could also be described in terms of what units they're in, and potentially used by other rules.

}

/** Scores fnode with 'price' in its class name */
hasPriceInParentClassName(fnode) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to break these each out into their own rules with their own coefficients for node versus parent node.

if (top <= viewportHeight) {
return ONEISH * featureCoeff;
}
const viewportHeight = 950;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The viewport height and/or width shows up in a lot of different places (here, trainees.js, price_updates.js)... Maybe we could have a Fathom config in ./src/extraction/fathom with these values? Not sure what else would go in there though -- what do you think @Osmose ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a separate config for them, but putting them in the main config seems fine to me.


// Test nodeRect.top. They're probably not side by side with the title at
// the bottom. Rather, title will be below image.
const bottomDistance = Math.abs(imageRect.bottom - nodeRect.top);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of a product page I have seen where the title is below the image -- did you find some from developing your test corpus?

return this.hasPriceishPatternCoeff;
}
return DEFAULT_SCORE;
const regExp = /\$?\d+\.\d{2}(?![0-9])/;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this addition, (?![0-9]), is allowing any non-digits to follow the two digits after the decimal point, instead of requiring that the end of the string be those two digits after the decimal point?

// Trying it out in a REPL...
const regExpBefore = /\${0,1}\d+\.\d{2}$/;
const regExpAfter = /\$?\d+\.\d{2}(?![0-9])/;
const sample = 'US $4.99e';
regExpBefore.test(sample); // false
regExpAfter.test(sample); // true

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants