Improve data extraction method using a static ruleset. Fixes #21. #24

biancadanforth · 2018-07-14T17:43:08Z

This builds off of #6, extending product recognition to also look for product information (title, image and price only so far) in a static dataset, 'product_extraction_data.json'. The fallback is to check Open Graph meta tags.

Note: 'product_extraction_data.json is a manually generated set of CSS selectors for a product page by vendor. As a proof-of-concept, I have only added four of the biggest e-commerce sites to populate this dataset.

biancadanforth · 2018-07-14T18:01:26Z

@Osmose Ready for your review.

I encountered Issue #25 while working on this patch.

Osmose

Thanks for this! This is a big improvement over the extraction we had previously. Nice work!

Osmose · 2018-07-16T21:45:38Z

src/product_info.js

+function getCssSelectors() {
+  const hostname = new URL(window.location.href).host;
+  for (const [vendor, selectors] of Object.entries(extractionData)) {
+    if (hostname.includes(vendor)) {


One vendor could have a hostname that's a subset of another one (e.g. zon.com matching amazon.com).

The alternative is parsing the full domain, though. Otherwise we wouldn't be resilient to amazon.com being different from amazon.co.uk. I can't think of a better way to do this that's worth doing right now. Just wanted to bring it up.

Hmm yeah good observation. I can update the top level keys in the data blob to include the full hostname; what do you think about that?

I.e. this:

{ "amazon": { "thing": "stuff" } }

becomes:

{ "www.amazon.com": { "thing": "stuff" } }

The problem with that is that we stop working on, say, smile.amazon.com despite possibly being able to reuse the selectors there since they have the same UI. Or vendors that use subdomains for some other splitting of the same website.

Osmose · 2018-07-16T21:47:25Z

src/product_info.js

+  const hostname = new URL(window.location.href).host;
+  for (const [vendor, selectors] of Object.entries(extractionData)) {
+    if (hostname.includes(vendor)) {
+      const {title, price, image} = selectors || null;


Destructuring null should throw a TypeError. We'd want an empty object instead of null for this to work.

But right below we're just returning an object with these properties anyway. Why not just return selectors directly?

Osmose · 2018-07-16T21:48:11Z

src/product_info.js

+ * Returns any product information available on the page from CSS
+ * selectors if they exist, otherwise from Open Graph <meta> tags.
+ */
+function extractData(selectors) {


Why pass selectors as a parameter instead of calling getCssSelectors directly?

Osmose · 2018-07-16T21:48:49Z

src/product_info.js

+ * Returns any extraction data found for the vendor based on the URL
+ * for the page.
+ */
+function getCssSelectors() {


CSS is an acronym, so we generally should capitalize it everywhere it's used.

Osmose · 2018-07-16T21:50:32Z

src/product_info.js

+function extractData(selectors) {
+  const data = {};
+  if (selectors) {
+    for (const selector in selectors) {


Object.entries would let us iterate over the key/value pairs in this without having to use hasOwnProperty.

Osmose · 2018-07-16T21:51:36Z

src/product_info.js

+      if (Object.prototype.hasOwnProperty.call(selectors, selector)) {
+        const valueArr = selectors[selector];
+        let element = null;
+        for (let i = 0; i < valueArr.length; i++) {


We're just iterating over the values here, so we can use for (const value of valueArr) instead.

Also, the things we're iterating over are selectors, so we should change the variable names in this function such that selector is available to use here if possible. The thing we're calling a selector above is actually just an attribute of the product that we're building, so maybe renaming it to productAttribute would be useful.

We're just iterating over the values here, so we can use for (const value of valueArr) instead.

I can see that this uses fewer lines of code -- are there other reasons why for..of for iterating over an array is better than a normal for loop?

I can see that this uses fewer lines of code -- are there other reasons why for..of for iterating over an array is better than a normal for loop?

It's semantically the correct thing and thus easier to read and understand. If you're not using the index for anything else besides indexing into the list, then it reduces cognitive load on the reading developer to remove the index entirely and just say "for every element in the list, do this".

Osmose · 2018-07-16T21:54:09Z

src/product_info.js

+      // Avoid iterating over properties inherited through the prototype chain
+      if (Object.prototype.hasOwnProperty.call(selectors, selector)) {
+        const valueArr = selectors[selector];
+        let element = null;


This doesn't need to be mutable; we never use it outside the for loop, so just declare it as a const inside of the loop.

Osmose · 2018-07-16T21:55:46Z

src/product_info.js

+          if (element) {
+            data[selector] = element.getAttribute('content')
+              || element.innerText
+              || element.src;


JavaScript has automatic semicolon insertion, so multi-line statements like this are risky. This particular arrangement would work fine, but it's preferred to be explicit that this is all one statement by wrapping it in parenthesis.

Pulling all three of these attributes and picking the first one regardless of what the selectors point to strikes me as fragile. It's not obvious to me from the docs that innerText is always empty for an image element, for example, that we'd require the src attribute for. This list will also grow the more elements we're extracting.

One approach would be to add another attribute to the selectors that specifies how to extract the data. We could define the extraction data in a JavaScript file instead of JSON and do things like:

{ alibaba: { title: ['.product-title', INNER_TEXT], // or, alternatively title: innerTextFrom('.product-title'), }, }

Another would be to set some requirements per-product-attribute. If the image selectors are always expected to match img elements, then when selector is image you know you want to pull the src property. In a case like that, there's less value in us looping over the properties from the CSS selector object vs just having something like:

const product = { // This example doesn't handle the case where `firstElement` returns null title: firstElement(selectors.title).innerText, image: firstElement(selectors.image).src, };

I'm not entirely sure which is better, and too much work making this sophisticated may not be called for until we know what we're doing for the MVP and product extraction.

Yeah this is a simplified method that we would have to vary by product attribute and even per-vendor, as for example, some vendors use the content attribute for an element to store the price, whereas it is the innerText of an element for other vendors.

For that reason I think your first category of solutions makes the most sense.

biancadanforth · 2018-07-19T20:09:08Z

@Osmose Ready for your re-review! I am not a super fan of object nesting (re: the change to 'product_extraction_data.json'), but that seemed to be the clearest way to do this...

Osmose

r+wc, nice work! The suggested changes are minor, you can merge after making them.

This will make testing a little more convenient. 👍

Osmose · 2018-07-20T17:46:36Z

src/product_extraction_data.json

+  "www.aliexpress.com": {
+    "title": {
+      "selectors": [".product-name"],
+      "extractUsing": "innerText"


This makes it impossible to fallback to an element that doesn't use innerText. Perhaps use tuples in place of the strings you were before?

{ "www.aliexpress.com": { "title": [[".product-name", "innerText"]] } }

This is minor so I'm not gonna block on it. We can refine this extraction process later.

@Osmose : this is an interesting idea (TIL what a tuple is). I wonder though, while the extractUsing value will always be the last index, the length of the tuple data structure will vary by vendor and/or product attribute -- does this matter?

Is a tuple processed the same way as an array when the JSON is parsed into a JavaScript object?

So a tuple is an abstract idea of an immutable sequence of items. Python has language support for them:

this_is_a_tuple = (1, 2, 'a') assert this_is_a_tuple[1] == 2 # true # This line will fail, because tuples are not mutable (they cannot be modified) # Note that if the value at index 2 was an object that was mutable, we could # change it because mutating the object is not mutating the tuple containing a # reference to it. this_is_a_tuple[2] = 'c'

They're nice for small bits of data with few fields.

JavaScript/JSON do not support tuples. You can mimic them with lists, which is what I did in the example up there, but they're mutable and thus not true tuples. The [[ in the example above is not a special syntax for tuples, it's just a list with one element in it, which is also a list.

Which is the answer to your question. The idea is to just use some faux-tuples for each selector. For example, if title had two possible selectors:

{ "www.aliexpress.com": { "title": [ [".product-name", "innerText"], [".other-product", "content"] ] } }

I hope that makes it a little clearer. Honestly I'd recommend we ignore it for now and we can figure it out when we run into a case where we need it.

Thank you for the explanation! That makes a lot of sense.

Per our slack conversation, this will be noted, but left as-is for now, since this is more or less placeholder data, and we will likely have different (and differently structured) data to work with ultimately.

Osmose · 2018-07-20T17:48:33Z

src/product_info.js

+        if (element) {
+          const extractUsing = getProductAttributeInfo('extractUsing');
+          const extractionValue = extractUsing[productAttribute];
+          switch (extractionValue) {


This switch smells like a helper that can be pulled into it's own function to make this one a bit more readable.

Osmose · 2018-07-20T17:49:55Z

src/product_info.js

+      for (const selector of selectorArr) {
+        const element = document.querySelector(selector);
+        if (element) {
+          const extractUsing = getProductAttributeInfo('extractUsing');


Why not just pull the whole object for the matching site from the start? Then we can index into it instead of having to search the array and match the URL twice.

This builds off of #6, extending product recognition to also look for product information (title, image and price only so far) in a static dataset, 'product_extraction_data.json'. The fallback is to check Open Graph meta tags. Note: 'product_extraction_data.json is a manually generated set of CSS selectors for a product page by vendor. As a proof-of-concept, I have only added four of the biggest e-commerce sites to populate this dataset.

Rebased on top of latest master. Changed the structure of 'product_extraction_data.json' to add a third level key, 'extractUsing', so that we can know for a given product attribute (e.g. title) how to pull that information from an element for a particular vendor. For example: For a Walmart product page, to obtain the product title, we want to grab an element using one of these selectors and then extract the title using the innerText value of that element.

Added error handling for the case that: * Undefined or an empty string is returned from 'extractValueFromElement'. * No element is found for a given product attribute for any of the provided selectors.

biancadanforth mentioned this pull request Jul 14, 2018

Sidebar does not refresh on location change when isProductPage goes from true to false #25

Closed

Osmose mentioned this pull request Jul 16, 2018

Fix #14: Show page action when we detect a product. #22

Merged

Osmose self-requested a review July 16, 2018 21:41

Osmose suggested changes Jul 16, 2018

View reviewed changes

biancadanforth mentioned this pull request Jul 17, 2018

Attempt product page recognition earlier than after page load #29

Closed

biancadanforth force-pushed the 21-improve-extraction branch 2 times, most recently from bc0c43b to 9ec7bb3 Compare July 19, 2018 20:06

Osmose approved these changes Jul 20, 2018

View reviewed changes

biancadanforth added 3 commits July 23, 2018 11:07

Updated per osmose's feedback; improved error handling.

1f1f694

Added error handling for the case that: * Undefined or an empty string is returned from 'extractValueFromElement'. * No element is found for a given product attribute for any of the provided selectors.

biancadanforth force-pushed the 21-improve-extraction branch from 9ec7bb3 to 1f1f694 Compare July 23, 2018 23:21

biancadanforth merged commit 00f8c5b into master Jul 23, 2018

biancadanforth deleted the 21-improve-extraction branch July 23, 2018 23:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve data extraction method using a static ruleset. Fixes #21. #24

Improve data extraction method using a static ruleset. Fixes #21. #24

biancadanforth commented Jul 14, 2018

biancadanforth commented Jul 14, 2018

Osmose left a comment

Osmose Jul 16, 2018

biancadanforth Jul 18, 2018 •

edited

Loading

Osmose Jul 19, 2018

Osmose Jul 16, 2018

Osmose Jul 16, 2018

Osmose Jul 16, 2018

Osmose Jul 16, 2018

Osmose Jul 16, 2018

Osmose Jul 16, 2018

biancadanforth Jul 19, 2018 •

edited

Loading

Osmose Jul 19, 2018

Osmose Jul 16, 2018

Osmose Jul 16, 2018

Osmose Jul 16, 2018

biancadanforth Jul 19, 2018

biancadanforth commented Jul 19, 2018

Osmose left a comment

Osmose Jul 20, 2018

biancadanforth Jul 21, 2018 •

edited

Loading

Osmose Jul 21, 2018 •

edited

Loading

biancadanforth Jul 23, 2018 •

edited

Loading

Osmose Jul 20, 2018

Osmose Jul 20, 2018

Improve data extraction method using a static ruleset. Fixes #21. #24

Improve data extraction method using a static ruleset. Fixes #21. #24

Conversation

biancadanforth commented Jul 14, 2018

biancadanforth commented Jul 14, 2018

Osmose left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

biancadanforth Jul 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

biancadanforth Jul 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

biancadanforth commented Jul 19, 2018

Osmose left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

biancadanforth Jul 21, 2018 • edited Loading

Choose a reason for hiding this comment

Osmose Jul 21, 2018 • edited Loading

Choose a reason for hiding this comment

biancadanforth Jul 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

biancadanforth Jul 18, 2018 •

edited

Loading

biancadanforth Jul 19, 2018 •

edited

Loading

biancadanforth Jul 21, 2018 •

edited

Loading

Osmose Jul 21, 2018 •

edited

Loading

biancadanforth Jul 23, 2018 •

edited

Loading