Skip to content
This repository has been archived by the owner on Dec 3, 2020. It is now read-only.

Improve data extraction method using a static ruleset. Fixes #21. #24

Merged
merged 3 commits into from
Jul 23, 2018

Conversation

biancadanforth
Copy link
Collaborator

This builds off of #6, extending product recognition to also look for product information (title, image and price only so far) in a static dataset, 'product_extraction_data.json'. The fallback is to check Open Graph meta tags.

Note: 'product_extraction_data.json is a manually generated set of CSS selectors for a product page by vendor. As a proof-of-concept, I have only added four of the biggest e-commerce sites to populate this dataset.

@biancadanforth
Copy link
Collaborator Author

@Osmose Ready for your review.

I encountered Issue #25 while working on this patch.

Copy link
Contributor

@Osmose Osmose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! This is a big improvement over the extraction we had previously. Nice work!

function getCssSelectors() {
const hostname = new URL(window.location.href).host;
for (const [vendor, selectors] of Object.entries(extractionData)) {
if (hostname.includes(vendor)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One vendor could have a hostname that's a subset of another one (e.g. zon.com matching amazon.com).

The alternative is parsing the full domain, though. Otherwise we wouldn't be resilient to amazon.com being different from amazon.co.uk. I can't think of a better way to do this that's worth doing right now. Just wanted to bring it up.

Copy link
Collaborator Author

@biancadanforth biancadanforth Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah good observation. I can update the top level keys in the data blob to include the full hostname; what do you think about that?

I.e. this:

{
  "amazon": {
    "thing": "stuff"
  }
}

becomes:

{
  "www.amazon.com": {
    "thing": "stuff"
  }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with that is that we stop working on, say, smile.amazon.com despite possibly being able to reuse the selectors there since they have the same UI. Or vendors that use subdomains for some other splitting of the same website.

const hostname = new URL(window.location.href).host;
for (const [vendor, selectors] of Object.entries(extractionData)) {
if (hostname.includes(vendor)) {
const {title, price, image} = selectors || null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Destructuring null should throw a TypeError. We'd want an empty object instead of null for this to work.

But right below we're just returning an object with these properties anyway. Why not just return selectors directly?

* Returns any product information available on the page from CSS
* selectors if they exist, otherwise from Open Graph <meta> tags.
*/
function extractData(selectors) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why pass selectors as a parameter instead of calling getCssSelectors directly?

* Returns any extraction data found for the vendor based on the URL
* for the page.
*/
function getCssSelectors() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSS is an acronym, so we generally should capitalize it everywhere it's used.

function extractData(selectors) {
const data = {};
if (selectors) {
for (const selector in selectors) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Object.entries would let us iterate over the key/value pairs in this without having to use hasOwnProperty.

if (Object.prototype.hasOwnProperty.call(selectors, selector)) {
const valueArr = selectors[selector];
let element = null;
for (let i = 0; i < valueArr.length; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're just iterating over the values here, so we can use for (const value of valueArr) instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the things we're iterating over are selectors, so we should change the variable names in this function such that selector is available to use here if possible. The thing we're calling a selector above is actually just an attribute of the product that we're building, so maybe renaming it to productAttribute would be useful.

Copy link
Collaborator Author

@biancadanforth biancadanforth Jul 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're just iterating over the values here, so we can use for (const value of valueArr) instead.

I can see that this uses fewer lines of code -- are there other reasons why for..of for iterating over an array is better than a normal for loop?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that this uses fewer lines of code -- are there other reasons why for..of for iterating over an array is better than a normal for loop?

It's semantically the correct thing and thus easier to read and understand. If you're not using the index for anything else besides indexing into the list, then it reduces cognitive load on the reading developer to remove the index entirely and just say "for every element in the list, do this".

// Avoid iterating over properties inherited through the prototype chain
if (Object.prototype.hasOwnProperty.call(selectors, selector)) {
const valueArr = selectors[selector];
let element = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be mutable; we never use it outside the for loop, so just declare it as a const inside of the loop.

if (element) {
data[selector] = element.getAttribute('content')
|| element.innerText
|| element.src;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JavaScript has automatic semicolon insertion, so multi-line statements like this are risky. This particular arrangement would work fine, but it's preferred to be explicit that this is all one statement by wrapping it in parenthesis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulling all three of these attributes and picking the first one regardless of what the selectors point to strikes me as fragile. It's not obvious to me from the docs that innerText is always empty for an image element, for example, that we'd require the src attribute for. This list will also grow the more elements we're extracting.

One approach would be to add another attribute to the selectors that specifies how to extract the data. We could define the extraction data in a JavaScript file instead of JSON and do things like:

{
  alibaba: {
    title: ['.product-title', INNER_TEXT],
    // or, alternatively
    title: innerTextFrom('.product-title'),
  },
}

Another would be to set some requirements per-product-attribute. If the image selectors are always expected to match img elements, then when selector is image you know you want to pull the src property. In a case like that, there's less value in us looping over the properties from the CSS selector object vs just having something like:

const product = {
  // This example doesn't handle the case where `firstElement` returns null
  title: firstElement(selectors.title).innerText,
  image: firstElement(selectors.image).src,
};

I'm not entirely sure which is better, and too much work making this sophisticated may not be called for until we know what we're doing for the MVP and product extraction.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a simplified method that we would have to vary by product attribute and even per-vendor, as for example, some vendors use the content attribute for an element to store the price, whereas it is the innerText of an element for other vendors.

For that reason I think your first category of solutions makes the most sense.

@biancadanforth
Copy link
Collaborator Author

@Osmose Ready for your re-review! I am not a super fan of object nesting (re: the change to 'product_extraction_data.json'), but that seemed to be the clearest way to do this...

Copy link
Contributor

@Osmose Osmose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+wc, nice work! The suggested changes are minor, you can merge after making them.

This will make testing a little more convenient. 👍

"www.aliexpress.com": {
"title": {
"selectors": [".product-name"],
"extractUsing": "innerText"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it impossible to fallback to an element that doesn't use innerText. Perhaps use tuples in place of the strings you were before?

{
  "www.aliexpress.com": {
    "title": [[".product-name", "innerText"]]
  }
}

This is minor so I'm not gonna block on it. We can refine this extraction process later.

Copy link
Collaborator Author

@biancadanforth biancadanforth Jul 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Osmose : this is an interesting idea (TIL what a tuple is). I wonder though, while the extractUsing value will always be the last index, the length of the tuple data structure will vary by vendor and/or product attribute -- does this matter?

Is a tuple processed the same way as an array when the JSON is parsed into a JavaScript object?

Copy link
Contributor

@Osmose Osmose Jul 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So a tuple is an abstract idea of an immutable sequence of items. Python has language support for them:

this_is_a_tuple = (1, 2, 'a')
assert this_is_a_tuple[1] == 2 # true

# This line will fail, because tuples are not mutable (they cannot be modified)
# Note that if the value at index 2 was an object that was mutable, we could
# change it because mutating the object is not mutating the tuple containing a
# reference to it.
this_is_a_tuple[2] = 'c'

They're nice for small bits of data with few fields.

JavaScript/JSON do not support tuples. You can mimic them with lists, which is what I did in the example up there, but they're mutable and thus not true tuples. The [[ in the example above is not a special syntax for tuples, it's just a list with one element in it, which is also a list.

Which is the answer to your question. The idea is to just use some faux-tuples for each selector. For example, if title had two possible selectors:

{
  "www.aliexpress.com": {
    "title": [
      [".product-name", "innerText"],
      [".other-product", "content"]
    ]
  }
}

I hope that makes it a little clearer. Honestly I'd recommend we ignore it for now and we can figure it out when we run into a case where we need it.

Copy link
Collaborator Author

@biancadanforth biancadanforth Jul 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation! That makes a lot of sense.

Per our slack conversation, this will be noted, but left as-is for now, since this is more or less placeholder data, and we will likely have different (and differently structured) data to work with ultimately.

if (element) {
const extractUsing = getProductAttributeInfo('extractUsing');
const extractionValue = extractUsing[productAttribute];
switch (extractionValue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This switch smells like a helper that can be pulled into it's own function to make this one a bit more readable.

for (const selector of selectorArr) {
const element = document.querySelector(selector);
if (element) {
const extractUsing = getProductAttributeInfo('extractUsing');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just pull the whole object for the matching site from the start? Then we can index into it instead of having to search the array and match the URL twice.

This builds off of #6, extending product recognition to also look for product information (title, image and price only so far) in a static dataset, 'product_extraction_data.json'. The fallback is to check Open Graph meta tags.

Note: 'product_extraction_data.json is a manually generated set of CSS selectors for a product page by vendor. As a proof-of-concept, I have only added four of the biggest e-commerce sites to populate this dataset.
Rebased on top of latest master.

Changed the structure of 'product_extraction_data.json' to add a third level key, 'extractUsing', so that we can know for a given product attribute (e.g. title) how to pull that information from an element for a particular vendor.

For example: For a Walmart product page, to obtain the product title, we want to grab an element using one of these selectors and then extract the title using the innerText value of that element.
Added error handling for the case that:
* Undefined or an empty string is returned from 'extractValueFromElement'.
* No element is found for a given product attribute for any of the provided selectors.
@biancadanforth biancadanforth merged commit 00f8c5b into master Jul 23, 2018
@biancadanforth biancadanforth deleted the 21-improve-extraction branch July 23, 2018 23:25
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants