WebsiteAgent: Introduce per-node XPath evaluation in extraction. #412

Merged
merged 6 commits into from Jul 30, 2014

Conversation

Projects
None yet
4 participants
Collaborator

knu commented Jul 23, 2014

The new extraction sub-hash key value holds an XPath expression which
should be applied to each node to get the result value.

The previous directives "text": true and "attr": "src" are now
written as "value": "text()" and "value": "@src", respectively.

With this enhancement, it is now possible for this agent to perform some
basic text processing on its own by making use of XPath functions like
normalize-space, substring-after and translate.

Collaborator

knu commented Jul 23, 2014

@cantino This branch is based on my provide_access_to_agent branch just because there was a merge involved. This is just for my personal convenience, sorry. Please deal with this PR after you have dealt with #410.

Coverage Status

Coverage increased (+0.06%) when pulling f3535e2 on knu:website_agent-per_node_xpath into 67fcfef on cantino:master.

app/models/agents/event_formatting_agent.rb
@@ -104,11 +106,11 @@ def working?
def receive(incoming_events)
incoming_events.each do |event|
- payload = perform_matching(event.payload)
+ agent = Agent.find(event.agent_id)
@cantino

cantino Jul 23, 2014

Owner

Can't this just be event.agent?

@knu

knu Jul 24, 2014

Collaborator

Oh, yup. It was in the original code. 😃

@cantino

cantino Jul 24, 2014

Owner

Interesting... I wonder why.

@knu

knu Jul 24, 2014

Collaborator

This line will be gone when #410 is merged.

app/models/agents/website_agent.rb
- error '"attr" or "text" is required on HTML or XML extraction patterns'
- return
+ value, = node.xpath(extraction_details['value'])
+ if value.is_a?(Float) && value.to_i == value
@cantino

cantino Jul 23, 2014

Owner

Interesting; why are you casting 1.0 to 1?

@knu

knu Jul 24, 2014

Collaborator

#xpath evaluates any numeric return value as float, so you need to convert it to integer if you want count(//li.item) to be expanded to for example "5" instead of "5.0". I'll add a comment.

app/models/agents/website_agent.rb
- else
- error '"attr" or "text" is required on HTML or XML extraction patterns'
- return
+ value, = node.xpath(extraction_details['value'])
@cantino

cantino Jul 23, 2014

Owner

Are you using the , here to throw away extra values? If so, I'd suggest either of these for clarity:

node.xpath(extraction_details['value']).first

or

value, status = node.xpath(extraction_details['value'])

(assuming the other value is status)

@knu

knu Jul 24, 2014

Collaborator

#xpath may return a single value (string, float, boolean, ...) or a node set. This is a shortcut for getting a single value itself or the first node if a node set is returned.

@knu

knu Jul 24, 2014

Collaborator

On second thought, if a node set is returned the result should include all nodes so text() expands to all texts concatenated. I'll fix it.

@knu

knu Jul 24, 2014

Collaborator

Done in 6d97513.

Coverage Status

Coverage increased (+0.09%) when pulling 6d97513 on knu:website_agent-per_node_xpath into 67fcfef on cantino:master.

Owner

cantino commented Jul 26, 2014

To get the migration to run, I had to use send on extraction_type:

next if agent.send(:extraction_type) == 'json'

This migration will fail in the future if WebsiteAgent changes. Can we make it more robust, perhaps like:

  class Agent < ActiveRecord::Base
    include JSONSerializedField
    include LiquidInterpolatable
    json_serialize :options

    def extraction_type
      (interpolated['type'] || begin
        if interpolated['url'] =~ /\.(rss|xml)$/i
          "xml"
        elsif interpolated['url'] =~ /\.json$/i
          "json"
        else
          "html"
        end
      end).to_s
    end
  end

  def up
    Agent.where(type: 'Agents::WebsiteAgent').each do |agent|
      next if agent.extraction_type == 'json'

      agent.options_will_change!
      agent.options['extract'].each { |name, extraction|
        case
        when extraction.delete('text')
          extraction['value'] = './/text()'
        when attr = extraction.delete('attr')
          extraction['value'] = "@#{attr}"
        end
      }
      agent.save!
    end
  end

It's not perfect, but should be a bit more robust against changes to the codebase. What do you think?

app/models/agents/website_agent.rb
}
+ "@_attr_" is the XPath expression to extract the value of an attribute named _attr_ from a node, and "//text()" is to extract all the enclosed texts. You can also use [XPath functions](http://www.w3.org/TR/xpath/#section-String-Functions) like `normalize-space` to strip and squeeze whitespace, `substring-after` to extract part of a text, and `translate` to remove comma from a formatted number, etc. Note that these functions take a string, not a node set, so what you may think would be written as `normalize-text(.//text())` should actually be `normalize-text(.)`.
@cantino

cantino Jul 26, 2014

Owner

//text() or .//text()?

@knu

knu Jul 27, 2014

Collaborator

It's ".//text()", thanks for pointing it out!

Owner

cantino commented Jul 26, 2014

I like the direction that you're taking this-- XPath is very powerful and now people can do transformations inside of the WebsiteAgent.

I'd like to make sure this transition is approachable for everyone. What do other Huginn users think? /cc @alias1 @dsander @CloCkWeRX @albertsun @snicker etc.

Collaborator

knu commented Jul 27, 2014

@cantino Indeed, the call for extraction_type was wrong, and your code will work. I just thought pulling in the current implementation of extraction_type with interpolation might be a little bit overkill, so I made it look into options['extract'] instead. Hope this should be enough.

Coverage Status

Coverage increased (+0.56%) when pulling 79fca69 on knu:website_agent-per_node_xpath into 67fcfef on cantino:master.

Owner

cantino commented Jul 27, 2014

Sounds good. Testing on my setup for a day or two.

Owner

cantino commented Jul 29, 2014

If you re-merge master, this diff should get simpler.

knu added some commits Jul 23, 2014

WebsiteAgent: Introduce per-node XPath evaluation in extraction.
The new extraction sub-hash key `value` holds an XPath expression which
should be applied to each node to get the result value.

The previous directives `"text": true` and `"attr": "src"` are now
written as `"value": "text()"` and `"value": "@src"`, respectively.

With this enhancement, it is now possible for this agent to perform some
basic text processing on its own by making use of XPath functions like
`normalize-space`, `substring-after` and `translate`.
Look into options['extract'] for checking the extraction type.
WebsiteAgent#extraction_type was private and uncallable, and may not
remain existent in future.
Collaborator

knu commented Jul 29, 2014

Thanks! I've rebased this branch on master.

cantino added a commit that referenced this pull request Jul 30, 2014

Merge pull request #412 from knu/website_agent-per_node_xpath
WebsiteAgent: Introduce per-node XPath evaluation in extraction.

@cantino cantino merged commit 47eee57 into huginn:master Jul 30, 2014

Owner

cantino commented Jul 30, 2014

@knu, I've been running this for a few days and it seems to work well. The migration worked on my setup. Merging!

Owner

cantino commented Jul 30, 2014

Nice addition!

@knu knu deleted the knu:website_agent-per_node_xpath branch Aug 13, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment