Skip to content

Web Scraping

Philip Ford edited this page Apr 23, 2017 · 5 revisions

Groovy features make screen scraping easy. Url fetching in groovy uses Java classes like java.net.URL, but facilitated by additional Groovy methods such as withReader.

URL Fetching

Example: Reading content from a web page

// Contents of http://www.mrhaki.com/url.html:
// Simple test document
// for testing URL extensions
// in Groovy.
 
// Convert the URL string to a URL.  This apparently does the fetch too.
def url = "http://www.mrhaki.com/url.html".toURL()
 
assert '''\
Simple test document
for testing URL extensions
in Groovy.
''' == url.text
 
def result = []
// Looping through each line of the web page.
url.eachLine {
    if (it =~ /Groovy/) {
        result << it
    }
}
assert ['in Groovy.'] == result

// Reading each line from the web page 
url.withReader { reader ->
    assert 'Simple test document' == reader.readLine()

Another example

import org.ccil.cowan.tagsoup.Parser;
     
String ENCODING = "UTF-8"
 
@Grapes( @Grab('org.ccil.cowan.tagsoup:tagsoup:1.2') )       
def PARSER = new XmlSlurper(new Parser() )
 
def url = "http://www.bing.com/search?q=web+scraping"
 
new URL(url).withReader (ENCODING) { reader -> 
 
    def document = PARSER.parse(reader) 
    // Extracting information
}

HTML Parsing

Html parsing can be done with any of the many available html-parsing java tools like tagsoup or cyberneko. In this example we have used tagsoup and we can see how easy we declare our dependency on the library thanks to Grapes.

On top of that groovy’s xmlslurper and gpath allow to access specific parts of the parsed html in a convenient way. For the example of the article we would just need a line of code to extract the titles of the search results.

Below are two different ways to achieve that goal. For both examples we first use groovy’s ‘**’ to search for all document’s children in depth, this way we can find which one has as its id results.

First method

//JQuery selector: $('#results h3 a')
document.'**'.find{ it['@id'] == 'results'}.ul.li.div.div.h3.a.each { println it.text() }

In first example we specify the full element path from the results element to the links that represent the titles. As we can see this is less handy than just saying “i want all h3 descendants” the way it is done with JQuery.

Second method

//JQuery selector: $('#results h3 a')
document.'**'.find{ it['@id'] == 'results'}.'**'.findAll{ it.name() == 'h3'}.a.each { println it.text() }

The second example uses the ‘**’ operator to ask for all elements of type h3. However, if we keep comparing it with the way it is done with JQuery we find the solution quite complex.

References

Clone this wiki locally