Skip to content

peteroupc/HtmlParser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HTML5 parser for Java.

Takes an input stream or a file and returns an HTML document tree. The API is currently only a subset of the DOM. Example:

IDocument doc=HtmlDocument.parseFile(filename);
for(IElement element : doc.getElementsByTagName("img")){
    System.out.println(element.getAttribute("src"));
}

And here is a more complex example that gets all Open Graph and "image_src" images specified on a Web page.

  public static List<String> getWebpageImages(String url) throws IOException {
    IDocument doc;
    List<String> images=new ArrayList<String>();
    doc=HtmlDocument.parseURL(url);
    for(IElement element : doc.getElementsByTagName("meta")){
      if("og:image".equals(element.getAttribute("property")) ||
          "og:image:secure_url".equals(element.getAttribute("property"))){
        String content=HtmlDocument.getHref(element,element.getAttribute("content"));
        images.add(content);
      }
    }
    if(images.size()>0)return images;
    for(IElement element : doc.getElementsByTagName("link")){
      if("image_src".equals(element.getAttribute("rel"))){
        String content=HtmlDocument.getHref(element);
        images.add(content);
      }
    }
    return images;
  }

Copyright (C) 2013 Peter Occil. Licensed under the Expat License.

Sample code on this README file is dedicated to the public domain under CC0: http://creativecommons.org/publicdomain/zero/1.0/

About

HTML5 Parser for Java

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages