GitHub - peteroupc/HtmlParser: HTML5 Parser for Java

HTML5 parser for Java.

Takes an input stream or a file and returns an HTML document tree. The API is currently only a subset of the DOM. Example:

IDocument doc=HtmlDocument.parseFile(filename);
for(IElement element : doc.getElementsByTagName("img")){
    System.out.println(element.getAttribute("src"));
}

And here is a more complex example that gets all Open Graph and "image_src" images specified on a Web page.

  public static List<String> getWebpageImages(String url) throws IOException {
    IDocument doc;
    List<String> images=new ArrayList<String>();
    doc=HtmlDocument.parseURL(url);
    for(IElement element : doc.getElementsByTagName("meta")){
      if("og:image".equals(element.getAttribute("property")) ||
          "og:image:secure_url".equals(element.getAttribute("property"))){
        String content=HtmlDocument.getHref(element,element.getAttribute("content"));
        images.add(content);
      }
    }
    if(images.size()>0)return images;
    for(IElement element : doc.getElementsByTagName("link")){
      if("image_src".equals(element.getAttribute("rel"))){
        String content=HtmlDocument.getHref(element);
        images.add(content);
      }
    }
    return images;
  }

Sample code on this README file is dedicated to the public domain under CC0: http://creativecommons.org/publicdomain/zero/1.0/

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
api		api
src/main/java/com/upokecenter		src/main/java/com/upokecenter
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Contributors 2

Languages

License

peteroupc/HtmlParser

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages