Skip to content

groupdocs-parser/GroupDocs.Parser-for-Java

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Java Parser API to Extract Data

GroupDocs.Parser for Java is a Document Parser & Data Extraction Library that supports more than 50 popular document types. It can help build Java-based business applications with features of parsing raw, structured & formatted text as well as image & metadata extraction.

Directory Description
Examples Java examples and sample documents for you to get started quickly.

Parse Documents to Extract Text, Images & Metadata

  • Extract plain text from any of the supported documents.
  • Extract HTML or Markdown formatted text for a fast preview.
  • Extract structured text.
  • Extract text areas with coordinates, text style and other information.
  • Search text by a keyword or regular expression. Also get text around the found word.
  • Extract metadata from supported document formats.
  • Get information about document images and save them.
  • Extract data containers like ZIP archives, PDF portfolios, emails, OST and so on.
  • Extract table of contents (ToC).
  • Parse form data from PDF documents.

Get Started with GroupDocs.Parser for Java

GroupDocs.Parser for Java requires J2SE 7.0 (1.7), J2SE 8.0 (1.8) or above. Please install Java first if you do not have it already.

GroupDocs hosts all Java APIs on GroupDocs Artifact Repository, so simply configure your Maven project to fetch the dependencies automatically.

Extract Text from PDF Document

// create an instance of Parser class
try (Parser parser = new Parser(Constants.SamplePdf)) {
    // extract a text into the reader
    try (TextReader reader = parser.getText()) {
        // print a text from the document
        // if text extraction isn't supported, a reader is null
        System.out.println(reader == null ? "Text extraction isn't supported" : reader.readToEnd());
    }
}

Extract Formatted Text from DOCX

// create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleDocx)) {
    // extract a formatted text into the reader
    try (TextReader reader = parser.getFormattedText(new FormattedTextOptions(FormattedTextMode.Html))) {
        // print a formatted text from the document
        // if formatted text extraction isn't supported, a reader is null
        System.out.println(reader == null ? "Formatted text extraction isn't suppported" : reader.readToEnd());
    }
}

Extract Document Metadata via Java

// create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleDocx)) {
    // extract metadata from the document
    Iterable<MetadataItem> metadata = parser.getMetadata();
    // check if metadata extraction is supported
    if (metadata == null) {
        System.out.println("Metatada extraction isn't supported");
    }
    // iterate over metadata items
    for (MetadataItem item : metadata) {
        // print an item name and value
        System.out.println(String.format("%s: %s", item.getName(), item.getValue()));
    }
}

Home | Product Page | Documentation | Demos | API Reference | Examples | Blog | Search | Free Support | Temporary License

About

GroupDocs.Parser for Java examples, plugins and showcase projects

Resources

License

Stars

Watchers

Forks

Packages

No packages published