Java Parser API to Extract Data

GroupDocs.Parser for Java is a Document Parser & Data Extraction Library that supports more than 50 popular document types. It can help build Java-based business applications with features of parsing raw, structured & formatted text as well as image & metadata extraction.

Directory	Description
Examples	Java examples and sample documents for you to get started quickly.

Parse Documents to Extract Text, Images & Metadata

Extract plain text from any of the supported documents.
Extract HTML or Markdown formatted text for a fast preview.
Extract structured text.
Extract text areas with coordinates, text style and other information.
Search text by a keyword or regular expression. Also get text around the found word.
Extract metadata from supported document formats.
Get information about document images and save them.
Extract data containers like ZIP archives, PDF portfolios, emails, OST and so on.
Extract table of contents (ToC).
Parse form data from PDF documents.

Get Started with GroupDocs.Parser for Java

GroupDocs.Parser for Java requires J2SE 7.0 (1.7), J2SE 8.0 (1.8) or above. Please install Java first if you do not have it already.

GroupDocs hosts all Java APIs on GroupDocs Artifact Repository, so simply configure your Maven project to fetch the dependencies automatically.

Extract Text from PDF Document

// create an instance of Parser class
try (Parser parser = new Parser(Constants.SamplePdf)) {
    // extract a text into the reader
    try (TextReader reader = parser.getText()) {
        // print a text from the document
        // if text extraction isn't supported, a reader is null
        System.out.println(reader == null ? "Text extraction isn't supported" : reader.readToEnd());
    }
}

Extract Formatted Text from DOCX

// create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleDocx)) {
    // extract a formatted text into the reader
    try (TextReader reader = parser.getFormattedText(new FormattedTextOptions(FormattedTextMode.Html))) {
        // print a formatted text from the document
        // if formatted text extraction isn't supported, a reader is null
        System.out.println(reader == null ? "Formatted text extraction isn't suppported" : reader.readToEnd());
    }
}

Extract Document Metadata via Java

// create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleDocx)) {
    // extract metadata from the document
    Iterable<MetadataItem> metadata = parser.getMetadata();
    // check if metadata extraction is supported
    if (metadata == null) {
        System.out.println("Metatada extraction isn't supported");
    }
    // iterate over metadata items
    for (MetadataItem item : metadata) {
        // print an item name and value
        System.out.println(String.format("%s: %s", item.getName(), item.getValue()));
    }
}

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
Examples		Examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Examples

Examples

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Java Parser API to Extract Data

Parse Documents to Extract Text, Images & Metadata

Get Started with GroupDocs.Parser for Java

Extract Text from PDF Document

Extract Formatted Text from DOCX

Extract Document Metadata via Java

About

Releases

Packages

Contributors 7

License

groupdocs-parser/GroupDocs.Parser-for-Java

Folders and files

Latest commit

History

Repository files navigation

Java Parser API to Extract Data

Parse Documents to Extract Text, Images & Metadata

Get Started with GroupDocs.Parser for Java

Extract Text from PDF Document

Extract Formatted Text from DOCX

Extract Document Metadata via Java

About

Resources

License

Stars

Watchers

Forks