Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to convert to html use go-tika? #27

Open
imqishi opened this issue Jan 6, 2021 · 2 comments
Open

how to convert to html use go-tika? #27

imqishi opened this issue Jan 6, 2021 · 2 comments

Comments

@imqishi
Copy link

imqishi commented Jan 6, 2021

in java api, we can convert file to html like this:

public static String extractHtml(File file) throws IOException {
    byte[] bytes = Files.toByteArray(file);
    AutoDetectParser tikaParser = new AutoDetectParser();
    ByteArrayOutputStream out = new ByteArrayOutputStream();
    SAXTransformerFactory factory = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
    TransformerHandler handler;
    try {
        handler = factory.newTransformerHandler();
    } catch (TransformerConfigurationException ex) {
        throw new IOException(ex);
    }
    handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
    handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
    handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    handler.setResult(new StreamResult(out));
    ExpandedTitleContentHandler handler1 = new ExpandedTitleContentHandler(handler);
    try {
        tikaParser.parse(new ByteArrayInputStream(bytes), handler1, new Metadata());
    } catch (SAXException | TikaException ex) {
        throw new IOException(ex);
    }
    return new String(out.toByteArray(), "UTF-8");
}

can go-tika do this? when we use

client := tika.NewClient(nil, s.URL())
body, err := client.Parse(context.Background(), f)

what is the body's content? or how to understand this return?

@tbpg
Copy link
Member

tbpg commented Jan 6, 2021

go-tika is based on Tika server: https://cwiki.apache.org/confluence/display/TIKA/TikaServer

So, we should be able to adjust go-tika to support any endpoint there. client.Parse corresponds to PUT /tika.

@nathj07
Copy link
Contributor

nathj07 commented Apr 1, 2022

I'm just getting started using this and it looks like client.Parse will return HTML, I'm actually trying to get it to return text/plain but I don't seem to be able to find a way to pass a request header.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants