Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with Redetect File Type API #7527

Closed
stevenferey opened this issue Jan 22, 2021 · 13 comments · Fixed by #8835
Closed

Error with Redetect File Type API #7527

stevenferey opened this issue Jan 22, 2021 · 13 comments · Fixed by #8835
Assignees
Projects
Milestone

Comments

@stevenferey
Copy link
Contributor

Hello,

This issue follows a discussion in the google group :
https://groups.google.com/g/dataverse-community/c/_H8ZdAo85BU

Here is an example to describe the problem :

Dataverse version : 5.0 + S3 storage
file extension saved in S3: .tabular
current MIME type for this file: "application/octet-stream"
The .tabular extension is declared in the MimeTypeDetectionByFileExtension.properties file => tabular = text/tab-separated-values

Here is the problem :
When the redetect API resource is called, because the file is remote, its content is inserted into a temporary file: tempFileTypeCheck.tmp
The file extension is then compared to the list in MimeTypeDetectionByFileExtension.properties but the .tmp is not there.
Server return: "tmp is a file extension Dataverse doesn't know about. Consider adding it to the MimeTypeDetectionByFileExtension.properties file."

Finally, the "application/octet-stream" MIME Type is the result of the redetect API resource for this file :(
The expected result is "text/tab-separated-values"

thanks a lot.
Steven.

@djbrooke
Copy link
Contributor

Hey @stevenferey, thanks for the report. That's interesting, as we're also running on S3 but did not have the same experience on the Harvard Dataverse Repository AFAIK. We'll be running this API following the 5.4 release, as we've added some new mimetypes. It may give us a chance to get some more information.

I'm not sure it matters, but are you using AWS S3 or something else?

@stevenferey
Copy link
Contributor Author

Hello Danny,

Thank you for your reply.
No, our S3 storage is not hosted at amazon but on our internal INRAE network.

Hope this will be useful for you

Thanks a lot.
Steven.

@stevenferey
Copy link
Contributor Author

Hi,

I add that if "dryrun" is false, the recording is impossible (server error in the SQL transaction). The dataset concerned is then in HTTP 500 error because dataverse can no longer retrieve the "DerivedOriginalFileName" field for the file (if it is a tabular file): nullpointerexception :

[2021-01-26T16:58:15.412+0100] [Payara 5.2020] [SEVERE] [] [javax.enterprise.resource.webcontainer.jsf.application] [tid: _ThreadID=95 _ThreadName=http-thread-pool::jk-connector(5)] [timeMillis: 1611676695412] [levelValue: 1000] [[
Error Rendering View[/dataset.xhtml]
javax.el.ELException: /dataset.xhtml @59,66 value="#{DatasetPage.jsonLd}": java.lang.NullPointerException
at com.sun.faces.facelets.el.TagValueExpression.getValue(TagValueExpression.java:77)
at javax.faces.component.ComponentStateHelper.eval(ComponentStateHelper.java:170)
at javax.faces.component.ComponentStateHelper.eval(ComponentStateHelper.java:157)
at javax.faces.component.UIOutput.getValue(UIOutput.java:140)
at com.sun.faces.renderkit.html_basic.HtmlBasicInputRenderer.getValue(HtmlBasicInputRenderer.java:181)
at com.sun.faces.renderkit.html_basic.HtmlBasicRenderer.getCurrentValue(HtmlBasicRenderer.java:328)
at com.sun.faces.renderkit.html_basic.HtmlBasicRenderer.encodeEnd(HtmlBasicRenderer.java:143)
at javax.faces.component.UIComponentBase.encodeEnd(UIComponentBase.java:595)
at javax.faces.component.UIComponent.encodeAll(UIComponent.java:1654)
at javax.faces.component.UIComponent.encodeAll(UIComponent.java:1650)
at org.primefaces.renderkit.HeadRenderer.encodeBegin(HeadRenderer.java:72)
at javax.faces.component.UIComponentBase.encodeBegin(UIComponentBase.java:540)
at javax.faces.component.UIComponent.encodeAll(UIComponent.java:1644)
at javax.faces.component.UIComponent.encodeAll(UIComponent.java:1650)
at com.sun.faces.application.view.FaceletViewHandlingStrategy.renderView(FaceletViewHandlingStrategy.java:468)
at com.sun.faces.application.view.MultiViewHandler.renderView(MultiViewHandler.java:170)
at javax.faces.application.ViewHandlerWrapper.renderView(ViewHandlerWrapper.java:132)
at org.ocpsoft.rewrite.faces.RewriteViewHandler.renderView(RewriteViewHandler.java:196)
at javax.faces.application.ViewHandlerWrapper.renderView(ViewHandlerWrapper.java:132)
at com.sun.faces.lifecycle.RenderResponsePhase.execute(RenderResponsePhase.java:102)
at com.sun.faces.lifecycle.Phase.doPhase(Phase.java:76)
at com.sun.faces.lifecycle.LifecycleImpl.render(LifecycleImpl.java:199)
at javax.faces.webapp.FacesServlet.executeLifecyle(FacesServlet.java:708)
at javax.faces.webapp.FacesServlet.service(FacesServlet.java:451)
at org.apache.catalina.core.StandardWrapper.service(StandardWrapper.java:1636)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:331)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:211)
at org.ocpsoft.rewrite.servlet.RewriteFilter.doFilter(RewriteFilter.java:226)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:253)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:211)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:257)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161)
at org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:757)
at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:577)
at com.sun.enterprise.web.WebPipeline.invoke(WebPipeline.java:99)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:158)
at org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:757)
at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:577)
at org.apache.catalina.connector.CoyoteAdapter.doService(CoyoteAdapter.java:368)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:238)
at com.sun.enterprise.v3.services.impl.ContainerMapper$HttpHandlerCallable.call(ContainerMapper.java:520)
at com.sun.enterprise.v3.services.impl.ContainerMapper.service(ContainerMapper.java:217)
at org.glassfish.grizzly.http.server.HttpHandler.runService(HttpHandler.java:182)
at org.glassfish.grizzly.http.server.HttpHandler.doHandle(HttpHandler.java:156)
at org.glassfish.grizzly.http.server.HttpServerFilter.handleRead(HttpServerFilter.java:218)
at org.glassfish.grizzly.filterchain.ExecutorResolver$9.execute(ExecutorResolver.java:95)
at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeFilter(DefaultFilterChain.java:260)
at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeChainPart(DefaultFilterChain.java:177)
at org.glassfish.grizzly.filterchain.DefaultFilterChain.execute(DefaultFilterChain.java:109)
at org.glassfish.grizzly.filterchain.DefaultFilterChain.process(DefaultFilterChain.java:88)
at org.glassfish.grizzly.ProcessorExecutor.execute(ProcessorExecutor.java:53)
at org.glassfish.grizzly.nio.transport.TCPNIOTransport.fireIOEvent(TCPNIOTransport.java:524)
at org.glassfish.grizzly.strategies.AbstractIOStrategy.fireIOEvent(AbstractIOStrategy.java:89)
at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.run0(WorkerThreadIOStrategy.java:94)
at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.access$100(WorkerThreadIOStrategy.java:33)
at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy$WorkerThreadRunnable.run(WorkerThreadIOStrategy.java:114)
at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:569)
at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:549)
at java.lang.Thread.run(Thread.java:748)
Caused by: javax.el.ELException: java.lang.NullPointerException
at javax.el.BeanELResolver.getValue(BeanELResolver.java:304)
at com.sun.faces.el.DemuxCompositeELResolver._getValue(DemuxCompositeELResolver.java:156)
at com.sun.faces.el.DemuxCompositeELResolver.getValue(DemuxCompositeELResolver.java:184)
at com.sun.el.parser.AstValue.getValue(AstValue.java:114)
at com.sun.el.parser.AstValue.getValue(AstValue.java:177)
at com.sun.el.ValueExpressionImpl.getValue(ValueExpressionImpl.java:183)
at org.jboss.weld.module.web.el.WeldValueExpression.getValue(WeldValueExpression.java:50)
at com.sun.faces.facelets.el.TagValueExpression.getValue(TagValueExpression.java:73)
... 58 more
Caused by: java.lang.NullPointerException
at edu.harvard.iq.dataverse.DataFile.getDerivedOriginalFileName(DataFile.java:456)
at edu.harvard.iq.dataverse.DataFile.getOriginalFileName(DataFile.java:447)
at edu.harvard.iq.dataverse.util.json.JsonPrinter.json(JsonPrinter.java:608)
at edu.harvard.iq.dataverse.util.json.JsonPrinter.json(JsonPrinter.java:562)
at edu.harvard.iq.dataverse.util.json.JsonPrinter.jsonFileMetadatas(JsonPrinter.java:445)
at edu.harvard.iq.dataverse.util.json.JsonPrinter.json(JsonPrinter.java:365)
at edu.harvard.iq.dataverse.util.json.JsonPrinter.jsonWithCitation(JsonPrinter.java:419)
at edu.harvard.iq.dataverse.util.json.JsonPrinter.jsonAsDatasetDto(JsonPrinter.java:438)
at edu.harvard.iq.dataverse.export.ExportService.exportFormat(ExportService.java:211)
at edu.harvard.iq.dataverse.export.ExportService.getExport(ExportService.java:99)
at edu.harvard.iq.dataverse.export.ExportService.getExportAsString(ExportService.java:118)
at edu.harvard.iq.dataverse.DatasetPage.getJsonLd(DatasetPage.java:5486)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at javax.el.BeanELResolver.getValue(BeanELResolver.java:299)
... 65 more

Application server: Payara
Datverse version: 5.0

Steven.

@djbrooke
Copy link
Contributor

djbrooke commented Jan 27, 2021

Thanks @stevenferey for the additional details. After some testing, I was able to reproduce this. We'll be using this API in the next release so I will prioritize this.

edit: we will not be using this API in the next release but we should fix it anyway. :)

@djbrooke
Copy link
Contributor

  • We expect that the stacktrace above, the JSON-LD issue, is something that has been fixed post 5.0
  • @djbrooke has a stacktrace from the production server from earlier today

@scolapasta scolapasta removed their assignment Jan 27, 2021
@landreev
Copy link
Contributor

Here is the problem :
When the redetect API resource is called, because the file is remote, its content is inserted into a temporary file: tempFileTypeCheck.tmp
The file extension is then compared to the list in MimeTypeDetectionByFileExtension.properties but the .tmp is not there.

OK, this does look like a problem/bug. Note that it's not specific to S3 and the fact that we have to create a temp. file though - it doesn't look like it would work for local files either!
The file doesn't have that ".tabular" extension as saved on S3, or on the local filesystem - it's saved under a generated name (saved in the storageidentifier field of the dvobject table); it'll look something like 176f1dbd05a-8f2a0be66225. The "friendly" file name with the extension is only stored in the database (in the FileMetadata table).

I'm seeing this in the code:

return FileUtil.determineFileType(file, file.getName());

... it should of course be something like

return FileUtil.determineFileType(file, fileMetadata.getLabel());

In other words, that redetection API only works for the types that we recognize by the file content; but not by file names/extensions.
But this is easy to fix.

@landreev
Copy link
Contributor

The .tabular extension is declared in the MimeTypeDetectionByFileExtension.properties file => tabular = text/tab-separated-values

Did you add this extension and this type to the properties file above? - I'm not seeing it in the version of the file that we distribute.
Please note that you don't want to use the mime type text/tab-separated-values for UNINGESTED tabular files! text/tsv should be used for text files that have tab-separated columns, but haven't gone through the ingest process.

@landreev
Copy link
Contributor

Looking at the stack trace you posted, it appears to be the same problem as in #7310, that we have fixed since 5.0.
(Your dataset ended in that condition in a different way than what we saw in #7310; but the problem is the same - JSONLD export is not available when the page is trying to load; and the page fails when it tries to generate it).
An upgrade would solve this problem (not the failing redetect, but the 500 error in dataset problem). And there is a workaround that you can use now - you can simply export JSONLD on the command line:

curl "http://localhost:8080/api/datasets/export?persistentId=DATASET_ID&exporter=schema.org"

and this export will be cached, and the dataset page will start working again.

@landreev
Copy link
Contributor

For the local developers:
to summarize, there appear to be 2 separate issues (so far):

  1. Re-detection needs to use the filename saved in the database, not the name of the physical file, temp. or otherwise. i.e. it should be
return FileUtil.determineFileType(file, fileMetadata.getLabel());

instead of

return FileUtil.determineFileType(file, file.getName());

(trivial)
2) According to the stack trace generated by @djbrooke locally, this API is failing after the redetection, when it tries to re-export the metadata to reflect any changes, with some EJB transaction error. (the stack trace was posted on slack; we should probably add it here).
It appears that it was failing for the OP in the same manner - hence the 500 they were getting on the dataset page, since JSONLD was no longer available for the dataset (this is the problem solved in #7310 since 5.0).
This one seems more mysterious - wasn't immediately obvious from the stacktrace.

@stevenferey
Copy link
Contributor Author

Hello landreev,

Thank you very much for your answer.
Yes, we have added some type of files in the properties.

The curl command also makes the dataset visible again.
Thank you very much for that.

Steven.

@landreev
Copy link
Contributor

landreev commented Sep 1, 2021

Hello,
Apologies for letting this issue fall through the cracks.
It just wasn't clear how to address this properly. We found a few (but only a few!) files in our own production where redetect was causing the same problem. But we were never able to reproduce it on any other test system at all. I made a PR that appeared to be fixing it for the few known failure cases... but it was rejected for other reasons.

I'm assuming this hasn't been much of an issue for you. (With your initial use case - changing the mime type based on the filename extension - that could be easily done by a database query instead...). But it does look like this is being caused by some underlying EJB issue; that may result in problems elsewhere... So ideally, we'd like to understand what's going on.
Could you please re-test it with the same file in your system, under the version of Dataverse you are running now, just to see if it's still failing, with the same exception in the stack trace? - Thank you!

@stevenferey
Copy link
Contributor Author

Hello,

I have run an example to verify that the problem is still present on our Dataverse V5.3:

I am taking a .tabular file saved in S3 before customizing the MimeTypeDetectionByFileExtension.properties file

Its Mime type is "application / octet-stream", that's normal.


Next, I customize the MimeTypeDetectionByFileExtension.properties file so that my new .tabular files have the MIME type "text / tab-separated-values":
tabular = text / tab-separated-values

Now my new .tabular files saved in Dataverse have the MIME type "text / tab-separated-values".
Good!


To make my first file also have the new MIME type, I run the following resource:

curl -H "X-Dataverse-key: $ API_TOKEN" -X POST "$ SERVER_URL / api / files / $ ID / redetect? dryRun = true"

server.log:
tmp is a file extension Dataverse doesn't know about. Consider adding it to the MimeTypeDetectionByFileExtension.properties file.

curl -H "X-Dataverse-key: $ API_TOKEN" -X POST "$ SERVER_URL / api / files / $ ID / redetect? dryRun = false"

server.log: Attachment ( server.log )


I hope this can help with the analysis
Thank you so much,
Steven.

@pdurbin
Copy link
Member

pdurbin commented Jul 12, 2022

pdurbin pushed a commit to GlobalDataverseCommunityConsortium/dataverse that referenced this issue Sep 19, 2022
@pdurbin pdurbin added this to the 5.12 milestone Sep 19, 2022
poikilotherm pushed a commit to poikilotherm/dataverse that referenced this issue Sep 19, 2022
poikilotherm pushed a commit to poikilotherm/dataverse that referenced this issue Sep 20, 2022
@pdurbin pdurbin moved this from Watching to Done in pdurbin Oct 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
5 participants