Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UPFMT Hackathon #33

Closed
dumitrescustefan opened this issue Apr 13, 2018 · 119 comments
Closed

UPFMT Hackathon #33

dumitrescustefan opened this issue Apr 13, 2018 · 119 comments
Assignees
Labels
Component Participant is providing component(s) Docker Component/Application provided as a docker image

Comments

@dumitrescustefan
Copy link

Hi, we need help in testing our docker component.
So far we successfully registered it on test.openminted.eu, but we are unable to test it.

The component takes as input a folder where it searches first for xmi files and extracts raw text from it (in .txt format), and also searches for .txt files. All the .txt files get processed (segmented, tokenized, lemmatized, tagged and parsed) and in the output folder we create .conllu and .xmi formats.

We need help:

  1. in having xmi/txt files as input on the platform
  2. testing the actual component by observing its output
  3. specifying parameters in the omtd-share record (I am still unsure we set them right on how parameters have to be specified: we have input, output, and language)

Thank you,
Stefan (Ineosoft)

@gkirtzou gkirtzou self-assigned this Apr 16, 2018
@greenwoodma
Copy link
Member

Could you possibly attach your exiisting OMTD-SHARE XML descriptor to this issue, along with a description of the parameters you are trying to include so we can have a look at this before the online session. Thanks.

@dumitrescustefan
Copy link
Author

dumitrescustefan commented Apr 16, 2018

Hi, I tried again to register our docker component (xml attached).

We have only basic parameters:

  • input (folder from where the component will search for xmi and/or txt files
  • output (folder in which the component will write conllu and xmi files
  • language (any language, currently the docker component has only the english model loaded, so parameter should be --param:language=en )

For example, this works on my local pc: docker run -v E:_d\in:/input -v E:_d\out:/output upfmt:latest --input=/input --output=/output --param:language=en
(tested on both win and linux)

The docker image is here: https://hub.docker.com/r/dumitrescustefan/upfmt/
The git files are here: https://github.com/dumitrescustefan/UPFMT

@dumitrescustefan
Copy link
Author

The component is here : https://test.openminted.eu/landingPage/component/5f796253-c00d-432a-9c3a-d1b4d586ed50

( I already tried registering before, so now there is a UPFMT and a UPFMT2, same component, different XML shares to see if i did something wrong).

Could you point me to :

  1. is the share xml okay (meaning input ouput, and parameters)
  2. the next step is to build an app (i think), and then test it? I tried this but I did something wrong ( https://test.openminted.eu/landingPage/application/51173c7d-166c-425f-9927-335f023e8eb7 this was with the first component registration attempt) and when trying to run on a demo corpus (i think it was the one with 20 PDFs) i got a little red message in the middle of the page saying there's an error - got stuck here.

Thank you!
Stefan

@pennyl67
Copy link
Collaborator

Hi Stefan
Could you please attach the metadata as a separate file instead of inside the text?
Thanks
Penny

@dumitrescustefan
Copy link
Author

Here it is. I changed the extension to .txt otherwise attaching says that it can't handle this type of document (?!?).
Thank you!

5f796253-c00d-432a-9c3a-d1b4d586ed50.xml.txt

@pennyl67
Copy link
Collaborator

@dumitrescustefan
Thanks.
For the metadata, the only improvements I would suggest is

  • adding a meaningful description for the description element, as this will help users know what the component does
  • add the resourceCreationInfo with the resourceCreator (can be a group or organization) as it helps its citation.

Technical issues (if any) will be discussed in the hackathon session.

@dumitrescustefan
Copy link
Author

Definitely. The metadata now is only targeted to get things working; for the final version we will fill everything in fully, including parameter comments, citation, etc. Thanks!

@greenwoodma greenwoodma added the Docker Component/Application provided as a docker image label Apr 18, 2018
@pennyl67 pennyl67 added the Component Participant is providing component(s) label Apr 19, 2018
@mandiayba
Copy link
Member

Hi @dumitrescustefan

some remarks concerning your metadata

  • the distributionLocation element must only contain the "dumitrescustefan/upfmt". That is what is required to pull and run your docker image docker pull dumitrescustefan/upfmt docker run dumitrescustefan/upfmt ...

  • the command element must only contain your executor i.e. the part required to run your command by excluding the parameters and their respective values.

  • the following parameters are not required because you have defined the input and output with the inputContentResourceInfo and outputResourceInfo elements

        <ns0:parameterInfo>
           <ns0:parameterName>input</ns0:parameterName>
           <ns0:parameterLabel>input folder containing xmi and/or txt files</ns0:parameterLabel>
           <ns0:parameterDescription>input folder containing xmi and/or txt files</ns0:parameterDescription>
           <ns0:parameterType>string</ns0:parameterType>
           <ns0:optional>false</ns0:optional>
           <ns0:multiValue>false</ns0:multiValue>
           <ns0:defaultValue>/input</ns0:defaultValue>
        </ns0:parameterInfo>
        <ns0:parameterInfo>
           <ns0:parameterName>output</ns0:parameterName>
           <ns0:parameterLabel>output folder path where xmi and conllu files will be written</ns0:parameterLabel>
           <ns0:parameterDescription>output folder path where xmi and conllu files will be written</ns0:parameterDescription>
           <ns0:parameterType>string</ns0:parameterType>
           <ns0:optional>false</ns0:optional>
           <ns0:multiValue>false</ns0:multiValue>
           <ns0:defaultValue>/output</ns0:defaultValue>
        </ns0:parameterInfo>

thanks

@dumitrescustefan
Copy link
Author

Hi,

I made the changes you suggested above and re-registered as UPFMT3 : https://test.openminted.eu/landingPage/component/5f796253-c00d-432a-9c3a-01b4d586ed50

Could you tell me how to test it? Do i need to create an application?

Thank you,
Stefan
P.S. I did not know that if I set the public flag to false I could re-edit the xml, I'll do that for the next test.

5f796253-c00d-432a-9c3a-01b4d586ed50.zip

@mandiayba
Copy link
Member

@greenwoodma @galanisd I have tried UPFMT3 in a workflow (omtdImport -> pdfReader -> UPFMT3). I have run the workflow with a corpus (pdf) but it ends up with an error "System error getting execution status (Server responded: undefined)". Could we please have the logs to know what is wrong ?

the workflow is private https://test.openminted.eu/landingPage/application/0ca1e01c-b5c7-4cc6-a625-1f0f9ad117b6

@galanisd
Copy link
Member

It is possible that the pdfReader was not configured appropriately.
patterns->**/*.pdf

@galanisd
Copy link
Member

Also I had a look into our workflow engine. It seems that UPFMT3 wrapper which is generated from your OMTD-SHARE record uses "upfmt:latest" as a command for calling your component. Is this a valid command?

@galanisd galanisd self-assigned this Apr 20, 2018
@dumitrescustefan
Copy link
Author

dumitrescustefan commented Apr 20, 2018

Hi,

I re-registered the component as UPFMT4 (this time it is private so we can edit it), and put in the command just "upfmt".

In our local tests it works with both with and without the :latest tag. UPFMT4 now has just:

<ns0:componentDistributionInfo>
                <ns0:componentDistributionForm>dockerImage</ns0:componentDistributionForm>
                <ns0:distributionLocation>dumitrescustefan/upfmt</ns0:distributionLocation>
                <ns0:command>upfmt</ns0:command>
</ns0:componentDistributionInfo>

Just to be sure I specified, our component looks for all .xmi and/or .txt files in the input and dumps processed .xmi files in the output folder (as well as other files, for example .conllu-format, to easily check the output). Thank you very much!

@galanisd
Copy link
Member

Please send me the landing page...

@mandiayba
Copy link
Member

@galanisd what you mean by "the pdfReader was not configured appropriately" ? are there any specific things to consider when using the uima pdfreader in a workflow ?

@greenwoodma
Copy link
Member

@mandiayba I got caught out by this earlier. It seems that by default the PdfReader doesn't find any documents and so produces no output. This is because it's driven by a patterns param which defaults to blank. The easiest option is to set it to **/*.pdf which will match recursively all PDF files in the folder structure passed to it as input. @galanisd would it make sense to have this as a default value in the component as I would guess 99.9% of the cases this is the required behaviour and I'm sure this won't be the last time it trips someone up.

@galanisd
Copy link
Member

@mandiayba I got caught out by this earlier. It seems that by default the PdfReader doesn't find any documents and so produces no output. This is because it's driven by a patterns param which defaults to blank.

Exactly!

The easiest option is to set it to **/*.pdf which will match recursively all PDF files in the folder structure passed to it as input.

Exactly!

@galanisd would it make sense to have this as a default value in the component as I would guess 99.9% of the cases this is the required behaviour and I'm sure this won't be the last time it trips someone up.

Default values in the Galaxy XML wrappers come from default values in the OMTD-SHARE record. PdfReader is actually something like a built-in component in our platform; so, yes we can probably manually edit the wrapper and set **/*.pdf as default value for patterns parameter.

The other solution is to have some help & instructions for building workflows where it should be mentioned.

@greenwoodma
Copy link
Member

The other solution is to have some help & instructions for building workflows where it should be mentioned.

That made me laugh so much!

@reckart
Copy link
Member

reckart commented Apr 20, 2018

The best thing would be to have the **/*.pdf as a default value in the DKPro Core PdfReader, but it requires changes in uimaFIT - I'm taking note to have a look at it because I need to have a look at uimaFIT in the next days anyway (currently waiting for the project to be migrated to Git/GitHub), but no idea if I will be able to make the changes.

So editing the OMTD-SHARE descriptor before completing the registration process seems a sensible solution for the time being.

@mandiayba
Copy link
Member

@mandiayba I got caught out by this earlier. It seems that by default the PdfReader doesn't find any documents and so produces no output. This is because it's driven by a patterns param which defaults to blank. The easiest option is to set it to **/*.pdf which will match recursively all PDF files in the folder structure passed to it as input.

considering that component UPFMT4 takes xmi files as input, could we find another way to run it on the registry ? Could we use xmi files from @dumitrescustefan and define a executable workflow, for example omtdImporter -> UPFMT4 ?

@galanisd
Copy link
Member

Yes you can.

@mandiayba
Copy link
Member

@dumitrescustefan could you please attach a sample of input files ? I will try with them

@dumitrescustefan
Copy link
Author

The component also looks for .txt files (the .xmi input just extracts raw text from the xmi and creates a temporary txt, so it is the same as having txts directly). so if you already have a PDF->txt converter or something similar, might be easier to test.

Also, here is a sample .xmi file.
dummy.zip

@mandiayba
Copy link
Member

@galanisd I have tried the UPFMT 4 in a workflow (omtdImporter -> UPFMT4) with the corpus sent by @dumitrescustefan in the previous comment but it does run. I got the error "There was a problem running the application. Try again in a while. (corpus with ID '23f1d29d-919e-4847-b61d-61aea8967094' is empty)". Could you please check what is wrong with the corpus ?

@greenwoodma
Copy link
Member

@mandiayba did you just upload the zip file when creating the corpus? If so then that's the problem. The input documents need to be in a subfolder called fulltext but that zip file has the file in the root so won't be treated as a document for processing.

@gkirtzou
Copy link

gkirtzou commented May 2, 2018

Also, did you register multiple times your component in the registry today? Because I see multiple galaxy wrapper records for your component with today's date. The galaxy wrapper records are generated by the omtd platform when you register a component in order galaxy workflow engine will be able to call your component.

@dumitrescustefan
Copy link
Author

dumitrescustefan commented May 2, 2018

@gkirtzou Here is the zip with the latest XML:
upfmtV5.zip

I also made the component public so you could test it.

Also, yes, I pressed the button a few times. I did this because nothing happened for ~ 15 seconds the first time I clicked, so I tried again. A couple of times :) Then I saw a bunch of entries in the components list and I cleaned everything by deleting all duplicates. I had no visual feedback that anything was happening after pressing the button, and I became trigger-happy.

@gkirtzou
Copy link

gkirtzou commented May 2, 2018

Thanks for the metadata, I will check them. Aaaah, I see.. Yes sometimes the response is a little bit slow.

@gkirtzou
Copy link

gkirtzou commented May 2, 2018

@dumitrescustefan I am happy to announce that we have successfully run your component to the OMTD platform!!! In the attachements you would find the initial corpus with 2 pdf and the generated output. Could you verify that it is meaningful?

chebiCorpusInput.zip
chebiCorpusOutput.zip

@dumitrescustefan
Copy link
Author

dumitrescustefan commented May 3, 2018

@gkirtzou Yes, that's the output we should have 👍 I have left the temporary .conllu files as a debug in case something fails like the out-of-ram issue before, but with the final publication I will remove them. Thanks alot for the help!

@gkirtzou
Copy link

gkirtzou commented May 3, 2018

That's great news!! Than mean that we were able to successfully test your component!!! So we are done! The only thing that is left is to upload your component to the services, but we will let you know when to do that.
Please note that if you want to remove generating conllu files, to also remove the respective dataFormat from the outputResourceInfo description in the metadata. I would suggest leaving it if it will allow user to debug issues such as out-of-memory.

@dumitrescustefan
Copy link
Author

No, we will leave the final .connlu and .xmi files untouched (so users get bot txt and xml-type outputs). What I wanted to say was that I will remove the intermediary conllu file that precedes the parsing process: the file is always named temporary.conllu and exists only in the docker - I copy it out in the /output folder just to see that everything is ok up to that step.

Finally, I am unsure whether to ask in this thread or open a new issue: for the adapt courses should we use the test.openminted platform or wait for the non-test version? And a second question, for you, would be: for the testing process did you create an application? Or how did you perform the testing, as in the tutorial we should show how to run the component on a corpus.

Thanks!

@gkirtzou
Copy link

gkirtzou commented May 3, 2018

No, we will leave the final .connlu and .xmi files untouched (so users get bot txt and xml-type outputs). What I wanted to say was that I will remove the intermediary conllu file that precedes the parsing process: the file is always named temporary.conllu and exists only in the docker - I copy it out in the /output folder just to see that everything is ok up to that step.

Ahh, I see. Sorry I missunderstood what you send previously.

Finally, I am unsure whether to ask in this thread or open a new issue: for the adapt courses should we use the test.openminted platform or wait for the non-test version?

You will register your components to a non-test version of the platform. As soon as we are ready to processed, we will let you know.

And a second question, for you, would be: for the testing process did you create an application? Or how did you perform the testing, as in the tutorial we should show how to run the component on a corpus.

I created a private app via the workflow editor, that contains the following components in that order :

  1. The omtdImporter, a component that fetches the data from the registry to the workflow engine
  2. A pdfReader, a component that takes pdf and generates xmi setting the pattern as "**/*.pdf"
  3. Your component
    and I connect the components with the "noodle" functionality of galaxy in order to show the flow. It's pretty simple. You can go to test.openminted.eu and try it yourself.

@gkirtzou gkirtzou mentioned this issue May 3, 2018
@gkirtzou
Copy link

gkirtzou commented May 9, 2018

Dear @dumitrescustefan you can now proceed to the uploading of your component at https://services.openminted.eu/home

Just, some final suggestions, not obligatory but recommended, for the metadata record are

  • Add in both input and output resource info the language(s) that your component can handle.
  • Add the resourceCreator with contact yourself, for citation reasons.

Please, when you upload your component, create the appropriate workflow so that someone could run your component using the workflow editor. For more info see https://openminted.github.io/releases/workflow-editor/
Please also let me know when you have uploaded the your component to the production site.
If you encounter any problems, please let us know.
Thanks

@dumitrescustefan
Copy link
Author

@gkirtzou Thanks! I added the languages and updated the label for the language parameter in a new (private) component. Please tell me under what section is the resourceCreator so I can add it as well. As soon as I validate the component on the test server i'll upload the xml to the services.

@gkirtzou
Copy link

gkirtzou commented May 9, 2018

@dumitrescustefan when you edit the metadata of a registered component, there is the option "Add Resource Creation". You would find it under the Identification section.

@dumitrescustefan
Copy link
Author

@gkirtzou I edited the component, and it's public on :
https://services.openminted.eu/landingPage/component/8f47d5d7-22d5-43e1-b790-ec4c44af0a68
Thanks!

@gkirtzou
Copy link

@dumitrescustefan Thanks for uploading the component. Could you please create a public application as well, so that non expert user could use it?

Note that when you create an application, you wiil be asked to fill in a metadata record. Some tips for filling it in - so that they are discoverable by the users but also that users can cite you and your resource.

  • Give the application a unique name that humans can read and short enough
  • Give an explanatory description; remember you can re-use the description of the component(s) in the workflow accumulatively
  • Add in the inputProcessingResourceInfo the information of the first component used in the workflow (the one after omtdImpoter), and in the outputResourceInfo the information of the last component; if the annotations from previous components are retained in the final output, please add those as well
  • (optionally) You can use the relations set with the relation "hasPart" to document the components used in the workflow - it can be repeated multiple times.

If you encounter any problems, please let us know.
Thanks

@dumitrescustefan
Copy link
Author

@gkirtzou I am first trying to create an app on the test. server. I edited the metadata with all the above pointers, landing page is:
https://test.openminted.eu/landingPage/application/0c07c957-b001-4bde-8347-b3eec6c89ecb
The app is private so I can edit the metadata, can you see it?

I tried to run it a couple of times, but says : "running" for some time (though for 2 pdfs it should finish in ~1 minute). Is that normal behaviour? Also, I tried editing the workflow, and the save button seems not to work (any changes I make are discarded). I think I need to check the output icon of the last component to make the dataset not hidded (which is the default), but I can't seem to save the changes.

@gkirtzou
Copy link

The app is private so I can edit the metadata, can you see it?

I cannot see it since it is private. Could you please send me here the xml with the metadata from your app, so I could check them?

I tried to run it a couple of times, but says : "running" for some time (though for 2 pdfs it should finish in ~1 minute). Is that normal behaviour?

I check the workflow engine and I found three successful run from your workflow. Did you get the final output in the UI? Each experiment took ~10 minutes.
A question: your workflow consists of omtdImport -> PDFReader -> UPFMT component, correct?

Also, I tried editing the workflow, and the save button seems not to work (any changes I make are discarded).

You mean that you made changes in the workflow editor and that changes were not saved, when you reopen the app with workflow editor?

I think I need to check the output icon of the last component to make the dataset not hidded (which is the default), but I can't seem to save the changes.

No you don't need to do this. In fact I think that it should not be available as an option. Right @greenwoodma ?

@dumitrescustefan
Copy link
Author

Here is the xml zipped:
0c07c957-b001-4bde-8347-b3eec6c89ecb.zip

Regarding the workflow, I have the omtdImporter linked to the PdfReader then to the UPFMT component (the last version, updated one), exactly as you specified. However, on the UI, I get three "Running" tasks.

Lastly, regarding the workflow editor, when pressing Save it does not save the components' x y positions on the flow (i know it is just cosmetic but it's a hint that re-saving does not work), and also the editor does not allow me to view edit any component, like changing the pattern for the pdf reader, etc. I tried with both firefox and chrome (latest versions) in case it was a browser problem, but they both have the same behavior. Anyway, I brought this up as I believed that the "output" check is what kept the app to not complete..

@greenwoodma
Copy link
Member

@dumitrescustefan yes, the inability to change parameter values on a workflow that you have previously created is a known bug which we are looking into (it's a bug in Galaxy which they are investigating). Currently the only option is to remove the component you want to edit and then re-add it. Sorry about that, I'm aware just how annoying that specific bug is.

The hidden status of the dataset should have no impact on the workflow, as all that does is hide the output in the galaxy UI which you are not using to access the results. The OpenMinTeD platform retrieves the results via the Galaxy API for you and this is not affected by the hidden status.

@gkirtzou
Copy link

@dumitrescustefan I checked the application metadata and I have the following comments/suggestion

  • You could add your self as a resourceCreator for provenance and citation reasons
  • Concerning the relation with UPFMT component you have declared as resourceIdentifierSchemeName="URL" , thus the schemeURI should not be completed. schemeURI must be completed only when you declare as resourceIdentifierSchemeName="other". That's the logic behind this. Unfortunately, this is not depicted yet in the editor. My suggestion here is to dismiss the schemeURI. Or even better you can declare as resourceIdentifierSchemeName="OMTD" (I think in the editor you see the option OpenMinTeD Id, and as value you need to past last part from the / from the landing page URL. For example in your case, if the component has the following landing page https://services.openminted.eu/landingPage/component/8f47d5d7-22d5-43e1-b790-ec4c44af0a68, the require value would be 8f47d5d7-22d5-43e1-b790-ec4c44af0a68.
  • Concerning the relation with PDF reader I notice that we don't have a public landing page for that. I would suggest removing it, since it is confusing the information you added ( I imagine that you got the info from the workflow editor, which is not aligned with what the registy holds and use to recognize the components). I would check to see if we can add metadata for the pdf reader, so that people can add them in the relations as I suggest with the UPFMT component.

One question: is your application in service or in test? because I thought you were playing in test first, by in the metadata you are using a link to the UPFMT component registered in service.

@dumitrescustefan
Copy link
Author

Okay, thank you.
I added the resourceCreator info.
I changed the relation for the component to the OpenMinTeD Id. I also removed the PDF Reader relation for now. Even if everything is in test., I added the ID for the UPFMT that's in the services. server because it's one less thing to remember when I move everything to services.

I would like to try to add the app in the services. server, but for some reason I can't find the UPFMT component in the workflow editor.
Here's the landing page for the UPFMT component:
https://services.openminted.eu/landingPage/component/8f47d5d7-22d5-43e1-b790-ec4c44af0a68
Shouldn't the component be visible there? It's visible in the test. workflow, but not in services, though the component is available (& public) in both.

@gkirtzou
Copy link

I added the ID for the UPFMT that's in the services. server because it's one less thing to remember when I move everything to services.

Ok no problem. Could you send me the metadata, just to be sure?

Shouldn't the component be visible there?

Yes, I will check and see what going on and let you know.

@dumitrescustefan
Copy link
Author

Here is the zip with the app's metadata:
0c07c957-b001-4bde-8347-b3eec6c89ecb (2).zip

@dumitrescustefan
Copy link
Author

I forgot to pretty-print it, here it is again:
0c07c957-b001-4bde-8347-b3eec6c89ecb (2).zip

@gkirtzou
Copy link

A few comments, minor corrections:

  • in publicationIdentifier since publicationIdentifierSchemeName="URL" there is no need for schemeURI, you can remove it. The same login as I mentioned earlier applies here as well
  • In resourceCreator.personIdentifier you can change the personIdentifierSchemeName from "other" to "URL", I think it is more appropriate

I will let you know,when I figure out what's going on with your component.

@dumitrescustefan
Copy link
Author

Hi, for the personIdentifier I can choose between: ORCID, INSI, ResearcherID, ScopusID and other. There's no URL, I would have chosen that. Are any of these better than "other"?

@gkirtzou
Copy link

@dumitrescustefan You are right, there is not URL in person identifier scheme name. I got confused with the generic one that we have in the metadata schema, I am sorry for that.

If you have an ORCID that would be nice to add. If you want to use the url page to your linkedin account, then the "other" value is more appropriate.

@gkirtzou
Copy link

@dumitrescustefan We figure out what went wrong and we are trying to fix it. I will let you know as soon as we are good to go. Sorry about the trouble.

@dumitrescustefan
Copy link
Author

No problems, I'm standing by. Thank you very much!

@gkirtzou
Copy link

gkirtzou commented Jun 1, 2018

@dumitrescustefan We finally resolved the problem we had with the component's registration. I took the liberty and created an application to make your component available to the non-tdm users of OMTD platform. You can find the application here :

https://services.openminted.eu/landingPage/application/6aea5b89-e857-4c47-b111-81c441e7a741

The app runs correctly. Since everything works perfectly and nothing else remains open, I am closing the issue.

Cheers!

@gkirtzou gkirtzou closed this as completed Jun 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component Participant is providing component(s) Docker Component/Application provided as a docker image
Projects
None yet
Development

No branches or pull requests

7 participants