Skip to content

Commit

Permalink
#3918 - PubMed Central support
Browse files Browse the repository at this point in the history
- Added documentation for PMC external repo
- Added documentation for BioC format
  • Loading branch information
reckart committed May 23, 2023
1 parent bad80bc commit 2903d83
Show file tree
Hide file tree
Showing 6 changed files with 119 additions and 4 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,8 @@ include::{include-dir}external-search-repos-solr.adoc[leveloffset=+3]

include::{include-dir}external-search-repos-pubannotation.adoc[leveloffset=+3]

include::{include-dir}external-search-repos-pmc.adoc[leveloffset=+3]

<<<

include::{include-dir}constraints.adoc[leveloffset=+1]
Expand Down Expand Up @@ -239,6 +241,11 @@ data in a particular format. The **feature flag** column shows which flags you c
|====
| Format | Remote API format ID | Feature flag

| <<sect_formats_bioc>>
| `bioc`
| `format.bioc.enabled`


| <<sect_formats_conll2000>>
| `conll2000`
| `format.conll2000.enabled`
Expand Down Expand Up @@ -361,6 +368,8 @@ data in a particular format. The **feature flag** column shows which flags you c
|====


include::{include-dir}formats-bioc.adoc[leveloffset=+2]

include::{include-dir}formats-conll2000.adoc[leveloffset=+2]

include::{include-dir}formats-conll2002.adoc[leveloffset=+2]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
// Licensed to the Technische Universität Darmstadt under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The Technische Universität Darmstadt
// licenses this file to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License.
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

[[sect_external-search-repos-pubannotation]]
= PubAnnotation

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@
@Configuration
@AutoConfigureAfter({ ExternalSearchAutoConfiguration.class, BioCAutoConfiguration.class,
PubMedServicesAutoConfiguration.class })
@ConditionalOnProperty(prefix = "external-search.pubmed", //
@ConditionalOnProperty(prefix = "external-search.pmc", //
name = "enabled", havingValue = "true", matchIfMissing = false)
@ConditionalOnBean({ ExternalSearchService.class, BioCFormatSupport.class })
public class PubMedDocumentRepositoryAutoConfiguration
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
// Licensed to the Technische Universität Darmstadt under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The Technische Universität Darmstadt
// licenses this file to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License.
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

[[sect_external-search-repos-pmc]]
= PubMed Central

====
CAUTION: Experimental feature. To use this functionality, you need to enable it first by adding `external-search.pmc.enabled=true` to the `settings.properties` file (see the <<admin-guide.adoc#sect_settings, Admin Guide>>). You should also add `format.bioc.enabled=true` to enable
support for the BioC format used by this repository connector.
====

link:https://www.ncbi.nlm.nih.gov/pmc/[PubMed Central]® (PMC) is a free full-text archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). It can be added as an external document repository by
selecting the **PubMed Central** repository type.

NOTE: {product-name} uses the BioC version of the PMC documents for import. The search tries to
consider only documents that have full text available, but the BioC version of these texts may be
available only with a delay. Thus, if you cannot import a recently uploaded document from PMC into
{product-name}, you may try it again a day later and have more success.
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
// Licensed to the Technische Universität Darmstadt under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The Technische Universität Darmstadt
// licenses this file to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License.
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

[[sect_formats_bioc]]
= BioC (experimental)

====
CAUTION: Experimental feature. To use this functionality, you need to enable it first by adding `format.bioc.enabled=true` to the `settings.properties` file (see the <<admin-guide.adoc#sect_settings, Admin Guide>>).
====

This is a new and still experimental BioC format.

* Sentence information is supported
* If sentences are present in a BioC document, they are imported. Otherwise, {product-name} will
automatically try to determine sentence boundaries.
* On export, the BioC files are always created with sentence information.
* Passages are imported as a `Div` annotations and the passage `type` infon is set as the `type`
feature on these `Div` annotations
* When reading span or relation annotations, the `type` infon is used to look up a suitable
annotation layer. If a layer exists where either the full technical name of the layer or the
simple technical name (the part after the last dot) match the type, then an attempt will be made
to match the annotation to that layer. If the annotation has other infons that match features on
that layer, they will also be matched. If no layer matches but the default `SimpleSpan` layer is
present, annotations will be matched to that. Similarly, if only a single infon is present in an
annotation and no other feature matches, then the infon value may be matched to a potentially
existing `value` feature.
* When exporting annotations, the `type` infon will always be set to the full layer name and
features will be serialized to infons matching their names.
* If a document has not been imported from a BioC file containing passages and does not contain
`Div` annotations from any other source either, then on export a single passage containing the
entire document is created.
* Cross-passage relations are not supported.
* Passage-level infons are not supported.
* Document-level infons are not supported.


[cols="2,1,1,1,3"]
|====
| Format | Read | Write | Custom Layers | Description

| link:https://raw.githubusercontent.com/2mh/PyBioC/master/BioC.dtd[BioC (experimental)] (`bioc`)
| yes
| yes
| yes
| BioC format

|====

Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@
*/
package de.tudarmstadt.ukp.inception.io.bioc;

import static org.apache.uima.cas.SerialFormat.XMI_PRETTY;
import static org.apache.uima.fit.factory.CollectionReaderFactory.createReader;
import static org.apache.uima.fit.factory.CollectionReaderFactory.createReaderDescription;
import static org.apache.uima.fit.pipeline.SimplePipeline.iteratePipeline;
Expand All @@ -26,7 +25,6 @@
import java.util.ArrayList;

import org.apache.uima.fit.factory.CasFactory;
import org.apache.uima.util.CasIOUtils;
import org.junit.jupiter.api.Test;

import de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence;
Expand All @@ -46,7 +44,7 @@ void testRead() throws Exception
var cas = CasFactory.createCas();
reader.getNext(cas);

CasIOUtils.save(cas, System.out, XMI_PRETTY);
// CasIOUtils.save(cas, System.out, XMI_PRETTY);

assertThat(cas.getDocumentText()) //
.contains("Sentence 1.") //
Expand Down

0 comments on commit 2903d83

Please sign in to comment.