Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
c502b2f
Update Azure Translation character limits. Add NLP TextSplitting Mode…
hhuangMITRE Apr 3, 2024
434f1ca
Code refactor and tooltip update.
hhuangMITRE Apr 4, 2024
303048f
Minor tooltip update.
hhuangMITRE Apr 4, 2024
192fb29
Minor tooltip update.
hhuangMITRE Apr 4, 2024
a74b919
Update edge case for testing text splits.
hhuangMITRE Apr 4, 2024
4c6fd7a
Improve formatting.
jrobble Apr 9, 2024
c786612
Merge remote-tracking branch 'origin' into hhuang/azure-sentence-spli…
hhuangMITRE Apr 12, 2024
ba61756
Tooltip update, and adding additional WtP text splitter support.
hhuangMITRE Apr 12, 2024
bf20678
Merge branch 'develop' into hhuang/azure-sentence-split-and-char-limit
hhuangMITRE Apr 12, 2024
7438731
Tooltip update, and adding additional WtP text splitter support.
hhuangMITRE Apr 12, 2024
983dce7
Tooltip update.
hhuangMITRE Apr 12, 2024
dda479b
Tooltip updates and PyTorch cuda build.
hhuangMITRE Apr 16, 2024
23ae3b3
Tooltip update.
hhuangMITRE Apr 16, 2024
ddad147
Add minor check.
hhuangMITRE Apr 16, 2024
13ee66e
Improved stage builds for gpu/cpu options.
hhuangMITRE Apr 16, 2024
16c7b51
Submitting tested changes for Docker build.
hhuangMITRE Apr 17, 2024
9657eee
Final changes (cleanup + test cpu/gpu staged builds).
hhuangMITRE Apr 17, 2024
d6e59eb
Tooltip update.
hhuangMITRE Apr 18, 2024
28b641b
Set NVIDIA environment variables in Docker image.
hhuangMITRE Apr 18, 2024
cf4375f
Toggling GPU mode (final check).
hhuangMITRE Apr 18, 2024
1a7bad1
GPU mode passed, reverting change back to CPU.
hhuangMITRE Apr 18, 2024
9d4dfcb
Minor Copyright Date Update.
hhuangMITRE Apr 18, 2024
ed91663
Additonal adjustments, test fix.
hhuangMITRE Apr 19, 2024
487b7e7
Tooltip update.
hhuangMITRE Apr 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 40 additions & 4 deletions python/AzureTranslation/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@
# under contract, and is subject to the Rights in Data-General Clause #
# 52.227-14, Alt. IV (DEC 2007). #
# #
# Copyright 2023 The MITRE Corporation. All Rights Reserved. #
# Copyright 2024 The MITRE Corporation. All Rights Reserved. #
#############################################################################

#############################################################################
# Copyright 2023 The MITRE Corporation #
# Copyright 2024 The MITRE Corporation #
# #
# Licensed under the Apache License, Version 2.0 (the "License"); #
# you may not use this file except in compliance with the License. #
Expand All @@ -28,17 +28,53 @@

ARG BUILD_REGISTRY
ARG BUILD_TAG=latest
FROM ${BUILD_REGISTRY}openmpf_python_executor_ssb:${BUILD_TAG}

ARG RUN_TESTS=false
# To enable GPU resources, update
# below line to BUILD_TYPE=gpu
ARG BUILD_TYPE=cpu

FROM ${BUILD_REGISTRY}openmpf_python_executor_ssb:${BUILD_TAG} as download_python_packages


RUN pip install --no-cache-dir langcodes

RUN apt-get update && \
apt-get install -y git git-lfs && \
git lfs install && \
rm -rf /var/lib/apt/lists/*

# Install WtP and spaCy
RUN pip install --upgrade pip && \
pip install "spacy>=3.7.4" && \
pip install "wtpsplit>=1.3.0"

# Modify to add downloads for other models of interest.
RUN mkdir /wtp_models && cd /wtp_models && \
git clone https://huggingface.co/benjamin/wtp-bert-mini && \
python3 -m spacy download xx_sent_ud_sm

########################################################################
FROM download_python_packages as cpu_component_build
RUN pip install torch --index-url https://download.pytorch.org/whl/cpu

########################################################################
FROM download_python_packages as gpu_component_build

# Environment variables required by nvidia runtime.
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility

RUN pip install torch

########################################################################

FROM ${BUILD_TYPE}_component_build as component_final

RUN --mount=target=.,readwrite \
install-component.sh; \
if [ "${RUN_TESTS,,}" == true ]; then python tests/test_acs_translation.py; fi


LABEL org.label-schema.license="Apache 2.0" \
org.label-schema.name="OpenMPF Azure Translation" \
org.label-schema.schema-version="1.0" \
Expand Down
74 changes: 74 additions & 0 deletions python/AzureTranslation/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
/*****************************************************************************
* Copyright 2024 The MITRE Corporation *
* *
* Licensed under the Apache License, Version 2.0 (the "License"); *
* you may not use this file except in compliance with the License. *
* You may obtain a copy of the License at *
* *
* http://www.apache.org/licenses/LICENSE-2.0 *
* *
* Unless required by applicable law or agreed to in writing, software *
* distributed under the License is distributed on an "AS IS" BASIS, *
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. *
* See the License for the specific language governing permissions and *
* limitations under the License. *
******************************************************************************/

This project contains content developed by The MITRE Corporation. If this code
is used in a deployment or embedded within another project, it is requested
that you send an email to opensource@mitre.org in order to let us know where
this software is being used.

*****************************************************************************

The WtP, "Where the Point", sentence segmentation library falls under the MIT License:

https://github.com/bminixhofer/wtpsplit/blob/main/LICENSE

MIT License

Copyright (c) 2024 Benjamin Minixhofer

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

*****************************************************************************

The spaCy Natural Language Processing library falls under the MIT License:

The MIT License (MIT)

Copyright (C) 2016-2024 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
70 changes: 62 additions & 8 deletions python/AzureTranslation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,25 +35,25 @@ must be provided. Neither has a default value.
`https://<custom-translate-host>/translator/text/v3.0`. The URL should
not end with `/translate` because two separate endpoints are
used. `ACS_URL + '/translate'` is used for translation.
`ACS_URL + '/breaksentence'` is used to break up text when it is too long
for a single translation request. This property can also be configured
using an environment variable named `MPF_PROP_ACS_URL`.
This property can also be configured using an environment variable
named `MPF_PROP_ACS_URL`.

- `ACS_SUBSCRIPTION_KEY`: A string containing your Azure Cognitive Services
subscription key. To get one you will need to create an
Azure Cognitive Services account. This property can also be configured
using an environment variable named `MPF_PROP_ACS_SUBSCRIPTION_KEY`.


# Important Job Properties:
- `TO_LANGUAGE`: The BCP-47 language code for language that the properties
# Primary Job Properties
- `TO_LANGUAGE`: The BCP-47 language code for the language that the properties
should be translated to.

- `FEED_FORWARD_PROP_TO_PROCESS`: Comma-separated list of property names indicating
which properties in the feed-forward track or detection to consider
translating. For example, `TEXT,TRANSCRIPT`. If the first property listed is
present, then that property will be translated. If it's not, then the next
property in the list is considered. At most, one property will be translated.

- `FROM_LANGUAGE`: In most cases, this property should not be used. It should
only be used when automatic language detection is detecting the wrong
language: Users can provide a BCP-47 code to force the translation service
Expand All @@ -78,9 +78,63 @@ must be provided. Neither has a default value.
to identify the source language of the incoming text.


# Text Splitter Job Properties
The following settings control the behavior of dividing input text into acceptable chunks
for processing.

Through preliminary investigation, we identified the [WtP library ("Where's the
Point")](https://github.com/bminixhofer/wtpsplit) and [spaCy's multilingual sentence
detection model](https://spacy.io/models) for identifying sentence breaks
in a large section of text.

WtP models are trained to split up multilingual text by sentence without the need of an
input language tag. The disadvantage is that the most accurate WtP models will need ~3.5
GB of GPU memory. On the other hand, spaCy has a single multilingual sentence detection
that appears to work better for splitting up English text in certain cases, unfortunately
this model lacks support handling for Chinese punctuation.

- `SENTENCE_MODEL`: Specifies the desired WtP or spaCy sentence detection model. For CPU
and runtime considerations, the author of WtP recommends using `wtp-bert-mini`. More
advanced WtP models that use GPU resources (up to ~8 GB) are also available. See list of
WtP model names
[here](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#available-models). The
only available spaCy model (for text with unknown language) is `xx_sent_ud_sm`.

Review list of languages supported by WtP
[here](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#supported-languages).
Review models and languages supported by spaCy [here](https://spacy.io/models).

- `SENTENCE_SPLITTER_CHAR_COUNT`: Specifies maximum number of characters to process
through sentence/text splitter. Default to 500 characters as we only need to process a
subsection of text to determine an appropriate split. (See discussion of potential char
lengths
[here](https://discourse.mozilla.org/t/proposal-sentences-lenght-limit-from-14-words-to-100-characters).

- `SENTENCE_SPLITTER_INCLUDE_INPUT_LANG`: Specifies whether to pass input language to
sentence splitter algorithm. Currently, only WtP supports model threshold adjustments by
input language.

- `SENTENCE_MODEL_CPU_ONLY`: If set to TRUE, only use CPU resources for the sentence
detection model. If set to FALSE, allow sentence model to also use GPU resources.
For most runs using spaCy `xx_sent_ud_sm` or `wtp-bert-mini` models, GPU resources
are not required. If using more advanced WtP models like `wtp-canine-s-12l`,
it is recommended to set `SENTENCE_MODEL_CPU_ONLY=FALSE` to improve performance.
That model can use up to ~3.5 GB of GPU memory.

Please note, to fully enable this option, you must also rebuild the Docker container
with the following change: Within the Dockerfile, set `ARG BUILD_TYPE=gpu`.
Otherwise, PyTorch will be installed without cuda support and
component will always default to CPU processing.

- `SENTENCE_MODEL_WTP_DEFAULT_ADAPTOR_LANGUAGE`: More advanced WTP models will
require a target language. This property sets the default language to use for
sentence splitting, and is overwritten whenever `FROM_LANGUAGE`, `SUGGESTED_FROM_LANGUAGE`,
or Azure language detection return a different, WtP-supported language option.


# Listing Supported Languages
To list the supported languages replace `${ACS_URL}` and
`${ACS_SUBSCRIPTION_KEY}` in the following command and run it:
To list the supported languages replace `${ACS_URL}` and `${ACS_SUBSCRIPTION_KEY}` in the
following command and run it:
```shell script
curl -H "Ocp-Apim-Subscription-Key: ${ACS_SUBSCRIPTION_KEY}" "https://${ACS_URL}/languages?api-version=3.0&scope=translation"
```
Expand Down
4 changes: 2 additions & 2 deletions python/AzureTranslation/acs_translation_component/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@
# under contract, and is subject to the Rights in Data-General Clause #
# 52.227-14, Alt. IV (DEC 2007). #
# #
# Copyright 2023 The MITRE Corporation. All Rights Reserved. #
# Copyright 2024 The MITRE Corporation. All Rights Reserved. #
#############################################################################

#############################################################################
# Copyright 2023 The MITRE Corporation #
# Copyright 2024 The MITRE Corporation #
# #
# Licensed under the Apache License, Version 2.0 (the "License"); #
# you may not use this file except in compliance with the License. #
Expand Down
Loading