This repository has been archived by the owner on Mar 11, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 2
Update marian with binary model loading #11
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This first commit imports files from mts which was repurposed for bergamot translator from https://github.com/browsermt/mts/tree/nuke.
Modifications to SentencePiece are necessary to provide token level string_views. This commit changes marian to an alternate branch which has the feature incorporated.
CMakeLists have been modified with the necessary includes to add browsermt/mts@nuke files to the bergamot-translator library. In addition, adds the ssplit dependency, corresponding includes. Intel MKL fails on compilation, unable to find libraries. To solve this 3rd_party/CMakeLists.txt is modified with @UG's fixes to propogate variables (EXT_LIBS, etc) at a library level.
A faster linesplitter added for benchmarks is removed in favour of @UG's ssplit-cpp. NOTE: ssplit-cpp's regex based implementation is slow for one-line parses, which ideally needs to be improved in upstream ssplit-cpp to trivially reduce to a faster newline character based split.
Commit modifies the example test-code main-mts into the app folder, updating CMakeLists accordingly.
Removed Alignments, too many questions and no concrete answers. Better off removing unused code. History is kept for now, for internal use.
Vocabs was earlier loaded in each thread and copied several times. Modified this to be loaded only once in Service and reference used consistently later on. This change makes Tokenizer as a class rather moot, as there's only one private member and a function. Moved this into TextProcessor. SentenceSplitter, however remains a separate class. utils.{h,cpp} had only a single loadVocabularies function, which is at the moment required only in Service. Making loadVocabularies a function inside Service and getting rid of utils.*.
- Truncating long sentences into those of a specified length for faster processing is now a separate function, for improved readability. - Changes doing push_back -> emplace_back at places to avoid copy. - query_to_segments is renamed as process. - Comments are added in an attempt to bring some sanity.
Only the bergamot-translator library should be linked to main target Any other library (marian ${MARIAN_CUDA_LIB} ${EXT_LIBS} ssplit pcrecpp.a pcre.a) should be linked to bergamot-translator target inside src/translator folder.
Enables Mac and Ubuntu CPU only builds through GitHub CI. CI scripts are copied from marian-dev with necessary changes. 3rd-party/marian-dev is modified to meet C++17 requirements modifying for half_float.
Using std::string for config. Now capable of launching marian translator through API interface. There's a sketchy workaround to convert a string config to marian::Options, with an added note.
- Provide yaml formatted string as model configuration - Remove redundant files
- Print original and translated text - Just add 2 vector entries for texts
Requirement for string_view is the original source string be transferred all the way from input to service to back to TranslationResult. This constraint was violated in several places by means of existence of a copy-constructor. The issue is fixed by deleting copy and assignment constructors in marian::bergamot::TranslationResult and UnifiedAPI::TranslationResult, which demonstrated a few occurances of the same. Replaced the same with move semantics. In addition, future is set and get using move semantics at the moment. Default move-constructor didn't seem to be working, so they're made explicit for TranslationResults. This commit additionally packs a few deletions and improvements made to improve structure (textops.cpp, batcher.cpp) along the process of inspecting and fixing the garbled outputs. They are choose to be kept, in the interest of time, against a prettified atomic commit engineering. Combinations of the following commits in jp/string-view-bug [acfc92 78a588 12d91b 00a277 919e2f 9d3a46 b7e39b 18f67b bf667c]
- WORMHOLE cmake option is set to ON when compiling for WASM - WASM module might not run on Chrome
This reverts commit 3dd7a60.
Turn of assertions and disable exception catching for wasm builds
- Includes try/catch free builds - Has ASSERTION=0 and DISABLE_EXCEPTION_CATCHING=1 for wasm builds
- This combination gives min inference time (~ 200 WPS) on local machine
…he upcoming release
- Clears up the spaghetti of model packaging - Usage instructions - Formatting changes
- PACKAGE_DIR cmake option can now accept relative paths
- Now things are consistent with the top level README instructions that suggest to build in "build-wasm" folder
- Clarified that the Demo and API usage section assumes bergamot models were packaged into wasm binary - Formatting changes
WASM CircleCI builds
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.