Releases: modernmt/DataCollection
Final ModernMT DataCollection release
Final release of the baseline parallel data collection pipeline for the ModernMT project.
The pipeline is documented in the readme and documents linked from there.
Changes since the initial public release 0.1.0:
- Ensured that the pipeline can be run independently of any ModernMT project infrastructure (we deployed and tested in the Amazon Web Services us-east availability zone, where the Common Crawl data is hosted)
- Added support for Spanish, Portuguese, Dutch and Russian
- Documentation updates
- Bug fixes
- Documented known issues, limitations and enhancement ideas in issue tracker
Index files for the 2016_50 Common Crawl for the language pairs en→pt, en→nl and en→ru are included as attached, compressed files. These do not contain page pairs that were already contained in the 2015_32 indices attached to release 0.1.0. The index files are licensed under a Creative Commons Attribution 4.0 International License.
Initial public release
Initial public release of baseline parallel data collection pipeline.
The pipeline is documented in the readme and documents linked from there.
Phase 1 of the pipeline is an alpha release, Phase 2 is in beta.
Index files for the 2015_32 CommonCrawl for the language pairs en↔it, en↔fr, en↔de, en↔es, en↔pt, en↔nl and en↔ru are included as attached, compressed files. These index files are licensed under a Creative Commons Attribution 4.0 International License.