Skip to content

A set of scripts to build parallel corpora using Bilingual Sentence Aligner

Notifications You must be signed in to change notification settings

janissl/bsa-wrapper

Repository files navigation

bsa-wrapper

A set of scripts to build parallel corpora using Bilingual Sentence Aligner from Microsoft (by R.C.Moore)


Usage

File system structure:

${corpus_title}
|-- source
|   |-- ${title}_${source_lang}.snt
|   |-- ${title}_${target_lang}.snt
|-- work
|   |-- ${source_lang}-${target_lang}
|       |-- ${title}_${source_lang}.snt
|       |-- ${title}_${target_lang}.snt
|       |-- ${title}_${source_lang}.snt.aligned
|       |-- ${title}_${target_lang}.snt.aligned
|-- aligned_idx
|   |-- ${source_lang}-${target_lang}
|       |-- ${title}.${source_lang}.idx
|       |-- ${title}.${target_lang}.idx
|-- result
    |-- ${corpus_title}.${source_lang}-${target_lang}.${source_lang}
    |-- ${corpus_title}.${source_lang}-${target_lang}.${target_lang}
    |-- ${corpus_title}.unique.${source_lang}-${target_lang}.${source_lang}
    |-- ${corpus_title}.unique.${source_lang}-${target_lang}.${target_lang}
  • Additional Python dependency: PyYAML. Install it using the python -m pip install PyYAML command if necessary.
  • A Perl interpreter must be also installed on your machine.
  • Before running the shell script, put your source files in ${corpus_title}/source directory.
  • The content of source files must be segmented in sentences (one sentence per line).
  • Filenames of input files must have the following pattern: ${title}_${lang}.snt (e.g. document_en.snt).
  • Parallel files must have identical titles (e.g. article_001_en.snt, article_001_fr.snt).
  • There are two source data directories - 'original_source_data_directory' and 'preprocessed_source_data_directory' - specified in the YAML file. The 'original_source_data_directory' is used for files containing sentences in natural language (i.e. unmodified sentences). The 'preprocessed_source_data_directory' is used for additionaly preprocessed files originated from the 'original_source_data_directory' (e.g. stemmed files, additionally tokenized files etc.). The sentence alignment itself is done using the content from the 'preprocessed_source_data_directory'. On the contrary, the building of parallel corpora is done using the content from 'original_source_data_directory'. If no additional preprocessing has been made on source files, both paths must be equal.
  • The 'work', 'aligned_idx' and 'result' directories are created automatically.
  • Aligned corpora are placed in the 'result' directory.

Note: It is not necessary to keep all automatically created subdirectories (work, aligned_idx, result) under the same root but it is much easier to track the alignment process in this way.

An example of a configuration file (YAML):

(for running on Windows OS; replace values in square brackets with actual paths; see also io_args.yml.sample)


source_language: en
target_language: fr

corpus_title: aligned_corpora

original_source_data_directory: [...]\aligned_corpora\source
preprocessed_source_data_directory: [...]\aligned_corpora\source
work_directory: [...]\aligned_corpora\work
alignment_index_directory: [...]\aligned_corpora\aligned_idx
output_data_directory: [...]\aligned_corpora\result

Running the shell script

  • Enter the actual values for parameters in the configuration YAML file (see above).
  • Specify the name of the configuration (YAML) file in the run_bsa.bat file (the value of config_file). The YAML file must reside in the script directory.
  • Execute the following command (on Windows):
    .\run_bsa.bat

Notes:

  • The current set of scripts contains a slightly modified version of Bilingual Sentence Aligner in comparison to the original source. These modifications were implemented to minimize memory issues on larger corpora.
  • The current set of scripts may be also run under UNIX/Linux OS. For this purpose, a Bash script similar to run_bsa.bat must be executed.

References:

About

A set of scripts to build parallel corpora using Bilingual Sentence Aligner

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages