# Word Alignment for Hausa texts
In this notebook, word alignment on Hausa texts is performed using `fast_align` and `SimAlign`

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Word alignment with `fast_align`
Find out more [here](https://github.com/clab/fast_align).

First, install the repository:

In [None]:
%%bash
echo "git clone fast_align"
git clone https://github.com/clab/fast_align.git
cd fast_align
mkdir -p build
cd build
echo "cmake"
cmake ..
echo "make"
make
echo "test fast align"
/content/fast_align/build/fast_align

git clone fast_align
cmake
-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 7.5.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Could NOT find SparseHash (missing: SPARSEHASH_INCLUDE_DIR) 
-- Configuring done
-- Generating done
-- Build files have been written to: /content/fast_align/build
make
[ 16%] Building CXX object CMakeFiles/fast_align.dir/src/fast_align.cc.o
[ 33%] Building CXX object CMakeFiles/fast_align.dir/src/ttables.cc.o
[ 50%] Linking CXX executable fast_align
[ 50%] Built target fast_align
[ 66%] Building CXX object CMakeFiles/atools.dir/src/alignment_io.cc.o
[ 83%] Buildi

Cloning into 'fast_align'...
  Compatibility with CMake < 2.8.12 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.


Usage: /content/fast_align/build/fast_align -i file.fr-en
 Standard options ([USE] = strongly recommended):
  -i: [REQ] Input parallel corpus
  -v: [USE] Use Dirichlet prior on lexical translation distributions
  -d: [USE] Favor alignment points close to the monotonic diagonoal
  -o: [USE] Optimize how close to the diagonal alignment points should be
  -r: Run alignment in reverse (condition on target and predict source)
  -c: Output conditional probability table
 Advanced options:
  -I: number of iterations in EM training (default = 5)
  -q: p_null parameter (default = 0.08)
  -N: No null word
  -a: alpha parameter for optional Dirichlet prior (default = 0.01)
  -T: starting lambda for diagonal distance parameter (default 

### Forward Alignment
Next, I iterate over all files in the directory `"drive/MyDrive/Data/parallel/` that have the extension `txt`. The file in this directory have the assumed format `source sentence ||| target sentence`. For instance, in this case Hausa would be the target language whereas Englisch would be the source language.
Word alignment for each file is generated and stored in the directory `"drive/MyDrive/Data/aligned/fast_align`.

In [None]:
%%bash

#!/bin/bash
declare -a FILES=("drive/MyDrive/Data/parallel/Tanzil-de-ha.txt" "drive/MyDrive/Data/parallel/Tanzil-fr-ha.txt")

for f in "${FILES[@]}"
do
  echo "Processing $f file..."
  filename=$(basename -- "$f")
  name=${filename%.txt}
  echo $name
  forwardPath="/content/drive/MyDrive/Data/aligned/fast_align/$name-forward.align"
  reversePath="/content/drive/MyDrive/Data/aligned/fast_align/$name-reverse.align"
  symPath="/content/drive/MyDrive/Data/aligned/fast_align/$name-sym.align"
  /content/fast_align/build/fast_align -i $f -v -d -o > $forwardPath
  /content/fast_align/build/fast_align -i $f -v -d -o -r> $reversePath
  /content/fast_align/build/atools -i $forwardPath -j $reversePath -c grow-diag-final-and > $symPath
done

Processing drive/MyDrive/Data/parallel/Tanzil-de-ha.txt file...
Tanzil-de-ha
Processing drive/MyDrive/Data/parallel/Tanzil-fr-ha.txt file...
Tanzil-fr-ha


ARG=i
ARG=v
ARG=d
ARG=o
INITIAL PASS 
..............................................
expected target length = source length * 1.52442
ITERATION 1
..............................................
  log_e likelihood: -2.15189e+07
  log_2 likelihood: -3.10452e+07
     cross entropy: 29.8974
        perplexity: 1e+09
      posterior p0: 0.08
 posterior al-feat: -0.169145
       size counts: 3590
ITERATION 2
..............................................
  log_e likelihood: -5.40292e+06
  log_2 likelihood: -7.79477e+06
     cross entropy: 7.50657
        perplexity: 181.846
      posterior p0: 0.0622104
 posterior al-feat: -0.147922
       size counts: 3590
  1  model al-feat: -0.137907 (tension=4)
  2  model al-feat: -0.141624 (tension=3.79969)
  3  model al-feat: -0.144029 (tension=3.67371)
  4  model al-feat: -0.145542 (tension=3.59585)
  5  model al-feat: -0.146477 (tension=3.54825)
  6  model al-feat: -0.147048 (tension=3.51934)
  7  model al-feat: -0.147395 (tension=3.50185)
  8  model 

## Word Alignment with `SimAlign`

`SimAlign` relies on embeddings. First, necessary packages are installed and imported.

In [None]:
%%bash
pip install simalign
pip install sentencepiece

In [None]:
from simalign import SentenceAligner
import os

We also need to specify multilingual embeddings. Here, I use multilingual BERT which was finetuned on Hausa data. The model can be found [here](https://huggingface.co/Davlan/bert-base-multilingual-cased-finetuned-hausa).
`DATA_PATH` is the directory which contains files with the parallel sentences in the following format: `source language ||| target language`. The file names are expected to have the following format `CORPUS-SOURCE-TARGET.txt`. For example, `Tanzil-ar-ha.txt`. 
In the directory `OUT_PATH`, the alignment pairs will be stored.

In [None]:
models = ["Davlan/bert-base-multilingual-cased-finetuned-hausa",
          ]
DATA_PATH = "PATH/TO/PARALLEL/SENTENCES"
OUT_PATH = "OUT/PATH"

CORPUS = "CORPUS NAME"

Here, the alignment model is loaded.

In [None]:
# making an instance of our model.
# You can specify the embedding model and all alignment settings in the constructor.
aligner_mbert_hausa = SentenceAligner(model=models[0],
                            token_type="bpe",
                            matching_methods="mai",
                            device="cuda")

Next, I define some helper functions: `read_file` reads in files in the `|||` format described above. `write_pharao` writes the alignment files where there are pairs of indices. 

In [None]:
# Helper functions
def read_file(file_name):
  pairs = []
  with open(file_name, encoding="utf-8") as file:
    for line in file:
      line = line.strip().split("|||")
      src, trg = line
      pairs.append((src.split(), trg.split()))
  return pairs

def write_pharao(file_name, alignment_dict):
  lang = file_name.split("-")[1]
  types = ["mwmf", "inter", "itermax"]
  for t in types:
    out_file = "{}-{}-ha-{}.align".format(CORPUS, lang, t)
    path = os.path.join(OUT_PATH, out_file)
    with open(path, "w", encoding="utf-8") as f:
        for i in range(len(alignment_dict)):
            alignment = alignment_dict[i][t]
            alignment = ["{}-{}".format(src_idx, trg_idx) for src_idx, trg_idx in alignment]
            f.write("{}\n".format(" ".join(alignment)))

Here, the files in `DATA_PATH` are specified for which alignment should be performed. If all files should be aligned, set this to `os.listdir(DATA_PATH)`.

In [None]:
# File names for parallel data for each language
file_names = ['Tanzil-de-ha.txt', 'Tanzil-fr-ha.txt']

Finally, alignment is started.

In [None]:
for file in file_names:
    print(file)
    path = os.path.join(DATA_PATH, file)
    sent_pairs = read_file(path)
    n_sents = len(sent_pairs)
    align_dict = dict()
    for idx, (src, trg) in enumerate(sent_pairs):
        if idx % 100 == 0:
          print("{}/{} sentences aligned.".format(idx, n_sents))
        alignments = aligner_mbert_hausa.get_word_aligns(src, trg)
        align_dict[idx] = alignments
    print("Writing file...")
    write_pharao(file, align_dict)