v.1

kakaobrain · Apr 8, 2019 · d94ba5d · d94ba5d
1 parent 98ead2b
commit d94ba5d
Show file tree

Hide file tree

Showing 11 changed files with 4,066 additions and 1 deletion.
diff --git a/LICENSE b/LICENSE
@@ -1,3 +1,5 @@
+Copyright 2019 KakaoBrain Corp. <https://www.kakaobrain.com> All Rights Reserved.
+
                                  Apache License
                            Version 2.0, January 2004
                         http://www.apache.org/licenses/

diff --git a/README.md b/README.md
@@ -1 +1,114 @@
-# word2word
+[![image](https://img.shields.io/pypi/v/word2word.svg)](https://pypi.org/project/word2word/)
+[![image](https://img.shields.io/pypi/l/word2word.svg)](https://pypi.org/project/word2word/)
+[![image](https://img.shields.io/pypi/pyversions/word2word.svg)](https://pypi.org/project/word2word/)
+[![image](https://img.shields.io/badge/Say%20Thanks-!-1EAEDB.svg)](https://saythanks.io/to/kimdwkimdw)
+
+# word2word
+
+Easy-to-use word-to-word translations for 3,564 language pairs.
+
+## Key Features
+
+* A large collection of freely & publicly available word-to-word translations 
+    **for 3,564 language pairs across 62 unique languages.** 
+* Easy-to-use Python interface.
+* Constructed using an efficient approach that is quantitatively examined by 
+    proficient bilingual human labelers.
+
+## Usage
+
+First, install the package using `pip`:
+```bash
+pip install word2word
+```
+
+OR
+
+```
+git clone https://github.com/Kyubyong/word2word.git
+python setup.py install
+```
+
+Then, in Python, download the model and retrieve top-5 word translations 
+of any given word to the desired language:
+```python
+from word2word import Word2word
+en2fr = Word2word("en", "fr")
+print(en2fr("apple"))
+# out: ['pomme', 'pommes', 'pommier', 'tartes', 'fleurs']
+```
+
+![gif](./word2word.gif)
+
+## Supported Languages
+
+We provide top-k word-to-word translations across all available pairs 
+    from [OpenSubtitles2018](http://opus.nlpl.eu/OpenSubtitles2018.php). 
+This amounts to a total of 3,564 language pairs across 62 unique languages. 
+
+The full list is provided [here](word2word/supporting_languages.txt).
+
+## Methodology
+
+Our approach computes the top-k word-to-word translations based on 
+the co-occurrence statistics between cross-lingual word pairs in a parallel corpus.
+We additionally introduce a correction term that controls for any confounding effect
+coming from other source words within the same sentence.
+The resulting method is an efficient and scalable approach that allows us to
+construct large bilingual dictionaries from any given parallel corpus. 
+
+For more details, see the Methods section of [our paper draft](word2word-draft.pdf).
+
+
+## Comparisons with Existing Software
+
+A popular publicly available dataset of word-to-word translations is 
+[`facebookresearch/MUSE`](https://github.com/facebookresearch/MUSE), which 
+includes 110 bilingual dictionaries that are built from Facebook's internal translation tool.
+In comparison to MUSE, `word2word` does not rely on a translation software
+and contains much larger sets of language pairs (3,564). 
+`word2word` also provides the top-k word-to-word translations for up to 100k words 
+(compared to 5~10k words in MUSE) and can be applied to any language pairs
+for which there is a parallel corpus. 
+
+In terms of quality, while a direct comparison between the two methods is difficult, 
+we did notice that MUSE's bilingual dictionaries involving non-European languages may be not as useful. 
+For English-Vietnamese, we found that 80% of the 1,500 word pairs in 
+the validation set had the same word twice as a pair
+(e.g. crimson-crimson, Suzuki-Suzuki, Randall-Randall). 
+
+For more details, see Appendix in [our paper draft](word2word-draft.pdf). 
+
+
+## References
+
+If you use our software for research, please cite:
+```bibtex
+@misc{word2word2019,
+  author = {Park, Kyubyong and Kim, Dongwoo and Choe, Yo Joong},
+  title = {word2word},
+  year = {2019},
+  publisher = {GitHub},
+  journal = {GitHub repository},
+  howpublished = {\url{https://github.com/Kyubyong/word2word}}
+}
+```
+(We may later update this bibtex with a reference to [our paper report](word2word-draft.pdf).)
+
+All of our word-to-word translations were constructed from the publicly available
+    [OpenSubtitles2018](http://opus.nlpl.eu/OpenSubtitles2018.php) dataset:
+```bibtex
+@article{opensubtitles2016,
+  title={Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles},
+  author={Lison, Pierre and Tiedemann, J{\"o}rg},
+  year={2016},
+  publisher={European Language Resources Association}
+}
+```
+
+## Authors
+
+[Kyubyong Park](https://github.com/Kyubyong), 
+[Dongwoo Kim](https://github.com/kimdwkimdw), and 
+[YJ Choe](https://github.com/yjchoe)
+
diff --git a/make.py b/make.py
@@ -0,0 +1,276 @@
+# -*- coding: utf-8 -*-
+'''
+Word2word
+authors: Kyubyong Park (kbpark.linguist@gmail.com), YJ Choe (yjchoe33@gmail.com), Dongwoo Kim (kimdwkimdw@gmail.com)
+
+'''
+import codecs
+import os
+import re
+import pickle
+import operator
+from collections import Counter
+from itertools import chain
+from tqdm import tqdm
+import argparse
+import logging
+from utils import get_savedir
+
+
+def download(lang1, lang2):
+    '''Download corpora from Opensubtitles 2018'''
+    download = f"wget http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/moses/{lang1}-{lang2}.txt.zip -P data"
+    unzip = "unzip data/*.zip -d data/"
+    rm_zip = "rm data/*.zip"
+    rm_ids = "rm data/*.ids"
+    rm_readme = "rm README*"
+    for cmd in (download, unzip, rm_zip, rm_ids, rm_readme):
+        os.system(cmd)
+
+def normalize(tokens, ignore_first_word):
+    '''If ignore_firs_word is True,
+    We drop the first word or token
+    because its true case is unclear.'''
+    if ignore_first_word:
+        tokens = tokens[1:]
+    return tokens
+
+def word_segment(sent, lang, tokenizer):
+    if lang=="en":
+        words = tokenizer(sent)
+    elif lang == 'ko':
+        words = [word for word, _ in tokenizer.pos(sent)]
+    elif lang=='ja':
+        words = [elem for elem in tokenizer.getWS(sent)]
+    elif lang=='th':
+        words = tokenizer(sent, engine='mm')
+    elif lang=='vi':
+        words = tokenizer.tokenize(sent).split()
+    elif lang=='zh_cn':
+        words = [elem for elem in tokenizer.getWS(sent)]
+    elif lang=="zh_tw":
+        words = list(tokenizer.cut(sent, cut_all=False))
+    elif lang=="ar":
+        words = tokenizer.tokenize(sent)
+    else:  # Mostly european languages
+        sent = re.sub("([!.?,])", r" \1", sent)
+        words = sent.split()
+
+    return words
+
+
+def refine(fin, lang, max_lines, tokenizer, ignore_first_word):
+    lines = codecs.open(fin, 'r', 'utf-8').read().split("\n")
+    lines = lines[:max_lines]
+    sents = [normalize(word_segment(sent, lang, tokenizer), lang, ignore_first_word) for sent in tqdm(lines)]
+    return sents
+
+
+def create_conversion_dicts(sents, n_lexicon):
+    word2idx, idx2word, idx2cnt = dict(), dict(), dict()
+    word2cnt = Counter(tqdm(list(chain.from_iterable(sents))))
+    for idx, (word, cnt) in enumerate(word2cnt.most_common(n_lexicon)):
+        word2idx[word] = idx
+        idx2word[idx] = word
+        idx2cnt[idx] = cnt
+
+    return word2idx, idx2word, idx2cnt
+
+def update_monolingual_dict(xs, x2xs, cutoff):
+    for x in xs:
+        for _x in xs:  # _x: collocate
+            if x == _x: continue
+            if _x > cutoff: continue  # Cut off infrequent words to save memory
+            if x not in x2xs: x2xs[x] = dict()
+            if _x not in x2xs[x]: x2xs[x][_x] = 0
+            x2xs[x][_x] += 1
+    return x2xs
+
+
+def adjust_dict(x2ys, x2cnt, x2xs, reranking_width, n_trans):
+    _x2ys = dict()
+    for x, ys in tqdm(x2ys.items()):
+        if x not in x2xs: continue  # if there's no collocates, we don't have to adjust the score.
+        cntx = x2cnt[x]
+        y_scores = []
+        for y, cnty in sorted(ys.items(), key=operator.itemgetter(1), reverse=True)[:reranking_width]:
+            ts = cnty / float(cntx)  # translation score: initial value
+            for x2, cntx2 in x2xs[x].items():  # Collocates
+                p_x_x2 = cntx2 / float(cntx)
+                p_x2_y2 = 0
+                if x2 in x2ys:
+                    p_x2_y2 = x2ys[x2].get(y, 0) / float(x2cnt[x2])
+                ts -= (p_x_x2 * p_x2_y2)
+            y_scores.append((y, ts))
+        _ys = sorted(y_scores, key=lambda x: x[1], reverse=True)[:n_trans]
+        _ys = [each[0] for each in _ys]
+        _x2ys[x] = _ys
+
+    return _x2ys
+
+def load_tokenizer(lang):
+    if lang=="en":
+        from nltk.tokenize import word_tokenize as wt
+        tokenizer = wt
+    elif lang=="ko":
+        from konlpy.tag import Kkma
+        tokenizer = Kkma()
+    elif lang=="ja":
+        import Mykytea
+        opt="-model jp-0.4.7-1.mod"
+        tokenizer = Mykytea.Mykytea(opt)
+    elif lang=="zh_cn":
+        import Mykytea
+        opt = "-model ctb-0.4.0-1.mod"
+        tokenizer = Mykytea.Mykytea(opt)
+    elif lang=="zh_tw":
+        import jieba
+        tokenizer = jieba
+    elif lang=="vi":
+        from pyvi import ViTokenizer
+        tokenizer = ViTokenizer
+    elif lang=="th":
+        from pythainlp.tokenize import word_tokenize
+        tokenizer = word_tokenize
+    elif lang=="ar":
+        import pyarabic.araby as araby
+        tokenizer = araby
+    else:
+        tokenizer = None
+
+    return tokenizer
+
+
+# def sanity_check(word2x, x2ys, _x2ys, y2word, reranking_width):
+#     if "time" not in word2x: return ""
+#     time_id = word2x["time"]
+#
+#     # before adjustment
+#     ys = x2ys[time_id]
+#     y_cnt = sorted(ys.items(), key=operator.itemgetter(1), reverse=True)[:reranking_width]
+#     print("\tbefore adjustment the translations of `time` were =>", " | ".join(y2word[y] for y, cnt in y_cnt))
+#
+#     # after adjustment
+#     ys = _x2ys[time_id]
+#     print("\tafter adjustment the translations of `time` are => ", " | ".join(y2word[y] for y in ys))
+
+def main(hp):
+    logging.info("Step 0. Download ..")
+    lang1, lang2 = sorted([hp.lang1, hp.lang2])
+    download(lang1, lang2)
+
+    logging.info("Step 1. Load tokenizer ..")
+    tokenizer1 = load_tokenizer(lang1)
+    tokenizer2 = load_tokenizer(lang2)
+
+    logging.info("Step 2. Normalize sentences ..")
+    logging.info(f"Working on {lang1} ..")
+    fin = f'data/OpenSubtitles.{lang1}-{lang2}.{lang1}'
+    sents1 = refine(fin, lang1, hp.max_lines, tokenizer1, hp.ignore_first_word1)
+
+    logging.info(f"Working on {lang2} ..")
+    fin = f'data/OpenSubtitles.{lang1}-{lang2}.{lang2}'
+    sents2 = refine(fin, lang2, hp.max_lines, tokenizer2, hp.ignore_first_word2)
+
+    assert len(sents1) == len(sents2), \
+        f"""{lang1} and {lang2} MUST be the same in length.\n
+           {lang1} has {len(sents1)} lines, but {lang2} has {len(sents2)} lines"""
+
+    # Create folder
+    savedir = get_savedir()
+    os.makedirs(savedir, exist_ok=True)
+
+    print("Step 3. Initialize dictionaries")
+    # conversion dictionaries
+    word2x, x2word, x2cnt = create_conversion_dicts(sents1, hp.n_lexicon)
+    word2y, y2word, y2cnt = create_conversion_dicts(sents2, hp.n_lexicon)
+
+    # monolingual collocation dictionaries
+    x2xs = dict()  # {x: {x1: cnt, x2: cnt, ...}}
+    y2ys = dict()  # {y: {y1: cnt, y2: cnt, ...}}
+
+    # crosslingual collocation dictionaries
+    x2ys = dict()  # {x: {y1: cnt, y2: cnt, ...}}
+    y2xs = dict()  # {y: {x1: cnt, x2: cnt, ...}}
+
+    print("Step 4. Update dictionaries ...")
+    line_num = 1
+    for sent1, sent2 in tqdm(zip(sents1, sents2), total=len(sents1)):
+        if len(sent1) <= 1 or len(sent2) <= 1: continue
+
+        # To indices
+        xs = [word2x[word] for word in sent1 if word in word2x]
+        ys = [word2y[word] for word in sent2 if word in word2y]
+
+        # Monolingual dictionary updates
+        x2xs = update_monolingual_dict(xs, x2xs, hp.cutoff)
+        y2ys = update_monolingual_dict(ys, y2ys, hp.cutoff)
+
+        # Crosslingual dictionary updates
+        for x in xs:
+            for y in ys:
+                if line_num <= hp.lexicon_lines:
+                    ## lang1 -> lang2
+                    if x not in x2ys: x2ys[x] = dict()
+                    if y not in x2ys[x]: x2ys[x][y] = 0
+                    x2ys[x][y] += 1
+
+                    ## lang2 -> lang1
+                    if y not in y2xs: y2xs[y] = dict()
+                    if x not in y2xs[y]: y2xs[y][x] = 0
+                    y2xs[y][x] += 1
+
+                else:  # We don't add new words after some point to save memory.
+                    ## lang1 -> lang2
+                    if x in x2ys and y in x2ys[x] and x2ys[x][y] > 1:
+                        x2ys[x][y] += 1
+
+                    ## lang2 -> lang1
+                    if y in y2xs and x in y2xs[y] and y2xs[y][x] > 1:
+                        y2xs[y][x] += 1
+        line_num += 1
+
+    print("Step 5. Adjust ...")
+    _x2ys = adjust_dict(x2ys, x2cnt, x2xs, hp.reranking_width, hp.n_trans)
+    _y2xs = adjust_dict(y2xs, y2cnt, y2ys, hp.reranking_width, hp.n_trans)
+
+    # print("Step 5. Sanity check")
+    # if lang1 == "en":
+    #     sanity_check(word2x, x2ys, _x2ys, y2word, hp.reranking_width)
+    # elif lang2 == "en":
+    #     sanity_check(word2y, y2xs, _y2xs, x2word, hp.reranking_width)
+    # else:
+    #     pass
+
+    print("Step 6. Save")
+    pickle.dump((word2x, y2word, _x2ys), open(f'{savedir}/{lang1}-{lang2}.pkl', 'wb'))
+    pickle.dump((word2y, x2word, _y2xs), open(f'{savedir}/{lang2}-{lang1}.pkl', 'wb'))
+
+    print("Done!")
+
+if __name__ == "__main__":
+    # arguments setting
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--lang1', type=str, required=True,
+                        help="ISO 639-1 code of language. See `http://opus.lingfil.uu.se/OpenSubtitles2016.php`")
+    parser.add_argument('--lang2', type=str, required=True,
+                        help="ISO 639-1 code of language. See `http://opus.lingfil.uu.se/OpenSubtitles2016.php`")
+    parser.add_argument('--max_lines', type=int, default=1000000, help="maximum number of lines that are used")
+    parser.add_argument('--ignore_first_word1', dest="ignore_first_word1", action="store_true",
+                        help="Ignore first words in the source lang because we don't know the true case of them.")
+    parser.add_argument('--ignore_first_word2', dest="ignore_first_word2", action="store_true",
+                        help="Ignore first words in the target lang because we don't know the true case of them.")
+    parser.add_argument('--cutoff', type=int, default=1000,
+                        help="number of words that are used in calculating collocation")
+    parser.add_argument('--lexicon_lines', type=int, default=100000,
+                        help="New words are not added after some point to save memory")
+    parser.add_argument('--n_lexicon', type=int, default=100000,
+                        help="number fo words in lexicon")
+    parser.add_argument('--reranking_width', default=100,
+                        help="maximum collocates that we consider when reranking them")
+    parser.add_argument('--n_trans', type=int, default=10,
+                        help="number of final translations")
+    hp = parser.parse_args()
+
+    main(hp)
+    print("Done!")