##### About getting the alignments
The problem of English and Chinese is that they are not one word to one charactar mapping, I try to get rid of all multiple-words English phrases in the .dic file into single English words. I hope this dictionary at least helps a bit in training.
- I tried to use the parallel data from UPUS and run it with Fastalign to get the intersections, but the results is really bad. Again because English to Mandarin is mostly one to many mapping (but how many? That varies.)
- So instead I went back to the following method, very problematic but still it's able to abstract a lot of good alignments. The idea is just when there is many to many mapping, I divided the them into one-two mapping because usually Chinese "words" are two characters. The problem is in this way a lot of many (english) to two/three (chinese characters) are mix aligned.

(Alignments command lines: <br>

```
paste en-zh.en en-zh.zh | sed 's/\t/ ||| /' | grep '. ||| .' > en-zh <br>
fast_align/build/fast_align -d -o -v -i en-zh > en-zh.f <br>
fast_align/build/fast_align -d -o -v -r -i en-zh > en-zh.r <br>
fast_align/build/atools -i en-zh.f -j en-zh.r -c intersect > en-zh.i <br>
```

)


In [31]:
new_lines = []
with open("en-zh.dic", "r", encoding="utf-8") as f:
    for line in f:
        line = line.strip().split()
        line = line[2:-2]
        if len(line) > 2:
            en = line[:-1]
            zh = line[-1]
            for w in en:
                #print(w)
                if len(zh)>=2:
                    pair = w + " " + zh[:2]
                    zh = zh[2:]
                    new_lines.append(pair)
                elif zh:
                    pair = w + " " + zh
                    new_lines.append(pair)
        elif len(line) == 2 and line[0]!="" and line[1] !="":
            new_lines.append(line[0] + " " + line[1])
        

In [32]:
new_lines[-20:]

['world 世界',
 'order 秩序',
 'world 世界',
 's 领导',
 'leaders 人',
 'would 那样',
 'most 辉煌',
 'writers 作家',
 'wrong 大错特错了',
 'year 一年',
 'ago 前',
 'years 几年',
 'ago 前',
 'yellow 黄祸',
 'yen 日元',
 'yes 是的',
 'yes 没错',
 'young 年轻',
 'people 人',
 'Šefčovič Šefčovič']

In [33]:
with open("en-zh_train.dic", "w", encoding="utf-8") as f:
    for line in new_lines:
        print(line, file=f)

##### Now you run the new dictionary file with the *en* and *zh* vectors over *vecmap* (supervised) and get the crosslingual embeddings from both sides. I took only top 100000 vectors and save them into python dictionaries.
Command line to get crosslingual embeddings: <br>

```
python vecmap/map_embeddings.py --supervised en-zh_train.dic cc.en.300.vec cc.zh.300.vec en_mapped.emb zh_mapped.emb
```


In [34]:
zh_embeds, en_embeds = {}, {}
max_tokens = 100000
with open("zh_mapped.emb", "r", encoding="utf-8") as f, open("en_mapped.emb", "r", encoding="utf-8") as f2:
    i, j = 0, 0
    for line in f:
        line = line.strip().split()
        zh_embeds[line[0]] = [float(n) for n in line[1:]]
        i += 1
        if i == max_tokens:
            break
    for line in f2:
        line = line.strip().split()
        en_embeds[line[0]] = [float(n) for n in line[1:]]
        j += 1
        if j == max_tokens:
            break

2000000 300



##### And then concatenate the two crosslingual embedding sets (remove some punctuation duplicates). And print them into one embedding file called "final....".

In [None]:
merged = en_embeds
for token in zh_embeds:
    if token not in en_embeds:
        merged[token] = zh_embeds[token]
len(merged)

In [None]:
with open("final_en-zh_embeds.txt", "w", encoding="utf-8") as f:
    print(len(merged), sep="", end=" ", file=f)
    print(300, sep="", file=f)
    for pair in merged.items():
        token = pair[0]
        embed = pair[1]
        print(token, sep="", end=" ", file=f)
        print(embed, file=f)

##### Finally train the en udpipe model using the final merged embedding file using the ud English test set from UD. And evaluate the trained udpipe model on the testset of Chinese from UD.
- The result is really bad?! Maybe the dict file to vecmap is not good (not helpful) or is it the udpipe model training set too small because I used only the en UD test set.

**THE RESULT** is as following:<br>
Parsing from gold tokenization with gold tags - forms: 12012, **UAS: 29.44%, LAS: 18.32%**

Command line: <br>

```
../assignment03/udpipe --train en.udpipe --tokenizer=none --tagger=none --parser='embedding_form_file=final_en-zh_embeds.txt' en_ewt-ud-test.conllu

../assignment03/udpipe --parse --accuracy en.udpipe zh_gsdsimp-ud-test.conllu 

```


###### Lets try to compare it with the delexed model trained with the same source data and evaluate on the same target data.

**SO!** The delexed parser worked even much better than the one with embeddings?! What did I do wrong?

**The RESULT:** <br>
Parsing from gold tokenization with gold tags - forms: 12012, **UAS: 36.97%, LAS: 26.62%**

Command line: 

```
../assignment03/udpipe --train en_delex.udpipe --tokenizer=none --tagger=none --parser='embedding_form=0;embedding_feats=0
;' en_ewt-ud-test.conllu 

../assignment03/udpipe --parse --accuracy en_delex.udpipe zh_gsdsimp-ud-test.conllu  
```