### Prepare Data

##### First is to get word bert embedding 

- I did a little different from what the provided script does. I collected each sentence (words in a list) and got the bert embedding of each sentence from which I extracted all the subword embedding of the word in the sentence and used their mean.
- In order to make this possible, additionally I extracted the word span (from which subword to which subword of the sentence represents each word) of each word after passing tokenizer using the function called *word_to_tokens* and saved it for later to get word embedding from the sentence output of bert model.



The following function *get_sent_words_ids_and_spans* is to get sentence input ids and word spans in the sentence.

In [4]:
from transformers import AutoModel, AutoTokenizer
import torch

modelName_mono = "bert-base-uncased"
modelName_multi = "bert-base-multilingual-uncased"
# Some valid options:
# bert-base-uncased
# bert-base-cased
# bert-large-cased
# bert-base-multilingual-uncased
# bert-base-multilingual-cased

tokenizer_mono = AutoTokenizer.from_pretrained(modelName_mono)
model_mono = AutoModel.from_pretrained(modelName_mono)

def get_sent_words_ids_and_spans(sent, tokenizer=tokenizer_mono):
    tokenized = tokenizer(sent, return_tensors="pt",
                return_token_type_ids=False, return_attention_mask=True, is_split_into_words=True, add_special_tokens=False)
    input_ids, attention_mask = tokenized["input_ids"], tokenized["attention_mask"]
    tokens = tokenizer_mono.convert_ids_to_tokens(input_ids[0])
    #print("Subwords:", "len:", len(tokens), tokens)
    #print("subwords ids:", input_ids)
    #print("Input_ids:", input_ids.shape)
    ### not word_spans can't be a dictionary as there can be repeated word in a sentence, e.g. "the"
    word_spans = []
    for index in set(tokenized.word_ids()):
        start, end = tokenized.word_to_tokens(index)
        start, end = start+1, end+1
        word_spans.append((sent[index], start, end))
        #print(sent[index], start, end)
    
    # word_index = 0
    # i = 1
    # span_s, span_e = 1, 1
    #print("Words:", sent, len(sent))
    # for i in range(1, len(tokens)):
    #     if tokens[i].startswith("##"):
    #         print("span", tokens[i])
    #         span_e += 1
        # elif i+1 < len(tokens) and i-1>0 and tokens[i+1].startswith("##") and not tokens[i-1].startswith("##"):
        #     print("span:", tokens[i])
        #     span_e += 1
        # else:
        #     print(i)
        #     print("prev subword:", tokens[i-1])
        #     print("currt subword:", tokens[i])
        #     print("before:", span_s, span_e)
        #     if word_index < len(sent):
        #         if span_s == span_e:
        #             span_e += 1
        #         print("word:", sent[word_index])
        #         print("after:", span_s, span_e)
        #         print("word index", word_index)
        #         print()
        #         word_spans.append((sent[word_index], span_s, span_e))
        #         word_index += 1
        #         span_s = span_e 
    return input_ids, attention_mask, word_spans

An one sentence example to illustrate

In [14]:
sent = "A platypus is a mammal."
#sent = "But in a break from his past rhetoric about curtailing immigration , the GOP nominee proclaimed that as president he would allow “ tremendous numbers ” of legal immigrants based on a “ merit system . "
#sent = "“ While much of the digital transition is unprecedented in the United States , the peaceful transition of power is not , ” Obama special assistant Kori Schulman wrote in a blog post Monday ."
#sent = "For those who follow social media transitions on Capitol Hill, this will be a little different."
for s in sent:
    if s in [",", ".", "!", "?"]:
        sent = sent.replace(s, " "+s)
sent = sent.split()
input_ids, attention_mask, word_spans = get_sent_words_ids_and_spans(sent)
print()
print("Word spans:")
print(len(word_spans), len(sent))
print(sent)
word_spans

Subwords: len: 9 ['a', 'pl', '##at', '##yp', '##us', 'is', 'a', 'mammal', '.']
subwords ids: tensor([[ 1037, 20228,  4017, 22571,  2271,  2003,  1037, 25476,  1012]])

Word spans:
6 6
['A', 'platypus', 'is', 'a', 'mammal', '.']


[('A', 1, 2),
 ('platypus', 2, 6),
 ('is', 6, 7),
 ('a', 7, 8),
 ('mammal', 8, 9),
 ('.', 9, 10)]

And this function is to feed the sentence subword ids into model, and use word spans extract the word embedding from the sentence embedding (the output of the model). Save all word embeddings of the sentence into a list.

In [5]:
def get_word_embeddings(input_ids, attention_mask, word_spans, model=model_mono): 
    word_embeds = []
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
        sent_embeddings = outputs.last_hidden_state
        #print(sent_embeddings.shape)
        #print(sent_embeddings)
        for word, i, j in word_spans:
            #print(word)
            ## get the embeddings of each word in the sentence
            word_embed = sent_embeddings[:, i:j, :]
            ## word embedding is the mean of all its subwords
            word_embed = torch.mean(word_embed, dim=1)
            ## somehow torch returns "NaN" value instead of just 0.0
            word_embed = torch.nan_to_num(word_embed)
            word_embeds.append(word_embed[0].tolist())
    return word_embeds

To print out some outputs for illustration.

So each word is a vector of 768 numbers.

In [None]:
example = get_word_embeddings(input_ids, attention_mask, word_spans)

##### Finally we will get the word embeddings and pair them with their UPOS tags as data for MLP classifer training later.

In [6]:

source, target = "en_pud-ud-test.conllu", "zh_pud-ud-test.conllu"

ID = 0
FORM = 1
LEMMA = 2
UPOS = 3
XPOS = 4
FEATS = 5
HEAD = 6
DEPREL = 7

def prepare_data(model=model_mono, tokenizer=tokenizer_mono, filename=source):
    all_word_embeds, all_tags = [], []
    with open(filename, "r", encoding="utf-8") as f:
        words = []
        tags = []
        for line in f:
            if line == '\n':
                # end of sentence
                input_ids, attn_mask, word_spans = get_sent_words_ids_and_spans(sent=words, tokenizer=tokenizer)
                word_embeds = get_word_embeddings(input_ids, attn_mask, word_spans, model)
                #print(len(words), len(word_embeds), len(tags))
                if len(words) != len(word_embeds):
                    print(" ".join(words))
                all_word_embeds.append(word_embeds)
                all_tags.append(tags)
                words = []
                tags = []
            elif line.startswith('#'):
                continue
            else:
                fields = line.split('\t')
                if fields[ID].isdigit():
                    # standard token
                    words.append(fields[FORM])
                    tags.append(fields[UPOS])
        return all_word_embeds, all_tags

In [22]:
en_data = prepare_data()

In [23]:
zh_data = prepare_data(filename=target)

In [24]:
type(zh_data[0][0][0][0])

float

### Train the model

In [25]:
en_embeddings, en_tags = [w for s in en_data[0] for w in s], [t for s in en_data[1] for t in s]
zh_embeddings, zh_tags = [w for s in zh_data[0] for w in s], [t for s in zh_data[1] for t in s]

In [33]:
## saving the embeddings
with open("zh_pud.bert", "w", encoding="utf-8") as f:
    for i in range(len(zh_embeddings)):
        print(zh_tags[i], *zh_embeddings[i], sep="\t", file=f)

In [26]:
len(en_embeddings) == len(en_tags), len(en_tags), len(en_embeddings)

(True, 21176, 21176)

In [38]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
## this is a copy of train_mlp.py
def train_mlp(embeddings, tags, name="en_mlp_bert"):
    # Split into train and test
    embeddings_train, embeddings_test, tags_train, tags_test = train_test_split(embeddings, tags)

    # Train a classifier
    classifier = MLPClassifier(verbose=True)
    classifier.fit(embeddings_train, tags_train)

    # Evaluate the classifier
    score_train = classifier.score(embeddings_train, tags_train)
    score_test  = classifier.score(embeddings_test, tags_test)
    print('Training score:', score_train)
    print('Test score:', score_test)

    # Save the model
    import pickle
    pickle.dump(classifier, open(f"{name}.model", 'wb'))
    return classifier

##### Train with embeddings from pretrained monolingual BERT.

EN: <br>
Training score: 1.0 <br>
Test score: 0.7083490744238761

In [39]:
classifier_en = train_mlp(en_embeddings, en_tags)

Iteration 1, loss = 2.02892965
Iteration 2, loss = 1.34202194
Iteration 3, loss = 1.05717733
Iteration 4, loss = 0.89092834
Iteration 5, loss = 0.77278894
Iteration 6, loss = 0.68720483
Iteration 7, loss = 0.62339672
Iteration 8, loss = 0.57757470
Iteration 9, loss = 0.53597299
Iteration 10, loss = 0.49261537
Iteration 11, loss = 0.46263466
Iteration 12, loss = 0.42792633
Iteration 13, loss = 0.39668521
Iteration 14, loss = 0.36874044
Iteration 15, loss = 0.34642744
Iteration 16, loss = 0.32270986
Iteration 17, loss = 0.30043696
Iteration 18, loss = 0.28131644
Iteration 19, loss = 0.26334616
Iteration 20, loss = 0.24239579
Iteration 21, loss = 0.22535533
Iteration 22, loss = 0.20816411
Iteration 23, loss = 0.19705617
Iteration 24, loss = 0.17871164
Iteration 25, loss = 0.16847397
Iteration 26, loss = 0.15208206
Iteration 27, loss = 0.14458774
Iteration 28, loss = 0.13031535
Iteration 29, loss = 0.12239279
Iteration 30, loss = 0.11126955
Iteration 31, loss = 0.10501562
Iteration 32, los

ZH: <br>
Training score: 0.8280306332108835 <br>
Test score: 0.5089652596189764

In [40]:
classifier_zh = train_mlp(zh_embeddings, zh_tags, "zh_mlp_bert")

Iteration 1, loss = 2.22141623
Iteration 2, loss = 1.96835295
Iteration 3, loss = 1.79074983
Iteration 4, loss = 1.68530421
Iteration 5, loss = 1.61468521
Iteration 6, loss = 1.57144738
Iteration 7, loss = 1.53015333
Iteration 8, loss = 1.49521472
Iteration 9, loss = 1.47302468
Iteration 10, loss = 1.44944163
Iteration 11, loss = 1.42343265
Iteration 12, loss = 1.40225318
Iteration 13, loss = 1.38153233
Iteration 14, loss = 1.35998540
Iteration 15, loss = 1.34627407
Iteration 16, loss = 1.33042853
Iteration 17, loss = 1.31269609
Iteration 18, loss = 1.30076928
Iteration 19, loss = 1.29428144
Iteration 20, loss = 1.28227003
Iteration 21, loss = 1.26511574
Iteration 22, loss = 1.25699219
Iteration 23, loss = 1.24230846
Iteration 24, loss = 1.24115653
Iteration 25, loss = 1.22263450
Iteration 26, loss = 1.21282847
Iteration 27, loss = 1.20291557
Iteration 28, loss = 1.19812045
Iteration 29, loss = 1.18630005
Iteration 30, loss = 1.17763329
Iteration 31, loss = 1.16715650
Iteration 32, los



Training score: 0.827096693854679
Test score: 0.5033619723571162


#### Train with word embeddings from multilingual BERT.

- We can see that with multilingual embeddings for the words, the accuracy for development set is higher than with mono-lingual word embeddings. It's particularly better on zh (Mandarin) word tagging.

In [8]:
tokenizer_multi = AutoTokenizer.from_pretrained(modelName_multi)
model_multi = AutoModel.from_pretrained(modelName_multi)

model.safetensors: 100%|██████████| 672M/672M [01:02<00:00, 10.8MB/s] 


In [15]:
en_data_multi = prepare_data(model=model_multi, tokenizer=tokenizer_multi, filename=source)
zh_data_multi = prepare_data(model=model_multi, tokenizer=tokenizer_multi, filename=target)


In [18]:
en_embeddings_multi, en_tags_multi = [w for s in en_data_multi[0] for w in s], [t for s in en_data_multi[1] for t in s]
zh_embeddings_multi, zh_tags_multi = [w for s in zh_data_multi[0] for w in s], [t for s in zh_data_multi[1] for t in s]

In [35]:
with open("en_pud.mbert", "w", encoding="utf-8") as f:
    for i in range(len(en_embeddings_multi)):
        print(en_tags_multi[i], *en_embeddings_multi[i], sep="\t", file=f)

EN: <br>
Training score: 1.0 <br>
Test score: 0.8192293162070268

In [41]:
#Training score: 1.0
#Test score: 0.8192293162070268
classifier_en_multi = train_mlp(en_embeddings_multi, en_tags_multi, name="en_mlp_mbert")

Iteration 1, loss = 1.58806717
Iteration 2, loss = 0.88078622
Iteration 3, loss = 0.67668218
Iteration 4, loss = 0.55691570
Iteration 5, loss = 0.48106512
Iteration 6, loss = 0.42478240
Iteration 7, loss = 0.38715643
Iteration 8, loss = 0.34752500
Iteration 9, loss = 0.32296017
Iteration 10, loss = 0.29551995
Iteration 11, loss = 0.27126419
Iteration 12, loss = 0.25346888
Iteration 13, loss = 0.23190585
Iteration 14, loss = 0.21636442
Iteration 15, loss = 0.19811077
Iteration 16, loss = 0.18602516
Iteration 17, loss = 0.17104799
Iteration 18, loss = 0.15982360
Iteration 19, loss = 0.14707852
Iteration 20, loss = 0.13375349
Iteration 21, loss = 0.12126344
Iteration 22, loss = 0.11251577
Iteration 23, loss = 0.10410979
Iteration 24, loss = 0.09529039
Iteration 25, loss = 0.08764029
Iteration 26, loss = 0.08186889
Iteration 27, loss = 0.07536227
Iteration 28, loss = 0.06865053
Iteration 29, loss = 0.06190947
Iteration 30, loss = 0.05760596
Iteration 31, loss = 0.05247587
Iteration 32, los

ZH: <br>
Training score: 1.0 <br>
Test score: 0.7995890922674636

In [42]:
classifier_zh_multi = train_mlp(zh_embeddings_multi, zh_tags_multi, name="zh_mlp_mbert")

Iteration 1, loss = 1.52302379
Iteration 2, loss = 0.90748541
Iteration 3, loss = 0.73183841
Iteration 4, loss = 0.62573758
Iteration 5, loss = 0.54260005
Iteration 6, loss = 0.48780976
Iteration 7, loss = 0.44333115
Iteration 8, loss = 0.41217469
Iteration 9, loss = 0.38539976
Iteration 10, loss = 0.35823862
Iteration 11, loss = 0.33101715
Iteration 12, loss = 0.31479177
Iteration 13, loss = 0.29127759
Iteration 14, loss = 0.27589426
Iteration 15, loss = 0.25480497
Iteration 16, loss = 0.24167261
Iteration 17, loss = 0.22861887
Iteration 18, loss = 0.21432576
Iteration 19, loss = 0.19765101
Iteration 20, loss = 0.18532332
Iteration 21, loss = 0.17370704
Iteration 22, loss = 0.16329940
Iteration 23, loss = 0.15221348
Iteration 24, loss = 0.14165012
Iteration 25, loss = 0.13387711
Iteration 26, loss = 0.12348031
Iteration 27, loss = 0.11639175
Iteration 28, loss = 0.10710120
Iteration 29, loss = 0.09986017
Iteration 30, loss = 0.09249346
Iteration 31, loss = 0.08612141
Iteration 32, los

### Evaluation

I couldn't use eval_similarity somehow on wsl. It throws an error to me "AttributeError: module 'numpy' has no attribute 'typeDict'" and I updated all packages, still didn't work. 
Maybe I saved the embeddings wrong? I saved as you did: each line is gold-tag + embeddings numbers separated by "\t". <br>
Maybe you can fix this for me in the class.

#### So I used a very simple way to test: <br>
Compare:
- using classifier trained with mono-lingual embeddings of English words to predict Mandarin POS tags and count their accuracy.
- using classifier trained with multi-lingual embeddings of English words to predict Mandarin POS tags and count their accuaracy.

We can see that the second one predicts Mandarin with much higher accuracy.

In [48]:
import numpy as np
right = 0
for i in range(len(zh_embeddings)):
    word = np.array(zh_embeddings[i]).reshape(1, -1)
    tag = zh_tags[i]
    pred = classifier_en.predict(word)
    if pred == tag:
        right += 1
print("English model using mono-lingual embeddings for Mandarin Tagging Accuracy:", right/len(zh_tags))

English model using mono-lingual embeddings for Mandarin Tagging Accuracy: 0.13733364464160636


In [49]:
right = 0
for i in range(len(zh_embeddings_multi)):
    word = np.array(zh_embeddings_multi[i]).reshape(1, -1)
    tag = zh_tags_multi[i]
    pred = classifier_en_multi.predict(word)
    if pred == tag:
        right += 1

print("English model using multi-lingual embeddings for Mandarin Tagging Accuracy:", right/len(zh_tags))

English model using multi-lingual embeddings for Mandarin Tagging Accuracy: 0.4542610319869251
