### The harmonize rules are listed as below:
#### Rules
1. Verbs: basically with "VB" in upos all converted into "VERB", but if its dependency relation is "aux" then it should be converted into "AUX" (e.g. copula "be" in passive) and if its dependency relation is "acl" then it should be "ADJ"(e.g. verb participles). *PROBLEM*: "be" and "do/did" as auxilary verbs can be in other dependency relation.
    - VBZ/VBD/VB/VBG/VBN/VBP- VERB, (check the verb "be" and 'do/did' to see if its AUX)
    - VBZ + cop -> VERB
    - VBZ + aux -> AUX
    - VBN + aux -> AUX
    - VBN + acl -> ADJ
    - VBG + aux -> AUX
2. Nouns: upos contains "NN", exception "NNP" is actually proper noun which in UD is "PROPN"
    - NN/NNS -> NOUN
    - NNP -> PROPN
3. Pronouns: 
    - PRP/PRP$/EX/WDT/WP$ -> PRON
4. Adjectives: Besides some VBN, upos that contains "JJ".
    - JJ/JJS/JJR -> ADJ
4. Adpositions: "IN" for all adpositions besides "to" which is "TO" for. However,"IN" is also used for "SCONJ".
    - IN/TO -> ADP
5. Subordinating conjunction: Non-ADV markers that introduce an adverbial clause, like *because*, *since*. Non-pronominal relativizers like "that" in "I believe that he will come.". And complementizers, like *that* or *whether*.
    - IN + mark -> SCONJ
6. Coordinating conjunction: *and*, *or*, *but*
    - CC -> CCONJ
7. Determiners: upos that contains "DT", but UD considers "WDT", e.g. *that*, *which*, *who* that have a norminal function in relative clauses are considered relative pronouns in English, thus not determiners.
    - DT/PDT -> DET
8. Numerals
    - CD -> NUM
9. Adverbs: Everything contains "RB", exception "not" as a nagation should be a particle "PART".
    - RB/WRB/RBS/RBR -> ADV
    - RB + lemma is 'not' -> PART
10. Particles: Besides the lemma "not" in "RB" upos, "RP" and "POS" (possesive *'s*) are included.
    - RP/POS -> PART
11. Modal verbs: They are "AUX" in UD for English.
    - MD -> AUX
12. Foreign words: Although not sure if "RRB" and  "LRB" refers to right and left bracket, should be converted into "X" as well.
    - FW - X
    - RRB/LRB - ?

13. Maybe there are more but I missed ...

### Assignment 03
Accuracy of original parsing: UAS: 15.21%, LAS: 4.37% <br>
Accuracy of harmonize partially (only verbs, nouns, pronouns, adjectives) the UPOS: UAS: 49.90%, LAS: 34.27% <br>
Accuracy of harmonize all the upos I can find: UAS: 62.61%, LAS: 55.87% <br>

We can see that the accuracy increases each step but still far from good enough. I assume it's the unmatching of features which I didn't attend to.


In [13]:

# field indexes
ID = 0
FORM = 1
LEMMA = 2
UPOS = 3
XPOS = 4
FEATS = 5
HEAD = 6
DEPREL = 7

def harmonize(conluu_file, include_all=False):
    out_f = []
    with open(conluu_file, "r", encoding="utf-8") as f:
        for line in f:
            fields = line.strip().split('\t')
            if len(fields) >= 1 and fields[ID].isdigit():
                # TODO harmonize the tag, store the harmonized tag into UPOS
                upos, lemma, deprel = fields[UPOS], fields[LEMMA], fields[DEPREL]
                if "VB" in upos:
                    fields[UPOS] = "VERB"
                    if "aux" in deprel:
                        fields[UPOS] = "AUX"
                    #TODO elif lemma == "do":
                    if "acl" in deprel:
                        fields[UPOS] = "ADJ"
                elif "NN" in upos:
                    fields[UPOS] = "NOUN"
                    if upos == "NNP":
                        fields[UPOS] = "PROPN"
                elif upos in ["PRP", "PRP$", "EX", "WDT", "WP$"]:
                    fields[UPOS] = "PRON"
                elif "JJ" in upos:
                    fields[UPOS] = "ADJ"
            
                if include_all:
                    if upos == "IN" or upos == "TO":
                        fields[UPOS] = "ADP"
                        if upos == "IN" and deprel == "mark":
                            fields[UPOS] = "SCONJ"
                    elif upos == "DT" or upos == "PDT":
                        fields[UPOS] = "DET"
                    elif upos == "CD":
                        fields[UPOS] = "NUM"
                    elif "RB" in upos:
                        fields[UPOS] = "ADV"
                        if upos == "RB" and lemma ==  "not":
                            fields[UPOS] = "PART"
                    elif upos == "RP" or upos == "POS":
                        fields[UPOS] = "PART"
                    elif upos == "MD":
                        fields[UPOS] = "AUX"
                    elif upos == "CC":
                        fields[UPOS] = "CCONJ"
                    elif upos == "FW":
                        fields[UPOS] = "X"
                
                line = "\t".join(fields)
            #save all for new conllu
            out_f.append(line)
    file_name = "en-ud-dev-harmonized-partial.conllu"
    if include_all:
        file_name = "en-ud-dev-harmonized-all.conllu"

    with open(file_name, "w", encoding="utf-8") as out:
        for line in out_f:
            out.write(line)
            if not line.startswith("#") and not line.startswith("\n"):
                out.write("\n")

In [14]:
harmonize("en-ud-dev-orig.conllu")
harmonize("en-ud-dev-orig.conllu", True)