Translation between the _Anotación morfosintáctica do Corpus Técnico do Galego_ (AMFCTG) (Xavier Gomez Guinovart, Susana López Fernández. _Anotación morfosintáctica do Corpus Técnico do Galego._ <b>Lingua</b>MÁTICA — ISSN: 1647–0818 Núm. 1 - Maio 2009 - Pág. 61–71)  and the [_Spacy_ POS annotation framework](https://machinelearningknowledge.ai/tutorial-on-spacy-part-of-speech-pos-tagging/)

# From AMFCTG to _Spacy_

Only first level of POS tagging in _Spacy_

AMFCTG is a single-character, position-based code. The first character corresponds to the first level or gross of the POS _Spacy_ labeling, the others carry the morphological and detailed information for the POS _Spacy_ labeling

<center> <i>Table for  AMFCTG to Spacy translation</i> </center>

1ºChar  |  2ºChar | Spacy |
---- | ---- | ----
N   |  C | NOUN
N   | P  | PROPN
A   |  - | ADJ
V   | -  | VERB - AUX
R   | -  | ADV
M<br>Z   | -  | NUM
G   | -  | DET
P<br>X<br>D<br>T<br>Q<br>I   | -  | PRON
S | - | ADP
C | C | CCONJ
C | S | SCONJ
O | - | INTJ
F | - | PUNCT
Y<br>L | - | SYM
U | - | X  

Doubtful cases are foreign words (provisionally 'E' --> 'X'), contractions and enclisis, which in Spacy are treated as a single word, and POS is generally associated with the prepositional character (--> 'ADP') or the pronoun character (--> 'PRON').  
Another dubious decision is to assign an 'AUX' label to a verb if it is followed by another verb in the pronominal form.

Another case are locutions, which in the AMFCTG are marked with the words joined by the symbol "#". Fortunately, there are only about 650 different idioms in the corpus, making it feasible to map them with a hand-made dictionary, `expand_loc`.

In [None]:
A2S={'N':{'C':'NOUN','P':'PROPN'},
    'A':'ADJ',
    'V':['VERB','AUX'],
    'R':'ADV',
    'M':'NUM',
    'Z':'NUM',
    'G':'DET',
    'P':'PRON',
    'X':'PRON',
    'D':'PRON',
    'T':'PRON',
    'Q':'PRON',
    'I':'PRON',
    'S':'ADP',
    'C':{'C':'CCONJ','S':'SCONJ'},
    'O':'INTJ',
    'F':'PUNCT',
    'Y':'SYM',
    'L':'SYM',
    'U':'X',
    'E':'PROPN' 
    }


In [7]:
def process_file(path):
    raw=(path).read_text(encoding='utf8').split('\n')
    raw=[item.split('\t') for item in raw]
    sents=[]
    tmp=[]

    for item in raw:
        if not item[0]:
            sents.append(tmp)
            tmp=[]
        else:
            tmp.append(item)

    flp=[]
    num=0
    
    for s in sents:
        tmp=[]
        pmp=[]
        lmp=[]

        for t,p,l in s:
            #There seems to be a systematic error with excavaciónkkkPOS_tag in the train file
            if 'explanaciónkkk' in p:
                l,p=p.split('kkk')
            if '#' in t:
                if not t in expand_loc.keys():
                    print('locución inesperada',t)
                    break
                pmp+=expand_loc[t][0]
                lmp+=expand_loc[t][1]
                tmp+=t.strip('# ').split('#')

            else:

                tmp.append(t)
                pmp.append(p)
                lmp.append(l)
        #reconstructing sentence; leave spaces around PUNT to ensure correct lemmatization
        sent=' '.join(tmp)
        
        #working with POS
        pos=[]
        for i,p in enumerate(pmp):
            
            if not p:
                pos.append('')
            elif p in SpacyTags:
                pos.append(p)
            elif not p[0] in A2S.keys():
                print(tmp[i],p,lmp[i])
                print(sent)
            elif p[0] in ['N','C']:
                pos.append(A2S[p[0]][p[1]])
            elif p[0] == 'V':
                if i<len(pmp)-1 and pmp[i+1] and pmp[i+1][:2] in ['VN','VX','VP']:
                    pos.append(A2S[p[0]][1])
                else:
                    pos.append(A2S[p[0]][0])
            else:
                pos.append(A2S[p[0]])

        #building Spacy input
        flp.append((sent,tuple([*zip(tmp,pos,lmp)])))
    return flp