# Converting Dutch Universal Dependencies data to Parseme

There are currently no Parseme corpora for Dutch. Most existing Parseme corpora build on UD, as explained in [Savary et al 2023](https://nejlt.ep.liu.se/article/view/4453). 

In this project, we enrich Dutch UD data with Parseme annotation automatically. This annotation should then be corrected and validated manually, as not all distinctions made by Parseme are reflected in the UD annotation and some verbal MWEs might not have been annotated as such in UD. (The latter happens when syntax is regular and only semantics is idionatic/non-compositional.)


## Operationalizing Parseme guidelines 

We follow the discussion in Savary et al 2023 on the relationship between UD and Parseme annotation, and operationalize it as a set of heuristics that will identify specific Parseme categories on the basis of (Dutch) UD annotation as accurately as possible. Grew-match queries for Dutch UD are formulated to check whether the heuristic accurately identifies the Parseme category. To do the actual automatic annotation of a UD conllu file, we use the [parseme.cupt](https://gitlab.com/parseme/cuptlib) python library. 

### Inherent reflexives (IRV)

The reflexive pronoun dependent of an inherently reflexive verb is labeled with the dependency relation expl:pv. This [Grew-match query on UD Dutch Alpino](https://universal.grew.fr/?custom=6540f79d081ba) finds relevant instances. According to Parseme [guidelines](https://parsemefr.lis-lab.fr/parseme-st-guidelines/1.3/?page=050_Cross-lingual_tests/040_Inherently_reflexive_verbs__LB_IRV_RB_), verbal expressions containing an inherent reflexive as one of its parts (ie there also is another dependent of the verb that is part of the verbal expression), are labeled VID. This [query](https://universal.grew.fr/?custom=654508afc6138) finds inherent reflexives that are part of such a VID. As we can accurately identify IRV cases on the basis of UD annotation, no manual verification is necessary. 

### Verb-Particle Construction (VPC)

The VPC class refers to verbs that combine with a separable particle (ie we both have '_ik bel haar op_' and '_dat ik haar op moet bellen_' as well as '_ik moet haar opbellen_'). This is a frequent phenomenon in Germanic languages. In UD, the dependency label compound:prt is used to identify a particle dependent of a verb. However, not all compound:prt cases are Parseme VPCs. We only annotate verb-particle combinations as VPC where the particle has upos ADP or ADJ or ADV, and where the node labeled compound:prt does not have dependents of its own ([Grew-match query](https://universal.grew.fr/?custom=6543b87bf2cd2)). Other instances involving a compound:prt are discussed below.

If the particle immediately precedes the verb, the two are written as a single word. Nevertheless, these are also labeled as VPC. In the Dutch UD corpora, such cases can be identified as verbs where the lemma consists of prefix, a '\_' and a lemma, ie 'uit_schakelen' ('eliminate, switch-off') (and there is no compound:prt dependent in the sentence that matches the prefix), [Grew-match](https://universal.grew.fr/?custom=6540fcd3baa28)  

**Issue** Parseme distinguises between VPC.full, VPCs with a fully non-compositional meaning, and VPC.semi, VPCs where the meaning of the constructions can be derived to some extent from the meaning of its parts. As UD is a syntactic annotation framework, and this is a semantic distinction, this distinction requires manual checking. As most VPCs seem to be non-compositional, and approximately 75% of the VPCs in EN and DE are VPC.full, we label all VPCs as VPC.full for now.

**Issue** Note that this query also matches cases where the verb is a participle modifying a noun (amod) or cases of nominalisation (_het afgelopen weekeinde, de afgetreden bestuursleden_). The guidelines suggest that such cases are not to be included in the annotation. DE does not have such cases (maybe because participles used as nominal modifiers are not marked as such?), EN does contain a few cases (a marked-up draft, fucked up kids). On the other hand, there are also cases where the verb has a compound dependent but the result is not marked as MWE (above referenced counterparty, freeze dried food, radiation-induced cancers). For now, we have decided to not mark these as VPC (by excluding particple verbs that have the deprel amod). 

### Light verb constructions (LVC)

Light verb constructions consist of a semantically light verb and a predicative noun. In UD, such cases can be identified by searching for nouns that are compound:prt dependents of a verb and that do no have dependents of their own.[Grew-match](https://universal.grew.fr/?custom=65410400cd225)

Note that this set contains a wide variety of verbs, not all of which should probably be seen as semantically light. It may be possible to filter these out, ie by creating a list of verbal lemma's that can head a light verb construction. Also, a brief inspection of the EN (33 clusters by verb lemma) ,DE (81), and FR (92) parseme corpora suggests that quite a few different verbs are used in the LVC construction, and it does not seem to be a closed class. [Grew-match EN](https://parseme.grew.fr/?custom=65424e3a841e3), [DE](https://parseme.grew.fr/?custom=65424eb04d7b4) [FR](https://parseme.grew.fr/?custom=65424e9102651)

**Issue** A subset of these are cases where a verb has two compound:prt dependents, one a particle and the other a noun. I.e. for the query above, cases with bij_zetten are instances of _'luister bijzetten'_  'add luster to'. We currently analyze these as a single VID (following the guidelines for LVC + IRV). Note, however, that German seems to adopt an approach where these are labeled as overlapping LVC + VPC) 

**Issue** The set of LVC also includes many cases where the noun is an obj dependent of the verb. Assuming predicative nouns in this construction are singular and not introduced by a determiner (last requirement seems too restrictive, ie _geen rekening houden_), this query finds relevant cases (_deel uitmaken (VID?), rekening houden, gebruik maken, ..._) [Grew-match](https://universal.grew.fr/?custom=65421a43602d6) Note that the Alpino lexicon contains rich information on idiomatic verb-object constructions. All or a subset of these, where the verb is one of a small set (_hebben, laten, krijgen, ..._) could be labeled as LVC. 

**Issue** As with VPCs, there are 2 kinds of LVC, LVC.full (non-compositional) and LVC.cause (verb describes causal relation to noun, e.g. for German: [Grew-match Parseme](https://parseme.grew.fr/?custom=6543a5062bdd8) We initially label all LVC as LVC.full. 


### Verbal Idioms (VID)

If the node labeled compound:prt has dependents itself (staat in het teken van, aan de orde stellen), the construction is most likely a verbal idiom (VID) in Parseme. Also, if the word that is labeled compound:prt is a predicative noun, it is most likely a light verb construction (LVC). [Grew-match](https://universal.grew.fr/?custom=654141a700068)

**Issue** Some verbs have 2 compound:prt dependents. For now, it is assumed these are always VID (er uit zien). Also, some of these compounds have deps themselves: te wensen over laten --> compound:prt(laten,wensen), mark(wensen,te), compound:prt(laten,over) laat zich in de kaart kijken , doe geweld aan , nieuw leven inblazen, luister bij zetten, all OK now 

However, one might argue that te wensen over laten is actually a MVC or even that it is a combination of a MVC and VPC (as seems to be done often in the other corpora). Also, see the [guidelines for VID+IRV](https://parsemefr.lis-lab.fr/parseme-st-guidelines/1.3/?page=irv#irv-overlap-vid) and [guidelines for multiple dependents](https://parsemefr.lis-lab.fr/parseme-st-guidelines/1.3/?page=structural#test-s2) Guidelines suggest that these cases should actually be annotated as a single VID. 

**Issue** van plan zijn de vakantie in t Harde door te brengen : van plan is svp dep of zijn in Alpino, but gets attached to door-brengen in UD , this is an UD conversion mistake (fix: only label zijn as aux if it does not have a svp sister) [Grew-match](https://universal.grew.fr/?custom=6542228d22b44)

### Multi-verb constructions (MVC) 

These are cases where the compound:prt is a verb. The compound verb often has a te mark dependent, these are also included. Examples:
laten zien, heeft te maken, geroepen voelen, complex cases: te wensen over laten (2x compound:prt and wensen has a dep), zich laten aan_zien: here the compound:prt is a VPC (and there is a IRV as well), zich in de kaart laten kijken : IRV + VID + MVC ?, raak vuren? (does not feel like a MVC) laat (niets) van zich horen : case(zich,van)  [Grew-match](https://universal.grew.fr/?custom=6541429f56a3d)


### Workflow

First, the UD data in CONLLU format are mapped by a shell script to the CUPT format by adding a new column PARSEME:MWE that initially contains '*' for all tokens. 

Next we use the [cuptlib](https://gitlab.com/parseme/cuptlib) python library to read the data, match with specific words, and infer the new annotations. The enriched data is then written to an output file.


In [5]:
import conllu
import parseme.cupt as cupt
from parseme.cupt import MWE

In [42]:
corpus = 'nl_alpino-ud-train'

cupt0 = corpus + '.cupt.0'

with open(cupt0, encoding="utf-8") as f:
    data = f.read()

sentences = conllu.parse(data)

In [48]:
def one_word_particle_verb(verb,sentence) : 
    # boolean test to see whether we are dealing with a particle verb written as one (op_bellen)
    # and where there also are no other idiomatic parts that would already label this verb 
    one_word = False
    if verb['upos'] == 'VERB' and '_' in verb['lemma'] and verb['deprel'] != 'amod' :
        one_word = True
        for prt in sentence :
            if prt['deprel'] in ['compound:prt','expl:pv'] and prt['head'] == verb['id'] :
                one_word = False
    return one_word

def compound_children(id,sentence) :   # cases where annotation uses fixed for additional deps 
    children = []
    for dep in sentence : # ten onder ADP ADP
        if dep['head'] == id and dep['upos'] in ['NOUN','DET','ADP','PRON'] :
            children.append(dep['id'])
    return children

def compound_sisters(id,head,sentence) : # er (goed) uit zien 
    sisters = []
    for cmp in sentence:
        if cmp['id'] != id :
            if cmp['deprel'] in ['compound:prt','expl:pv'] and cmp['head'] == head:
                sisters.append(cmp['id'])
                # include call to compound_children for complex cases ... laat in de kaart kijken 
                children = compound_children(cmp,sentence)
                sisters = sisters + children
            elif cmp['id'] == head and one_word_particle_verb(cmp,sentence) :
                sisters.append(cmp['id'])
    return sisters

def head_is_compound_verb(id,sentence) :  # test if the verbal head is itself a compound (aan voel komen)
    compound = False
    for token in sentence :
        if token['id'] == id and token['deprel'] == 'compound:prt' :
            compound = True
    return compound
    
# zie ook https://parsemefr.lis-lab.fr/parseme-st-guidelines/1.3/?page=irv#irv-overlap-vid
# schreef op zijn naam, also incude op zijn? (deps of naam, cmp:prt?) no 
for sentence in sentences:
    mwes = []
    for token in sentence:
        (id,upos,head,deprel) = (token['id'],token['upos'], token['head'],token['deprel'])
        pointers = [id,head]
        mweclass = ''
        sisters = compound_sisters(id,head,sentence) 
        children = compound_children(id,sentence)
        if deprel == 'expl:pv' :  # zich vergissen
            if sisters:  # expl is part of a VID   zich in de kijker spelen, zich niet aan moet stellen
                # todo: zich aanstellen 
                if sisters[0] > id :
                    mweclass = 'VID'
                    pointers = pointers + sisters
                else :  # skip refl as it follows another compound:prt and is already accounted for 
                    continue
            else: 
                mweclass = "IRV"
        elif deprel == 'compound:prt' :
            if sisters :  # er (goed) uit zien  (also skip uit in that case...)
                if sisters[0] > id :
                    mweclass = 'VID'
                    pointers = pointers + sisters + children 
                else :  # skip this prt as it follows another compound:prt and is already accounted for 
                    continue
            elif children :   ## compound:prt with dependents is considered a VID (default) or MVC
                              ## aan voel komen is a special case, compound:prt(komen,aan), compound:prt(voel,komen)
                pointers = pointers + children
                if upos == 'VERB' : # te maken hebben, 
                    mweclass = 'MVC'
                else :
                    mweclass = 'VID' # in de gaten hebben 
            elif upos == 'NOUN' :
                mweclass = 'LVC.full' # naam maken, plaats maken/vinden, gestalte krijgen, spitsroeden lopen, poolshoogte nemen, 
            elif upos == 'VERB' :  # 
                mweclass = 'MVC'
            elif head_is_compound_verb(id,sentence) : 
                continue
            else : 
                mweclass = 'VPC.full'        
        elif one_word_particle_verb(token,sentence) : 
            if sisters :
                continue
            else :
                mweclass = "VPC.full"
                pointers = [id]
        if mweclass :  # only update the list if a MWE was actually found 
            mwe = MWE(mweclass,pointers)
            mwes.append(mwe)
    #print(mwes)
    cupt.replace_mwes(sentence,mwes)
            

In [49]:
outfile = corpus + '.cupt'

with open(outfile, 'w', encoding='utf-8') as f:
    for sentence in sentences:
        f.write(sentence.serialize())

## Open Issues and To do 

- enhanced deps can reconstruct elided nodes, also annotate?
- expletive subjects + verb ?
- complex syntax cases, ie where obj has a relative clause with the support verb? is that possible?
- IAV labels (is there at Alpino level, gone at UD level, but can be included) 
- abstract for naples meeting (deadline: 24 Nov) 
- check FLAT annotation tool
- https://universal.grew.fr/?custom=6544ab463854b  This query finds suspect expletives in the context of compound:prt verbs, ie they seem to be attached to the wrong verb.
- in zijn hemd staan, clear VID, not annotated as such
- guidelines suggest that some PPs can also be LVCs

