# A manual classification of dependency relations

Here, I will look into the distributive properties of a manual classification of dependency relations into two classes:

- __Content relations__: Dependencies that link a content word with another content word. 
- __Function relations__: Dependencies that link a word with a function word.

These are the relations that fit well within the specification of what I consider to be content and function dependencies. More fuzzy choices are further elaborated below.

### Content relations

acl, advcl, advmod, amod, appos, ccomp, compound, conj, csubj, csubjpass, dislocated, dobj, foreign, iobj, list, name, nmod, nsubj, nsubjpass, nummod, parataxis, remnant, root, vocative, xcomp

### Function relations

aux, auxpass, case, cc, cop, det, expl, mark, neg, punct

## Questionables

Some of these are more questionable than others.

### Content relations

__discourse__: Based on [the specification](http://universaldependencies.github.io/docs/u/dep/discourse.html) and each language's documentation, discourse seems to be used mainly for interjections, exclamations, emoticons, as well as a few language specific instances. While the [current distribution](/notebooks/Dependency%20distribution%20in%20Universal%20Dependency%20languages.ipynb) says otherwise, we would expect this to be fairly consistent across the languages. The fact that it isn't has probably more to do with treebank inconsistencies, as well as that some languages (Bulgarian) has more user generated text data. As such, we will classify it as a content relation. 

__reparandum__: Almost exclusively used in English. I expect this to rise in usage with speech treebanks becoming increasingly more common. This would be an important relation when comparing such treebanks.


### Function relations

__dep__: _unspecified dependency_ -- this is the goto dependency in cases where the parser cannot make a better guess about its relation. In theory, based on [its description](http://universaldependencies.github.io/docs/u/dep/dep.html), I would presume that the relation would only be available in the parser output. That said, looking at the [dependency distribution](/notebooks/Dependency%20distribution%20in%20Universal%20Dependency%20languages.ipynb), we can see that it is quite present in most notably Danish.  If it was the case that it was only in the parser output, its content/function classification wouldn't make any difference since we would use the gold standard as the main reference to compare the output against. In practice, though, that doesn't seem to be the case. Since we cannot assume any syntactic information on the relations, we will choose to ignore it (i.e. have it as a function relation).

__goeswith__: Almost exclusively used in English and Danish. We cannot make any assumptions about its syntatic information, and is of little significance (in my opinion) when evaluating the quality of the parser. 

__mwe__: multi-word expressions are idiomatically coded expressions, often between function words with our without a content word. Since the word order often is quite strict, and [some](http://universaldependencies.github.io/docs/en/dep/mwe.html) [languages](http://universaldependencies.github.io/docs/fi/dep/mwe.html) even lists all(?) of them, we expect them to be fairly easy to parse and give big benefits for languages with many such relations.


### Found quirks

__compound__: German tokenizes e.g. _T-shirt_ into _T_, _\-\-_, and _shirt_ with a compound relation between _shirt_ and _T_, while Swedish keeps it as one token. This causes German to have quite a few more compound relations than similar languages. (1114 in German training data, 6 in Swedish).

As soon as possible, I will continue with a closer analysis of what such a classification could mean, and if we can learn anything from previous analyses on [Dependency distribution in UD languages](http://localhost:8888/notebooks/Dependency%20distribution%20in%20Universal%20Dependency%20languages.ipynb) and [Type-token ratio in Universal Dependencies treebanks](http://localhost:8888/notebooks/Type-token%20ratio%20in%20Universal%20Dependencies%20treebanks.ipynb)