# Training a Dependency Parser with SpaCy and Universal Dependencies

- Download and unzip the latest Universal Dependency release.
- Convert the CoNLL-U file to .spacy format using SpaCy.
- (Optional) Create a custom language class for languages SpaCy doesn't support (not needed for Basque).
- Instantiate a project with a configuration file for tokenization, tagging, and dependency parsing.
- Train the dependency parser.
- Evaluate the trained parser.
- Use the parser on example sentences.

First, download the Universal Dependencies dataset. We will use the latest release (v2.14). Chose a language with more than 1k samples. For the following examples, we will use Basque.

In [15]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [16]:
  %reload_ext autoreload

In [2]:
from collections import Counter
import statistics

Now that you have the treebank, you'll convert the CoNLL-U file (.conllu) into SpaCy’s binary format (.spacy). This conversion makes the data ready for training. The --n-sents flag ensures the data is split into documents with a limited number of sentences (10, in this case), which helps during training.

In [3]:
!python -m spacy convert es_gsd-ud-train.conllu . --converter conllu -n 10
!python -m spacy convert es_gsd-ud-dev.conllu . --converter conllu -n 10
!python -m spacy convert es_gsd-ud-test.conllu . --converter conllu -n 10

[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (1419 documents): es_gsd-ud-train.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (140 documents): es_gsd-ud-dev.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (43 documents): es_gsd-ud-test.spacy[0m


Next, initialize a SpaCy project with a pipeline consisting of tok2vec, tagger, and parser. We will generate a configuration file with optimized settings for training the parser. This will create a config.cfg file, which defines all the parameters needed to train the model.

In [4]:
!python -m spacy init config config.cfg --lang eu --pipeline tok2vec,tagger,parser --optimize efficiency  --force

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: eu
- Pipeline: tagger, parser
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


Now that you have your data and configuration, it's time to train the dependency parser. Use the train.spacy and dev.spacy files generated earlier.

```console
[paths]
train = "./eu_bdt-ud-train.spacy"
dev = "./eu_bdt-ud-dev.spacy"
```

Lastly, train the model:

In [6]:
!python -m spacy train config.cfg --output ./output --paths.train ./es_gsd-ud-train.spacy --paths.dev ./es_gsd-ud-dev.spacy

[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'tagger', 'parser'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS PARSER  TAG_ACC  DEP_UAS  DEP_LAS  SENTS_F  SCORE 
---  ------  ------------  -----------  -----------  -------  -------  -------  -------  ------
  0       0          0.00       254.84       625.43    41.58    15.46     6.37     0.01    0.26
  0     200       4462.01     13528.87     40530.98    89.32    72.49    64.56    57.79    0.79
  0     400       6080.50      6299.56     26860.51    91.64    77.86    71.43    94.74    0.83
  0     600       5988.69      5075.02     23343.75    92.35    80.08    73.96    94.86    0.85
  0     800       6447.55      4772.35     22275.40    92.59    80.21    74.59    95.48    0.85
  0    1000       6590.91      4411.91     21142.72    92.85    81.34    75.85    96.89    0.86
  0    1200  

Once the training is done, evaluate the parser on the development set to measure its accuracy.
The output will provide detailed metrics like LAS (Labeled Attachment Score), which measures how well the parser predicts dependency relations per label.

In [8]:
!python -m spacy evaluate ./output/model-best ./es_gsd-ud-test.spacy

[38;5;4mℹ Using CPU[0m
[1m

TOK      97.68
TAG      93.24
UAS      82.92
LAS      78.91
SENT P   95.43
SENT R   92.97
SENT F   94.19
SPEED    12357

[1m

                 P       R       F
case         93.83   84.11   88.71
advmod       85.03   78.59   81.68
root         89.24   85.48   87.32
det          96.87   85.12   90.61
nsubj        80.45   76.96   78.67
nmod         79.34   80.00   79.67
amod         86.53   85.81   86.17
appos        60.33   59.68   60.00
flat         73.10   82.76   77.63
expl:pv      98.33   81.94   89.39
acl:relcl    71.74   70.21   70.97
obl          70.40   68.97   69.68
cop          79.64   83.12   81.35
cc           87.47   84.41   85.91
conj         56.46   56.11   56.28
mark         87.58   84.35   85.94
xcomp        66.14   74.34   70.00
obj          76.80   78.21   77.50
fixed        68.67   50.00   57.87
advcl        56.52   43.33   49.06
ccomp        67.53   63.41   65.41
acl          62.26   60.00   61.11
obl:arg      84.38   57.45   68.35
de

In [20]:

from spacy.scorer import Scorer
from spacy.training import Example

def evaluate_model(model_path, test_data_path):
    nlp = spacy.load(model_path)

    doc_bin = spacy.tokens.DocBin().from_disk(test_data_path)
    examples = []
    for doc in doc_bin.get_docs(nlp.vocab):
        pred_doc = nlp(doc.text)
        examples.append(Example(pred_doc, doc))

    scorer = Scorer()
    scores = scorer.score(examples)

    print(scores)

    return scores

model_path = "./output/model-best"
test_data_path = "./es_gsd-ud-test.spacy"

evaluation_results = evaluate_model(model_path, test_data_path)


{'token_acc': 0.976773864216437, 'token_p': 0.9733707673983325, 'token_r': 0.953257790368272, 'token_f': 0.9632092944940226, 'sents_p': 0.9543269230769231, 'sents_r': 0.9297423887587822, 'sents_f': 0.9418742586002372, 'tag_acc': 0.9323615905894387, 'pos_acc': 0.0, 'morph_acc': 0.0, 'morph_micro_p': 0.0, 'morph_micro_r': 0.0, 'morph_micro_f': 0.0, 'morph_per_feat': {'Mood': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'Number': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'Person': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'Tense': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'VerbForm': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'Definite': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'Gender': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'PronType': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'PunctType': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'Poss': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'Case': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'PrepCase': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'Reflex': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'PunctSide': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'NumForm': {'p': 0.0, 'r': 0.0, 'f': 

Use the trained parser to parse some example sentences.

In [3]:
import spacy

nlp = spacy.load("./output/model-best")

doc = nlp("Subo a la montaña.")
for token in doc:
    print(token.text, token.dep_, token.head.text)

Subo ROOT Subo
a case montaña
la det montaña
montaña obl Subo
. punct Subo


In [4]:
from spacy import displacy

nlp = spacy.load("./output/model-best")

doc = nlp("Todavía te queda un largo camino hasta llegar a casa.")
#doc = nlp("Voy a la casa.")
displacy.render(doc, style="dep", jupyter=True, options={"distance": 110})

## Analyse Trees


In [6]:
from src.helper import printTree
from src.helper import getDepthAndDegree
from src.helper import avDict
from src.helper import printDict
from src.helper import get_lines_without_number
from src.helper import regex_clean
from src.helper import clean_spanish_text
import re

In [9]:
## Load the Spanish text 
spanish_text_path = "spa_wikipedia_2021_30K-sentences.txt"

final_L = get_lines_without_number(spanish_text_path)
#Get the data as lines 
# Already cleaned from starting numbers
whole_text = get_lines_without_number(spanish_text_path)
#Use the cleaning regex
whole_text = clean_spanish_text(whole_text)

root = 'nmod' #'obj'
chil = ["nmod","obj"]
tree_to_check = (root,chil)

In [11]:
from src.treeCounter import tree_counter
av_depth,mean_degree_dict,mean_distance_dict,leave_counter,count_ancestors,count_descendants,count_head,count_childreen,all_sent_with_structure = tree_counter(whole_text,nlp,tree_to_check)

 Progress:  29950 / 30000

In [5]:

print("\nTotal sentences: ",len(whole_text)," Average Depth: ",av_depth)


Total sentences:  30000  Average Depth:  5.5502


In [6]:
print("Average Number of degrees")
printDict(mean_degree_dict)

#count_degree_dict

Average Number of degrees
ccomp=>4.8273
ROOT=>4.5942
acl:relcl=>4.0224
parataxis=>3.9012
csubj=>3.7811
obl:agent=>3.6477
advcl=>3.6326
conj=>3.5416
obl=>3.3491
nmod=>2.9557
obj=>2.7288
nsubj:pass=>2.6123
nsubj=>2.5203
xcomp=>2.3137
acl=>2.0663
obl:arg=>2.0391
appos=>1.7688
dep=>1.4551
compound=>1.3029
nummod=>1.2225
advmod=>1.1521
amod=>1.1260
mark=>1.1257
flat=>1.0472
cc=>1.0414
fixed=>1.0337
cop=>1.0308
case=>1.0211
aux=>1.0146
det=>1.0045
aux:pass=>1.0011
punct=>1.0008
expl:pv=>1.0001


In [7]:
print("Average Distance To Root")
printDict(mean_distance_dict)

Average Distance To Root
cc=>3.5967
case=>3.4964
fixed=>3.4219
flat=>3.2543
amod=>3.1072
mark=>3.0915
det=>3.0862
nummod=>3.0816
compound=>2.9599
nmod=>2.8517
appos=>2.8015
acl:relcl=>2.7451
acl=>2.6501
conj=>2.6236
advmod=>2.4473
dep=>2.3739
obl:agent=>2.2709
obj=>2.2477
obl:arg=>2.1528
aux=>2.0989
xcomp=>2.0989
obl=>1.9273
expl:pv=>1.8152
cop=>1.8028
advcl=>1.7952
ccomp=>1.7487
aux:pass=>1.6511
nsubj=>1.5906
csubj=>1.4576
nsubj:pass=>1.3910
parataxis=>1.0323
punct=>1.0111
ROOT=>0.0000


In [8]:
print("Most Common Leave Nodes")
for label, count in leave_counter.most_common(10):
    print(label," count: ",count)

Most Common Leave Nodes
case  count:  106678
det  count:  93745
amod  count:  36251
punct  count:  29748
cc  count:  19642
mark  count:  19085
advmod  count:  16601
flat  count:  11980
expl:pv  count:  9169
nummod  count:  8051


In [9]:
#This function takes a dictionary whith values of counter 
# so there is a counter for every key there is a counter which indicates how often a value was for example the ancestors of the key
# it will then print the fist two (if there are two) most_common classes of every counter 
def print_count_dict(cd):
    # Iterate over every key in the dictionary 
    for key in cd:
        #Take the most common two entries from the Counter corrospendonding to the key 
        most_common = cd[key].most_common(2)

        #If there is nothing in the counter it will print out empty 
        if len(most_common) == 0:
            print(key, "- Empty")

        #Otherwiese it will Print out the one ore two most common item in the counter 
        elif len(most_common) == 1:
            print(key, "-", most_common[0][0],":",most_common[0][1])
        else:
            print(key, "-", most_common[0][0],":",most_common[0][1],"--",most_common[1][0],":",most_common[1][1] )
print("Most Common Ancestors")
# ancestors will count the whole path to the root 
# head will just count the node above
print_count_dict(count_ancestors)
#print_count_dict(count_head)

Most Common Ancestors
nummod - ROOT : 9878 -- obl : 4676
nsubj - ROOT : 29799 -- advcl : 3088
case - ROOT : 108843 -- nmod : 78984
nmod - ROOT : 59397 -- obl : 22865
cop - ROOT : 8210 -- conj : 1130
det - ROOT : 94132 -- obl : 35805
ROOT - Empty
cc - conj : 21522 -- ROOT : 20475
conj - ROOT : 20311 -- nmod : 5332
punct - ROOT : 29766 -- parataxis : 103
obl - ROOT : 41910 -- advcl : 6825
amod - ROOT : 40490 -- nmod : 16100
obj - ROOT : 25681 -- advcl : 8167
advmod - ROOT : 19146 -- conj : 4002
flat - ROOT : 12429 -- nmod : 5051
appos - ROOT : 7950 -- obl : 2931
dep - ROOT : 468 -- conj : 87
mark - ROOT : 21264 -- advcl : 12200
acl:relcl - ROOT : 8254 -- obl : 2798
acl - ROOT : 4450 -- obj : 1443
expl:pv - ROOT : 9170 -- conj : 1631
fixed - ROOT : 5458 -- case : 2187
obl:arg - ROOT : 2480 -- conj : 618
xcomp - ROOT : 7009 -- conj : 1461
advcl - ROOT : 11799 -- conj : 1736
aux - ROOT : 5956 -- acl:relcl : 1323
parataxis - ROOT : 1518 -- conj : 11
obl:agent - ROOT : 2765 -- acl : 824
aux:p

In [10]:
print("Most Common descendants")
#descedants will count the complete subtree
# childreen just the child nodes
print_count_dict(count_descendants)
#print_count_dict(count_childreen)

Most Common descendants
nummod - nummod : 10368 -- case : 1001
nsubj - nsubj : 30666 -- det : 28921
case - case : 108890 -- fixed : 2187
nmod - case : 78984 -- nmod : 76887
cop - cop : 8216 -- case : 139
det - det : 94164 -- case : 269
ROOT - case : 108843 -- det : 94132
cc - cc : 20476 -- fixed : 831
conj - conj : 23072 -- cc : 21522
punct - punct : 29766 -- case : 11
obl - case : 66242 -- obl : 44795
amod - amod : 40955 -- advmod : 1961
obj - det : 28200 -- obj : 27854
advmod - advmod : 19990 -- case : 1356
flat - flat : 12664 -- case : 157
appos - appos : 8165 -- flat : 3464
dep - dep : 470 -- case : 134
mark - mark : 21274 -- fixed : 1645
acl:relcl - case : 13345 -- det : 11962
acl - case : 5643 -- acl : 4574
expl:pv - expl:pv : 9170 -- case : 1
fixed - fixed : 5480 -- case : 82
obl:arg - obl:arg : 2489 -- case : 1629
xcomp - case : 7485 -- xcomp : 7433
advcl - case : 16592 -- det : 15584
aux - aux : 5959 -- det : 39
parataxis - det : 2974 -- case : 2935
obl:agent - case : 4467 -- 

In [14]:
from src.helper import print_all_senteces_with_structure_indication
print(f"Total number sentences with this structure {len(all_sent_with_structure)}")
print_all_senteces_with_structure_indication(all_sent_with_structure,tree_to_check,nlp)

Total number sentences with this structure 3


NameError: name 'get_label' is not defined

## Bonus Tasks (Optional):
- Fine-tune the model: Adjust hyperparameters like batch size, learning rate, or the number of training iterations in config.cfg to optimize the model.
- Use more data: Extend your experiment by training on larger treebanks from other languages or domains and compare the analyses.
- Analyze errors: Evaluate where the parser struggles by comparing predicted dependencies with the gold standard in the development set.