# Training a Dependency Parser with SpaCy and Universal Dependencies

- Download and unzip the latest Universal Dependency release.
- Convert the CoNLL-U file to .spacy format using SpaCy.
- (Optional) Create a custom language class for languages SpaCy doesn't support (not needed for Basque).
- Instantiate a project with a configuration file for tokenization, tagging, and dependency parsing.
- Train the dependency parser.
- Evaluate the trained parser.
- Use the parser on example sentences.

First, download the Universal Dependencies dataset. We will use the latest release (v2.14). Chose a language with more than 1k samples. For the following examples, we will use Basque.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from collections import Counter
import statistics

In [16]:
import spacy

nlp = spacy.load("./output_pers/model-best")


In [37]:
from spacy import displacy

nlp = spacy.load("./output_pers/model-best")

doc = nlp(str(all_sent_with_structure[0]))

displacy.render(doc, style="dep", jupyter=True, options={"distance": 120})

## Analyse Trees


In [6]:
from src.helper import printTree
from src.helper import getDepthAndDegree
from src.helper import avDict
from src.helper import printDict
from src.helper import get_lines_without_number
from src.helper import regex_clean
from src.helper import clean_spanish_text
import re

In [12]:
import re
from collections import Counter
from src.helper import clean_persian_text
persian_text_path = "fas_news_2020_100K/fas_news_2020_100K-sentences.txt"
text = get_lines_without_number(persian_text_path)

# Clean the Persian text
whole_text = clean_persian_text(text)

# Tree to search for
root = 'nmod' #'obj'
chil = ["nmod","obj"]
tree_to_check = (root,chil)


In [15]:
from src.treeCounter import tree_counter
av_depth,mean_degree_dict,mean_distance_dict,leave_counter,count_ancestors,count_descendants,count_head,count_childreen,all_sent_with_structure = tree_counter(whole_text,nlp,tree_to_check)

 Progress:  99950 / 100000

In [15]:

print("\nTotal sentences: ",len(whole_text)," Average Depth: ",av_depth)


Total sentences:  100000  Average Depth:  5.93683


In [16]:
print("Average Number of degrees")
printDict(mean_degree_dict)

#count_degree_dict

Average Number of degrees
ccomp=>4.3151
acl:relcl=>4.1853
ROOT=>3.8443
advcl=>3.8416
conj=>3.4540
nmod=>2.9839
obl=>2.9435
obj=>2.4991
xcomp=>2.2606
nsubj:pass=>2.1563
vocative=>2.0455
nsubj=>2.0429
dep=>2.0309
appos=>2.0033
dislocated=>1.8467
nmod:poss=>1.7897
parataxis=>1.7141
advmod=>1.6264
flat:foreign=>1.4508
flat=>1.4119
punct=>1.2592
compound=>1.2394
fixed=>1.2331
amod=>1.1021
cop=>1.1002
compound:lvc=>1.0938
compound:prt=>1.0513
nummod=>1.0509
mark=>1.0494
det:predet=>1.0451
aux=>1.0420
cc=>1.0384
case=>1.0339
aux:pass=>1.0205
cc:preconj=>1.0133
det=>1.0052


In [17]:
print("Average Distance To Root")
printDict(mean_distance_dict)

Average Distance To Root
cc=>4.6010
flat=>4.4632
flat:foreign=>4.2459
nmod:poss=>4.1007
amod=>4.0790
fixed=>4.0649
case=>3.7923
det=>3.7735
cc:preconj=>3.7200
conj=>3.4758
nmod=>3.3125
acl:relcl=>3.2316
det:predet=>3.1944
mark=>3.0085
dep=>2.6004
aux=>2.5920
compound=>2.5797
appos=>2.5673
compound:prt=>2.5214
aux:pass=>2.4717
compound:lvc=>2.4491
cop=>2.4026
obj=>2.3981
vocative=>2.3182
xcomp=>2.2437
obl=>2.2160
nummod=>2.1607
advcl=>2.1070
nsubj:pass=>1.9262
nsubj=>1.9209
advmod=>1.7381
ccomp=>1.6655
parataxis=>1.6031
dislocated=>1.4733
punct=>1.3062
ROOT=>0.0000


In [18]:
print("Most Common Leave Nodes")
for label, count in leave_counter.most_common(10):
    print(label," count: ",count)

Most Common Leave Nodes
case  count:  257375
amod  count:  144834
nmod:poss  count:  128476
nummod  count:  105477
cc  count:  89433
compound:lvc  count:  58403
mark  count:  54009
det  count:  47536
advmod  count:  40748
nsubj  count:  31082


In [25]:
#This function takes a dictionary whith values of counter 
# so there is a counter for every key there is a counter which indicates how often a value was for example the ancestors of the key
# it will then print the fist two (if there are two) most_common classes of every counter 
def print_count_dict(cd):
    # Iterate over every key in the dictionary 
    for key in cd:
        #Take the most common two entries from the Counter corrospendonding to the key 
        most_common = cd[key].most_common(2)

        #If there is nothing in the counter it will print out empty 
        if len(most_common) == 0:
            print(key, "- Empty")

        #Otherwiese it will Print out the one ore two most common item in the counter 
        elif len(most_common) == 1:
            print(key, "-", most_common[0][0],":",most_common[0][1])
        else:
            print(key, "-", most_common[0][0],":",most_common[0][1],"--",most_common[1][0],":",most_common[1][1] )
print("Most Common Ancestors")
# ancestors will count the whole path to the root 
# head will just count the node above
#print_count_dict(count_ancestors)
print_count_dict(count_head)

Most Common Ancestors
nsubj - ROOT : 51652 -- ccomp : 27358
amod - nmod:poss : 59420 -- nmod : 23462
ROOT - ROOT : 100000
nmod:poss - nmod:poss : 127332 -- nmod : 61148
conj - nmod:poss : 23247 -- ROOT : 20296
aux - ccomp : 10408 -- ROOT : 7715
cop - ROOT : 9167 -- ccomp : 6695
dep - ROOT : 2291 -- conj : 1355
case - nmod : 121099 -- obl : 101987
fixed - case : 6820 -- cc : 2682
nmod - nmod:poss : 24429 -- ROOT : 19209
det:predet - obl : 73 -- nsubj : 71
det - nmod:poss : 12721 -- nsubj : 9684
obl - ROOT : 42932 -- ccomp : 31913
cc - conj : 87639 -- ROOT : 1028
acl:relcl - nmod:poss : 4507 -- nmod : 3988
advmod - ROOT : 62792 -- ccomp : 9886
nummod - advmod : 48495 -- ROOT : 29255
mark - ccomp : 25078 -- acl:relcl : 14612
compound - ccomp : 8054 -- ROOT : 7568
appos - ROOT : 3347 -- nsubj : 1950
obj - ROOT : 13629 -- ccomp : 13445
compound:lvc - ROOT : 22070 -- ccomp : 18537
ccomp - ROOT : 39175 -- ccomp : 12960
flat - nmod:poss : 16396 -- flat : 4820
parataxis - ROOT : 364 -- nummod :

In [22]:
print("Most Common descendants")
#descedants will count the complete subtree
# childreen just the child nodes
#print_count_dict(count_descendants)
print_count_dict(count_childreen)

Most Common descendants
nsubj - nmod:poss : 34548 -- amod : 16178
amod - nmod : 4022 -- conj : 3379
ROOT - advmod : 62792 -- nsubj : 51652
nmod:poss - nmod:poss : 127332 -- amod : 59420
conj - cc : 87639 -- nmod:poss : 29143
aux - obl : 256 -- obj : 163
cop - obl : 1398 -- nsubj : 275
dep - case : 1651 -- nmod:poss : 1045
case - fixed : 6820 -- conj : 1558
fixed - nmod:poss : 886 -- nmod : 532
nmod - case : 121099 -- nmod:poss : 61148
det:predet - fixed : 6 -- case : 5
det - det : 81 -- case : 62
obl - case : 101987 -- nmod:poss : 51012
cc - fixed : 2682 -- case : 219
acl:relcl - mark : 14612 -- obl : 11909
advmod - nummod : 48495 -- nmod : 4770
nummod - amod : 1604 -- punct : 873
mark - fixed : 2001 -- nmod : 201
compound - nummod : 3510 -- conj : 1169
appos - nmod:poss : 4307 -- amod : 1518
obj - case : 21560 -- nmod:poss : 15480
compound:lvc - case : 4321 -- nmod:poss : 483
ccomp - obl : 31913 -- nsubj : 27358
flat - nmod:poss : 5251 -- flat : 4820
parataxis - nsubj : 58 -- nmod : 5

In [24]:
from src.helper import print_all_senteces_with_structure_indication
print(f"Total number sentences with this structure {len(all_sent_with_structure)}")
print_all_senteces_with_structure_indication(all_sent_with_structure,tree_to_check,nlp)

Total number sentences with this structure 15
The sentences is: 
این بدان معنا است که برزیل کانون اصلی شیوع کروناویروس در مریکای لاتین است متخصصین می گویند به دلیل عدم تست گیری کافی در این کشور مارهای واقعی بیشتر از ن چیزی است که اعلام می شود

Root:  کانون Root Type:  nmod
برزیل  :  obj
اصلی  :  amod
شیوع  :  nmod:poss
مریکای  :  nmod
است  :  cop

------------------

The sentences is: 
این تحلیلگر بازار سرمایه با اشاره به اینکه شاخص هم وزن همچنان ممکن ست منفی بماند در پاسخ به این سوال که یا افزایش قیمت دلار و سکه نشانه جابجایی نقدینگی از بورس به بازار ارز و سکه است

Root:  پاسخ Root Type:  nmod
در  :  case
سوال  :  nmod
نشانه  :  obj

------------------


------------------

The sentences is: 
به گزارش همشهری نلاین به نقل از ایرنا بهنام محمدی شنبه هفدهم خرداد افزود این تشسوزی که ساعت  امروز رخ داد با همکاری ماموران  پاسگاه انتظامی و همچنین مشارکت اهالی بومی منطقه یادشده در کمتر از  دقیقه به طول کامل مهار شد

Root:  کمتر Root Type:  nmod
در  :  case
   :  nmod
دقیقه  :  obj

-----------

## Bonus Tasks (Optional):
- Fine-tune the model: Adjust hyperparameters like batch size, learning rate, or the number of training iterations in config.cfg to optimize the model.
- Use more data: Extend your experiment by training on larger treebanks from other languages or domains and compare the analyses.
- Analyze errors: Evaluate where the parser struggles by comparing predicted dependencies with the gold standard in the development set.