## Trankit overview.

<div id="div1"> <img src="trankit_model.png", width=700> </div>


> Trankit outperforms other toolkits over all remaining tasks (e.g., POS
> and morphological tagging) in which the improvement boost is substantial and significant for sentence segmentation and dependency parsing. For
> example, English enjoys a 7.22% improvement for
> sentence segmentation, a 3.92% and 4.37% improvement for UAS and LAS in dependency parsing. 

*Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing*  
https://doi.org/10.48550/arXiv.2101.03289

---

![](perf.png)
![](perf2.png)

---

## Initializing the English pipeline from trankit

In [1]:
from trankit import Pipeline

p = Pipeline(lang='english', gpu=False, cache_dir='./cache')

Loading pretrained XLM-Roberta, this may take a while...
Loading tokenizer for english
Loading tagger for english
Loading lemmatizer for english
Loading NER tagger for english
Active language: english


**Notable performance problems (only with my hardware):**
- slow parsing for large datasets using CPU,
- insufficient memory while using GPU (my GPU has only 3GiB memory, and it's filled FAST).

![](perf3.png)

---

### Example dependency parsing using simple sentences from the presentation (20230123) and some random ones generated with ChatGPT.

In [2]:
text = '''I invited John, Mary, and Tom.
I played for an hour in the most famous jazz club in the whole of USA.
I played in the most famous jazz club in the whole of USA for an hour.
I invited to the club absolutely everybody in my address book.
I invited absolutely everybody in my address book to the club.
I took a shower, brushed my teeth, and went to bed.
I love the smell of freshly brewed coffee in the morning.
I prefer reading books to watching TV shows or movies.
The Simpsons is a beloved animated sitcom that has been entertaining audiences.
The Simpsons is the longest-running American primetime TV series.
Springfield is home to memorable characters.
Marge Simpson is the loving and supportive wife of Homer.
'''


all = p.posdep(text)

---

Trankit, while parsing, creates a dictionary with all the sentences and relevant data.

**Example of generated dictionary:**

In [3]:
all

{'text': 'I invited John, Mary, and Tom.\nI played for an hour in the most famous jazz club in the whole of USA.\nI played in the most famous jazz club in the whole of USA for an hour.\nI invited to the club absolutely everybody in my address book.\nI invited absolutely everybody in my address book to the club.\nI took a shower, brushed my teeth, and went to bed.\nI love the smell of freshly brewed coffee in the morning.\nI prefer reading books to watching TV shows or movies.\nThe Simpsons is a beloved animated sitcom that has been entertaining audiences.\nThe Simpsons is the longest-running American primetime TV series.\nSpringfield is home to memorable characters.\nMarge Simpson is the loving and supportive wife of Homer.\n',
 'sentences': [{'id': 1,
   'text': 'I invited John, Mary, and Tom.',
   'tokens': [{'id': 1,
     'text': 'I',
     'upos': 'PRON',
     'xpos': 'PRP',
     'feats': 'Case=Nom|Number=Sing|Person=1|PronType=Prs',
     'head': 2,
     'deprel': 'nsubj',
     'dsp

**Note that this output** doesn't have the information about dependency length (but can be extracted).

**Auxiliary functions for generating:**  
(a) dictionary with relevant data for generating a $LaTeX$ file with dependency trees,  
(b) adding other useful information, such as placement of the governor.

In [51]:
def parse_sentence(dep_parsed):
    dict = {'parsed_latex': []}
    parsed = ""
    deps = []
    for i in range(len(dep_parsed["sentences"])):
        for j in range(len(dep_parsed["sentences"][i]["tokens"])):
            if(dep_parsed["sentences"][i]["tokens"][j]["text"] in ("$", "%", "&")):
                rooted = "\\" + dep_parsed["sentences"][i]["tokens"][j]["text"] + " \&"
            else:
                rooted = dep_parsed["sentences"][i]["tokens"][j]["text"] + " \&"
            parsed += rooted
            deps.append((dep_parsed["sentences"][i]["tokens"][j]["head"], dep_parsed["sentences"][i]["tokens"][j]["id"], dep_parsed["sentences"][i]["tokens"][j]["deprel"]))

        parsed += " \\\\"
        dict["parsed_latex"].append({'id': i+1, 'sentence': parsed, 'dependencies': deps})
        parsed = ""
        deps = [] 

    return dict

def extract_info(dep_parsed):
    direction = {"left": [], "right": []}
    for i in range(len(dep_parsed["parsed_latex"])):
        depends = dep_parsed["parsed_latex"][i]["dependencies"]
        for deps in depends:
            dependency_length_words = abs(deps[0] - deps[1])
            if(deps[1] > deps[0] and deps[2] != "PUNCT"):
                direction["left"].append(dependency_length_words)
            elif(deps[1] < deps[0] and deps[2] != "PUNCT"):
                direction["right"].append(dependency_length_words)
    return direction

**Function generating the $LaTeX$ file with dependency trees.**

In [53]:
def generate_latex(template, id, dep_parsed):
    filename = f'dependency_{id}.tex'
    latex_dependency_begin = '''\\hspace*{-4cm}
    \\begin{dependency}[theme = simple]\n'''

    latex_dependency_end = "\\end{dependency}\n"
    latex_sentence_begin = "\\begin{deptext}[column sep = 0.5em]\n"
    latex_sentence_end = "\\end{deptext}\n"
    code = ""

    for i in range(len(dep_parsed["parsed_latex"])):
        depends = dep_parsed["parsed_latex"][i]["dependencies"]
        edges = ""
        if(len(depends) < 18): # print only these sentences, that have less than 18 branches
            for deps in depends:
                #dependency_length_words = abs(deps[0] - deps[1])
                label = deps[2]
                if(deps[0] == 0):
                    edges += (f"\deproot[edge unit distance=4ex, edge style=dotted]{ {deps[1]} }{{}}\n")
                elif(deps[2] != "PUNCT"):
                    # labels: distance in words
                    # edges += (f"\depedge{ {deps[0]} }{ {deps[1]} }{ {abs(deps[0] - deps[1])} }\n")

                    # labels: upos
                    edges += (f"\depedge{ {deps[0]} }{ {deps[1]} }{ {label} }\n")
            code += latex_dependency_begin + latex_sentence_begin + dep_parsed["parsed_latex"][i]["sentence"] + "\n" + latex_sentence_end + edges +"\n" + latex_dependency_end + "\n"

    with open(filename, "w") as latex_file:
        latex = template.replace(r'\VAR{sentence}', code)
        latex_file.write(latex)


with open("template_dep.tex") as template_file:
    global latex_code
    latex_code = template_file.read()

**Parsing the example text and creating a $LaTeX$ file for insight.**

In [54]:
dep_parsed = parse_sentence(all)
dirs = extract_info(dep_parsed)
generate_latex(latex_code, "example", dep_parsed)

**Attempt at calculating basic statistics about the data.**

In [None]:
import statistics as st
print("Mean distance in words when governor on the left: " + str(round(st.mean(dirs["left"]), 2)) 
      + "\nMean distance in words when governor on the right: " + str(round(st.mean(dirs["right"]), 2)))

---

## Using a short snippet of COCA for sample parsing.

https://www.corpusdata.org/formats.asp

In [55]:
import re
example_corpus = open("text_blog.txt", "r")
snippet = example_corpus.readlines()[1:5]
plain = ""
for sent in snippet:
    plain += sent

# Clean the data a bit
patterns = ["@........", "<h>.*?<p>", "<p>"]
result = plain
for pat in patterns:
    result = re.sub(pat, "", result)

print(result)

  Emerging from the fog , as it were . I always forget just how all-consuming the first throes of rehearsal for an opera can be . Especially if , as in this case , the lead role is not yet cast ( it 's a hellishly difficult sing , and the intended singer pulled out just before musical rehearsals began . He is a consummate professional , so he must have had very solid reasons , but that still leaves other colleagues ( luckily not me : the role I am covering is not so big ) a bit adrift in a piece which requires long dialogues with the lead ... Let 's just say that does n't lead to happy relaxation , on the whole .  And I also have to admit to having been nervous before the initial music call . I had n't sung for this music director before ; and no matter what anyone says , it FEELS like an audition when you sing for someone in a position of power for the first time . It would of course have helped to have had masses of   had a more voice-friendly opera to start with , but hey , beggars 

**Parsing the sample.**

In [None]:
par = p.posdep(result)

**Generating $LaTeX$ file for insight.**

In [56]:
parp = parse_sentence(par)
generate_latex(latex_code, "coca_snippet", parp)

**Basic statistics**

In [None]:
dirs = extract_info(parp) 
print("Mean distance in words when governor on the left: " + str(round(st.mean(dirs["left"]), 2)) 
      + "\nMean distance in words when governor on the right: " + str(round(st.mean(dirs["right"]), 2)))

---

**Noteworthy issues:**
- data must be cleaned more precisely than it's done here,
- all headers must be removed (may only be a problem for this sample),
- there are some anomalies that may be hard to remove automatically (although there may not be similar problems when using the full version of COCA),
- contracted words in this corpus are separated (should they be for our needs?),
- don't know yet if there is a possibility to parse according to other paradigms than Universal Dependencies ("Stanford"),
- **(*)** some sentences are too long to fit them on standard A4 format in the $LaTeX$ file,
- **(*)** there are special characters in the plain text that make a lot of trouble while building the $LaTeX$ file.


**(*)** = (not relevant for our cause).

---

**To do:**
1. Aquire the full COCA corpus.
2. Clean the corpus so that there won't be any anomalies.
3. Parse whole corpus.
4. Find out if it's possible to use other paradigms for parsing.
5. Figure out a reliable way of counting dependency distance in words, syllables and letters.
6. Conduct a statistical analysis of the results.

---

## Drawing a sample from Polish corpus.

In [None]:
res = open("results", "r")
sentence = res.readline()

dictionary = {"text": 
            "", "sentences": 
            [{"id": 1, "text": "", "tokens": [{'id': 1, 'text': 'Brakowało', 'upos': 'VERB', 'xpos': 
            'praet:sg:n:imperf', 'feats': 'Aspect=Imp|Gender=Neut|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act', 
            'head': 0, 'deprel': 'root', 'dspan': (0, 9), 'span': (0, 9), 'lemma': 'brakować'}, 
            {'id': 2, 'text': 'nam', 'upos': 'PRON', 'xpos': 'ppron12:pl:dat:m1:pri', 
            'feats': 'Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur|Person=1|PronType=Prs', 
            'head': 1, 'deprel': 'iobj', 'dspan': (10, 13), 'span': (10, 13), 'lemma': 'my'}, 
            {'id': 3, 'text': 'i', 'upos': 'CCONJ', 'xpos': 'conj', 'head': 4, 'deprel': 'advmod:emph', 
            'dspan': (14, 15), 'span': (14, 15), 'lemma': 'i'}, 
            {'id': 4, 'text': 'im', 'upos': 'PRON', 'xpos': 'ppron3:pl:dat:f:ter:akc:npraep', 
            'feats': 'Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long', 
            'head': 1, 'deprel': 'conj', 'dspan': (16, 18), 'span': (16, 18), 'lemma': 'on'}, 
             {'id': 5, 'text': 'do', 'upos': 'ADP', 'xpos': 'prep:gen', 'feats': 'AdpType=Prep',
            'head': 6, 'deprel': 'case', 'dspan': (19, 21), 'span': (19, 21), 'lemma': 'do'}, 
             {'id': 6, 'text': 'szczęścia', 'upos': 'NOUN', 'xpos': 'subst:sg:gen:n:ncol',
            'feats': 'Case=Gen|Gender=Neut|Number=Sing', 'head': 1, 'deprel': 'obl', 'dspan': (22, 31),
            'span': (22, 31), 'lemma': 'szczęście'}, 
             {'id': 7, 'text': 'ładnej', 'upos': 'ADJ', 'xpos': 'adj:sg:gen:f:pos',
            'feats': 'Case=Gen|Degree=Pos|Gender=Fem|Number=Sing',
            'head': 8, 'deprel': 'amod', 'dspan': (32, 38), 'span': (32, 38), 'lemma': 'ładny'},
            {'id': 8, 'text': 'pogody', 'upos': 'NOUN', 'xpos': 'subst:sg:gen:f',
            'feats': 'Case=Gen|Gender=Fem|Number=Sing', 'head': 1, 'deprel': 'iobj',
            'dspan': (39, 45), 'span': (39, 45), 'lemma': 'pogoda'},
            {'id': 9, 'text': 'oraz', 'upos': 'CCONJ', 'xpos': 'conj', 'head': 11,
            'deprel': 'cc', 'dspan': (46, 50), 'span': (46, 50), 'lemma': 'oraz'},
            {'id': 10, 'text': 'lepszego', 'upos': 'ADJ', 'xpos':
            'adj:sg:gen:m3:com', 'feats': 'Animacy=Inan|Case=Gen|Degree=Cmp|Gender=Masc|Number=Sing',
            'head': 11, 'deprel': 'amod', 'dspan': (51, 59), 'span': (51, 59), 'lemma': 'dobry'}, 
            {'id': 11, 'text': 'nastroju', 'upos': 'NOUN', 'xpos': 'subst:sg:gen:m3', 'feats':
            'Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing', 'head': 8, 'deprel': 'conj',
            'dspan': (60, 68), 'span': (60, 68), 'lemma': 'nastrój'}, {'id': 12, 'text':
            '.', 'upos': 'PUNCT', 'xpos': 'interp', 'feats': 'PunctType=Peri',
            'head': 1, 'deprel': 'punct', 'dspan': (68, 69), 'span': (68, 69), 'lemma': '.'}]}]}
dictionary
parse = parse_sentence(dictionary)
generate_latex(latex_code, "polish", parse)

---

## ConLLU conversion test.

In [20]:
all["sentences"][0]["tokens"][6]

{'id': 7,
 'text': 'and',
 'upos': 'CCONJ',
 'xpos': 'CC',
 'head': 8,
 'deprel': 'cc',
 'dspan': (22, 25),
 'span': (22, 25)}

In [47]:
import csv

with open("columns.csv", "r") as csvfile:
    reader = csv.reader(csvfile, dialect='excel-tab')
    cols = next(reader)

print(cols)

['governor.position', 'governor.word', 'governor.nkjp.tag', 'governor.pos', 'governor.ms', 'conjunction.word', 'conjunction.nkjp.tag', 'conjunction.pos', 'conjunction.ms', '1.conjunct', '1.dep.label', '1.head.word', '1.head.nkjp.tag', '1.head.pos', '1.head.ms', '1.words', '1.tokens', '1.chars', '2.conjunct', '2.dep.label', '2.head.word', '2.head.nkjp.tag', '2.head.pos', '2.head.ms', '2.words', '2.tokens', '2.chars', 'sentence', 'sent_id', 'genre', 'converted.from.file']


In [48]:
with open("conll-test.csv", "w") as conll:
    writer = csv.writer(conll)
    writer.writerow(cols)