## Universal dependencies!
### Notas sobre **Python** y Universal dependencies!

Un [apunte](https://applied-language-technology.readthedocs.io/en/latest/notebooks/part_iii/05_text_linguistics.html)

Librería [conllu](https://github.com/EmilStenstrom/conllu/)

In [50]:
!pip install conllu



In [51]:
import conllu as cll

data = """
# text = The quick brown fox jumps over the lazy dog.
1   The     the    DET    DT   Definite=Def|PronType=Art   4   det     _   _
2   quick   quick  ADJ    JJ   Degree=Pos                  4   amod    _   _
3   brown   brown  ADJ    JJ   Degree=Pos                  4   amod    _   _
4   fox     fox    NOUN   NN   Number=Sing                 5   nsubj   _   _
5   jumps   jump   VERB   VBZ  Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0   root    _   _
6   over    over   ADP    IN   _                           9   case    _   _
7   the     the    DET    DT   Definite=Def|PronType=Art   9   det     _   _
8   lazy    lazy   ADJ    JJ   Degree=Pos                  9   amod    _   _
9   dog     dog    NOUN   NN   Number=Sing                 5   nmod    _   SpaceAfter=No
10  .       .      PUNCT  .    _                           5   punct   _   _

"""

In [52]:
sentences = cll.parse(data)
sentences

[TokenList<The, quick, brown, fox, jumps, over, the, lazy, dog, .>]

In [53]:
## accedemos a la primera oración (la única en este caso!)

sentence = sentences[0]

In [54]:
## sentences funciona como una lista que tiene diccionarios como elementos!

token = sentence[0]

In [55]:
token

{'id': 1,
 'form': 'The',
 'lemma': 'the',
 'upos': 'DET',
 'xpos': 'DT',
 'feats': {'Definite': 'Def', 'PronType': 'Art'},
 'head': 4,
 'deprel': 'det',
 'deps': None,
 'misc': None}

Paréntesis sobre **diccionarios!** :)

Ahora, volvemos a **UD**

In [56]:
sentence

TokenList<The, quick, brown, fox, jumps, over, the, lazy, dog, .>

In [57]:
## ¿Cómo podemos filtrar las oraciones?

## x forma
sentence.filter(form="quick")

TokenList<quick>

In [58]:
## x upos
sentence.filter(upos="VERB")

TokenList<jumps>

In [59]:
## el verbo!

token = sentence[4]

In [60]:
token

{'id': 5,
 'form': 'jumps',
 'lemma': 'jump',
 'upos': 'VERB',
 'xpos': 'VBZ',
 'feats': {'Mood': 'Ind',
  'Number': 'Sing',
  'Person': '3',
  'Tense': 'Pres',
  'VerbForm': 'Fin'},
 'head': 0,
 'deprel': 'root',
 'deps': None,
 'misc': None}

In [61]:
## x rasgo: noten que estamos entrando a un diccionario
sentence.filter(feats__Degree="Pos")

TokenList<quick, brown, lazy>

In [62]:
## busquemos las heads!
heads = {}

for token in sentence.filter(feats__Degree="Pos"):
    heads[token['id']]=token['head'] 

In [63]:
heads

{2: 4, 3: 4, 8: 9}

In [64]:
## busquemos las heads de todas los tokens

## busquemos las heads!
heads_all = {}

for token in sentence:
    heads_all[token['id']]=token['head'] 

In [65]:
heads_all

{1: 4, 2: 4, 3: 4, 4: 5, 5: 0, 6: 9, 7: 9, 8: 9, 9: 5, 10: 5}

In [66]:
## otra forma :)

## busquemos las heads!
heads_all_form = {}

for token in sentence:
    heads_all_form[token['id']]=[token['form'],token['head']]

In [67]:
heads_all_form

{1: ['The', 4],
 2: ['quick', 4],
 3: ['brown', 4],
 4: ['fox', 5],
 5: ['jumps', 0],
 6: ['over', 9],
 7: ['the', 9],
 8: ['lazy', 9],
 9: ['dog', 5],
 10: ['.', 5]}

In [68]:
## ¿Cómo podemos cambiar los datos?

sentence[3]["form"] = "cat"

In [69]:
sentence

TokenList<The, quick, brown, cat, jumps, over, the, lazy, dog, .>

In [70]:
## ¿Y si queremos volver a fomato conllu?

sentence.serialize()

'# text = The quick brown fox jumps over the lazy dog.\n1\tThe\tthe\tDET\tDT\tDefinite=Def|PronType=Art\t4\tdet\t_\t_\n2\tquick\tquick\tADJ\tJJ\tDegree=Pos\t4\tamod\t_\t_\n3\tbrown\tbrown\tADJ\tJJ\tDegree=Pos\t4\tamod\t_\t_\n4\tcat\tfox\tNOUN\tNN\tNumber=Sing\t5\tnsubj\t_\t_\n5\tjumps\tjump\tVERB\tVBZ\tMood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin\t0\troot\t_\t_\n6\tover\tover\tADP\tIN\t_\t9\tcase\t_\t_\n7\tthe\tthe\tDET\tDT\tDefinite=Def|PronType=Art\t9\tdet\t_\t_\n8\tlazy\tlazy\tADJ\tJJ\tDegree=Pos\t9\tamod\t_\t_\n9\tdog\tdog\tNOUN\tNN\tNumber=Sing\t5\tnmod\t_\tSpaceAfter=No\n10\t.\t.\tPUNCT\t.\t_\t5\tpunct\t_\t_\n\n'

In [71]:
with open('sentence.conllu', 'w') as f:    
    #f.writelines([sentence.serialize() + "\n" for sentence in sentences])
    f.writelines(sentence.serialize())

In [72]:
## ahora, lo volvemos a leer :)

from io import open
from conllu import parse

data_file = open("sentence.conllu", "r", encoding="utf-8")
data_file = data_file.read()
sentences = cll.parse(data_file)

In [73]:
sentences[0]

TokenList<The, quick, brown, cat, jumps, over, the, lazy, dog, .>

In [74]:
## miramos nuevamente el primer token

token = sentence[0]

In [75]:
token

{'id': 1,
 'form': 'The',
 'lemma': 'the',
 'upos': 'DET',
 'xpos': 'DT',
 'feats': {'Definite': 'Def', 'PronType': 'Art'},
 'head': 4,
 'deprel': 'det',
 'deps': None,
 'misc': None}