**Parsing de Constituência**

In [1]:
!pip3 install sentencepiece
!pip3 install benepar
!pip3 install -U pip setuptools wheel
!pip3 install -U spacy[cuda102]
!python3 -m spacy download en_core_web_md

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.98-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.98
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting benepar
  Downloading benepar-0.2.0.tar.gz (33 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting torch-struct>=0.5
  Downloading torch_struct-0.5-py3-none-any.whl (34 kB)
Collecting tokenizers>=0.9.4
  Downloading tokenizers-0.13.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m55.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting

Importando as dependência e inicializando os modelos

In [2]:
import benepar, spacy
benepar.download('benepar_en3')

[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Unzipping models/benepar_en3.zip.


True

In [3]:
nlp = spacy.load('en_core_web_md')
nlp.add_pipe('benepar', config={'model': 'benepar_en3'})

<benepar.integrations.spacy_plugin.BeneparComponent at 0x7f1a5a2d0760>

Extraindo a árvore de constituência

In [4]:
texto = 'Hal, switch to manual hibernation control.'

doc = nlp(texto)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [5]:
sent = list(doc.sents)[0]
sent._.parse_string

'(FRAG (INTJ (UH Hal)) (, ,) (VP (VB switch) (PP (IN to) (NP (JJ manual) (NN hibernation) (NN control)))) (. .))'

Iterando os constituentes

In [6]:
for const in sent._.constituents:
  if len(const._.labels) != 0:
    print(const._.labels, const)

('FRAG',) Hal, switch to manual hibernation control.
('INTJ',) Hal
('VP',) switch to manual hibernation control
('PP',) to manual hibernation control
('NP',) manual hibernation control


**Parsing de Dependência** 



In [32]:
!pip3 install tabulate
!pip install spacy-transformers 
!python3 -m spacy download pt_core_news_lg
!python3 -m spacy download en_core_web_trf

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0mLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pt-core-news-lg==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/pt_core_news_lg-3.5.0/pt_core_news_lg-3.5.0-py3-none-any.whl (568.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m568.2/568.2 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('pt_core_news_lg')
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-trf==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.5.0/en_core_web_trf-3.5.0-py3-none-any.whl (460.3 MB)


Parsing em Português


In [8]:
import spacy

spacy.prefer_gpu()
nlp = spacy.load('pt_core_news_lg')

In [9]:
doc = nlp('Hal, passe para o controle de hibernação.')

In [10]:
import tabulate 

data = []
for token in doc:
	data.append((token.i, token.lemma_, token.pos_, token.morph, token.dep_, token.head))

header = ['Idx', 'Lemma', 'Classe de Palavraa', 'Morfologia', 'Dependência', 'Governador']
print(tabulate.tabulate(data, header))

  Idx  Lemma       Classe de Palavraa    Morfologia                                             Dependência    Governador
-----  ----------  --------------------  -----------------------------------------------------  -------------  ------------
    0  Hal         PROPN                 Gender=Masc|Number=Sing                                nsubj          passe
    1  ,           PUNCT                                                                        punct          passe
    2  passe       VERB                  Mood=Sub|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin  ROOT           passe
    3  para        ADP                                                                          case           controle
    4  o           DET                   Definite=Def|Gender=Masc|Number=Sing|PronType=Art      det            controle
    5  controle    NOUN                  Gender=Masc|Number=Sing                                nmod           passe
    6  de          ADP                        

Buscando subestruturas

In [18]:
def get_subfrase(root, tokens, subfrase):
	subfrase.append((root.i, str(root)))
	
	for token in root.children:
		subfrase = get_subfrase(token, tokens, subfrase)
	return subfrase

In [19]:
r = get_subfrase(doc[5], doc, [])

[w[1] for w in sorted(r, key=lambda x: x[0])]

['para', 'o', 'controle', 'de', 'hibernação']

Parsing em Inglês

In [34]:
import spacy
import spacy_transformers

spacy.prefer_gpu()
nlp = spacy.load('en_core_web_trf')

In [37]:
doc = nlp('Hal, switch to manual hibernation control.')

In [38]:
import tabulate 

data = []
for token in doc:
	data.append((token.i, token.lemma_, token.pos_, token.morph, token.dep_, token.head))

header = ['Idx', 'Lemma', 'Classe de Palavraa', 'Morfologia', 'Dependência', 'Governador']
print(tabulate.tabulate(data, header))

  Idx  Lemma        Classe de Palavraa    Morfologia      Dependência    Governador
-----  -----------  --------------------  --------------  -------------  ------------
    0  Hal          PROPN                 Number=Sing     npadvmod       switch
    1  ,            PUNCT                 PunctType=Comm  punct          switch
    2  switch       VERB                  VerbForm=Inf    ROOT           switch
    3  to           ADP                                   prep           switch
    4  manual       ADJ                   Degree=Pos      amod           control
    5  hibernation  NOUN                  Number=Sing     compound       control
    6  control      NOUN                  Number=Sing     pobj           to
    7  .            PUNCT                 PunctType=Peri  punct          switch


**Stanza**

In [22]:
!pip3 install stanza

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting stanza
  Downloading stanza-1.5.0-py3-none-any.whl (802 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m802.5/802.5 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting emoji (from stanza)
  Downloading emoji-2.2.0.tar.gz (240 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.9/240.9 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-2.2.0-py3-none-any.whl size=234926 sha256=2045dc3482c228680a9210db2d82c6d8a1d3b0998cf46699197268a91c21dea3
  Stored in directory: /root/.cache/pip/wheels/9a/b8/0f/f580817231cbf59f6ade9fd132ff60ada1de9f7dc85521f857
Successfully built emoji
Installing collected packages: emoji, stanza
Successfully instal

Parsing em Português

In [23]:
import stanza 
stanza.download('pt')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

INFO:stanza:Downloading default packages for language: pt (Portuguese) ...


Downloading https://huggingface.co/stanfordnlp/stanza-pt/resolve/v1.5.0/models/default.zip:   0%|          | 0…

INFO:stanza:Finished downloading models and saved to /root/stanza_resources.


In [24]:
nlp = stanza.Pipeline('pt')

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

INFO:stanza:Loading these models for language: pt (Portuguese):
| Processor    | Package |
--------------------------
| tokenize     | bosque  |
| mwt          | bosque  |
| pos          | bosque  |
| lemma        | bosque  |
| constituency | cintil  |
| depparse     | bosque  |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: constituency
INFO:stanza:Loading: depparse
INFO:stanza:Done loading processors!


In [25]:
doc = nlp('Hal, passe para o controle de hibernação')

In [27]:
data = []

for snt in doc.sentences:
  for token in snt.words:
    head = snt.words[token.head-1].text if token.head > 0 else 'root'
    data.append((token.id, token.text, token.upos, token.feats, token.deprel, head))

header = ['Idx', 'Token', 'Classe de Palavra', 'Morfologia', 'Dependência', 'Governador']
print(tabulate.tabulate(data, header))

  Idx  Token       Classe de Palavra    Morfologia                                             Dependência    Governador
-----  ----------  -------------------  -----------------------------------------------------  -------------  ------------
    1  Hal         PROPN                Gender=Masc|Number=Sing                                nsubj          passe
    2  ,           PUNCT                                                                       punct          Hal
    3  passe       VERB                 Mood=Sub|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin  root           root
    4  para        ADP                                                                         case           controle
    5  o           DET                  Definite=Def|Gender=Masc|Number=Sing|PronType=Art      det            controle
    6  controle    NOUN                 Gender=Masc|Number=Sing                                obj            passe
    7  de          ADP                                   

Parsing em Inglês

In [28]:
import stanza 
stanza.download('en')
nlp = stanza.Pipeline('en')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/default.zip:   0%|          | 0…

INFO:stanza:Finished downloading models and saved to /root/stanza_resources.
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

INFO:stanza:Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| constituency | wsj       |
| depparse     | combined  |
| sentiment    | sstplus   |
| ner          | ontonotes |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: constituency
INFO:stanza:Loading: depparse
INFO:stanza:Loading: sentiment
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


In [30]:
doc = nlp('Hal, switch to manual hibernation control')

data = []

for snt in doc.sentences:
  for token in snt.words:
    head = snt.words[token.head-1].text if token.head > 0 else 'root'
    data.append((token.id, token.text, token.upos, token.feats, token.deprel, head))

header = ['Idx', 'Token', 'Classe de Palavra', 'Morfologia', 'Dependência', 'Governador']
print(tabulate.tabulate(data, header))

  Idx  Token        Classe de Palavra    Morfologia    Dependência    Governador
-----  -----------  -------------------  ------------  -------------  ------------
    1  Hal          PROPN                Number=Sing   nsubj          switch
    2  ,            PUNCT                              punct          switch
    3  switch       VERB                 VerbForm=Fin  root           root
    4  to           ADP                                case           control
    5  manual       ADJ                  Degree=Pos    amod           control
    6  hibernation  NOUN                 Number=Sing   compound       control
    7  control      NOUN                 Number=Sing   obl            switch
