<a href="https://colab.research.google.com/github/krakowiakpawel9/ml_course/blob/master/x/04_spacy/03_pos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* @author: krakowiakpawel9@gmail.com  
* @site: e-smartdata.org

### spaCy
Strona biblioteki: [https://spacy.io/](https://spacy.io/)  

Podstawowa biblioteka do przetwarzania języka naturalnego w języku Python.

Aby zainstalować bibliotekę spaCy, użyj polecenia poniżej:
```
!pip install spacy
```
Aby zaktualizować do najnowszej wersji użyj polecenia poniżej:
```
!pip install --upgrade spacy
```
Kurs stworzony w oparciu o wersję `2.1.9`

### Spis treści:
1. [Import bibliotek](#0)
2. [Oznaczanie części mowy](#1)



### <a name='0'></a> Import bibliotek

In [0]:
import spacy

nlp = spacy.load('en')

### <a name='1'></a> Oznaczanie części mowy

In [0]:
from spacy import displacy

doc = nlp('She likes red apples.')

displacy.render(doc, style='dep', jupyter=True)

In [0]:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 100})

In [0]:
displacy.render(doc, style='dep', jupyter=True, options={'color': '#6b7bc9', 'bg': '#d0facd'})

In [0]:
displacy.render(doc, style='dep', jupyter=True, options={'color': '#ffffff', 'bg': '#3692f5'})


In [0]:
displacy.render(doc, style='dep', jupyter=True, options={'compact': True})

In [0]:
for name in ['PRON', 'VERB', 'ADJ', 'NOUN']:
    print(f'{name.ljust(6)}:{spacy.explain(name)}')

PRON  :pronoun
VERB  :verb
ADJ   :adjective
NOUN  :noun


In [0]:
for name in ['nsubj', 'dobj', 'amod']:
    print(f'{name.ljust(6)}:{spacy.explain(name)}')

nsubj :nominal subject
dobj  :direct object
amod  :adjectival modifier


In [0]:
for token in doc:
    print(f'{token.text.ljust(7)}:{token.dep_.ljust(6)}:{token.head.text.ljust(7)}:{[child for child in token.children]}')

# token | dependency | head | children    

She    :nsubj :likes  :[]
likes  :ROOT  :likes  :[She, apples, .]
red    :amod  :apples :[]
apples :dobj  :likes  :[red]
.      :punct :likes  :[]


In [0]:
doc = nlp('Other countries with large car-making industries that have been hit by the virus include South Korea '
    'which is the worst-affected nation after China, and Japan.')

In [0]:
# token.text - tekst
# token.lemma_ - podstawowa forma wyrazu
# token.pos_ - część mowy
# token.tag_ - część mowy (szczegółowo)
# token.dep_ - zależność składniowa
# token.shape_ - kształt słowa (wielkie litery, interpunkcja, cyfry)
# token.is_alpha - czy jest znakiem alfa
# token.is_stop - czy jest słowem ze zbioru tzw. stop wordów

In [0]:
for token in doc:
    print(f'{token.text.ljust(10)}:{token.lemma_.ljust(9)}:{token.pos_.ljust(6)}:{token.tag_.ljust(5)}:'
    f'{token.dep_.ljust(10)}:{token.shape_.ljust(6)}:{str(token.is_alpha).ljust(6)}:{token.is_stop}')

# token | lemma | part-of-speech | tag | dependency | shape | is_alpha | is_stop    

Other     :other    :ADJ   :JJ   :amod      :Xxxxx :True  :True
countries :country  :NOUN  :NNS  :nsubj     :xxxx  :True  :False
with      :with     :ADP   :IN   :prep      :xxxx  :True  :True
large     :large    :ADJ   :JJ   :amod      :xxxx  :True  :False
car       :car      :NOUN  :NN   :npadvmod  :xxx   :True  :False
-         :-        :PUNCT :HYPH :punct     :-     :False :False
making    :make     :VERB  :VBG  :amod      :xxxx  :True  :False
industries:industry :NOUN  :NNS  :pobj      :xxxx  :True  :False
that      :that     :DET   :WDT  :nsubjpass :xxxx  :True  :True
have      :have     :VERB  :VBP  :aux       :xxxx  :True  :True
been      :be       :VERB  :VBN  :auxpass   :xxxx  :True  :True
hit       :hit      :VERB  :VBN  :relcl     :xxx   :True  :False
by        :by       :ADP   :IN   :agent     :xx    :True  :True
the       :the      :DET   :DT   :det       :xxx   :True  :True
virus     :virus    :NOUN  :NN   :pobj      :xxxx  :True  :False
include   :include  :VERB  :VBP 

In [0]:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 140, 'color': '#6b7bc9', 'bg': '#d0facd'})

Noun chunks

In [0]:
# chunk.text - tekst
# chunk.root.text - rzeczownik z wyciętego kawałka
# chunk.root.dep_ - relacja zależności

In [0]:
for chunk in doc.noun_chunks:
    print(f'{chunk.text.ljust(28)}:{chunk.root.text.ljust(11)}:{chunk.root.dep_.ljust(6)}')

# chunk | root text | dependency

Other countries             :countries  :nsubj 
large car-making industries :industries :pobj  
the virus                   :virus      :pobj  
South Korea                 :Korea      :dobj  
the worst-affected nation   :nation     :attr  
China                       :China      :pobj  
Japan                       :Japan      :conj  


In [0]:
spacy.explain('pobj')

'object of preposition'

countries
which


In [0]:
%%time
for token in doc:
    if token.pos_ == 'VERB':
        print(token)

making
have
been
hit
include
is
affected
CPU times: user 2.66 ms, sys: 135 µs, total: 2.79 ms
Wall time: 3.74 ms


In [0]:
%%time
from spacy.symbols import VERB

for token in doc:
    if token.pos == VERB:
        print(token)

making
have
been
hit
include
is
affected
CPU times: user 1.88 ms, sys: 1.77 ms, total: 3.65 ms
Wall time: 5.77 ms
