## SpaCy 

+ Библиотека для продвинутого NLP
+ Ряд языков, английский, китайский, немецкий, французский, итальянский, польский, испанский и др., разрабатываются модели для всё новых языков
+ Про spaCy: https://spacy.io/usage

+ Установка для английского:
```
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm
```

In [3]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [4]:
d = nlp("Colorless green ideas sleep furiously")
for token in d:
    print(token.text, token.pos_)

Colorless ADJ
green ADJ
ideas NOUN
sleep VERB
furiously ADV


### Функционал

#### Частеречная разметка

https://spacy.io/usage/linguistic-features

In [5]:
d = nlp("Colorless green ideas sleep furiously")
for token in d:
    print(token.text, token.pos_)

Colorless ADJ
green ADJ
ideas NOUN
sleep VERB
furiously ADV


#### Лемматизация, токенизация, морфологический анализ

In [6]:
d = nlp("Colorless green ideas sleep furiously")
for token in d:
    print(token.text, token.lemma_)

Colorless colorless
green green
ideas idea
sleep sleep
furiously furiously


In [31]:
d = nlp("Colorless green ideas sleep furiously")
print(doc[2].morph)  
print(doc[2].pos_)  

Animacy=Inan|Case=Acc|Gender=Fem|Number=Sing
NOUN


In [7]:
d = nlp("To be, or not to be, that is the question: Whether 'tis nobler in the mind to suffer")
for token in d:
    print(token.text, token.lemma_)

To to
be be
, ,
or or
not not
to to
be be
, ,
that that
is be
the the
question question
: :
Whether whether
' '
tis tis
nobler nobler
in in
the the
mind mind
to to
suffer suffer


#### Синтаксис: Дерево зависимостей

In [8]:
d = nlp("Colorless green ideas sleep furiously")
for token in d:
    print(token.text, token.lemma_, token.pos_, token.dep_)

Colorless colorless ADJ amod
green green ADJ amod
ideas idea NOUN nsubj
sleep sleep VERB ROOT
furiously furiously ADV advmod


#### Навигация по дереву: вершины и зависимые

Используется терминология "head" (вершина) и "child" (зависимое)

In [9]:
for token in d:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])

Colorless amod ideas NOUN []
green amod ideas NOUN []
ideas nsubj sleep VERB [Colorless, green]
sleep ROOT sleep VERB [ideas, furiously]
furiously advmod sleep VERB []


In [10]:
d = nlp("If a farmer owns a donkey, he beats it.")
for token in d:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])

If mark owns VERB []
a det farmer NOUN []
farmer nsubj owns VERB [a]
owns advcl beats VERB [If, farmer, donkey]
a det donkey NOUN []
donkey dobj owns VERB [a]
, punct beats VERB []
he nsubj beats VERB []
beats ROOT beats VERB [owns, ,, he, it, .]
it dobj beats VERB []
. punct beats VERB []


Найдём детей "beats" справа и слева:

In [12]:
d = nlp("If a farmer owns a donkey, he beats it.")
print([token.text for token in d[8].lefts]) 
print([token.text for token in d[8].rights])  
print(d[8].n_lefts)  
print(d[8].n_rights) 

['owns', ',', 'he']
['it', '.']
3
2


Теперь найдём тип зависимости, детей слева, детей справа, родителей:

In [13]:
root = [token for token in d if token.head == token][0]
subject = list(root.lefts)[0]
for descendant in subject.subtree:
    assert subject is descendant or subject.is_ancestor(descendant)
    print(descendant.text, descendant.dep_, descendant.n_lefts,
            descendant.n_rights,
            [ancestor.text for ancestor in descendant.ancestors])

If mark 0 0 ['owns', 'beats']
a det 0 0 ['farmer', 'owns', 'beats']
farmer nsubj 1 0 ['owns', 'beats']
owns advcl 2 1 ['beats']
a det 0 0 ['donkey', 'owns', 'beats']
donkey dobj 1 0 ['owns', 'beats']


Распечатаем часть речи, тип зависимости, вершину:

In [14]:
d = nlp("If a farmer owns a donkey, he beats it.")
for token in d:
    print(token.text, token.pos_, token.dep_, token.head.text)

If SCONJ mark owns
a DET det farmer
farmer NOUN nsubj owns
owns VERB advcl beats
a DET det donkey
donkey NOUN dobj owns
, PUNCT punct beats
he PRON nsubj beats
beats VERB ROOT beats
it PRON dobj beats
. PUNCT punct beats


Визуализируем дерево:

In [16]:
from spacy import displacy

In [17]:
d = nlp("If a farmer owns a donkey, he beats it.")

displacy.render(d, style='dep')


Задание:

1. Протестируем SpaCy и освежим наши познания в синтаксисе.
2. Дадим на разбор предложение с синтаксической неоднозначностью. Какой анализ предлагает SpaCy?
    + John saw the man on the mountain with a telescope.
    + I'm glad I'm a man, and so is Lola.
    + больше примеров: https://en.wikipedia.org/wiki/Syntactic_ambiguity

3. Дадим примеры на острова (приведены ниже с традиционной оценкой их приемлемости). Какой анализ предлагает SpaCy?
    + What_i does Sarah believe that Susan thinks that John bought \_i?
    + Complex NPs
      - *Who_i did Mary see the report that was about _i? (cf. Mary saw the report that was about the senator.)
      - ?*Which sportscar_i did the color of _i delight the baseball player? (cf. The color of the sportscar delighted the baseball player.)

    + Complements of manner-of-speaking verbs
       - ??What_i did Mary whisper that Bill liked _i? (cf. Mary whispered that Bill liked wine.)

    + Complex NPs (both noun complements and relative clauses)
        - *The senator who_i Mary saw the report that was about _i bothered many people on the committee.

    + Subject islands
        - ?* The sportscar which_i the color of _i delighted the baseball player…

    + Complements of manner-of-speaking verbs
        - ?The wine which_i Mary whispered that Bill liked _i was a Cabernet.

4. Справляется ли SpaCy с эллипсисом:
    + John can play the guitar; Mary can, too.



In [18]:
d = nlp("I'm glad I'm a man, and so is Lola.")

displacy.render(d, style='dep')

In [19]:
d = nlp("Who did Mary see the report that was about?")

displacy.render(d, style='dep')

In [20]:
d = nlp("John can play the guitar; Mary can, too.")

displacy.render(d, style='dep')

#### Named Entity Recognition (NER)

In [21]:
doc = nlp("Twitter permanently suspends President Donald Trump")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Donald Trump 39 51 PERSON


In [22]:
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
doc = nlp("Twitter permanently suspends President Donald Trump")
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('Before', ents)
# the model didn't recognise "President" as an entity :(

president_ent = Span(doc, 3, 4, label="PE") # create a Span for the new entity, PE = political entity
doc.ents = list(doc.ents) + [president_ent]

ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('After', ents)
# [('President', 29, 38, 'PE')]

Before [('Donald Trump', 39, 51, 'PERSON')]
After [('President', 29, 38, 'PE'), ('Donald Trump', 39, 51, 'PERSON')]


#### Как обстоят дела с другими языками? 

Загрузим модель для немецкого:

In [None]:
! python -m spacy download de_core_news_sm

In [34]:
nlp = spacy.load("de_core_news_sm")
doc = nlp("schöne rote Äpfel auf dem Baum")
print([token.text for token in doc[2].lefts])  # ['schöne', 'rote']
print([token.text for token in doc[2].rights])  # ['auf']

['schöne', 'rote']
['auf']


In [36]:
doc = nlp("schöne rote Äpfel auf dem Baum") 
print(doc[2].morph)  
print(doc[2].pos_) 

Case=Acc|Gender=Masc|Number=Plur
NOUN


Загрузим модель для русского языка:

In [None]:
! python -m spacy download ru_core_news_sm

In [37]:
from spacy.lang.ru.examples import sentences 

nlp = spacy.load("ru_core_news_sm")
doc = nlp(sentences[0])
print(doc.text)
for token in doc:
    print(token.text, token.pos_, token.dep_)

Apple рассматривает возможность покупки стартапа из Соединённого Королевства за $1 млрд
Apple PROPN nsubj
рассматривает VERB ROOT
возможность NOUN obj
покупки NOUN nmod
стартапа NOUN nmod
из ADP case
Соединённого ADJ amod
Королевства PROPN nmod
за ADP case
$ SYM nmod
1 NUM appos
млрд NOUN punct


In [38]:
doc = nlp("Блины бабушка очень уж вкусные печет.") 
print(doc[1].morph)  
print(doc[1].pos_) 

Animacy=Anim|Case=Nom|Gender=Fem|Number=Sing
NOUN


In [27]:
r = nlp("Какой прекрасный пень!")

displacy.render(r, style='dep')

In [28]:
r = nlp("Семантику я люблю, а вот синтаксис не очень.")

displacy.render(r, style='dep')

In [29]:
r = nlp("Блины бабушка очень уж вкусные печет.")

displacy.render(r, style='dep')

In [30]:
r = nlp("Ваня нарисовал чудесную картину.")

displacy.render(r, style='dep')