<a href="https://colab.research.google.com/github/krakowiakpawel9/ml_course/blob/master/spc/01_intro_spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* @author: krakowiakpawel9@gmail.com  
* @site: e-smartdata.org

### spaCy
Strona biblioteki: [https://spacy.io/](https://spacy.io/)  

Podstawowa biblioteka do przetwarzania języka naturalnego w języku Python.

Aby zainstalować bibliotekę spaCy, użyj polecenia poniżej:
```
!pip install spacy
```
Aby zaktualizować do najnowszej wersji użyj polecenia poniżej:
```
!pip install --upgrade spacy
```
Kurs stworzony w oparciu o wersję `2.1.9`

### Spis treści:
1. [Import bibliotek](#0)
2. [Klasa Vocab](#1)
3. [Klasa StringStore](#2)
4. [Klasa Language](#3)
5. [Klasa Doc](#4)
6. [Klasa Token](#5)
7. [Klasa Span](#6)



### <a name='0'></a> Import bibliotek

In [59]:
import numpy as np
import spacy

np.set_printoptions(precision=6, linewidth=200)
spacy.__version__

'2.1.9'

### <a name='1'></a> Klasa Vocab
Vocab - tworzy słownik dla danego języka

In [0]:
from spacy.vocab import Vocab

Vocab?

In [3]:
vocab = Vocab(strings=['hello', 'world'])
vocab

<spacy.vocab.Vocab at 0x7f7e21c53f48>

In [4]:
len(vocab)

2

In [5]:
vocab.strings

<spacy.strings.StringStore at 0x7f7e71e90268>

### <a name='2'></a> Klasa StringStore
StringStore - wyszukuje ciągi znaków za pomocą 64-bitowych hashy. 

In [0]:
from spacy.strings import StringStore

StringStore?

In [7]:
stringstore = StringStore(['python', 'java'])
stringstore

<spacy.strings.StringStore at 0x7f7e2027f730>

In [8]:
len(stringstore)

2

In [9]:
stringstore['python']

17956708691072489762

In [10]:
stringstore['java']

1049729868293614729

In [11]:
'scala' in stringstore

False

In [12]:
'java' in stringstore

True

In [13]:
for string in stringstore:
    print(f'{type(string)}: {string}')

<class 'str'>: python
<class 'str'>: java


In [14]:
stringstore.add('scala')

13115865354982139976

In [15]:
len(stringstore)

3

In [16]:
stringstore[13115865354982139976]

'scala'

Funkcja hashowania

In [17]:
from spacy.strings import hash_string

hash_string('scala')

13115865354982139976

In [18]:
hash_string('apple')

8566208034543834098

### <a name='3'></a> Klasa Language

In [0]:
from spacy.language import Language

Language?

In [20]:
nlp = Language(Vocab())
nlp

<spacy.language.Language at 0x7f7e8f763898>

In [0]:
from spacy.lang.en import English

English?

In [22]:
nlp = English()
nlp

<spacy.lang.en.English at 0x7f7e1f2a7ac8>

In [23]:
nlp('Python is becoming more and more popular.')

Python is becoming more and more popular.

In [24]:
doc = nlp('Python is becoming more and more popular.')
doc

Python is becoming more and more popular.

In [25]:
nlp.vocab

<spacy.vocab.Vocab at 0x7f7e1dc59480>

In [26]:
nlp.lang

'en'

In [27]:
nlp.meta

{'author': '',
 'description': '',
 'email': '',
 'lang': 'en',
 'license': '',
 'name': 'model',
 'pipeline': [],
 'spacy_version': '>=2.1.9',
 'url': '',
 'vectors': {'keys': 0, 'name': None, 'vectors': 0, 'width': 0},
 'version': '0.0.0'}

### <a name='4'></a> Klasa Doc
Kontener do przechowywania informacji językowych. Sekwencja Tokenów.

In [28]:
type(doc)

spacy.tokens.doc.Doc

In [29]:
doc = nlp('Sample Doc object')
doc

Sample Doc object

In [0]:
from spacy.tokens import Doc

Doc?

In [31]:
doc = Doc(nlp.vocab, words=['Hello', 'world', 'in', 'spaCy', '!'], spaces=[True, True, True, False, False])
doc

Hello world in spaCy!

In [32]:
doc[0]

Hello

In [33]:
doc[-1]

!

In [34]:
type(doc[0])

spacy.tokens.token.Token

In [35]:
len(doc)

5

In [0]:
doc.lang_

In [36]:
for token in doc:
    print(f'type: {type(token)}: {token}')

type: <class 'spacy.tokens.token.Token'>: Hello
type: <class 'spacy.tokens.token.Token'>: world
type: <class 'spacy.tokens.token.Token'>: in
type: <class 'spacy.tokens.token.Token'>: spaCy
type: <class 'spacy.tokens.token.Token'>: !


In [37]:
doc[:2]

Hello world

In [38]:
type(doc[:2])

spacy.tokens.span.Span

Podobieństwo

In [39]:
!python -m spacy download en_core_web_md

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [0]:
nlp = spacy.load('en_core_web_md')

In [47]:
doc1 = nlp('I like cars')
doc2 = nlp('I like bikes')
doc3 = nlp('She hates horror movies')

doc1.similarity(doc2)

0.8884695422092167

In [48]:
doc1.similarity(doc3)

0.5628471172779665

Export to json

In [51]:
doc_json = doc.to_json()
doc_json

{'text': 'Hello world in spaCy!',
 'tokens': [{'end': 5, 'id': 0, 'start': 0},
  {'end': 11, 'id': 1, 'start': 6},
  {'end': 14, 'id': 2, 'start': 12},
  {'end': 20, 'id': 3, 'start': 15},
  {'end': 21, 'id': 4, 'start': 20}]}

In [55]:
import json

print(json.dumps(doc_json, indent=4, sort_keys=True))

{
    "text": "Hello world in spaCy!",
    "tokens": [
        {
            "end": 5,
            "id": 0,
            "start": 0
        },
        {
            "end": 11,
            "id": 1,
            "start": 6
        },
        {
            "end": 14,
            "id": 2,
            "start": 12
        },
        {
            "end": 20,
            "id": 3,
            "start": 15
        },
        {
            "end": 21,
            "id": 4,
            "start": 20
        }
    ]
}


In [65]:
from spacy.attrs import LOWER, IS_ALPHA

doc.to_array([LOWER, IS_ALPHA])

array([[ 5983625672228268878,                    1],
       [ 1703489418272052182,                    1],
       [ 3002984154512732771,                    1],
       [10639093010105930009,                    1],
       [17494803046312582752,                    0]], dtype=uint64)

NER - Named Entity Recognition

In [76]:
doc = nlp('She is going to fly to London next week')
doc.ents

(London, next week)

In [77]:
for entity in doc.ents:
    print(f'{entity.text.ljust(13)}: {entity.label_}')

London       : GPE
next week    : DATE


In [78]:
spacy.explain('GPE')

'Countries, cities, states'

In [79]:
spacy.explain('DATE')

'Absolute or relative dates or periods'

In [80]:
for chunk in doc.noun_chunks:
    print(chunk)

She
London


In [83]:
doc = nlp('Python is awesome. It is really good.')

for sentence in doc.sents:
    print(f'type: {type(sentence)}: {sentence}')

type: <class 'spacy.tokens.span.Span'>: Python is awesome.
type: <class 'spacy.tokens.span.Span'>: It is really good.


### <a name='5'></a> Klasa Token

Pojedynczy token (słowo, znak interpunkcyjny, spacja, itd.)

In [122]:
doc = nlp('NLP Bootcamp in Python. 2020')

for token in doc:
    print(f'type: {type(token)}: {token}')

type: <class 'spacy.tokens.token.Token'>: NLP
type: <class 'spacy.tokens.token.Token'>: Bootcamp
type: <class 'spacy.tokens.token.Token'>: in
type: <class 'spacy.tokens.token.Token'>: Python
type: <class 'spacy.tokens.token.Token'>: .
type: <class 'spacy.tokens.token.Token'>: 2020


In [123]:
token1 = doc[0]
token2 = doc[1]
token1

NLP

In [124]:
len(token1)

3

In [125]:
token1.similarity(token2)

0.18886003

In [126]:
token1.nbor()

Bootcamp

In [127]:
token1.nbor(2)

in

In [128]:
token1.vector

array([-0.008069,  0.17201 , -0.36143 , -0.52395 , -0.15279 , -0.34861 , -0.036767, -0.53732 ,  0.44094 , -0.36079 , -0.4784  , -0.056812, -0.24766 ,  0.10597 ,  0.41365 , -1.088   ,  0.19286 ,
        0.28928 , -0.087171,  0.77045 ,  0.25926 ,  0.060863, -0.12351 , -0.38785 , -0.42876 ,  0.48097 ,  0.48476 , -0.42188 , -0.18591 ,  0.067789,  0.51135 , -0.63264 ,  0.072284,  0.91496 ,
        0.3255  ,  0.99644 , -0.24083 ,  0.11028 , -0.54889 ,  0.36591 , -0.09808 , -0.3543  , -0.23814 ,  0.49636 ,  0.10786 ,  0.031344, -0.27557 ,  0.28765 , -0.63515 , -0.20711 , -0.44874 ,
       -0.50543 ,  0.10289 ,  0.27115 , -0.074156,  0.57831 , -0.25995 ,  0.18628 , -0.22099 , -0.46459 , -0.11055 ,  0.39854 ,  0.15971 , -0.8858  ,  0.29029 , -0.24326 , -0.047822, -0.097751,
       -0.49836 ,  0.24001 , -0.22119 , -0.15564 ,  0.42348 , -0.036272,  1.1548  ,  0.31605 ,  0.088542, -0.031927, -0.063066, -0.048259,  0.020346,  0.16526 ,  0.48541 ,  0.25581 , -0.51612 ,
       -0.2273  ,  1.0575  , -

In [129]:
token1.vector.shape

(300,)

In [130]:
# atrybuty 
token1.doc

NLP Bootcamp in Python. 2020

In [131]:
token1.sent

NLP Bootcamp in Python.

In [133]:
token2.text

'Bootcamp'

In [134]:
token2.pos_

'PROPN'

In [135]:
spacy.explain('PROPN')

'proper noun'

### <a name='6'></a> Klasa Span
Wycinek z obiektu Doc

In [139]:
span = doc[1:4]
span

Bootcamp in Python

In [140]:
type(span)

spacy.tokens.span.Span

In [141]:
for token in span:
    print(f'type: {type(token)}: {token}')

type: <class 'spacy.tokens.token.Token'>: Bootcamp
type: <class 'spacy.tokens.token.Token'>: in
type: <class 'spacy.tokens.token.Token'>: Python
