# NLP with spaCy (https://spacy.io/) - Introduction
1. Language model #installation https://spacy.io/usage https://spacy.io/usage/models
2. Processing pipeline
3. Linguistic features
4. Word vectors

In [2]:
import spacy

## Create the nlp object - load a language model

In [3]:
nlp = spacy.load("en_core_web_sm")

## Processing pipelines
https://spacy.io/usage/processing-pipelines

In [36]:
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


## Create a variable (string type)

In [37]:
text = 'Guido van Rossum (born 31 January 1956) is a Dutch programmer best known as the creator of the Python programming language, for which he was the "benevolent dictator for life" (BDFL) until he stepped down from the position in July 2018. He remained a member of the Python Steering Council through 2019, and withdrew from nominations for the 2020 election.'

## Print the variable text

In [38]:
print(text)

Guido van Rossum (born 31 January 1956) is a Dutch programmer best known as the creator of the Python programming language, for which he was the "benevolent dictator for life" (BDFL) until he stepped down from the position in July 2018. He remained a member of the Python Steering Council through 2019, and withdrew from nominations for the 2020 election.


## Create the doc object and print it

In [39]:
doc = nlp(text)
print(doc)

Guido van Rossum (born 31 January 1956) is a Dutch programmer best known as the creator of the Python programming language, for which he was the "benevolent dictator for life" (BDFL) until he stepped down from the position in July 2018. He remained a member of the Python Steering Council through 2019, and withdrew from nominations for the 2020 election.


## Check the length of both text and doc

In [40]:
print(len(text))
print(len(doc))

355
70


## Print tokens in text:

In [41]:
for token in text[0:10]:
    print(token)

G
u
i
d
o
 
v
a
n
 


## Print tokens in doc:

In [42]:
for token in doc[0:10]:
    print(token)

Guido
van
Rossum
(
born
31
January
1956
)
is


## Index into the Doc to get a single token:

In [43]:
token = doc[2]
print(token.text) # accessing token as str

Rossum


## Print tokens in text.split()

In [44]:
for token in text.split()[:10]:
    print(token) # almost the same, but parethesis cause a problem

Guido
van
Rossum
(born
31
January
1956)
is
a
Dutch


## Print sentences in doc object:

In [45]:
for sent in doc.sents:
    print('Separate sentence:')
    print(sent)

Separate sentence:
Guido van Rossum (born 31 January 1956) is a Dutch programmer best known as the creator of the Python programming language, for which he was the "benevolent dictator for life" (BDFL) until he stepped down from the position in July 2018.
Separate sentence:
He remained a member of the Python Steering Council through 2019, and withdrew from nominations for the 2020 election.


## Create a variable which contains the 1st sentence from doc object

In [46]:
sentence1 = doc.sents[0]

TypeError: 'generator' object is not subscriptable

## Convert to a list

In [47]:
sentence1 = list(doc.sents)[0]
print(sentence1)

Guido van Rossum (born 31 January 1956) is a Dutch programmer best known as the creator of the Python programming language, for which he was the "benevolent dictator for life" (BDFL) until he stepped down from the position in July 2018.


In [48]:
sentence2 = list(doc.sents)[1]
print(sentence2)

He remained a member of the Python Steering Council through 2019, and withdrew from nominations for the 2020 election.


## Tokens contain more metadata (linguistic features)

In [49]:
token2 = sentence1[2]
print(token2)

Rossum


In [50]:
token2.text

'Rossum'

In [51]:
token2.pos_

'PROPN'

In [52]:
token2.left_edge

Guido

In [53]:
token2.right_edge

)

In [54]:
# Show lemma 
print(sentence1)
print(sentence1[9])
print(sentence1[9].lemma_)

Guido van Rossum (born 31 January 1956) is a Dutch programmer best known as the creator of the Python programming language, for which he was the "benevolent dictator for life" (BDFL) until he stepped down from the position in July 2018.
is
be


In [55]:
print(sentence1[13])
print(sentence1[13].lemma_)

best
well


In [56]:
token2.morph

Number=Sing

## Part of Speech (POS) tagging 

In [57]:
# Part of Speech Tagging, Dependency Parser
for token in doc[:10]:
    print(token.text, token.pos_, token.dep_)

Guido PROPN compound
van PROPN compound
Rossum PROPN nsubj
( PUNCT punct
born VERB acl
31 NUM nummod
January PROPN npadvmod
1956 NUM nummod
) PUNCT punct
is AUX ROOT


In [58]:
from spacy import displacy

In [59]:
displacy.render(doc, style = "dep")

In [60]:
## Named Entity Recognition (NER)

In [27]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Guido van Rossum PERSON
31 January 1956 DATE
Dutch NORP
July 2018 DATE
the Python Steering Council ORG
2019 DATE
2020 DATE


In [28]:
displacy.render(doc, style = "ent")

## Word vectors

In [29]:
word = nlp(u'table')
#print(word.vector.shape)
print(word.vector)

[-0.58313864  1.0630852   0.38540393  0.44942492 -0.11450165  0.44385102
 -0.55115414  0.11430165 -0.8289989   0.3883373  -0.1464433  -0.1114676
  0.96859455  0.8757668   0.43654776 -0.31346455 -0.61303586  2.1882706
  0.02898124 -0.0783288   1.592581   -0.5172376   0.98317546  0.29783285
  0.28142393 -0.30727428 -0.8090403  -0.45026407  0.08394663 -0.48521546
  2.0606382  -2.185183    0.4335337   0.01390551 -0.97075903  0.01180416
 -0.4132477  -0.56985474 -0.58562505  0.8963891  -1.223198   -1.455324
  0.11601293  0.47334346 -0.8372756  -0.51900864  0.2746813  -0.20313692
  1.0556079   0.4269778  -0.9994365  -0.09648305 -1.1062945  -0.02543956
 -0.25719407  0.8337139  -0.4959442   0.54176724 -0.18229069  0.25719252
 -0.66506255 -0.02189845  0.6129887  -0.6857461  -0.36937773 -0.3326221
 -0.20762897 -0.58286655  0.4092842  -0.80683696  0.29174507  0.48075396
 -0.28493685 -0.23907119  0.48551935  0.11707528  0.53606135  2.224441
 -0.8839178  -1.2027037   2.2877827   0.6073051  -0.047986

In [30]:
doc1 = nlp("I like apples and oranges.")
doc2 = nlp("Fruit tastes very good.")

In [31]:
print(doc1, "<->", doc2, doc1.similarity(doc2))

I like apples and oranges. <-> Fruit tastes very good. 0.2295838973287422


  print(doc1, "<->", doc2, doc1.similarity(doc2))


In [32]:
nlp2 = spacy.load("en_core_web_lg")

In [33]:
doc1 = nlp2("I like apples and oranges.")
doc2 = nlp2("Fruit tastes very good.")

In [34]:
print(doc1, "<->", doc2, doc1.similarity(doc2))

I like apples and oranges. <-> Fruit tastes very good. 0.8026200751227274
