This notebooks contains basic functions from SpaCy. Some cells are based on the [SpaCy documentation](https://spacy.io/) and some are based on [the SpaCy course](https://course.spacy.io/).

## Installation and imports

In [16]:
%pip install spacy > /dev/null
!python -m spacy download en_core_web_sm > /dev/null
!python -m spacy download es_core_news_md > /dev/null


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [17]:
import spacy

from spacy import displacy

## Linguistic annotations

In [18]:
nlp_en = spacy.load("en_core_web_sm")

In [19]:
doc_en = nlp_en("Apple is looking at buying U.K. startup for $1 billion")

After calling the `nlp` function on a text, SpaCy returns a `Doc` object.

In [20]:
nlp_en.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [21]:
for token in doc_en:
    print(token.text, token.pos_, token.dep_)

Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN dobj
startup NOUN dep
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj


In [22]:
for entity in doc_en.ents:
    print(entity.text, entity.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


In [23]:
displacy.render(doc_en, style="ent")

Spanish example

In [51]:
nlp_es = spacy.load("es_core_news_md")
doc_es = nlp_es("Apple está viendo de comprar un emprendimiento por muchos dólares, unos $100.000 o cien mil.")
for token in doc_es:
    print(token.text, token.lemma_, token.pos_, token.tag_,
            token.shape_, token.is_alpha, token.is_stop, token.head.lemma_, token.dep_)

Apple Apple PROPN PROPN Xxxxx True False ver nsubj
está estar AUX AUX xxxx True True ver aux
viendo ver VERB VERB xxxx True False ver ROOT
de de ADP ADP xx True True comprar mark
comprar comprar VERB VERB xxxx True False ver xcomp
un uno DET DET xx True True emprendimiento det
emprendimiento emprendimiento NOUN NOUN xxxx True False comprar obj
por por ADP ADP xxx True True dólar case
muchos mucho DET DET xxxx True True dólar det
dólares dólar NOUN NOUN xxxx True False comprar obl
, , PUNCT PUNCT , False False $ punct
unos uno DET DET xxxx True True $ det
$ $ NUM NUM $ False False dólar appos
100.000 100000 NUM NUM ddd.ddd False False $ nummod
o o CCONJ CCONJ x True True cien cc
cien cien NUM NUM xxxx True False 100000 conj
mil mil NUM NUM xxx True False $ appos
. . PUNCT PUNCT . False False ver punct


In [50]:
displacy.render(doc_es, style='dep')

In [25]:
for ent in doc_es.ents:
    print(ent.text, ent.label_)

Apple ORG


In [26]:
displacy.render(doc_es, style="ent")


## Word vectors and semantic similarity

In [27]:
tokens = nlp_es("perro gato banana afskfsd")

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

perro True 32.17176 False
gato True 35.150936 False
banana True 25.973137 False
afskfsd False 0.0 True


We compare two vectors using spaCy's similarity analysis. From the documentation: keep in mind that *"spaCy’s similarity implementation usually assumes a pretty general-purpose definition of similarity"*.

In [48]:
doc1 = nlp_es("Me gustan las berenjenas al horno y la comida hecha con soja.")
doc2 = nlp_es("La comida vegetariana sabe muy bien.")

# Similarity of two documents
print(f"Comparing two documents. Doc 1: {doc1}\n Doc 2: {doc2}\n Similarity: {doc1.similarity(doc2)} \n")


# Similarity of tokens and spans
berenjenas = doc1[3:6]
soja = doc1[8:12]
print(f"Comparing tokens and spans: {berenjenas} <-> {soja} {berenjenas.similarity(soja)}")

Comparing two documents. Doc 1: Me gustan las berenjenas al horno y la comida hecha con soja.
 Doc 2: La comida vegetariana sabe muy bien.
 Similarity: 0.4613224950137207 

Comparing tokens and spans: berenjenas al horno <-> comida hecha con soja 0.21418216824531555
