# Spacy (https://spacy.io/)

 Spacy is a Python library that allows us to perform many NLP tasks such as: tokenization, PoS tagging, Entity Detection and Parsing. 
 
 Another important advantages of  Spacy is that it can be used to generated word vectors (word embeddings) for every word in a sentence. In this way, we can calculate the semantic similarity. Spacy already includes some of these pre-trained models, which you can use right now!!.
 
  
## Install spacy
First, you must install spacy on your google colab. 



In [None]:
!pip install spacy
!python -m spacy download en
import spacy
nlp = spacy.load('en')


[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.7/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.7/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [None]:
text = '''Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a 
sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped 
riskier assets. MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2 
week trough, with Australian shares sinking 1.6 percent. Japan’s Nikkei dived 3.1 percent led by losses in 
electric machinery makers and suppliers of Apple’s iphone parts. Sterling fell to $1.286 after three straight 
sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the
European Union over Brexit, British Prime Minister Theresa May said on Monday.'''


document = nlp(text)



## Sentence splitting

Spacy also returns the list of sentences in the text. The list of sentences is stored in the property **sents**. For each sentence, you can traverse its tokens and read their properties. 

In the following cell, for each token in a sentence, we show token and its PoS tag :


In [None]:
for i,s in enumerate(document.sents):
    print(i,s)
    

0 Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a 
sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped 
riskier assets.
1 MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2 
week trough, with Australian shares sinking 1.6 percent.
2 Japan’s
3 Nikkei dived 3.1 percent led by losses in 
electric machinery makers and suppliers of Apple’s iphone parts.
4 Sterling fell to $1.286 after three straight 
sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the
European Union over Brexit, British Prime Minister Theresa May said on Monday.


## Tokenization 

Once you have loaded a document, you can parser it using the Spaci. You only have to pass the document as argument to the object **nlp**, which was created above. This return a list of objects, which represent the tokens of the text and their properties. These properties are very useful in the representation of instances for NLP tasks such as Named Entity Recognition or Text Clasification. 

In the next cell, we iterate the tokens and show some of their properties :
-    original text
-    shape (which is a pattern of its uppercase and lowercase)
-  PoS tag
-   lemma of token
-    prefix
-    sufix
-    Brown cluster id: cluster which the token belongs

You can find the description of the token properties at the following link: https://spacy.io/api/token#attributes

In [None]:
for i, token in enumerate(document):
    print("original:", token.orth_)
    print("shape:", token.shape_)
    print("PoS tag:", token.pos_)


    print("lowercased:", token.lower_)
    print("lemma:", token.lemma_)
    print("prefix:", token.prefix_)
    print("suffix:", token.suffix_)
    print("----------------------------------------")
    #only shows 5 tokens
    if i > 5:
        break

original: Asian
shape: Xxxxx
PoS tag: ADJ
lowercased: asian
lemma: asian
prefix: A
suffix: ian
----------------------------------------
original: shares
shape: xxxx
PoS tag: NOUN
lowercased: shares
lemma: share
prefix: s
suffix: res
----------------------------------------
original: skidded
shape: xxxx
PoS tag: VERB
lowercased: skidded
lemma: skid
prefix: s
suffix: ded
----------------------------------------
original: on
shape: xx
PoS tag: ADP
lowercased: on
lemma: on
prefix: o
suffix: on
----------------------------------------
original: Tuesday
shape: Xxxxx
PoS tag: PROPN
lowercased: tuesday
lemma: Tuesday
prefix: T
suffix: day
----------------------------------------
original: after
shape: xxxx
PoS tag: ADP
lowercased: after
lemma: after
prefix: a
suffix: ter
----------------------------------------
original: a
shape: x
PoS tag: DET
lowercased: a
lemma: a
prefix: a
suffix: a
----------------------------------------


## Entity Detection
Spacy is also able to identify the named entities that occur in the text. The named entities contained in the document are stored in its property **ents**: 
For each entity, you can acces the following properties:
- string: contains the whole mention of the named entity.
- label: is the entity type.
- start_char and end_char are the offsets of the mention in the text.






In [None]:
print('Original Sentence: {}'.format(text))
print()

for entity in document.ents:
    print('Type: {}, Value: {}, star: {}, end: {}'.format(entity.label_, entity.text,entity.start_char, entity.end_char))


Original Sentence: Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a 
sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped 
riskier assets. MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2 
week trough, with Australian shares sinking 1.6 percent. Japan’s Nikkei dived 3.1 percent led by losses in 
electric machinery makers and suppliers of Apple’s iphone parts. Sterling fell to $1.286 after three straight 
sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the
European Union over Brexit, British Prime Minister Theresa May said on Monday.

Type: NORP, Value: Asian, star: 0, end: 5
Type: DATE, Value: Tuesday, star: 24, end: 31
Type: LOC, Value: Europe, star: 147, end: 153
Type: LOC, Value: Asia-Pacific, star: 252, end: 264
Type: GPE, Value: Japan, star: 280, end: 285
Type: PERCENT, Value

In [None]:
for i,s in enumerate(document.sents):
  print("Text sentence: ", i, s)
  print('Named entities for the sentence: ', i)
  for e in s.ents:
    print('\t',e.string,e.label_,e.start_char, e.end_char)
  print

#for w in document.ents:
#  print(w.string,w.label_)
  
  


Text sentence:  0 Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a 
sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped 
riskier assets.
Named entities for the sentence:  0
	 Asian  NORP 0 5
	 Tuesday  DATE 24 31
	 Europe  LOC 147 153
Text sentence:  1 MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2 
week trough, with Australian shares sinking 1.6 percent.
Named entities for the sentence:  1
	 Asia-Pacific  LOC 252 264
	 Japan  GPE 280 285
	 1.7 percent  PERCENT 294 305
	 1-1/2  CARDINAL 311 316
	 week trough DATE 318 329
	 Australian  NORP 336 346
	 1.6 percent PERCENT 362 373
Text sentence:  2 Japan’s
Named entities for the sentence:  2
	 Japan GPE 375 380
Text sentence:  3 Nikkei dived 3.1 percent led by losses in 
electric machinery makers and suppliers of Apple’s iphone parts.
Named entities for the sentence:  3
	 3.1 percent  PERCEN

Spacy also provides a nice library, displacy, to highlight the entity mentions in the texts:

In [None]:
from spacy import displacy

displacy.render(nlp(str(text)), jupyter=True, style='ent')


## Exercise: Spacy for Spanish

Use Spacy to detect the named entities in the following text:

In [None]:
text = '''Junts per Catalunya opta ahora por no poner palos en las ruedas para que Esquerra facilite la investidura de Pedro Sánchez. 
La formación que lidera Carles Puigdemont —a la espera de que la justicia belga decida sobre su extradición— anunció este 
martes que retira una moción sobre la autodeterminación, que tenía que ser votada hoy miércoles en el Parlament y que ponía a ERC
 en una situación comprometida. La decisión, que generó mucho debate interno, se gestó en la reunión que tuvieron 
 el expresident y varios cargos electos de Junts, el pasado lunes en Bélgica..'''




First, you must download and load the model for Spanish:

In [None]:
!python -m spacy download es
nlp = spacy.load('es')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting es_core_news_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-2.2.5/es_core_news_sm-2.2.5.tar.gz (16.2 MB)
[K     |████████████████████████████████| 16.2 MB 4.7 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('es_core_news_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.7/dist-packages/es_core_news_sm -->
/usr/local/lib/python3.7/dist-packages/spacy/data/es
You can now load the model via spacy.load('es')


In [None]:
# write here you code
document = nlp(text)

for entity in document.ents:
    print('Type: {}, Value: {}, star: {}, end: {}'.format(entity.label_, entity.text,entity.start_char, entity.end_char))


Type: PER, Value: Junts per Catalunya, star: 1, end: 20
Type: ORG, Value: Esquerra, star: 74, end: 82
Type: PER, Value: Pedro Sánchez, star: 110, end: 123
Type: MISC, Value: La formación, star: 126, end: 138
Type: PER, Value: Carles Puigdemont, star: 150, end: 167
Type: MISC, Value: Parlament, star: 351, end: 360
Type: ORG, Value: ERC, star: 375, end: 378
Type: MISC, Value: La decisión, star: 411, end: 422
Type: PER, Value: Junts, star: 537, end: 542
Type: LOC, Value: Bélgica, star: 563, end: 570


Please, use the displacy library to show them:

In [None]:
# write here you code
from spacy import displacy

displacy.render(nlp(str(text)), jupyter=True, style='ent')


Now we are going to use Spacy to perform NER in texts from the biomedical domain. Does it work well?

In [None]:
#text='''Benz(a)anthracene is a polycyclic aromatic hydrocarbon. 
#The phosphoinositide, phosphatidylinositol-3,4,5-trisphosphate
#(PI(3,4,5)P3), is a key signaling lipid.'''

text='''Monkeypox is a rare disease that is caused by infection 
with monkeypox virus. Monkeypox virus belongs to the Orthopoxvirus 
genus in the family Poxviridae. The Orthopoxvirus genus also 
includes variola virus (which causes smallpox), vaccinia virus 
(used in the smallpox vaccine), and cowpox virus.'''
nlp = spacy.load('en')
doc = nlp(text)
# for i,t in enumerate(doc):
# print(t.orth_+'\t'+'\t'+t.pos_)
for entity in doc.ents:
    print('Type: {}, Value: {}, star: {}, end: {}'.format(entity.label_, entity.text,entity.start_char, entity.end_char))


Type: ORG, Value: Orthopoxvirus, star: 110, end: 123
Type: PRODUCT, Value: Poxviridae, star: 145, end: 155
Type: GPE, Value: Orthopoxvirus, star: 161, end: 174
