# NLP model using Spacy

# 1. Introduction to Spacy

Link: https://spacy.io/ Here we can go to models and choose the language we would like to use.

In [None]:
!python -m spacy download es_core_news_sm

2022-09-01 08:23:40.141958: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting es-core-news-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-3.4.0/es_core_news_sm-3.4.0-py3-none-any.whl (12.9 MB)
[K     |████████████████████████████████| 12.9 MB 8.0 MB/s 
Installing collected packages: es-core-news-sm
Successfully installed es-core-news-sm-3.4.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_sm')


In [None]:
import spacy

In [None]:
nlp = spacy.load("es_core_news_sm") #loading the Spanish language package and storing it in a variable

In [None]:
nlp

<spacy.lang.es.Spanish at 0x7fc07e552e50>

In [None]:
document = """La situación de la pandemia en el
mundo cada vez se vuelve más compleja. En la actualidad, cada país está afrontando
distintos problemas sociales, sanitarios y económicos, lo que ha llevado a una merma
en la calidad de vida de las personas, principalmente de bajos recursos. Dicho esto, 
es fundamental tener políticas adecuadas para enfrentar la incertidumbre."""

In [None]:
document_spacy = nlp(document)

In [None]:
document_spacy

La situación de la pandemia en el
mundo cada vez se vuelve más compleja. En la actualidad, cada país está afrontando
distintos problemas sociales, sanitarios y económicos, lo que ha llevado a una merma
en la calidad de vida de las personas, principalmente de bajos recursos. Dicho esto, 
es fundamental tener políticas adecuadas para enfrentar la incertidumbre.

In [None]:
type(document_spacy) #a class called 'doc' from the library spacy. This class includes the text and all the results
#that are going to be applied on the text once we call a model, e.g. tokenization

spacy.tokens.doc.Doc

## 2. Text preprocessing


In [None]:
for token in document_spacy:
  print(token)

La
situación
de
la
pandemia
en
el


mundo
cada
vez
se
vuelve
más
compleja
.
En
la
actualidad
,
cada
país
está
afrontando


distintos
problemas
sociales
,
sanitarios
y
económicos
,
lo
que
ha
llevado
a
una
merma


en
la
calidad
de
vida
de
las
personas
,
principalmente
de
bajos
recursos
.
Dicho
esto
,


es
fundamental
tener
políticas
adecuadas
para
enfrentar
la
incertidumbre
.


Detect and remove stop words. For example, the first paragraph, words such as 'la', 'de' don't give much value. Note that **commas and dots are not stop words**, but they also need to be removed.

In [None]:
for token in document_spacy:
  print(token.text, '--', token.is_stop, '--', token.is_punct)

La -- True -- False
situación -- False -- False
de -- True -- False
la -- True -- False
pandemia -- False -- False
en -- True -- False
el -- True -- False

 -- False -- False
mundo -- False -- False
cada -- True -- False
vez -- True -- False
se -- True -- False
vuelve -- False -- False
más -- True -- False
compleja -- False -- False
. -- False -- True
En -- True -- False
la -- True -- False
actualidad -- False -- False
, -- False -- True
cada -- True -- False
país -- False -- False
está -- True -- False
afrontando -- False -- False

 -- False -- False
distintos -- False -- False
problemas -- False -- False
sociales -- False -- False
, -- False -- True
sanitarios -- False -- False
y -- True -- False
económicos -- False -- False
, -- False -- True
lo -- True -- False
que -- True -- False
ha -- True -- False
llevado -- False -- False
a -- True -- False
una -- True -- False
merma -- False -- False

 -- False -- False
en -- True -- False
la -- True -- False
calidad -- False -- False
de -- T

In [None]:
document_spacy_clean = [token for token in document_spacy if not token.is_stop and not token.is_punct]
#just leave the text without the stop words and the puntuation

In [None]:
document_spacy_clean

[situación,
 pandemia,
 ,
 mundo,
 vuelve,
 compleja,
 actualidad,
 país,
 afrontando,
 ,
 distintos,
 problemas,
 sociales,
 sanitarios,
 económicos,
 llevado,
 merma,
 ,
 calidad,
 vida,
 personas,
 principalmente,
 bajos,
 recursos,
 ,
 fundamental,
 políticas,
 adecuadas,
 enfrentar,
 incertidumbre]

LEMMATIZATION
Good strategy if we want to compare texts and see if they are similar, if they refer to similar issues. We get the original word, the meaning of each word (the lemma). 

In [None]:
for token in document_spacy_clean:
  print(token.lemma_)

situación
pandemia


mundo
volver
complejo
actualidad
país
afrontar


distinto
problema
social
sanitario
económico
llevar
merma


calidad
vida
persona
principalmente
bajo
recurso


fundamental
política
adecuado
enfrentar
incertidumbre


## 3. Named-entity recognition


In [None]:
employees = "nombre: Nestor Campos email: nestor@micorreo.com país: Chile nombre: Juana Perez email: juana@micorreo.com país:Argentina nombre: Francisca Leiva email: francisca.leiva@gmail.cl país: Uruguay "

In [None]:
employees_spacy = nlp(employees)

In [None]:
for token in employees_spacy:
  if token.like_email: #print if token has similar format to an email (like_number, url, is_digit, is_upper)
    print(token.text)

nestor@micorreo.com
juana@micorreo.com
francisca.leiva@gmail.cl


In [None]:
print(employees_spacy.ents) #extract entities, but what kind of entities?

(Nestor Campos, Chile, Juana Perez, Argentina, Francisca Leiva, Uruguay)


In [None]:
for entity in employees_spacy.ents:
  print(entity.text, '--', entity.label_) #the label is what we need to work with in order to get the type of entity

Nestor Campos -- PER
Chile -- LOC
Juana Perez -- PER
Argentina -- LOC
Francisca Leiva -- PER
Uruguay -- LOC


In [None]:
#RENDERIZATION
from spacy import displacy
displacy.render(employees_spacy, style= "ent", jupyter=True)

## 4. Matching based on patterns


In [None]:
my_trip = """Hace un par de años visité Cuzco para conocer Machu Picchu.
           También he visitado Argentina un par de veces. 
           La última vez, en 2020, pude visitar Boston.
           """
#There's a common pattern in those three phrases, visitar + place      

In [None]:
my_trip_spacy = nlp(my_trip)

In [None]:
from spacy.matcher import Matcher

In [None]:
matcher = Matcher(nlp.vocab)

In [None]:
pattern = [{"LEMMA":"visitar"}, {"POS": "PROPN"}] #POS (position), PROPN (pronoun or noun)

In [None]:
matcher.add("lugares_visitados", [pattern])

In [None]:
matches_trip = matcher(my_trip_spacy)
matches_trip #it appears the matching (same numeration), then the initial and the final position of the pattern in the sentence. 

[(16290618549988947147, 5, 7),
 (16290618549988947147, 15, 17),
 (16290618549988947147, 31, 33)]

In [None]:
for id_value, start, end in matches_trip:
  print(my_trip_spacy[start:end].text)

visité Cuzco
visitado Argentina
visitar Boston


## 5. Similiarity
In Machine Learning and NLP, there are three popular similarity or distance metrics - **Euclidean distance, dot product, and cosine similarity**.
Each word is linked to a vector, so two texts are similar when they have similar numeric values.

In [None]:
text1= nlp('malo')

In [None]:
text2= nlp('horrible')

In [None]:
text1.similarity(text2)
#pay attention to the warning message! Obviously, the words horrible and malo are much more similar than 24%

  """Entry point for launching an IPython kernel.


0.24709503388408582

In [None]:
!python -m spacy download es_core_news_md #need to load a more complex model with more vectors and combinations

2022-09-01 09:44:55.441656: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting es-core-news-md==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_md-3.4.0/es_core_news_md-3.4.0-py3-none-any.whl (42.3 MB)
[K     |████████████████████████████████| 42.3 MB 1.9 MB/s 
Installing collected packages: es-core-news-md
Successfully installed es-core-news-md-3.4.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_md')


In [None]:
nlp2 = spacy.load("es_core_news_md")

In [None]:
text1= nlp2('malo')

In [None]:
text2= nlp2('horrible')

In [None]:
text1.similarity(text2) #now we get better results

0.5889570376208377

In [None]:
#let's compare some larger snippets of text
revision_1 = nlp2('La comida estaba deliciosa')
revision_2 = nlp2('La comida estaba excelente')
revision_3 = nlp2('No me gustó la comida')
revision_4 = nlp2('No me gustó la experiencia')
revision_5 = nlp2('La comida no estaba buena')

In [None]:
revision_1.similarity(revision_2)

0.9841328990920907

In [None]:
revision_3.similarity(revision_4)

0.9809007926808403

In [None]:
revision_3.similarity(revision_5)

0.6037055715091896