<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/4/47/Acronimo_y_nombre_uc3m.png"/>

<img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-nc-sa.png" width=15%/>
</center>    

# Spacy (https://spacy.io/)

Spacy es una librería de Python que nos permite realizar muchas tareas de PLN, como: tokenización, análisis morfosintáctico, reconocimiento de entidades, etc. 

Otra ventaja importante de Spacy es que incluye modelos pre-entrenados de word embeddings (vectores que representan palabras). Estos vectores pueden utilizarse en muchas aplicaciones de PLN, como por ejemplo, medir la similitud semántica entre textos. 
  




## Instalar Spacy 

In [1]:
!pip install spacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


El siguiente paso sería descargar alguno de los modelos que ofrece Spacy. El siguiente [link](#https://spacy.io/usage/models) tiene una lista de todos los modelos disponibles en Spacy.  
 

 En este tutorial, usaremos el modelo **en_core_web_sm**, que permite realizar tareas como la tokenización, análisis morfosintáctico, análisis de dependencias y reconocimiento de entidades. Además, este modelo incluye vectores de palabras (word embeddings) para obtener similitud semántica.


In [2]:
!python3 -m spacy download en_core_web_sm


2023-01-27 08:26:33.232037: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m31.2 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


Creamos un objeto, en este caso lo llamaremos nlp, donde se carga el modelo. El objeto nlp será utilizado para analizar textos. 

In [3]:
import spacy
nlp = spacy.load('en_core_web_sm')           # load model package "en_core_web_sm"
print('en_core_web_sm ha sido cargado')



en_core_web_sm ha sido cargado



## División de oraciones
Cuando analizamos un texto con Spacy, el objeto nlp devuelve un objeto documento que contiene lista de oraciones en el texto. Esta lista es almacenada en la propiedad **sents** del documento. 


In [4]:
text='''Pedro Sánchez Pérez-Castejón (born 29 February 1972) is a Spanish politician serving as Prime Minister of Spain since 2 June 2018. On 7 January 2020, Pedro Sanchez was confirmed by the Congress of Deputies as Prime Minister with a lead of just two votes (167 to 165), at the helm of the first coalition government since the restoration of democracy in the 1970s, ending the political deadlock that included two inconclusive elections.He has also been Secretary-General of the Spanish Socialist Workers' Party (PSOE) since June 2017, having previously held that office from 2014 to 2016.
He served as a Madrid city councillor from 2004 to 2009, before being elected to the Congress of Deputies. In 2014, he became Secretary-General of the PSOE, becoming the party's candidate for Prime Minister in the 2015 and 2016 general elections. Sánchez resigned as Secretary-General after disagreements with the party's executive, and was re-elected the following year during a series of primaries, defeating Susana Díaz and Patxi López.
On 31 May 2018, the PSOE filed a no confidence motion in the Rajoy II Government in the Congress of Deputies, which was passed the following day with the support of Unidas Podemos and the leader of the party Pablo Iglesias, as well as various regionalist and nationalist parties. Sánchez was subsequently sworn in as Prime Minister of Spain by Felipe VI on 2 June. He led his party to gain 38 seats in the April 2019 general election, the PSOE's first national victory since 2008, a victory re-validated at the subsequent November 2019 general election.'''

document = nlp(text)

for i,s in enumerate(document.sents):
    print(i,s)
    

0 Pedro Sánchez Pérez-Castejón (born 29 February 1972) is a Spanish politician serving as Prime Minister of Spain since 2 June 2018.
1 On 7 January 2020, Pedro Sanchez was confirmed by the Congress of Deputies as Prime Minister with a lead of just two votes (167 to 165), at the helm of the first coalition government since the restoration of democracy in the 1970s, ending the political deadlock that included two inconclusive elections.
2 He has also been Secretary-General of the Spanish Socialist Workers' Party (PSOE) since June 2017, having previously held that office from 2014 to 2016.

3 He served as a Madrid city councillor from 2004 to 2009, before being elected to the Congress of Deputies.
4 In 2014, he became Secretary-General of the PSOE, becoming the party's candidate for Prime Minister in the 2015 and 2016 general elections.
5 Sánchez resigned as Secretary-General after disagreements with the party's executive, and was re-elected the following year during a series of primaries

## Tokenization 

Desde el objecto documento, también podemos acceder a sus tokens y consultar sus propiedades. En la siguiente celda, iteramos los tokens del documento y mostramos algunas de sus propiedades:
- texto original del token.
- forma (que es un patrón de sus mayúsculas y minúsculas)
- la palabra en minúsculas.
- Categoría gramatical (PoS tag).
- lema de token.
- prefijo.
- sufijo.

En el siguienge [link](#https://spacy.io/api/token#attributes), puedes encontrar la descripción de estas y otras propiedades de los tokens. 

Estas propiedades son muy útiles en la representación de instancias para tareas de PLN como por ejemplo como el reconocimiento de entidades  o la clasificación de textos

In [5]:
for i, token in enumerate(document):
    print("original:", token.orth_)
    print("shape:", token.shape_)
    print("PoS tag:", token.pos_)
    print("lowercased:", token.lower_)
    print("lemma:", token.lemma_)
    print("prefix:", token.prefix_)
    print("suffix:", token.suffix_)
    print("----------------------------------------")
    #solo mostramos los 5 primeros tokens
    if i > 5:
        break

original: Pedro
shape: Xxxxx
PoS tag: PROPN
lowercased: pedro
lemma: Pedro
prefix: P
suffix: dro
----------------------------------------
original: Sánchez
shape: Xxxxx
PoS tag: PROPN
lowercased: sánchez
lemma: Sánchez
prefix: S
suffix: hez
----------------------------------------
original: Pérez
shape: Xxxxx
PoS tag: PROPN
lowercased: pérez
lemma: Pérez
prefix: P
suffix: rez
----------------------------------------
original: -
shape: -
PoS tag: PUNCT
lowercased: -
lemma: -
prefix: -
suffix: -
----------------------------------------
original: Castejón
shape: Xxxxx
PoS tag: PROPN
lowercased: castejón
lemma: Castejón
prefix: C
suffix: jón
----------------------------------------
original: (
shape: (
PoS tag: PUNCT
lowercased: (
lemma: (
prefix: (
suffix: (
----------------------------------------
original: born
shape: xxxx
PoS tag: VERB
lowercased: born
lemma: bear
prefix: b
suffix: orn
----------------------------------------


## Reconocimiento de Entidades
Spacy también permite reconocer las entidades  que aparecen en el texto. Para acceder a las entidades presentes en el documento, es necesario recorrer la propiedad **ents** del objeto documento. Para cada entidad, es posible acceder a las siguientes propiedades:
- text: contiene la mención completa de la entidad nombrada.
- label: es el tipo de entidad.
- start_char y end_char indican la posición del primer y último carácter de la entidad en el texto, respectivamente. 



In [6]:
for i,s in enumerate(document.sents):
  print("Sentence: ", s)
  print("Entities:")
  for e in s.ents:
    print('\t',e.text,e.label_,e.start_char, e.end_char)
  print()



Sentence:  Pedro Sánchez Pérez-Castejón (born 29 February 1972) is a Spanish politician serving as Prime Minister of Spain since 2 June 2018.
Entities:
	 Pedro Sánchez Pérez-Castejón PERSON 0 28
	 29 CARDINAL 35 37
	 February 1972 DATE 38 51
	 Spanish NORP 58 65
	 Spain GPE 106 111
	 2 June 2018 DATE 118 129

Sentence:  On 7 January 2020, Pedro Sanchez was confirmed by the Congress of Deputies as Prime Minister with a lead of just two votes (167 to 165), at the helm of the first coalition government since the restoration of democracy in the 1970s, ending the political deadlock that included two inconclusive elections.
Entities:
	 7 January 2020 DATE 134 148
	 Pedro Sanchez PERSON 150 163
	 the Congress of Deputies ORG 181 205
	 two CARDINAL 244 247
	 167 to 165 CARDINAL 255 265
	 first ORDINAL 287 292
	 the 1970s DATE 352 361
	 two CARDINAL 407 410

Sentence:  He has also been Secretary-General of the Spanish Socialist Workers' Party (PSOE) since June 2017, having previously held that 

Spacy incluye un paquete que permite resaltar las entidades en el texto:

In [7]:
from spacy import displacy

displacy.render(nlp(str(text)), jupyter=True, style='ent')
#displacy.serve(nlp(str(text)), style="ent")



## Reconocimiento de sintagmas nominales

Spacy también proporciona la lista de sintagmas nominales en una oración. Esta lista se almacena en la propiedade **noun_chunks** del objeto document.

Para cada sintagma nominal, es posible devolver los siguientes atributos: 
- texto: el texto completo del sintagma nominal.
- root: el núcleo del sintagma nominal.
- dep: la relación gramatical del sintagma nominal en la oración. Puede encontrar más información sobre estas relaciones (dependencias) en https://nlp.stanford.edu/software/dependencies_manual.pdf. Algunas de estas dependencias son:
 - *nsubj*: es el sujeto sintáctico. Por ejemplo, 'Clinton defeated Dole', nsubj(derrotado,Clinton)
 - *dobj*: es el objeto directo de un VP. Por ejemplo, 'She gave me a raise' -> dobj (gave, raise).
 - *pobj*: es el objeto de una preposición (por ejemplo, 'I sit on the chair' -> pobj(on, chair)
- head: representa el núcleo del sintagama con el que guarda la relación gramatical. Por ejemplo, 'The boy' tiene una relación gramática del sujeto con 'ran'
 
 




In [8]:
import spacy
text= "The boy with the spotted dog quickly ran after the firetruck."
doc = nlp(text)

for chunk in doc.noun_chunks:
    print('text chunk:',chunk.text)
    print('root chunk:',chunk.root.text)
    print('grammatical dependency:',chunk.root.dep_)
    print('head chunk:',chunk.root.head.text)
    print('---------------------------------')
    

text chunk: The boy
root chunk: boy
grammatical dependency: nsubj
head chunk: ran
---------------------------------
text chunk: the spotted dog
root chunk: dog
grammatical dependency: pobj
head chunk: with
---------------------------------
text chunk: the firetruck
root chunk: firetruck
grammatical dependency: pobj
head chunk: after
---------------------------------


## Análisis de dependencias
SpaCy proporciona un analizador de dependencias sintácticas rápido y preciso. Este analizador obtiene las relaciones gramaticales entre tokens en una oración. Esta información es crucial para tareas de PLN como la extracción de relaciones.



La siguiente celda muestra el análisis de dependencia de una oración:


In [9]:
# Let's look at the dependencies of this example:
example = "The boy with the spotted dog quickly ran after the firetruck."
doc = nlp(example)
# shown as: original token, dependency tag, head word
for token in doc:
    print("word:",token.orth_)
    print("grammatical relation:", token.dep_)
    print("connected word (head):", token.head.orth_)
    print('------------------------------------------')

word: The
grammatical relation: det
connected word (head): boy
------------------------------------------
word: boy
grammatical relation: nsubj
connected word (head): ran
------------------------------------------
word: with
grammatical relation: prep
connected word (head): boy
------------------------------------------
word: the
grammatical relation: det
connected word (head): dog
------------------------------------------
word: spotted
grammatical relation: amod
connected word (head): dog
------------------------------------------
word: dog
grammatical relation: pobj
connected word (head): with
------------------------------------------
word: quickly
grammatical relation: advmod
connected word (head): ran
------------------------------------------
word: ran
grammatical relation: ROOT
connected word (head): ran
------------------------------------------
word: after
grammatical relation: prep
connected word (head): ran
------------------------------------------
word: the
grammatical re

In [10]:
displacy.render(doc, jupyter=True, style='dep')



Para comparar el análisis proporcionado por Spacy, puedes usar el analizador sintáctico online proporcionado por Standford: http://nlp.stanford.edu:8080/corenlp/process