# NLP Básico con Spacy

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Ohtar10/icesi-nlp/blob/main/Sesion1/1-spacy-basics.ipynb)

## Referencias
* [NLP - Natural Language Processing With Python](https://www.udemy.com/course/nlp-natural-language-processing-with-python)
* [Natural Language Processing in Action](https://www.manning.com/books/natural-language-processing-in-action)

Este notebook contiene ejemplos básico de uso de la librería Spacy para procesamiento de lenguaje natural con técnicas clásicas. Esta herramienta nos servirá para familiarizarnos con los métodos clásicos.

## Preparación del entorno
Asumiendo que la librería ya se encuentra instalada, dependiendo de la tarea, necesitamos descargar un corpus, por ejemplo en el idioma ingles sería:

In [2]:
import pkg_resources
import warnings

warnings.filterwarnings('ignore')

installed_packages = [package.key for package in pkg_resources.working_set]
IN_COLAB = 'google-colab' in installed_packages

In [3]:
!test '{IN_COLAB}' = 'True' && wget  https://github.com/Ohtar10/icesi-nlp/raw/refs/heads/main/requirements.txt && pip install -r requirements.txt

In [19]:
!python -m spacy download en_core_web_sm

zsh:1: command not found: python


El cual debemos luego importar:

In [7]:
pip install spacy

Defaulting to user installation because normal site-packages is not writeable
Collecting spacy
  Downloading spacy-3.8.7-cp39-cp39-macosx_11_0_arm64.whl (6.3 MB)
[K     |████████████████████████████████| 6.3 MB 4.5 MB/s eta 0:00:01
Collecting pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4
  Downloading pydantic-2.11.7-py3-none-any.whl (444 kB)
[K     |████████████████████████████████| 444 kB 45.4 MB/s eta 0:00:01
[?25hCollecting preshed<3.1.0,>=3.0.2
  Downloading preshed-3.0.10-cp39-cp39-macosx_11_0_arm64.whl (129 kB)
[K     |████████████████████████████████| 129 kB 28.5 MB/s eta 0:00:01
[?25hCollecting spacy-loggers<2.0.0,>=1.0.0
  Downloading spacy_loggers-1.0.5-py3-none-any.whl (22 kB)
Collecting jinja2
  Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)
[K     |████████████████████████████████| 134 kB 27.8 MB/s eta 0:00:01
[?25hCollecting thinc<8.4.0,>=8.3.4
  Downloading thinc-8.3.6-cp39-cp39-macosx_11_0_arm64.whl (848 kB)
[K     |████████████████████████████████| 848 kB 24.4 MB/s

In [9]:
import spacy

# Download the model if not already present
try:
	nlp = spacy.load('en_core_web_sm')
except OSError:
	from spacy.cli import download
	download('en_core_web_sm')
	nlp = spacy.load('en_core_web_sm')

Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.


## Creando un documento simple
Este documento será automáticamente interpretado con spacy para el lenguaje seleccionado.

In [10]:
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

Desde aquí, podemos observar los diferentes elementos del documento.

In [11]:
col1 = "Token"
col2 = "POS" # Part of Speech
col3 = "S-dep" # Syntactic dependency

print(f"{col1:{20}}{col2:{20}}{col3:{20}}")
for token in doc:
    print(f"{token.text:{20}}{token.pos_:{20}}{token.dep_}")

Token               POS                 S-dep               
Tesla               PROPN               nsubj
is                  AUX                 aux
looking             VERB                ROOT
at                  ADP                 prep
buying              VERB                pcomp
U.S.                PROPN               dobj
startup             VERB                advcl
for                 ADP                 prep
$                   SYM                 quantmod
6                   NUM                 compound
million             NUM                 pobj


Hemos impreso los tokens (palabras en este caso), la parte del contexto que representan (POS) y la dependencia semantica que dicho token tiene.

En NLP clásico hay una taxonomía especializada para cada elemento del lenguaje. Cada elemento fue producto de estudios diversos y variados con el fin de ofrecer un modelado sistemático del lenguaje. Expertos en lenguaje estuvieron involucrados en la creación de esta taxonomía.

Ahora, librerías como Spacy facilitan el procesamiento de esta taxonomía.

## Un pipeline simple de Spacy

El núcleo de Spacy es el pipeline que no es más que el procesamiento/transformación que toma el texto original y se lo somete a diversos procesos de NLP

In [12]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x134c41f40>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x134c41c40>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x134ae6350>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x134ccb9c0>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x134c97140>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x134ae6200>)]

Como podemos observar aquí, la instanciación por defecto es un pipeline compuesto por diferentes componentes que deberían ser familiares para nosotros:

* Token 2 Vec: Convertir los tokens en vectores.
* Lemmatizer: Extracción de componentes raíz de las palabras
* NER: Named entity recognition para identificar los sujetos de los documentos.

Un documento es iterable y los items pueden ser accedidos por índice.

In [13]:
n = 0
print(f"The {n}th token in the document is: {doc[n]}")

The 0th token in the document is: Tesla


## Exploremos diferentes elementos transformados

In [14]:
from spacy.tokens.doc import Doc
import pandas as pd

def get_doc_elements(doc: Doc):
    elements = ["text", "lemma", "pos", "tag", "shape", "alpha", "stop"]
    rows = [ [token.text, token.lemma_, token.pos_, token.tag_, token.shape_, token.is_alpha, token.is_stop] 
            for token in  doc]
    return pd.DataFrame(rows, columns=elements)

In [15]:
doc_elements = get_doc_elements(doc)
doc_elements

Unnamed: 0,text,lemma,pos,tag,shape,alpha,stop
0,Tesla,Tesla,PROPN,NNP,Xxxxx,True,False
1,is,be,AUX,VBZ,xx,True,True
2,looking,look,VERB,VBG,xxxx,True,False
3,at,at,ADP,IN,xx,True,True
4,buying,buy,VERB,VBG,xxxx,True,False
5,U.S.,U.S.,PROPN,NNP,X.X.,False,False
6,startup,startup,VERB,VBD,xxxx,True,False
7,for,for,ADP,IN,xxx,True,True
8,$,$,SYM,$,$,False,False
9,6,6,NUM,CD,d,False,False


Done:

|Tag|Descrición|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

## Objetos Span
Un span puede interpretarse como una porción de un documento, es decir, puede empezar desde alún índice hasta otro. Esto facilita el procesamiento por pedazos (chunks) en lugar el documento completo.

In [16]:
# Definition of NLP according to Wikipedia 
doc = nlp(u"Natural language processing (NLP) is a subfield of computer science, \
information engineering, and artificial intelligence concerned with the \
interactions between computers and human (natural) languages, in particular \
how to program computers to process and analyze large amounts of natural language data.\
Challenges in natural language processing frequently involve speech recognition, natural \
language understanding, and natural language generation.")

quote = doc[10:30]
quote

computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural)

Observemos aquí que el slice es por los tokens y no por los caracteres individuales. Esto es muy útil ya que podemos estar seguros de no interrumpir abruptamente los tokens.

## Trabajando con oraciones
Podemos iterar sobre oraciones en los documentos, es decir, frases separadas por el punto "."

In [17]:
doc = nlp("This is the first sentence. This is the second sentence. And this is the last sentence.")
for sent in doc.sents:
    print(sent)

This is the first sentence.
This is the second sentence.
And this is the last sentence.


**Nota:** Cada punto es considerado un token, etnonces en el segundo "This" en el anterior documento está en el índice `6`, no en el `5`.

In [18]:
print(f"Token 5: {doc[5]}")
print(f"Token 6: {doc[6]}")
print(f"Is token 6 a sentence start? {doc[6].is_sent_start}")

Token 5: .
Token 6: This
Is token 6 a sentence start? True


In [22]:
# Instala el modelo de español
%pip install spacy
!python -m spacy download es_core_news_sm

# Cárgalo en tu código
import spacy

try:
    nlp = spacy.load("es_core_news_sm")
except OSError:
    from spacy.cli import download
    download("es_core_news_sm")
    nlp = spacy.load("es_core_news_sm")

# Procesa texto en español
doc = nlp("El lenguaje natural es fascinante.")
for token in doc:
    print(token.text, token.pos_, token.dep_)

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
zsh:1: command not found: python
Defaulting to user installation because normal site-packages is not writeable
Collecting es-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-3.8.0/es_core_news_sm-3.8.0-py3-none-any.whl (12.9 MB)
Installing collected packages: es-core-news-sm
Successfully installed es-core-news-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Re

You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.


El DET det
lenguaje NOUN nsubj
natural ADJ amod
es AUX cop
fascinante ADJ ROOT
. PUNCT punct


In [23]:
col1 = "Token"
col2 = "POS" # Part of Speech
col3 = "S-dep" # Syntactic dependency

print(f"{col1:{20}}{col2:{20}}{col3:{20}}")
for token in doc:
    print(f"{token.text:{20}}{token.pos_:{20}}{token.dep_}")

Token               POS                 S-dep               
El                  DET                 det
lenguaje            NOUN                nsubj
natural             ADJ                 amod
es                  AUX                 cop
fascinante          ADJ                 ROOT
.                   PUNCT               punct


In [24]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x15117bf40>),
 ('morphologizer',
  <spacy.pipeline.morphologizer.Morphologizer at 0x15117bb80>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1511a4ac0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x1511753c0>),
 ('lemmatizer', <spacy.lang.es.lemmatizer.SpanishLemmatizer at 0x1511dd140>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x1513440b0>)]

In [25]:
n = 0
print(f"The {n}th token in the document is: {doc[n]}")

The 0th token in the document is: El


In [26]:
from spacy.tokens.doc import Doc
import pandas as pd

def get_doc_elements(doc: Doc):
    elements = ["text", "lemma", "pos", "tag", "shape", "alpha", "stop"]
    rows = [ [token.text, token.lemma_, token.pos_, token.tag_, token.shape_, token.is_alpha, token.is_stop] 
            for token in  doc]
    return pd.DataFrame(rows, columns=elements)

In [27]:
doc_elements = get_doc_elements(doc)
doc_elements

Unnamed: 0,text,lemma,pos,tag,shape,alpha,stop
0,El,el,DET,DET,Xx,True,True
1,lenguaje,lenguaje,NOUN,NOUN,xxxx,True,False
2,natural,natural,ADJ,ADJ,xxxx,True,False
3,es,ser,AUX,AUX,xx,True,True
4,fascinante,fascinante,ADJ,ADJ,xxxx,True,False
5,.,.,PUNCT,PUNCT,.,False,False


In [34]:
# Definition of NLP according to Wikipedia 
doc = nlp(u"En resumen, el notebook te enseña que la tokenización es el primer paso crucial " \
"para que una máquina procese texto, y que herramientas como Spacy no solo dividen\
      el texto, sino que también lo analizan para entender su estructura y significado.")

quote = doc[10:20]
quote

es el primer paso crucial para que una máquina procese