### Realizando tareas de NLP utilizando spaCy

In [18]:
#!pip install -U pip setuptools wheel
#!pip install -U 'spacy'
#!python -m spacy download en_core_web_sm

Collecting pip
  Downloading pip-24.0-py3-none-any.whl.metadata (3.6 kB)
Collecting setuptools
  Downloading setuptools-68.0.0-py3-none-any.whl.metadata (6.4 kB)
Downloading pip-24.0-py3-none-any.whl (2.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hDownloading setuptools-68.0.0-py3-none-any.whl (804 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m804.0/804.0 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25h[33mDEPRECATION: swifter 1.0.7 has a non-standard dependency specifier ipywidgets>=7.0.0cloudpickle>=0.2.2. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of swifter or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mInstalling collected packages: setuptools, pi

Primero importamos el modelo central

In [19]:
import warnings
warnings.filterwarnings('ignore')

import spacy
nlp = spacy.load("en_core_web_sm")

**Tokenización**

In [20]:
# Tokenizacion
oracion = nlp.tokenizer("We live in Paris.")

# Longitud de la oracion
print("El numero de tokens: ", len(oracion))

# Imprimir palabras individuales (tokens)
print("Los tokens: ")
for palabras in oracion:
    print(palabras)

El numero de tokens:  5
Los tokens: 
We
live
in
Paris
.


Otro ejemplo

In [21]:
import pandas as pd

# Uso de Jeopardy 
data = pd.read_csv('jeopardy.csv')
data = pd.DataFrame(data=data)

data.columns = map(lambda x: x.lower().strip(), data.columns)

data = data[0:1000] 

# Tokenizamos Jeopardy Questions
data["question_tokens"] = data["question"].apply(lambda x: nlp(x))

In [22]:
pregunta_ejemplo = data.question[0]
tokens_pregunta_ejemplo = data.question_tokens[0]
print("La primera pregunta es:")
print(pregunta_ejemplo)

La primera pregunta es:
For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory


Imprimimos tokens individuales de la primera pregunta

In [23]:
print("Los tokens de la primera pregunta son:")
for tokens in tokens_pregunta_ejemplo:
    print(tokens)

Los tokens de la primera pregunta son:
For
the
last
8
years
of
his
life
,
Galileo
was
under
house
arrest
for
espousing
this
man
's
theory


**Etiquetado gramatical**

In [24]:
print("Etiquetas gramaticales para cada token en la primera pregunta:")
for token in tokens_pregunta_ejemplo:
    print(token.text,token.pos_, spacy.explain(token.pos_))

Etiquetas gramaticales para cada token en la primera pregunta:
For ADP adposition
the DET determiner
last ADJ adjective
8 NUM numeral
years NOUN noun
of ADP adposition
his PRON pronoun
life NOUN noun
, PUNCT punctuation
Galileo PROPN proper noun
was AUX auxiliary
under ADP adposition
house NOUN noun
arrest NOUN noun
for ADP adposition
espousing VERB verb
this DET determiner
man NOUN noun
's PART particle
theory NOUN noun


**Análisis de dependencias**

Veamos las etiquetas de análisis de dependencia para cada uno de los tokens en la primera pregunta: 

In [25]:
for token in tokens_pregunta_ejemplo:
    print(token.text,token.dep_, spacy.explain(token.dep_))

For prep prepositional modifier
the det determiner
last amod adjectival modifier
8 nummod numeric modifier
years pobj object of preposition
of prep prepositional modifier
his poss possession modifier
life pobj object of preposition
, punct punctuation
Galileo nsubj nominal subject
was ROOT root
under prep prepositional modifier
house compound compound
arrest pobj object of preposition
for prep prepositional modifier
espousing pcomp complement of preposition
this det determiner
man poss possession modifier
's case case marking
theory dobj direct object


El análisis de dependencias es difícil de ver, así que usemos el visualizador integrado de `spacy` para tener una mejor idea de las dependencias entre los tokens: 


In [26]:
from spacy import displacy

displacy.render(tokens_pregunta_ejemplo, style='dep',
                jupyter=True, options={'distance': 120})

**Fragmentación**

Imprimimos los tokens para la oración `My parents live in New York City` sin fragmentación.

In [27]:
for token in nlp("My parents live in New York City."):
    print(token.text)

My
parents
live
in
New
York
City
.


Imprimimos los tokens para la oración `My parents live in New York City` con fragmentación.

In [28]:
for token in nlp("My parents live in New York City.").noun_chunks:
    print(token.text)

My parents
New York City


**Lematización**

In [29]:
lematizacion = pd.DataFrame(data=[], columns=["original","lematizado"])
i = 0
for token in tokens_pregunta_ejemplo:
    lematizacion.loc[i,"original"] = token.text
    lematizacion.loc[i,"lematizado"] = token.lemma_
    i = i+1

lematizacion

Unnamed: 0,original,lematizado
0,For,for
1,the,the
2,last,last
3,8,8
4,years,year
5,of,of
6,his,his
7,life,life
8,",",","
9,Galileo,Galileo


**Reconocimiento de entidad nombrada (NER)**

In [30]:
# NER
oracion_ejemplo = "George Washington was an American political leader, \
military general, statesman, and Founding Father who served as the \
first president of the United States from 1789 to 1797.\n"

print(oracion_ejemplo)

print("Texto Inicio Fin Etiqueta")
doc = nlp(oracion_ejemplo)
for token in doc.ents:
    print(token.text, token.start_char, token.end_char, token.label_)

George Washington was an American political leader, military general, statesman, and Founding Father who served as the first president of the United States from 1789 to 1797.

Texto Inicio Fin Etiqueta
George Washington 0 17 PERSON
American 25 33 NORP
first 119 124 ORDINAL
the United States 138 155 GPE
1789 to 1797 161 173 DATE


Usemos el visualizador integrado de `spacy` para visualizar esta oración con las etiquetas de entidad relevantes.

In [31]:
displacy.render(doc, style='ent', jupyter=True, options={'distance': 120})

### Operaciones básicas con Spacy
Comencemos realizando una cadena de operaciones básicas de NLP que llamamos pipe de procesamiento. spaCy realiza todas estas operaciones detrás de escena, permitiéndole concentrarse en la lógica específica de su aplicación. El pipe es así:

```
Entrada de texto-> tokenización->Lemantización->Tagging->Parsing->Reconocimiento de identidades -> salida
```

### Lecturas:

- https://spacy.io/usage/linguistic-features
- https://spacy.io/usage/spacy-101
- https://spacy.io/api/data-formats#pos-tagging 

### Ejercicios

En los siguientes códigos identificas las tareas de NLP desarrolladas y luego indica el pipe de Spacy.

In [34]:
doc = nlp(u'I have flown to LA. Now I am flying to Frisco.')
for token in doc:
  print(token.text, token.pos_, token.dep_)

I PRON nsubj
have AUX aux
flown VERB ROOT
to ADP prep
LA PROPN pobj
. PUNCT punct
Now ADV advmod
I PRON nsubj
am AUX aux
flying VERB ROOT
to ADP prep
Frisco PROPN pobj
. PUNCT punct


In [None]:
## Tu respuesta

In [35]:
doc = nlp(u'I have flown to LA. Now I am flying to Frisco.')
for sent in doc.sents:
  print([w.text for w in sent if w.dep_ == 'ROOT' or w.dep_ == 'pobj'])

['flown', 'LA']
['flying', 'Frisco']


In [40]:
## Tu respuesta

In [36]:
doc = nlp(u'I have flown to LA. Now I am flying to Frisco.')
for token in doc:
  if token.ent_type != 0:
    print(token.text, token.ent_type_)

LA GPE
Frisco ORG


In [41]:
## Tu respuesta

In [37]:
doc = nlp(u'this product integrates both libraries for downloading and applying patches')
for token in doc:
  print(token.text, token.lemma_)

this this
product product
integrates integrate
both both
libraries library
for for
downloading download
and and
applying apply
patches patch


In [None]:
## Tu respuesta

In [38]:
doc = nlp(u'I have flown to LA. Now I am flying to Frisco.')
print([w.text for w in doc if w.tag_== 'VBG' or w.tag_== 'VB'])

['flying']


In [None]:
# Tu respuesta

In [39]:
doc = nlp(u'I am flying to Frisco')
print([w.text for w in doc])

['I', 'am', 'flying', 'to', 'Frisco']


In [None]:
# Tu respuesta