#  Lemmatization in NLP using WordNetLemmatizer

La lemmatización es el proceso de convertir una palabra a su forma base. La diferencia entre la lematización y la derivación es que la lematización considera el contexto y convierte la palabra a su forma base de acuerdo con el contexto, mientras que, en la derivación, la palabra se convierte a su forma base mediante el uso de reglas simples.

ref:
- https://aparnamishra144.medium.com/lemmatization-in-nlp-using-wordnetlemmatizer-420a444a50d
- github copilot

In [8]:
import nltk
#used for performing lemmatization
from nltk.stem import WordNetLemmatizer 
#used to remove repeating words like- of,we,the,them etc
from nltk.corpus import stopwords  
para = """Yoga develops inner awareness. \
It focuses your attention on your body's abilities at the present moment. \
It helps develop breath and strength of mind and body. \
It's not about physical appearance. \
Yoga studios typically don't have mirrors. \
This is so people can focus their awareness inward rather than how a pose — or the people around them — looks.
"""



In [9]:
para.split(". ")

['Yoga develops inner awareness',
 "It focuses your attention on your body's abilities at the present moment",
 'It helps develop breath and strength of mind and body',
 "It's not about physical appearance",
 "Yoga studios typically don't have mirrors",
 'This is so people can focus their awareness inward rather than how a pose — or the people around them — looks.\n']

In [3]:
#tokenizing the para into sentences
#TODO: una herramienta regex que verifique la existencia de espacios luego de un punto.
sentences = nltk.sent_tokenize(para)
for sentence in sentences:
    print(sentence)

print("\nlength of sentences after tokenization:",len(sentences))

Yoga develops inner awareness.
It focuses your attention on your body's abilities at the present moment.
It helps develop breath and strength of mind and body.
It's not about physical appearance.
Yoga studios typically don't have mirrors.
This is so people can focus their awareness inward rather than how a pose — or the people around them — looks.
length of sentences after tokenization: 6


In [10]:
# WordNetLemmatizer is a library used for Lemmatizing task
#creating lemmatizer object 
lemmatizer = WordNetLemmatizer()
#for loop 
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)
# print(words)
for sentence in sentences:
    print(sentence)

Yoga develops inner awareness .
It focus attention body 's ability present moment .
It help develop breath strength mind body .
It 's physical appearance .
Yoga studio typically n't mirror .
This people focus awareness inward rather pose — people around — look .


In [11]:
for word in words:
    print(word)

This
people
focus
awareness
inward
rather
pose
—
people
around
—
look
.


**Conclusión:**

- nltk.sent_tokenize(para): sirve para separar un string en una lista de párrafos, la complicación se debe a errores en el texto, como por ejemplo un punto seguido sin un espacio antes del comienzo del siguiente caracter.
- nltk.word_tokenize(sentences[i]): separa un párrafo en una lista de palabras.

- WordNetLemmatizer(): realiza el procesamiento de cada palabra convirtiéndola a su forma base.

- (investigar) nltk.pos_tag(words): etiqueta cada palabra con su tipo de palabra (sustantivo, verbo, etc).

