In [21]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

words = ["eating","eats","eat","ate","adjustable","rafting","ability","meeting"]

for word in words:
    print(word,"|",stemmer.stem(word))

eating | eat
eats | eat
eat | eat
ate | ate
adjustable | adjust
rafting | raft
ability | abil
meeting | meet


Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meanings to one word. 
Text preprocessing includes both Stemming as well as Lemmatization. Many times people find these two terms confusing. Some treat these two as the same. Actually, lemmatization is preferred over Stemming because lemmatization does morphological analysis of the words.
Applications of lemmatization are: 
 

Used in comprehensive retrieval systems like search engines.
Used in compact indexing
 

Examples of lemmatization:

-> rocks : rock
-> corpora : corpus
-> better : good

One major difference with stemming is that lemmatize takes a part of speech parameter, “pos” If not supplied, the default is “noun.”
Below is the implementation of lemmatization words using NLTK:

In [22]:
import nltk
# nltk.download()

In [23]:
# import these modules 

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print("rocks :",lemmatizer.lemmatize("rocks"))
print("eating :",lemmatizer.lemmatize("eating"))

# a denotes adjective in "pos"

print("better :", lemmatizer.lemmatize("better"))
print("better :", lemmatizer.lemmatize("better",pos="a"))



rocks : rock
eating : eating
better : better
better : good



# Python | PoS Tagging and Lemmatization using spaCy



spaCy is one of the best text analysis library. spaCy excels at large-scale information extraction tasks and is one of the fastest in the world. It is also the best way to prepare text for deep learning. spaCy is much faster and accurate than NLTKTagger and TextBlob.



Top Features of spaCy:
1. Non-destructive tokenization
2. Named entity recognition
3. Support for 49+ languages
4. 16 statistical models for 9 languages
5. Pre-trained word vectors
6. Part-of-speech tagging
7. Labeled dependency parsing
8. Syntax-driven sentence segmentation

In [24]:
!pip install spacy
!python -m spacy download en_core_web_sm

[0m2023-02-28 12:07:31.602198: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-28 12:07:31.700975: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-02-28 12:07:32.165128: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-02-28 12:07:32.165191: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:

[0mInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.5.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [26]:
import spacy

# Load English tokenizer, tagger, 
# parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")

# Process whole documents
text = ("""My name is Shaurya Uppal. 
I enjoy writing articles on GeeksforGeeks checkout
my other article by going to my profile section.""")


doc = nlp(text)

# Token and tag
for token in doc:
    print(token,token.pos_)
# You web list of verb tokens
print("Verbs :",[token.text for token in doc if token.pos_ == "VERB" ])





My PRON
name NOUN
is AUX
Shaurya PROPN
Uppal PROPN
. PUNCT

 SPACE
I PRON
enjoy VERB
writing VERB
articles NOUN
on ADP
GeeksforGeeks PROPN
checkout VERB

 SPACE
my PRON
other ADJ
article NOUN
by ADP
going VERB
to ADP
my PRON
profile NOUN
section NOUN
. PUNCT
Verbs : ['enjoy', 'writing', 'checkout', 'going']


In [28]:
#Customizing lemmatizer
nlp.pipe_names

ar = nlp.get_pipe('attribute_ruler')

ar.add([[{"TEXT":"Bro"}],[{"TEXT":"Brah"}]],{"LEMMA":"Brother"})

doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted")
for token in doc:
    print(token.text, "|", token.lemma_)

Bro | Brother
, | ,
you | you
wanna | wanna
go | go
? | ?
Brah | Brother
, | ,
do | do
n't | not
say | say
no | no
! | !
I | I
am | be
exhausted | exhaust
