# Basic Preprocessing in NLP

## Case Folding, Stop Word Removal and Lemmatization

In [1]:
# checking spacy's version
!python -m spacy info

[1m

spaCy version    3.7.6                         
Location         /usr/local/lib/python3.10/dist-packages/spacy
Platform         Linux-6.1.85+-x86_64-with-glibc2.35
Python version   3.10.12                       
Pipelines        en_core_web_sm (3.7.1)        



In [2]:
# importing the spacy library
import spacy

In [3]:
# loading the spacy model (also known as trained pipelines)
nlp = spacy.load('en_core_web_sm')

In [4]:
# sample sentence
corpus = "Energy cannot be created or destroyed, it can only be changed from one form to another. - Albert Einstein"

# processing the text (tokenizing) to return a Doc object
doc = nlp(corpus)

## Case Folding

Case Folding is just converting all words to lower case to normalize the text by removing variations and reducing complexity that helps with tasks such as text comparison or search.

In [5]:
# case folding uses the lower_ attribute
print([t.lower_ for t in doc])

['energy', 'can', 'not', 'be', 'created', 'or', 'destroyed', ',', 'it', 'can', 'only', 'be', 'changed', 'from', 'one', 'form', 'to', 'another', '.', '-', 'albert', 'einstein']


## Stop Word Removal

The spaCy libary contains a default stop word list. The list can be viewed, add your own words and even remove the words.

In [6]:
# viewing the default stop word list by spacy
print(nlp.Defaults.stop_words)
print(f"total = {len(nlp.Defaults.stop_words)}")

{'none', 'me', 'although', '‘s', 'we', 'ever', 'it', 'own', 'becoming', 'whereas', 'fifteen', 'somehow', 'wherever', 'under', 'front', 'many', 'or', 'unless', 'nowhere', 'to', 'nor', 'now', 'sixty', 'has', 'around', 'up', 're', 'ourselves', 'since', 'behind', 'whether', 'beyond', '‘re', '’s', 'everyone', 'between', 'by', 'fifty', 'they', 'themselves', 'already', 'thence', 'various', 'afterwards', 'mostly', 'formerly', 'so', 'which', 'how', 'doing', 'may', 'i', 'too', 'at', 'into', 'less', 'becomes', 'being', 'should', 'been', 'towards', 'really', 'several', 'where', 'as', 'another', 'made', 'down', 'latter', 'whom', '’m', 'except', 'most', 'n‘t', 'why', 'go', 'onto', 'part', 'name', 'be', 'move', 'regarding', 'these', 'nine', 'using', 'somewhere', 'became', 'else', 'forty', 'about', 'mine', 'those', 'perhaps', 'but', 'a', 'have', 'thru', 'them', 'there', 'within', 'herself', 'this', 'against', 'throughout', 'per', 'three', 'amongst', 'thereafter', 'toward', 'each', 'because', 'besides'

In [7]:
# adding my own stop word "hello"
nlp.Defaults.stop_words.add("hello")
print(nlp.Defaults.stop_words)
print(f"new total = {len(nlp.Defaults.stop_words)}")

{'none', 'me', 'although', '‘s', 'we', 'ever', 'it', 'own', 'becoming', 'whereas', 'fifteen', 'somehow', 'wherever', 'under', 'front', 'many', 'or', 'unless', 'nowhere', 'to', 'nor', 'now', 'sixty', 'has', 'around', 'up', 're', 'ourselves', 'since', 'behind', 'whether', 'beyond', '‘re', '’s', 'everyone', 'between', 'by', 'fifty', 'they', 'themselves', 'already', 'thence', 'various', 'afterwards', 'mostly', 'formerly', 'so', 'which', 'how', 'doing', 'may', 'i', 'too', 'at', 'into', 'less', 'becomes', 'being', 'should', 'been', 'towards', 'really', 'several', 'where', 'as', 'another', 'made', 'down', 'latter', 'whom', '’m', 'except', 'most', 'n‘t', 'why', 'go', 'onto', 'part', 'name', 'be', 'move', 'regarding', 'these', 'nine', 'using', 'somewhere', 'became', 'else', 'forty', 'about', 'mine', 'those', 'perhaps', 'but', 'a', 'have', 'thru', 'them', 'there', 'within', 'herself', 'this', 'against', 'throughout', 'per', 'three', 'amongst', 'thereafter', 'toward', 'each', 'because', 'besides'

In [8]:
# returning everything back to normal by removing my stop word
nlp.Defaults.stop_words.remove("hello")
print(nlp.Defaults.stop_words)
print(f"total = {len(nlp.Defaults.stop_words)}")

{'none', 'me', 'although', '‘s', 'we', 'ever', 'it', 'own', 'becoming', 'whereas', 'fifteen', 'somehow', 'wherever', 'under', 'front', 'many', 'or', 'unless', 'nowhere', 'to', 'nor', 'now', 'sixty', 'has', 'around', 'up', 're', 'ourselves', 'since', 'behind', 'whether', 'beyond', '‘re', '’s', 'everyone', 'between', 'by', 'fifty', 'they', 'themselves', 'already', 'thence', 'various', 'afterwards', 'mostly', 'formerly', 'so', 'which', 'how', 'doing', 'may', 'i', 'too', 'at', 'into', 'less', 'becomes', 'being', 'should', 'been', 'towards', 'really', 'several', 'where', 'as', 'another', 'made', 'down', 'latter', 'whom', '’m', 'except', 'most', 'n‘t', 'why', 'go', 'onto', 'part', 'name', 'be', 'move', 'regarding', 'these', 'nine', 'using', 'somewhere', 'became', 'else', 'forty', 'about', 'mine', 'those', 'perhaps', 'but', 'a', 'have', 'thru', 'them', 'there', 'within', 'herself', 'this', 'against', 'throughout', 'per', 'three', 'amongst', 'thereafter', 'toward', 'each', 'because', 'besides'

In [9]:
# using stop words to remove unnecessary words
print([t for t in doc if not t.is_stop])

[Energy, created, destroyed, ,, changed, form, ., -, Albert, Einstein]


## Lemmatization

Through lemmatization it is possible to break a word down and returning it to its base or root form.

In [10]:
for t in doc:
  t_text = t.text
  t_lemma = t.lemma_  # lemmatization uses lemma_ attribute
  print(f"{t_text:<15}{t_lemma:<15}")

Energy         energy         
can            can            
not            not            
be             be             
created        create         
or             or             
destroyed      destroy        
,              ,              
it             it             
can            can            
only           only           
be             be             
changed        change         
from           from           
one            one            
form           form           
to             to             
another        another        
.              .              
-              -              
Albert         Albert         
Einstein       Einstein       
