# Stemming in NLP

The process of removing affixes from a word so that we are left with the stem of that word is called stemming. For example, consider the words ‘run’, ‘running’, and ‘runs’, all convert into the root word ‘run’ after stemming is implemented on them. One crucial point about stem words is that they need not be meaningful. For example, the word ‘traditional’ stem is ‘tradi’ and has no meaning.



## Why use Stemming in NLP?
The benefits of using the stemming algorithm in an NLP  project can be summarised as follows:
1. It reduces the number of words that serve as an input to the Machine Learning/Deep Learning model.
2. It minimizes the confusion around words that have similar meanings.
3. It lowers the complexity of the input space.

4. When creating applications that search a specific text in a document, using stemming for indexing assists in retrieving relevant documents.
5. It assists in eliminating the out-of-vocabulary (OOV) problem. For example, if the vocabulary does not contain the word ‘oranges’, one can use the stem word ‘orange’ as a proxy.
6. It enhances the accuracy of the ML/DL model as the model does not have to deal with inflected word forms.

## Types of Stemming in NLP
Let us discuss the three popular types of stemming: 
1. Porter 
2. Snowball
3. Lancaster.

### 1. Porter Stemming in NLP
1. SSESS to SS
2. IES to I
3. SS to SS
4. S to _

In [1]:
from nltk.stem import PorterStemmer

In [2]:
# Create a Porter Stemmer instance
porter_stemmer = PorterStemmer()

In [3]:
# Example words for stemming
words = ["running", "jumps", "happily", "running", "happily"]

In [4]:
# Apply stemming to each word
stemmed_words = [porter_stemmer.stem(word) for word in words]

In [6]:
# Print the results
print("Original words:", words)
print("Stemmed words:", stemmed_words)

Original words: ['running', 'jumps', 'happily', 'running', 'happily']
Stemmed words: ['run', 'jump', 'happili', 'run', 'happili']


#### Applying stemming to a word(.docx) and saving it as a text file

In [7]:
import docx

In [8]:
doc = docx.Document(r"C:\Users\bhaka\Documents\NLP AS-2.docx")

In [26]:
# creating an empty list
fullText=[]

In [27]:
for para in doc.paragraphs:
    fullText.append(para.text)
print('\n'.join(fullText)) 




Que. Word net- Word Sense Disambiguation- Novel Word Sense detection.
Word net:
WordNet is a lexical database of the English language that organizes words into groups called synsets, which stand for "synonym sets." These synsets contain words that are considered to be cognitive synonyms, meaning they express similar or related concepts. WordNet covers different parts of speech including nouns, verbs, adjectives, and adverbs.
Here's a breakdown of the key components and features of WordNet:
Synsets: These are sets of words that are considered synonymous or closely related in meaning. Each synset represents a distinct concept. For example, the synset for "car" might include words like "automobile," "vehicle," and "motorcar," as they all refer to the same concept.
Conceptual-Semantic Relations: Synsets in WordNet are interconnected through conceptual and semantic relationships. These relationships help to organize and link related concepts together. Some common types of conceptual-seman

In [29]:
from nltk.tokenize import word_tokenize
# Tokenize each paragraph into words
tokenized_words = []
for para in fullText:
    tokenized_words.extend(word_tokenize(para))


In [14]:
# Create a Porter Stemmer instance
ps = PorterStemmer()
 

In [30]:
# Apply stemming to each word
# Apply stemming to each word
stemmed_words = [ps.stem(word) for word in tokenized_words]
print(stemmed_words)

['que', '.', 'word', 'net-', 'word', 'sens', 'disambiguation-', 'novel', 'word', 'sens', 'detect', '.', 'word', 'net', ':', 'wordnet', 'is', 'a', 'lexic', 'databas', 'of', 'the', 'english', 'languag', 'that', 'organ', 'word', 'into', 'group', 'call', 'synset', ',', 'which', 'stand', 'for', '``', 'synonym', 'set', '.', "''", 'these', 'synset', 'contain', 'word', 'that', 'are', 'consid', 'to', 'be', 'cognit', 'synonym', ',', 'mean', 'they', 'express', 'similar', 'or', 'relat', 'concept', '.', 'wordnet', 'cover', 'differ', 'part', 'of', 'speech', 'includ', 'noun', ',', 'verb', ',', 'adject', ',', 'and', 'adverb', '.', 'here', "'s", 'a', 'breakdown', 'of', 'the', 'key', 'compon', 'and', 'featur', 'of', 'wordnet', ':', 'synset', ':', 'these', 'are', 'set', 'of', 'word', 'that', 'are', 'consid', 'synonym', 'or', 'close', 'relat', 'in', 'mean', '.', 'each', 'synset', 'repres', 'a', 'distinct', 'concept', '.', 'for', 'exampl', ',', 'the', 'synset', 'for', '``', 'car', "''", 'might', 'includ', 

## 2.Snowball Stemming in NLP
This stemming algorithm is also known as the Porter2 algorithm because it is an improved version of the Porter algorithm that supports multiple languages. It is more accurate than the Porter algorithm and works with Unicode and string data.

In [40]:
from nltk.stem import SnowballStemmer

In [42]:
s=SnowballStemmer('english')

In [43]:
tokenized_word=[]
for para in fullText:
    tokenized_word.extend(word_tokenize(para))
print(tokenized_word)

['Que', '.', 'Word', 'net-', 'Word', 'Sense', 'Disambiguation-', 'Novel', 'Word', 'Sense', 'detection', '.', 'Word', 'net', ':', 'WordNet', 'is', 'a', 'lexical', 'database', 'of', 'the', 'English', 'language', 'that', 'organizes', 'words', 'into', 'groups', 'called', 'synsets', ',', 'which', 'stand', 'for', '``', 'synonym', 'sets', '.', "''", 'These', 'synsets', 'contain', 'words', 'that', 'are', 'considered', 'to', 'be', 'cognitive', 'synonyms', ',', 'meaning', 'they', 'express', 'similar', 'or', 'related', 'concepts', '.', 'WordNet', 'covers', 'different', 'parts', 'of', 'speech', 'including', 'nouns', ',', 'verbs', ',', 'adjectives', ',', 'and', 'adverbs', '.', 'Here', "'s", 'a', 'breakdown', 'of', 'the', 'key', 'components', 'and', 'features', 'of', 'WordNet', ':', 'Synsets', ':', 'These', 'are', 'sets', 'of', 'words', 'that', 'are', 'considered', 'synonymous', 'or', 'closely', 'related', 'in', 'meaning', '.', 'Each', 'synset', 'represents', 'a', 'distinct', 'concept', '.', 'For', 

In [45]:
stemmed_wrd=[s.stem(word) for word in tokenized_words]
print(stemmed_wrd)

['que', '.', 'word', 'net-', 'word', 'sens', 'disambiguation-', 'novel', 'word', 'sens', 'detect', '.', 'word', 'net', ':', 'wordnet', 'is', 'a', 'lexic', 'databas', 'of', 'the', 'english', 'languag', 'that', 'organ', 'word', 'into', 'group', 'call', 'synset', ',', 'which', 'stand', 'for', '``', 'synonym', 'set', '.', "''", 'these', 'synset', 'contain', 'word', 'that', 'are', 'consid', 'to', 'be', 'cognit', 'synonym', ',', 'mean', 'they', 'express', 'similar', 'or', 'relat', 'concept', '.', 'wordnet', 'cover', 'differ', 'part', 'of', 'speech', 'includ', 'noun', ',', 'verb', ',', 'adject', ',', 'and', 'adverb', '.', 'here', "'s", 'a', 'breakdown', 'of', 'the', 'key', 'compon', 'and', 'featur', 'of', 'wordnet', ':', 'synset', ':', 'these', 'are', 'set', 'of', 'word', 'that', 'are', 'consid', 'synonym', 'or', 'close', 'relat', 'in', 'mean', '.', 'each', 'synset', 'repres', 'a', 'distinct', 'concept', '.', 'for', 'exampl', ',', 'the', 'synset', 'for', '``', 'car', "''", 'might', 'includ', 

## 3. Lancaster Stemming in NLP
This stemming algorithm is one of the fastest algorithms available out there. Unlike Snowball stemmer and Porter stemmer, the stem words in this algorithm are not intuitive. 

In [46]:
from nltk.stem import LancasterStemmer

In [52]:
ls=LancasterStemmer()

In [53]:
tokens=[]
for para in fullText:
    tokens.extend(word_tokenize(para))
print(tokens)

['Que', '.', 'Word', 'net-', 'Word', 'Sense', 'Disambiguation-', 'Novel', 'Word', 'Sense', 'detection', '.', 'Word', 'net', ':', 'WordNet', 'is', 'a', 'lexical', 'database', 'of', 'the', 'English', 'language', 'that', 'organizes', 'words', 'into', 'groups', 'called', 'synsets', ',', 'which', 'stand', 'for', '``', 'synonym', 'sets', '.', "''", 'These', 'synsets', 'contain', 'words', 'that', 'are', 'considered', 'to', 'be', 'cognitive', 'synonyms', ',', 'meaning', 'they', 'express', 'similar', 'or', 'related', 'concepts', '.', 'WordNet', 'covers', 'different', 'parts', 'of', 'speech', 'including', 'nouns', ',', 'verbs', ',', 'adjectives', ',', 'and', 'adverbs', '.', 'Here', "'s", 'a', 'breakdown', 'of', 'the', 'key', 'components', 'and', 'features', 'of', 'WordNet', ':', 'Synsets', ':', 'These', 'are', 'sets', 'of', 'words', 'that', 'are', 'considered', 'synonymous', 'or', 'closely', 'related', 'in', 'meaning', '.', 'Each', 'synset', 'represents', 'a', 'distinct', 'concept', '.', 'For', 

In [54]:
stem_word=[ls.stem(word) for word in tokens]
print(stem_word)

['que', '.', 'word', 'net-', 'word', 'sens', 'disambiguation-', 'novel', 'word', 'sens', 'detect', '.', 'word', 'net', ':', 'wordnet', 'is', 'a', 'lex', 'databas', 'of', 'the', 'engl', 'langu', 'that', 'org', 'word', 'into', 'group', 'cal', 'synset', ',', 'which', 'stand', 'for', '``', 'synonym', 'set', '.', "''", 'thes', 'synset', 'contain', 'word', 'that', 'ar', 'consid', 'to', 'be', 'cognit', 'synonym', ',', 'mean', 'they', 'express', 'simil', 'or', 'rel', 'conceiv', '.', 'wordnet', 'cov', 'diff', 'part', 'of', 'speech', 'includ', 'noun', ',', 'verb', ',', 'adject', ',', 'and', 'adverb', '.', 'her', "'s", 'a', 'breakdown', 'of', 'the', 'key', 'compon', 'and', 'feat', 'of', 'wordnet', ':', 'synset', ':', 'thes', 'ar', 'set', 'of', 'word', 'that', 'ar', 'consid', 'synonym', 'or', 'clos', 'rel', 'in', 'mean', '.', 'each', 'synset', 'repres', 'a', 'distinct', 'conceiv', '.', 'for', 'exampl', ',', 'the', 'synset', 'for', '``', 'car', "''", 'might', 'includ', 'word', 'lik', '``', 'automob