## Stemming
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

Ref: => https://www.ibm.com/topics/stemming

In other Words, 
Stemming in Natural Language Processing (NLP) is the process of reducing a word to its base or root form by removing prefixes, suffixes, or inflections. The goal of stemming is to simplify words for text processing by reducing different forms of a word to a common base form. This is particularly useful in search engines, information retrieval, and text mining to group similar words together.

For example:
"running", "runner", and "ran" can all be reduced to the stem "run".
"better" and "best" might both be reduced to "bet" (although this is an oversimplification of meaning).

=== Importance of Stemming: ===

Helps to normalize text data by reducing words to their root form.

Reduces the complexity of language in text data.

Can improve search engine performance by allowing different forms of a word to match the same search query.

===  Common Stemming Algorithms: ====

Porter Stemmer: One of the most widely used stemming algorithms. It follows a set of rules to remove common suffixes.

Lancaster Stemmer: A more aggressive algorithm that tends to reduce words to shorter stems.
    
Snowball Stemmer: An improved version of the Porter Stemmer, providing better language support.

In [1]:
## Classification Problem
## Comments of product is a positive review or negative review
## Reviews----> eating, eat,eaten [going,gone,goes]--->go

words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

### PorterStemmer
 The Porter Stemmer is one of the most commonly used stemming algorithms in Natural Language Processing (NLP). It was introduced by Martin Porter in 1980, and it is designed to reduce words to their base or root form (known as the "stem") by systematically removing common suffixes from English words. The algorithm applies a series of transformation rules to words, which allows different forms of the same word to be treated as identical in various NLP tasks like search engines, text mining, and information retrieval.

In [3]:
from nltk.stem import PorterStemmer

In [9]:
stemming=PorterStemmer()

In [24]:
for word in words:
    print(word+" ---->"+stemming.stem(word))
    
# another way list comprehension 
porter_stemmed = [stemming.stem(word) for word in words]
print(f"porter_stemmed:", porter_stemmed)

eating ---->eat
eats ---->eat
eaten ---->eaten
writing ---->write
writes ---->write
programming ---->program
programs ---->program
history ---->histori
finally ---->final
finalized ---->final
porter_stemmed: ['eat', 'eat', 'eaten', 'write', 'write', 'program', 'program', 'histori', 'final', 'final']


In [13]:
stemming.stem('congratulations') # dis advantage of PorterStemmer/ steamming , this solve by lematization 

'congratul'

In [15]:
stemming.stem("sitting")

'sit'

### RegexpStemmer class
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example

In [8]:
from nltk.stem import RegexpStemmer

In [22]:
reg_stemmer=RegexpStemmer('ing$|s$|e$|able$', min=4)

In [23]:
reg_stemmer.stem('eating')

'eat'

In [24]:
reg_stemmer.stem('ingeating')

'ingeat'

### Snowball Stemmer
 It is a stemming algorithm which is also known as the Porter2 stemming algorithm as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer.

 Improved Accuracy: Snowball Stemmer fixes some of the aggressive reductions of Porter Stemmer, providing better word stems that are closer to actual root words.

In [25]:
from nltk.stem import SnowballStemmer

In [26]:
snowballsstemmer=SnowballStemmer('english')

In [27]:
for word in words:
    print(word+"---->"+snowballsstemmer.stem(word))

eating---->eat
eats---->eat
eaten---->eaten
writing---->write
writes---->write
programming---->program
programs---->program
history---->histori
finally---->final
finalized---->final


In [28]:
stemming.stem("fairly"),stemming.stem("sportingly")

('fairli', 'sportingli')

In [31]:
snowballsstemmer.stem("fairly"),snowballsstemmer.stem("sportingly")

('fair', 'sport')

In [33]:
snowballsstemmer.stem('goes')

'goe'

In [34]:
stemming.stem('goes')

'goe'

In [28]:
# To over come the situation of stemming disadvantages we go through Lematization 

While stemming is a useful technique in Natural Language Processing (NLP) to reduce words to their base forms, it has several disadvantages and limitations:

1. Over-stemming:
Stemming can be too aggressive, reducing words to stems that are not linguistically meaningful or related to the original word.
For example, the word "universe" might be reduced to "univers", which doesn't carry the full meaning of the original word.
Over-stemming can lead to false positives, where different words with distinct meanings are treated as the same.
Example:

Words like "universal" and "university" might be stemmed to the same root "univers", though they have different meanings.
2. Under-stemming:
Sometimes stemming algorithms fail to reduce words to their true root form, leaving words partially stemmed.
This results in missed matches between words that should be considered equivalent in meaning.
Example:

The words "ran" and "running" might not be reduced to the same root, missing the fact that both relate to "run."
3. Loss of Semantics:
Stemming focuses purely on the form of the word and ignores its context or meaning.
By stripping suffixes and prefixes, the algorithm often loses the semantic information that may be important for understanding the full context of the word.
Example:

Words like "better" and "good" are related in meaning but have very different stems, so stemming won't help recognize their synonymy.
4. Language-Specific Limitations:
Stemming algorithms, such as Porter Stemmer, are typically designed for specific languages (like English), and may not work well for other languages with different morphological rules.
Even though Snowball Stemmer supports multiple languages, it still might not capture all linguistic nuances for highly inflected languages like Finnish or Turkish.
5. Inconsistent Results:
Different stemming algorithms can produce inconsistent results for the same word. For example, the Porter Stemmer might stem "connection" to "connect", while another algorithm might reduce it to "conn".
This inconsistency can make the output difficult to interpret and apply in some NLP tasks.
6. Non-Words as Stems:
Stemming can often produce stems that are not valid words in the language, making it harder to interpret or use these stems in a human-readable context.
Example:

The word "agreement" might be stemmed to "agre", which is not a meaningful word in English.
7. Lack of Handling of Compound Words:
Stemming algorithms are generally not well-suited to handle compound words or multi-word expressions.
They treat compound words as separate tokens and may produce inaccurate stems for each part.
Example:

The word "notebook" may be incorrectly split and stemmed as "note" and "book", losing the combined meaning.
8. Not Suitable for Complex NLP Tasks:
Stemming is typically used for basic text normalization tasks, such as search engines or information retrieval.
For more complex NLP tasks, like machine translation, sentiment analysis, or language understanding, stemming can be too simplistic and may lead to poor performance.
9. Alternative Techniques (e.g., Lemmatization):
Lemmatization, which reduces words to their dictionary or canonical form, is often preferred over stemming for more advanced NLP tasks.
Lemmatization considers the part of speech and context of the word, providing more accurate root forms than stemming.
Example:

Lemmatization would reduce "better" to "good" (depending on context), while stemming would not make this connection.
Summary of Stemming Disadvantages:
Disadvantage	Description
Over-stemming	Stems too aggressively, leading to loss of meaning.
Under-stemming	Fails to reduce words to their correct root form.
Loss of Semantics	Ignores the meaning and context of the word, leading to ambiguity.
Language-Specific Issues	Often designed for specific languages, lacking flexibility.
Inconsistent Results	Different stemming algorithms produce varying outputs.
Non-Words as Stems	Produces stems that are not valid or recognizable words.
Handling Compound Words	Struggles with compound words and multi-word expressions.
Limited NLP Use	Not ideal for complex NLP tasks like translation or sentiment analysis.
Better Alternatives	Lemmatization often provides better results in many scenarios.
In many cases, lemmatization is preferred over stemming because it is more context-aware and produces more accurate root forms of words. However, stemming is still useful for quick and computationally inexpensive text preprocessing tasks.