# Stemming - Text Preprocessing

**Goal**: This notebook explores stemming, a key text preprocessing technique used in NLP. Stemming reduces words to their base or root form (known as the "stem"), which helps in minimizing variations of a word in textual data.

**Context**: Stemming is useful when dealing with tasks like information retrieval and search engines, where variations of a word should be treated as the same word (e.g., "running," "runs," and "ran" become "run"). This notebook introduces the concept of stemming and demonstrates various stemming algorithms, highlighting their benefits and limitations.


### Stemming
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

In [1]:
## Classification Problem
## Product comments are a positive review or negative review
## Reviews----> eating, eat,eaten [going,gone,goes]--->go

words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

### PorterStemmer


In [2]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [3]:
for word in words:
    print(word+"--->"+stemmer.stem(word))

eating--->eat
eats--->eat
eaten--->eaten
writing--->write
writes--->write
programming--->program
programs--->program
history--->histori
finally--->final
finalized--->final


In [4]:
stemmer.stem('congratulations')

'congratul'

In [5]:
stemmer.stem('sitting')

'sit'

### RegexpStemmer class
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example

In [6]:
from nltk.stem import RegexpStemmer

In [10]:
reg_stemmer = RegexpStemmer('ing$|s$|e$|able$', min=4)

In [11]:
reg_stemmer.stem('eating')

'eat'

In [12]:
reg_stemmer.stem('ingeating')

'ingeat'

'ing$' does not remove 'ing' particles at the beginning of the word. That's why reg_stemmer is returning 'ingeat' as a result. Let's fix this by adding 'ing' without $ to reg_stemmer instance. 

In [23]:
reg_stemmer = RegexpStemmer('ing|ing$|s$|e$|able$', min=4)

In [24]:
reg_stemmer.stem('ingeating')

'eat'

### Snowball Stemmer
It is a stemming algorithm which is also known as the Porter2 stemming algorithm as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer.

In [15]:
from nltk.stem import SnowballStemmer
sb_stemmer = SnowballStemmer('english')

In [25]:
for word in words:
    print(word+"--->"+sb_stemmer.stem(word))
#This one works better

eating--->eat
eats--->eat
eaten--->eaten
writing--->write
writes--->write
programming--->program
programs--->program
history--->histori
finally--->final
finalized--->final


Let's spot different behaviours

In [19]:
stemmer.stem("fairly"), stemmer.stem("sportingly")

('fairli', 'sportingli')

In [20]:
sb_stemmer.stem("fairly"), sb_stemmer.stem("sportingly")

('fair', 'sport')

In [21]:
stemmer.stem('goes')

'goe'

In [22]:
sb_stemmer.stem('goes')

'goe'

Performs better than Porter Stemmer, but still fails

# Stemming Use Cases

Stemming is a crucial step in many Natural Language Processing (NLP) applications. Below are some common use cases where stemming is applied:

## 1. **Search Engines and Information Retrieval**
   - **Use Case:** Improve search results by matching different forms of a word.
   - **Example:** In a search query, the words "running", "runs", and "runner" can all be stemmed to "run", allowing the search engine to match documents containing any variation of the term.
   - **Benefit:** Reduces the complexity of search algorithms and enhances the recall of relevant documents.

## 2. **Text Classification**
   - **Use Case:** Simplify text data by converting words to their root forms, which helps to reduce the size of the vocabulary for classifiers.
   - **Example:** A sentiment analysis model benefits from stemming because it generalizes terms like "excited", "exciting", and "excitedly" into "excit".
   - **Benefit:** Improves the accuracy and performance of classification algorithms by reducing noise.

## 3. **Natural Language Processing Pipelines**
   - **Use Case:** Pre-process large text datasets to normalize word variations.
   - **Example:** In topic modeling (e.g., Latent Dirichlet Allocation), stemming reduces vocabulary size, enhancing the model's ability to group related terms together.
   - **Benefit:** Efficiently processes text for downstream tasks like entity recognition or machine translation.

## 4. **Content Recommendation Systems**
   - **Use Case:** Increase the relevance of recommended content by unifying different word forms.
   - **Example:** In e-commerce, stemming helps recommend products related to search terms like "buy", "bought", or "buying", all stemming to "buy".
   - **Benefit:** Ensures more accurate matching and improves the relevance of recommendations.

## 5. **Question Answering Systems**
   - **Use Case:** Enhance the ability to find answers by stemming query terms to their root form.
   - **Example:** A user asking "What is the definition of computing?" will have a better chance of receiving an answer that also includes variations like "compute", "computed", and "computer".
   - **Benefit:** Increases the coverage and accuracy of the answers returned by the system.

## 6. **Plagiarism Detection**
   - **Use Case:** Identify paraphrased content by reducing words to their base form for comparison.
   - **Example:** "Plagiarism" and "plagiarized" can be stemmed to the root form "plagiari", helping to detect similar content in different forms.
   - **Benefit:** Improves detection by focusing on conceptual similarity rather than exact word matching.

## 7. **Spam Detection**
   - **Use Case:** Normalize variations in word forms to identify spam messages effectively.
   - **Example:** Words like "earn", "earning", and "earned" can be reduced to "earn" in order to more easily flag spam messages offering quick earnings.
   - **Benefit:** Reduces false negatives and improves the precision of spam filters.

## 8. **Opinion Mining and Sentiment Analysis**
   - **Use Case:** Stemming helps in identifying sentiments related to different forms of words that share the same base meaning.
   - **Example:** Words like "happy", "happily", and "happiness" can all be stemmed to "happi", improving the analysis of sentiments in user reviews.
   - **Benefit:** Provides better generalization across different forms of expression in sentiment analysis models.

## 9. **Machine Translation**
   - **Use Case:** Simplify the translation process by reducing words to their root forms before translating.
   - **Example:** In translating between languages, stemming helps reduce ambiguity by simplifying terms like "translated", "translating", and "translates" into "translat".
   - **Benefit:** Increases the efficiency and accuracy of the translation model.

## 10. **Document Summarization**
   - **Use Case:** Summarization models benefit from stemming by generalizing word forms, which helps in identifying key points.
   - **Example:** The words "analyze", "analyzed", and "analyzing" can be stemmed to "analyz", ensuring better extraction of relevant information.
   - **Benefit:** Reduces redundancy and improves the clarity of summarized text.

