## Tokeniztion- A process in which a paragraph is converted into sentences or words.

In [4]:
corpus="""Hello Everyone my name is Md Aftab And I am fron Araria.
i am a 3rd year student of B.Tech in Computer Science and Engineering.
"""

print(corpus)

Hello Everyone my name is Md Aftab And I am fron Araria.
i am a 3rd year student of B.Tech in Computer Science and Engineering.



### **Types of Stemming in NLP**  

Stemming algorithms use different approaches to reduce words to their root forms. Below are the most common types of stemming algorithms:  

---

### **1. Porter Stemmer** (Most Common)  
- Developed by **Martin Porter** in 1980.  
- Uses a **set of rules** to remove common suffixes (e.g., "ing", "ed", "s").  
- Produces **non-dictionary words** sometimes.  
- Good for general-purpose NLP tasks.  

**Example:**  
- "running" → "run"  
- "flies" → "fli"  
- "happiness" → "happi"  



### **2. Snowball Stemmer** (Improved Porter Stemmer)  
- Also called **"Porter2"**, an improved version of the Porter Stemmer.  
- **Supports multiple languages** (English, Spanish, French, etc.).  
- More accurate than the Porter Stemmer.  

**Example:**  
- "running" → "run"  
- "happily" → "happi"  


### **3. Lancaster Stemmer** (More Aggressive)  
- More **aggressive** than Porter and Snowball.  
- Can **over-stem** (reduce words too much).  
- Faster but less accurate.  

**Example:**  
- "running" → "run"  
- "happiness" → "happy"  
- "organization" → "organ"  


### **4. Regex-based Stemmer (Regular Expression Stemmer)**  
- Uses **regular expressions** to define stemming rules.  
- Good for **custom stemming** when specific suffixes need to be removed.  

**Example:**  
- Removing **common suffixes** like "ing", "ed", "ly".  



### **5. Krovetz Stemmer**  
- A **lightweight stemmer**, focuses on **minimizing over-stemming**.  
- Uses a **dictionary lookup** to ensure valid words.  
- Less aggressive than Porter, Snowball, and Lancaster.  

**Example:**  
- "running" → "run"  
- "cats" → "cat"  
- "better" → "better" (Keeps correct base form)  

> **Note:** Krovetz Stemmer is mainly used in **Information Retrieval (IR)** systems.

---

### **Comparison of Stemmers**  

| **Stemmer**       | **Aggressiveness** | **Accuracy** | **Supports Multiple Languages?** | **Common Use Cases** |
|------------------|-----------------|------------|----------------------------|----------------|
| **Porter**       | Medium          | Medium     | No                         | General NLP |
| **Snowball**     | Medium          | High      | Yes                        | Multilingual NLP |
| **Lancaster**    | High (Over-stemming) | Low   | No                         | Fast Stemming |
| **Regex**        | Customizable    | Varies     | No                         | Simple, Rule-Based NLP |
| **Krovetz**      | Low             | High      | No                         | Information Retrieval |

Would you like an example of using **different stemmers together**? 🚀

In [3]:
import nltk 
from nltk.tokenize import sent_tokenize



In [6]:
corpus="""Hello Everyone my name is Md Aftab And I am fron Araria.
i am a 3rd year student of B.Tech in Computer Science and Engineering.
"""

docs=sent_tokenize(corpus) ## making two sentences

In [7]:
docs

['Hello Everyone my name is Md Aftab And I am fron Araria.',
 'i am a 3rd year student of B.Tech in Computer Science and Engineering.']

In [8]:
type(docs)

list

In [11]:
## converting paragraph into words.
from nltk.tokenize import word_tokenize
words=word_tokenize(corpus)

In [12]:
words

['Hello',
 'Everyone',
 'my',
 'name',
 'is',
 'Md',
 'Aftab',
 'And',
 'I',
 'am',
 'fron',
 'Araria',
 '.',
 'i',
 'am',
 'a',
 '3rd',
 'year',
 'student',
 'of',
 'B.Tech',
 'in',
 'Computer',
 'Science',
 'and',
 'Engineering',
 '.']

In [13]:
type(words)

list

In [14]:
for sentences in docs:
    print(word_tokenize(sentences))

['Hello', 'Everyone', 'my', 'name', 'is', 'Md', 'Aftab', 'And', 'I', 'am', 'fron', 'Araria', '.']
['i', 'am', 'a', '3rd', 'year', 'student', 'of', 'B.Tech', 'in', 'Computer', 'Science', 'and', 'Engineering', '.']


In [17]:
## Tree Bank Word Tokenizer.
from nltk.tokenize import TreebankWordTokenizer
tbwt=TreebankWordTokenizer()

In [18]:
tbwt.tokenize(corpus)

['Hello',
 'Everyone',
 'my',
 'name',
 'is',
 'Md',
 'Aftab',
 'And',
 'I',
 'am',
 'fron',
 'Araria.',
 'i',
 'am',
 'a',
 '3rd',
 'year',
 'student',
 'of',
 'B.Tech',
 'in',
 'Computer',
 'Science',
 'and',
 'Engineering',
 '.']

## Stemming

In [19]:
words=["eating","eats","eaten","writing","write","written","programming","program"]

## porter stemmer

In [21]:
from nltk.stem import PorterStemmer


In [22]:
stemming=PorterStemmer()


In [24]:
for word in words:
    print(word+"----->"+stemming.stem(word))

eating----->eat
eats----->eat
eaten----->eaten
writing----->write
write----->write
written----->written
programming----->program
program----->program


## Regexp Stemmer 

In [25]:
from nltk.stem import RegexpStemmer

In [26]:
stemmer=RegexpStemmer(regexp='ing$|s$|e$|able$', min=4)

In [27]:
stemmer.stem('eating')

'eat'

In [28]:
stemmer.stem('reading')

'read'

In [None]:
stemmer.stem('Table')

'TAbl'

In [35]:
from nltk.stem import LancasterStemmer

ls = LancasterStemmer()
print(ls.stem("organization"))  


org
