### Stemming

Stamming is a text normalization technique used in Natural Language Processing that is used to reduce the word to the root word. The root word may or mat not have a meaning.


### Importance of Stemming:

1.  It helps in finding out the root word that may or may not have a meaning.

2.  It speeds up the NLP related tasks

3.  It improves the speed and the accuracy of the model

4.  It improves the searching tasks


### PorterStemmer

Porter Stemmer is a widely used stemming algorithm that is used to reduce the word into the root word by removing the suffixes from it.


### Steps used in this Algorithm:-

1.  Import all the necessary libraries

2.  Include the necessary Resource required for nltk

3.  Create a Sample Text

4.  Perform the tokenization of  the text

5.  Load the English Stopwords

6.  Remove all the Stop words from the text

7.  Perform stemming on all the filtered words

In [88]:
from  nltk.stem  import  PorterStemmer

### Create an object for Porter Stemmer

ps = PorterStemmer()

## create a list of words

words = ["running", "runs", "ran", "studies", "studying"]

### perform the stemming over each and every word

res = [ps.stem(x) for x in words]

print(res)

['run', 'run', 'ran', 'studi', 'studi']


### OBSERVATIONS:

1.  Here "running" ----------------> "run"  (Root Word has a meaning)

2.  Here "studies" ----------------> "studi" (Root Word has no meaning)

### Step 1: Import all the necessary libraries

In [89]:
import  numpy  as   np
import  pandas as   pd
import  matplotlib.pyplot  as  plt
import  seaborn            as  sns

import  nltk

from    nltk.tokenize     import word_tokenize  ,  sent_tokenize

from    nltk.stem         import PorterStemmer


from    nltk.corpus       import stopwords

### OBSERVATIONS:

1.   numpy  ----------->  Computation of the numerical array

2.   pandas ----------->  Data Creation and Manipulation

3.   matplotlib ------->  Data Visualization

4.   seaborn  --------->  Data Correlation

5.   nltk   ----------->  Library for NLP for performing the text preprocessing operations

6.   corpus  ---------->  location where the data is stored

7.   stopwords -------->  unnecessary words that do not add any meaning to the text

8.   tokenize  ---------> breaks the text into samller parts

9.   sent_tokenize -----> breaks the paragraphs text into sentences

10.  word_tokenize -----> breaks the sentence text into the words

11.  PorterStemmer -----> reduces the word into root word by removing the suffixes from it

### Step 2: Include the necessary Resource required for nltk

In [90]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

### OBSERVATIONS:

1. punkt_tab is an advance library module used in NLTK which contains the pre-trained statistical modules and helps the tokenizer to accurately split the text into the sentences.

### Step 3: Create a Sample Text

In [91]:
# Define sample paragraph
text = """
Natural Language Processing allows computers to understand human language.
People are communicating using texts, emails, and social media every day.
We are building systems that can automatically analyze and respond intelligently.
"""

In [92]:
text

'\nNatural Language Processing allows computers to understand human language.\nPeople are communicating using texts, emails, and social media every day.\nWe are building systems that can automatically analyze and respond intelligently.\n'

### Step 4: Perform the tokenization of  the text

In [93]:
from  nltk.tokenize import word_tokenize, sent_tokenize

### perform the sentence tokenization

sentences = sent_tokenize(text)

print(sentences)

['\nNatural Language Processing allows computers to understand human language.', 'People are communicating using texts, emails, and social media every day.', 'We are building systems that can automatically analyze and respond intelligently.']


In [94]:
### perform the word tokenization

words = word_tokenize(text)

print(words)

['Natural', 'Language', 'Processing', 'allows', 'computers', 'to', 'understand', 'human', 'language', '.', 'People', 'are', 'communicating', 'using', 'texts', ',', 'emails', ',', 'and', 'social', 'media', 'every', 'day', '.', 'We', 'are', 'building', 'systems', 'that', 'can', 'automatically', 'analyze', 'and', 'respond', 'intelligently', '.']


### Step 5: Load the English Stopwords

In [95]:
from nltk.corpus  import stopwords


### define the set of all the stopwords used in english

english_stopwords = set(stopwords.words("english"))

print(english_stopwords)

{'isn', 'm', 'my', "aren't", 'yourself', "should've", 've', 'am', 'herself', 'those', 'couldn', 'does', 'should', 'her', "hadn't", 'its', 'hasn', 'down', 'between', 'ma', 'until', 'who', 'him', 'an', 'out', 'wouldn', 'own', 'and', 'be', 'than', "didn't", 'i', 'you', "shouldn't", 'are', 'mustn', "you're", 'most', 'yourselves', 'himself', 'themselves', 're', 'as', 'no', 'before', 'this', 'against', 'further', 'so', 'can', 'won', 'there', 'shan', 'them', 'not', 'ourselves', 'myself', 'from', 'more', 'which', 'if', 'of', 'we', 'did', 'needn', 'after', "you've", "you'll", 'had', 'do', 's', 'y', 'on', 'again', 'what', 'other', 'have', 'each', "she's", 'with', 'few', 'll', 'were', 'under', 'when', 'has', 'off', 'wasn', "weren't", "mightn't", 'shouldn', 'these', 'mightn', 'a', 'how', 'through', 'their', 'our', 'he', "don't", 'didn', 'too', 'some', "couldn't", "that'll", 'because', 'here', 'for', 'only', 'by', "shan't", 'aren', 'why', "haven't", 'been', 'but', 'theirs', 'any', "wasn't", 'all', 

In [96]:
### perform the regexptokenization on the text

from nltk.tokenize import RegexpTokenizer


reg = RegexpTokenizer(r'\w+')


### perform the tokenization on the text

ans = reg.tokenize(text)

print(ans)

['Natural', 'Language', 'Processing', 'allows', 'computers', 'to', 'understand', 'human', 'language', 'People', 'are', 'communicating', 'using', 'texts', 'emails', 'and', 'social', 'media', 'every', 'day', 'We', 'are', 'building', 'systems', 'that', 'can', 'automatically', 'analyze', 'and', 'respond', 'intelligently']


### Step 6: Remove all the Stop words from the text

In [97]:
res = [x for x in ans if(x not in english_stopwords)]

print(res)

['Natural', 'Language', 'Processing', 'allows', 'computers', 'understand', 'human', 'language', 'People', 'communicating', 'using', 'texts', 'emails', 'social', 'media', 'every', 'day', 'We', 'building', 'systems', 'automatically', 'analyze', 'respond', 'intelligently']


### Step 7: Perform stemming on all the filtered words

In [98]:
ps = PorterStemmer()


ans = [ps.stem(x) for x in res if(x != ' ')]

print(ans)

['natur', 'languag', 'process', 'allow', 'comput', 'understand', 'human', 'languag', 'peopl', 'commun', 'use', 'text', 'email', 'social', 'media', 'everi', 'day', 'we', 'build', 'system', 'automat', 'analyz', 'respond', 'intellig']


### OBSERVATIONS:

1.  Here all the words has been converetd into their root words that may or may not have a meaning.