# TEXT PREPROCESSING - STEMMING

**Stemming** is text preprocessing techinique where the tokens generated from the corpus are reduced to their base units. The base units need not be meaningful words. This makes it less complex and faster. Manual rule based way of cutting words down.
* _go, going, gone --> go_


**Overstemming** is when you stem too much of the token.
* _universe, university, universities --> univers_

**Understemming** is when you dont stem the token enough
* datum , data -> dat ==> What about date?

**StopWords** are words which do not add much meaning to the sentence.
* a, an, the, is 

In [1]:
#will be using NLTK to demonstrate stemming
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

paragraph = """Paragraphs are the building blocks of papers. Many students define paragraphs \
in terms of length. A paragraph is a group of at least five sentences. Paragraph \
is half a page long, etc."""

In [16]:
#Check the list of stopwords in english language
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [10]:
#generate sentences from the paragraph
sentences = nltk.sent_tokenize(paragraph)
print(sentences)

['Paragraphs are the building blocks of papers.', 'Many students define paragraphs in terms of length.', 'A paragraph is a group of at least five sentences.', 'Paragraph is half a page long, etc.']


In [12]:
#initalise stemmer and stem each word, remove stopwords
stemmer = PorterStemmer()
stem_sentences = []

for sentence in sentences:
    words = nltk.word_tokenize(sentence)
    print("Words before stemming : ", words)
    
    stem_words = []
    for word in words:
        if word not in set(stopwords.words('english')):
            stem_word = stemmer.stem(word)
            stem_words.append(stem_word)
    
    stem_sentence = ' '.join(stem_words)
    stem_sentences.append(stem_sentence)
        
    print("Words after stemming : ", stem_words)
    
print("Sentences after stemming : ", stem_sentences)           

Words before stemming :  ['Paragraphs', 'are', 'the', 'building', 'blocks', 'of', 'papers', '.']
Words after stemming :  ['paragraph', 'build', 'block', 'paper', '.']
Words before stemming :  ['Many', 'students', 'define', 'paragraphs', 'in', 'terms', 'of', 'length', '.']
Words after stemming :  ['mani', 'student', 'defin', 'paragraph', 'term', 'length', '.']
Words before stemming :  ['A', 'paragraph', 'is', 'a', 'group', 'of', 'at', 'least', 'five', 'sentences', '.']
Words after stemming :  ['A', 'paragraph', 'group', 'least', 'five', 'sentenc', '.']
Words before stemming :  ['Paragraph', 'is', 'half', 'a', 'page', 'long', ',', 'etc', '.']
Words after stemming :  ['paragraph', 'half', 'page', 'long', ',', 'etc', '.']
Sentences after stemming :  ['paragraph build block paper .', 'mani student defin paragraph term length .', 'A paragraph group least five sentenc .', 'paragraph half page long , etc .']


**Notes :**

* Stemming will lead to words which are not understood by humans. 
* Counter Intutive to do this when we need the human readble words as input for a problem like NLU, or ChatBot or Text Generation
* Lemmatization solves this issue and we pay it through complexity and time