# Question 2

using Data_1.txt

## Explain the importance of stemming in text analytics

In terms of the English language, words often can take many form other than its base form. In English Grammar, a word is morphed through inflectional morphology depending on the sentence structure. However, the base word and the inflected words have the same meaning, just used in different context as in past tense and present tense. Therefore, stemming is important to reduce the inflected words into its base form, to normalize the text into a standard form irrespective of their inflected form. This process may help in information retrieval systems like search engines such that the system is able to return relevant and comparable results.

https://towardsdatascience.com/stemming-vs-lemmatization-in-nlp-dea008600a0
https://www.analyticsvidhya.com/blog/2021/11/an-introduction-to-stemming-in-natural-language-processing/

In [2]:
with open("Data_1.txt", 'r') as f:
    data = f.read()
print(data)

from textblob import TextBlob
blob = TextBlob(data)
words_blob = blob.words
words_blob

It is one thing to automatically detect that a particular word occurs in a text, and to
display some words that appear in the same context. However, we can also determine
the location of a word in the text: how many words from the beginning it appears. This
positional information can be displayed using a dispersion plot. Each stripe represents
an instance of a word, and each row represents the entire text.


WordList(['It', 'is', 'one', 'thing', 'to', 'automatically', 'detect', 'that', 'a', 'particular', 'word', 'occurs', 'in', 'a', 'text', 'and', 'to', 'display', 'some', 'words', 'that', 'appear', 'in', 'the', 'same', 'context', 'However', 'we', 'can', 'also', 'determine', 'the', 'location', 'of', 'a', 'word', 'in', 'the', 'text', 'how', 'many', 'words', 'from', 'the', 'beginning', 'it', 'appears', 'This', 'positional', 'information', 'can', 'be', 'displayed', 'using', 'a', 'dispersion', 'plot', 'Each', 'stripe', 'represents', 'an', 'instance', 'of', 'a', 'word', 'and', 'each', 'row', 'represents', 'the', 'entire', 'text'])

## Demonstrate word stemming using Regular Expression, Porter Stemmer and Lancaster Stemmer and report the output

### Using RegEx

In [3]:
import re

# Defining the affixes
# s
# s'
# 's
# ed
# ing
# en
# er
# est

# translating it into RegEx exp
suffixes = r"s$|s'$|'s$|ed$|ing$|en$|er$|est$"
for word in words_blob:
    stem = re.sub(suffixes, '', word)
    print(f'{word.ljust(15)} ==> {stem}')


It              ==> It
is              ==> i
one             ==> one
thing           ==> th
to              ==> to
automatically   ==> automatically
detect          ==> detect
that            ==> that
a               ==> a
particular      ==> particular
word            ==> word
occurs          ==> occur
in              ==> in
a               ==> a
text            ==> text
and             ==> and
to              ==> to
display         ==> display
some            ==> some
words           ==> word
that            ==> that
appear          ==> appear
in              ==> in
the             ==> the
same            ==> same
context         ==> context
However         ==> Howev
we              ==> we
can             ==> can
also            ==> also
determine       ==> determine
the             ==> the
location        ==> location
of              ==> of
a               ==> a
word            ==> word
in              ==> in
the             ==> the
text            ==> text
how             ==> how
m

In some cases, RegEx will not stem the words correctly, for example `is` was stemmed to `i`, which is totally wrong. The word `thing` was also stemmed to `th` which does not have any meaning. This is because RegEx does not check whether the output is a proper word, but only does the pattern matching.

### Using Potter Stemmer

In [4]:
from nltk.stem import PorterStemmer

p = PorterStemmer()
for word in words_blob:
    stem = p.stem(word)
    print(f'{word.ljust(15)} ==> {stem}')

It              ==> it
is              ==> is
one             ==> one
thing           ==> thing
to              ==> to
automatically   ==> automat
detect          ==> detect
that            ==> that
a               ==> a
particular      ==> particular
word            ==> word
occurs          ==> occur
in              ==> in
a               ==> a
text            ==> text
and             ==> and
to              ==> to
display         ==> display
some            ==> some
words           ==> word
that            ==> that
appear          ==> appear
in              ==> in
the             ==> the
same            ==> same
context         ==> context
However         ==> howev
we              ==> we
can             ==> can
also            ==> also
determine       ==> determin
the             ==> the
location        ==> locat
of              ==> of
a               ==> a
word            ==> word
in              ==> in
the             ==> the
text            ==> text
how             ==> how
many   

For Porter Stemmer, it works better than RegEx as we can see in words like `is` and `thing` does not get mistakenly stemmed like in RegEx. It is also easier to use Porter Stemmer as we do not need to define the inflectional morphology suffixes manually like we do in RegEx. There are many more words that get stemmed like `automatic`, `location`, `information`.

### Using Lancaster Stemmer

In [5]:
from nltk.stem import LancasterStemmer
l =  LancasterStemmer()
for word in words_blob:
    stem = l.stem(word)
    print(f'{word.ljust(15)} ==> {stem}')

It              ==> it
is              ==> is
one             ==> on
thing           ==> thing
to              ==> to
automatically   ==> autom
detect          ==> detect
that            ==> that
a               ==> a
particular      ==> particul
word            ==> word
occurs          ==> occ
in              ==> in
a               ==> a
text            ==> text
and             ==> and
to              ==> to
display         ==> display
some            ==> som
words           ==> word
that            ==> that
appear          ==> appear
in              ==> in
the             ==> the
same            ==> sam
context         ==> context
However         ==> howev
we              ==> we
can             ==> can
also            ==> also
determine       ==> determin
the             ==> the
location        ==> loc
of              ==> of
a               ==> a
word            ==> word
in              ==> in
the             ==> the
text            ==> text
how             ==> how
many            ==

Using Lancaster Stemmer, similar to Porter Stemmer we do not need to manually define the suffixes. However, when compared to Porter Stemmer, it does not properly stem some words like `occurs`, `location`, `position`. These words are not properly stemmed, leaving the output to make no sense.

## Justify the most suitable stemming operation for text analytics. Support your answer using the obtained output.

In [8]:
# comparing the results side by side
print(f"{'Original'.ljust(15)} ==> {'RegEx'.ljust(15)} {'Porter'.ljust(15)} {'Lancaster'.ljust(15)} ")
for word in words_blob:
    print(f'{word.ljust(15)} ==> {re.sub(suffixes, "", word).ljust(15)} {p.stem(word).ljust(15)} {l.stem(word).ljust(15)}')

Original        ==> RegEx           Porter          Lancaster       
It              ==> It              it              it             
is              ==> i               is              is             
one             ==> one             one             on             
thing           ==> th              thing           thing          
to              ==> to              to              to             
automatically   ==> automatically   automat         autom          
detect          ==> detect          detect          detect         
that            ==> that            that            that           
a               ==> a               a               a              
particular      ==> particular      particular      particul       
word            ==> word            word            word           
occurs          ==> occur           occur           occ            
in              ==> in              in              in             
a               ==> a               a          

It is better to use Porter Stemmer. As RegEx has some limitations in finding the inflectional morphology in words, for example `positional`, `information`, `represent`. On the other hand Lancaster Stemmer sometimes over-stem the words until it had no meaning, like `occurs` turned to `occ`, `location` turned to `loc`. Porter Stemmer is the best performing stemmer despite it having some errors in the stemming, which is unavoidable. 