<p style="color:#153462; 
          font-weight: bold; 
          font-size: 30px; 
          font-family: Gill Sans, sans-serif; 
          text-align: center;">
          Stemming and Lemmatization</p>

In [39]:
from typing import List

### Definition of Stemming

Process of reducing inflected(or sometimes derived) words to their word stem or root.
(or)
Crudely chopping off the end of the word to leave only the base.

Examples of Stemmping:<br>
<table>
    <tr>
        <th>id</th>
        <th>Words</th>
        <th>Stemming Word</th>
    </tr>
    <tr>
        <td>1</td>
        <td>Stemming/Stemmed</td>
        <td>Stem</td>
    </tr>
    <tr>
        <td>2</td>
        <td>Electricity/Electrical</td>
        <td>Elect</td>
    </tr>
    <tr>
        <td>3</td>
        <td>Berries/Berry</td>
        <td>Berri</td>
    </tr>
    <tr>
        <td>4</td>
        <td>Meanness/Meaning</td>
        <td>Mean</td>
    </tr>
</table>

<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       Stemming uses very crude rules, so it isn't perfect. In example 4 above both word are not closely releated but it come to conclusion of mean.
   </font>
</p>

<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       Stemming helps to reduce the corpus of words. Below are different types of stemming methods available
       <ol>
           <li>Porter Stemmer</li>
           <li>Snowball Stemmer</li>
           <li>Lancaster Stemmer</li>
           <li>Regex-Based Stemmer</li>
       </ol>
   </font>
</p>

### Demo of Stemming

In [12]:
import nltk

In [13]:
ps = nltk.PorterStemmer()

In [14]:
print(ps.stem("grows"))
print(ps.stem("growing"))
print(ps.stem("grow"))

grow
grow
grow


In [15]:
# It is able to differential between noun and verbs
print(ps.stem("run"))
print(ps.stem("running"))
print(ps.stem("runner"))

run
run
runner


### Reading Raw Text

In [16]:
import pandas as pd
import string
import re

nltk.data.path.append(r"D:\Artificial_Intelligence\nat_lang_proc\nltk_data")
stopwords = nltk.corpus.stopwords.words("english")

data_df = pd.read_csv(r"D:/Artificial_Intelligence/nat_lang_proc/data/SMSSpamCollection.tsv", delimiter="\t", header=None)
data_df.columns = ["labels", "body_text"]
data_df.head()

Unnamed: 0,labels,body_text
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


In [17]:
def clean_data(text):
    without_punc = "".join([char for char in text if char not in string.punctuation])
    tokenzied_text = re.findall("\w+", without_punc)
    text_nostopwords = [word for word in tokenzied_text if word not in stopwords]
    return text_nostopwords

In [18]:
data_df["text_no_stopwords"] = data_df["body_text"].apply(lambda x:clean_data(x.lower()))

In [19]:
data_df.head()

Unnamed: 0,labels,body_text,text_no_stopwords
0,ham,I've been searching for the right words to tha...,"[ive, searching, right, words, thank, breather..."
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, 2, wkly, comp, win, fa, cup, fin..."
2,ham,"Nah I don't think he goes to usf, he lives aro...","[nah, dont, think, goes, usf, lives, around, t..."
3,ham,Even my brother is not like to speak with me. ...,"[even, brother, like, speak, treat, like, aids..."
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[date, sunday]"


### Stemming Text

In [20]:
def stem_text(words_list):
    text = [ps.stem(word) for word in words_list]
    return text

In [21]:
data_df["stem_words"] = data_df["text_no_stopwords"].apply(stem_text)

In [22]:
data_df.head()

Unnamed: 0,labels,body_text,text_no_stopwords,stem_words
0,ham,I've been searching for the right words to tha...,"[ive, searching, right, words, thank, breather...","[ive, search, right, word, thank, breather, pr..."
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, 2, wkly, comp, win, fa, cup, fin...","[free, entri, 2, wkli, comp, win, fa, cup, fin..."
2,ham,"Nah I don't think he goes to usf, he lives aro...","[nah, dont, think, goes, usf, lives, around, t...","[nah, dont, think, goe, usf, live, around, tho..."
3,ham,Even my brother is not like to speak with me. ...,"[even, brother, like, speak, treat, like, aids...","[even, brother, like, speak, treat, like, aid,..."
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[date, sunday]","[date, sunday]"


### Lemmatization

<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       Lemmatization is a process of grouping together the inflected form of words so they can be analyzed as single term.
The goal of lemmatization is to reduce a word to its root form, also called a lemma. Lemmatization links similar meaning words as one word, making tools such as chatbots and search engine queries more effective and accurate (or) Using vocabulary analysis of words aiming to remove inflectional endings to return the dictionary form of a word.
   </font>
</p>


#### Stemming vs Lemmatization

<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       <ul>
            <li>The goal of both is to condense derived words in their base forms.</li>
            <li>Stemming is typically faster as it is simply chops off the end of the word using heuristics, without any understanding of
                the context in which a word is used</li>
            <li>Lemmatization typically more accurate as it uses more informed analysis to create group of words with similar meaning based on the content around the word</li>
        </ul>
   </font>
</p>

In [23]:
wn = nltk.WordNetLemmatizer()

In [33]:
dir(wn)[-5:]

['__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'lemmatize']

In [34]:
# Both words convey different meaning, but stemming treated them as 'mean'
print(ps.stem("meanness"))
print(ps.stem("meaning"))

mean
mean


In [35]:
# Lemmatization predicted correctly
print(wn.lemmatize("meanness"))
print(wn.lemmatize("meaning"))

meanness
meaning


In [36]:
print(ps.stem("goose"))
print(ps.stem("greese"))

goos
grees


In [37]:
print(wn.lemmatize("goose"))
print(wn.lemmatize("greese"))

goose
greese


#### Applying lemmatization

In [40]:
def lemmatization(tokenized_text: List) -> List:
    """
    :param tokenized_text: Tokens list
    :type tokenized_text: List
    :returns: A list of lemmatized tokens
    :rtype: List
    """
    text = [wn.lemmatize(word) for word in tokenized_text]
    return text

In [43]:
data_df["lemmatized_words"] = data_df["text_no_stopwords"].apply(lemmatization)
data_df

Unnamed: 0,labels,body_text,text_no_stopwords,stem_words,lemmatized_words
0,ham,I've been searching for the right words to tha...,"[ive, searching, right, words, thank, breather...","[ive, search, right, word, thank, breather, pr...","[ive, searching, right, word, thank, breather,..."
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, 2, wkly, comp, win, fa, cup, fin...","[free, entri, 2, wkli, comp, win, fa, cup, fin...","[free, entry, 2, wkly, comp, win, fa, cup, fin..."
2,ham,"Nah I don't think he goes to usf, he lives aro...","[nah, dont, think, goes, usf, lives, around, t...","[nah, dont, think, goe, usf, live, around, tho...","[nah, dont, think, go, usf, life, around, though]"
3,ham,Even my brother is not like to speak with me. ...,"[even, brother, like, speak, treat, like, aids...","[even, brother, like, speak, treat, like, aid,...","[even, brother, like, speak, treat, like, aid,..."
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[date, sunday]","[date, sunday]","[date, sunday]"
...,...,...,...,...,...
5563,spam,This is the 2nd time we have tried 2 contact u...,"[2nd, time, tried, 2, contact, u, u, 750, poun...","[2nd, time, tri, 2, contact, u, u, 750, pound,...","[2nd, time, tried, 2, contact, u, u, 750, poun..."
5564,ham,Will ü b going to esplanade fr home?,"[ü, b, going, esplanade, fr, home]","[ü, b, go, esplanad, fr, home]","[ü, b, going, esplanade, fr, home]"
5565,ham,"Pity, * was in mood for that. So...any other s...","[pity, mood, soany, suggestions]","[piti, mood, soani, suggest]","[pity, mood, soany, suggestion]"
5566,ham,The guy did some bitching but I acted like i'd...,"[guy, bitching, acted, like, id, interested, b...","[guy, bitch, act, like, id, interest, buy, som...","[guy, bitching, acted, like, id, interested, b..."
