<a href="https://www.kaggle.com/sid9300/learning-stemming-lemmatization?scriptVersionId=84370057" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

* [Prerequisite](#pre)
    * [Importing Libraries](#importing)
    * [Reading Sample Texts](#reading)
    * [Extracting Tokens](#extracting)
* [Tokenisation](#tokenisation)
    * [Word Tokenisation](#word)
    * [Sentence Tokenisation](#sentence)
    * [Tweet Tokenisation](#tweet)
    * [Custom Tokenisation (using Regex)](#regex)
* [Stemmer](#stemmer) 
    * [Porter Stemmer](#porter)
    * [Snowball Stemmer](#snowball)
* [Lemmatizer](#lemmatizer)
    * [Wordnet Lemmatizer](#wordnet)
* [Conclusion](#conclusion)

## <font color='#4a8bad'>Prerequisite</font>
***
<a id="pre"></a>

#### <font color='#4a8bad'>Importing Libraries</font>
<a id="importing"></a>

In [1]:
import pandas as pd

from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import regexp_tokenize

from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

from nltk.stem import WordNetLemmatizer

#### <font color='#4a8bad'>Reading Sample Texts</font>
<a id="reading"></a>

In [2]:
text = "Very orderly and methodical he looked, with a hand on each knee, and a loud watch ticking a sonorous sermon under his flapped newly bought waist-coat, as though it pitted its gravity and longevity against the levity and evanescence of the brisk fire."

print("Text : ", len(text))
print("------------------")
print(text)

Text :  250
------------------
Very orderly and methodical he looked, with a hand on each knee, and a loud watch ticking a sonorous sermon under his flapped newly bought waist-coat, as though it pitted its gravity and longevity against the levity and evanescence of the brisk fire.


In [3]:
document = "At nine o'clock I visited him myself. It looks like religious mania, and he'll soon think that he himself is God."
print("Text : ", len(document))
print("------------------")
print(text)

Text :  113
------------------
Very orderly and methodical he looked, with a hand on each knee, and a loud watch ticking a sonorous sermon under his flapped newly bought waist-coat, as though it pitted its gravity and longevity against the levity and evanescence of the brisk fire.


In [4]:
message = "i recently watched this show called mindhunters:). i totally loved it 😍. it was gr8 <3. #bingewatching #nothingtodo 😎"
print("Text : ", len(message))
print("------------------")
print(message)

Text :  117
------------------
i recently watched this show called mindhunters:). i totally loved it 😍. it was gr8 <3. #bingewatching #nothingtodo 😎


#### <font color='#4a8bad'>Extracting Tokens</font>
<a id="extracting"></a>

In [5]:
tokens = word_tokenize(text.lower())
print("Tokens : ", len(tokens))
print("------------------")
print(tokens)

Tokens :  47
------------------
['very', 'orderly', 'and', 'methodical', 'he', 'looked', ',', 'with', 'a', 'hand', 'on', 'each', 'knee', ',', 'and', 'a', 'loud', 'watch', 'ticking', 'a', 'sonorous', 'sermon', 'under', 'his', 'flapped', 'newly', 'bought', 'waist-coat', ',', 'as', 'though', 'it', 'pitted', 'its', 'gravity', 'and', 'longevity', 'against', 'the', 'levity', 'and', 'evanescence', 'of', 'the', 'brisk', 'fire', '.']


## <font color='#4a8bad'>Tokenisation</font>
***
<a id="tokenisation"></a>

#### <font color='#4a8bad'>Word Tokenisation</font>
<a id="word"></a>

NLTK's word tokeniser not only breaks on whitespaces but also breaks contraction words such as he'll into "he" and "'ll". On the other hand it doesn't break "o'clock" and treats it as a separate token.

In [6]:
words = word_tokenize(document)
print("Words : ", len(words))
print("------------------")
print(words)

Words :  25
------------------
['At', 'nine', "o'clock", 'I', 'visited', 'him', 'myself', '.', 'It', 'looks', 'like', 'religious', 'mania', ',', 'and', 'he', "'ll", 'soon', 'think', 'that', 'he', 'himself', 'is', 'God', '.']


#### <font color='#4a8bad'>Sentence Tokenisation</font>
<a id="sentence"></a>

Tokenising based on sentence requires you to split on the period ('.'). Let's use nltk sentence tokeniser.

In [7]:
sentences = sent_tokenize(document)
print("Sentences : ", len(sentences))
print("------------------")
print(sentences)

Sentences :  2
------------------
["At nine o'clock I visited him myself.", "It looks like religious mania, and he'll soon think that he himself is God."]


#### <font color='#4a8bad'>Tweet Tokenisation</font>
<a id="tweet"></a>

A problem with word tokeniser is that it fails to tokeniser emojis and other complex special characters such as word with hashtags. Emojis are common these days and people use them all the time.
Emojis have their own significance in areas like sentiment analysis where a happy face and sad face can salone prove to be a really good predictor of the sentiment. Similarly, the hashtags are broken into two tokens. A hashtag is used for searching specific topics or photos in social media apps such as Instagram and facebook. So there, you want to use the hashtag as is.

Let's use the tweet tokeniser of nltk to tokenise this message.

In [8]:
tweets = TweetTokenizer().tokenize(message)
print("Tweets : ", len(tweets))
print("------------------")
print(tweets)

Tweets :  23
------------------
['i', 'recently', 'watched', 'this', 'show', 'called', 'mindhunters', ':)', '.', 'i', 'totally', 'loved', 'it', '😍', '.', 'it', 'was', 'gr8', '<3', '.', '#bingewatching', '#nothingtodo', '😎']


#### <font color='#4a8bad'>Custom Tokenisation (using Regex)</font>
<a id="regex"></a>

Now, there is a tokeniser that takes a regular expression and tokenises and returns result based on the pattern of regular expression.

Let's look at how you can use regular expression tokeniser.

In [9]:
pattern = "#[\w]+"

words_regex = regexp_tokenize(message, pattern)
print("Words : ", len(words_regex))
print("------------------")
print(words_regex)

Words :  2
------------------
['#bingewatching', '#nothingtodo']


## <font color='#4a8bad'>Stemmer</font>
***
<a id="stemmer"></a>

#### <font color='#4a8bad'>Porter Stemmer</font>
<a id="porter"></a>

In [10]:
stemmer = PorterStemmer()
porter_stemmed = [stemmer.stem(token) for token in tokens]

print("Tokens : ", len(porter_stemmed))
print("------------------")
print(porter_stemmed)

Tokens :  47
------------------
['veri', 'orderli', 'and', 'method', 'he', 'look', ',', 'with', 'a', 'hand', 'on', 'each', 'knee', ',', 'and', 'a', 'loud', 'watch', 'tick', 'a', 'sonor', 'sermon', 'under', 'hi', 'flap', 'newli', 'bought', 'waist-coat', ',', 'as', 'though', 'it', 'pit', 'it', 'graviti', 'and', 'longev', 'against', 'the', 'leviti', 'and', 'evanesc', 'of', 'the', 'brisk', 'fire', '.']


#### <font color='#4a8bad'>Snowball Stemmer</font>
<a id="snowball"></a>

In [11]:
stemmer = SnowballStemmer("english")
snowball_stemmed = [stemmer.stem(token) for token in tokens]

print("Tokens : ", len(snowball_stemmed))
print("------------------")
print(snowball_stemmed)

Tokens :  47
------------------
['veri', 'order', 'and', 'method', 'he', 'look', ',', 'with', 'a', 'hand', 'on', 'each', 'knee', ',', 'and', 'a', 'loud', 'watch', 'tick', 'a', 'sonor', 'sermon', 'under', 'his', 'flap', 'newli', 'bought', 'waist-coat', ',', 'as', 'though', 'it', 'pit', 'it', 'graviti', 'and', 'longev', 'against', 'the', 'leviti', 'and', 'evanesc', 'of', 'the', 'brisk', 'fire', '.']


## <font color='#4a8bad'>Lemmatizer</font>
***
<a id="lemmatizer"></a>

#### <font color='#4a8bad'>Wordnet Lemmatizer</font>
<a id="wordnet"></a>

In [12]:
wordnet_lemmatizer = WordNetLemmatizer()
lemmatized = [wordnet_lemmatizer.lemmatize(token) for token in tokens]

print("Tokens : ", len(lemmatized))
print("------------------")
print(lemmatized)

Tokens :  47
------------------
['very', 'orderly', 'and', 'methodical', 'he', 'looked', ',', 'with', 'a', 'hand', 'on', 'each', 'knee', ',', 'and', 'a', 'loud', 'watch', 'ticking', 'a', 'sonorous', 'sermon', 'under', 'his', 'flapped', 'newly', 'bought', 'waist-coat', ',', 'a', 'though', 'it', 'pitted', 'it', 'gravity', 'and', 'longevity', 'against', 'the', 'levity', 'and', 'evanescence', 'of', 'the', 'brisk', 'fire', '.']


## <font color='#4a8bad'>Conclusion</font>
***
<a id="conclusion"></a>

In [13]:
df = pd.DataFrame({"tokens": tokens, "porter_stemmed" : porter_stemmed, "snowball_stemmed" : snowball_stemmed, "lemmatized" : lemmatized})
df_new = df[(df.tokens != df.porter_stemmed) | (df.tokens != df.snowball_stemmed) | (df.tokens != lemmatized)]

print("Differences : ", len(df_new))
print("------------------")
df_new

Differences :  16
------------------


Unnamed: 0,tokens,porter_stemmed,snowball_stemmed,lemmatized
0,very,veri,veri,very
1,orderly,orderli,order,orderly
3,methodical,method,method,methodical
5,looked,look,look,looked
18,ticking,tick,tick,ticking
20,sonorous,sonor,sonor,sonorous
23,his,hi,his,his
24,flapped,flap,flap,flapped
25,newly,newli,newli,newly
29,as,as,as,a
