Stemming:
Stemming is the process of reducing words to their word stem, base, or root form. 
It involves removing suffixes or prefixes from words to derive their base form. Stemming algorithms are typically rule-based and can be fast and efficient. However, they may not always produce a valid root word. For example, the word "running" 
might be stemmed to "run", but "flies" might also be stemmed to "fli".

Lemmatization:
Lemmatization, on the other hand, aims to reduce words to their canonical form or lemma.
Unlike stemming, lemmatization considers the context of the word and its part of speech (POS). It involves looking up words in a lexicon (such as WordNet) and applying morphological analysis to determine the lemma. Lemmatization ensures that the resulting word is a valid word in the language.
For example, "ran" would be lemmatized to "run", and "better" would be lemmatized to "good".

In [21]:
!pip install nltk
import nltk
import numpy as np
import pandas as pd

Defaulting to user installation because normal site-packages is not writeable


In [29]:
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...


True

In [30]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.stem import PorterStemmer,WordNetLemmatizer
from nltk import pos_tag

In [31]:
sentence1 = "Stemming and lemmatization are different techniques used to reduce words to their root form, but they produce varying results. Lemmatization is better than stemming"

In [32]:
import string
def Tokenise(sentence: str):
    punctuation=string.punctuation+ '[]{}()<>'
    for char in punctuation:
        sentence=sentence.replace(char," ")
#         print(sentence)
    sentence= sentence.lower()
    tokens=sentence.split()
    return tokens

tokens=Tokenise(sentence1)
print(tokens)
print(len(tokens))

['stemming', 'and', 'lemmatization', 'are', 'different', 'techniques', 'used', 'to', 'reduce', 'words', 'to', 'their', 'root', 'form', 'but', 'they', 'produce', 'varying', 'results', 'lemmatization', 'is', 'better', 'than', 'stemming']
24


In [33]:
def RemStopWord(token):
    stop_word=set(stopwords.words('english'))
    filtered=[word for word in token if word not in stop_word]
    return filtered

tokens=RemStopWord(tokens)
print(tokens)

['stemming', 'lemmatization', 'different', 'techniques', 'used', 'reduce', 'words', 'root', 'form', 'produce', 'varying', 'results', 'lemmatization', 'better', 'stemming']


In [34]:
pos_tag_list=pos_tag(tokens)
pos_tag_list
# VBG: Verb, Gerund/Present Participle
# NN: Noun, Singular or Mass
# JJ: Adjective
# NNS: Noun, Plural
# VBN: Verb, Past Participle
# VB: Verb, Base Form
# VBP: Verb, Non-3rd Person Singular Present
# RBR: Adverb, Comparative

[('stemming', 'VBG'),
 ('lemmatization', 'NN'),
 ('different', 'JJ'),
 ('techniques', 'NNS'),
 ('used', 'VBN'),
 ('reduce', 'VB'),
 ('words', 'NNS'),
 ('root', 'VBP'),
 ('form', 'NN'),
 ('produce', 'VBP'),
 ('varying', 'VBG'),
 ('results', 'NNS'),
 ('lemmatization', 'NN'),
 ('better', 'RBR'),
 ('stemming', 'NN')]

In [35]:
#stemming
stemmer=PorterStemmer()
for w in tokens:
    print(f"{w} : {stemmer.stem(w)}")

stemming : stem
lemmatization : lemmat
different : differ
techniques : techniqu
used : use
reduce : reduc
words : word
root : root
form : form
produce : produc
varying : vari
results : result
lemmatization : lemmat
better : better
stemming : stem


In [36]:
lemmatizer = WordNetLemmatizer()

for w in tokens:
    print(f"{w} : {lemmatizer.lemmatize(w)}")

stemming : stemming
lemmatization : lemmatization
different : different
techniques : technique
used : used
reduce : reduce
words : word
root : root
form : form
produce : produce
varying : varying
results : result
lemmatization : lemmatization
better : better
stemming : stemming


In [45]:
def CalcTF(tokens):
    term_freq = {}
    for word in tokens:
        if word not in term_freq:
            term_freq[word]=tokens.count(word)/len(tokens)
            
    return term_freq

CalcTF(tokens)
# TREM FREq  = occ/total

# IDF(t,D)=
# log( Number of documents in corpus ∣D∣/Total number of documents containing term t)

{'stemming': 0.13333333333333333,
 'lemmatization': 0.13333333333333333,
 'different': 0.06666666666666667,
 'techniques': 0.06666666666666667,
 'used': 0.06666666666666667,
 'reduce': 0.06666666666666667,
 'words': 0.06666666666666667,
 'root': 0.06666666666666667,
 'form': 0.06666666666666667,
 'produce': 0.06666666666666667,
 'varying': 0.06666666666666667,
 'results': 0.06666666666666667,
 'better': 0.06666666666666667}

In [44]:
def calculateTF_IDF(documents):
    documents = sent_tokenize(documents)
    document_map = {}
    document_tf = {}
    unique_words = set()
    word_idf = {}

    for i, document in enumerate(documents):
        tokenizedWords  = Tokenise(document)
        document_map[i] = tokenizedWords

        document_tf[i] = CalcTF(tokenizedWords)

        for word in tokenizedWords:
            unique_words.add(word)

    for word in unique_words:
        count = 0
        for _, tokenedWords in document_map.items():
            if word in tokenedWords:
                count += 1

        word_idf[word] = count

    return word_idf, document_tf


word_idf, document_tf = calculateTF_IDF(sentence1)
print(word_idf)
print()
print(document_tf)

{'words': 1, 'results': 1, 'their': 1, 'are': 1, 'produce': 1, 'to': 1, 'stemming': 2, 'varying': 1, 'techniques': 1, 'reduce': 1, 'is': 1, 'and': 1, 'better': 1, 'used': 1, 'different': 1, 'they': 1, 'but': 1, 'form': 1, 'than': 1, 'root': 1, 'lemmatization': 2}

{0: {'stemming': 0.05263157894736842, 'and': 0.05263157894736842, 'lemmatization': 0.05263157894736842, 'are': 0.05263157894736842, 'different': 0.05263157894736842, 'techniques': 0.05263157894736842, 'used': 0.05263157894736842, 'to': 0.10526315789473684, 'reduce': 0.05263157894736842, 'words': 0.05263157894736842, 'their': 0.05263157894736842, 'root': 0.05263157894736842, 'form': 0.05263157894736842, 'but': 0.05263157894736842, 'they': 0.05263157894736842, 'produce': 0.05263157894736842, 'varying': 0.05263157894736842, 'results': 0.05263157894736842}, 1: {'lemmatization': 0.2, 'is': 0.2, 'better': 0.2, 'than': 0.2, 'stemming': 0.2}}
