Stemming:
Stemming is the process of reducing words to their word stem, base, or root form. 
It involves removing suffixes or prefixes from words to derive their base form. Stemming algorithms are typically rule-based and can be fast and efficient. However, they may not always produce a valid root word. For example, the word "running" 
might be stemmed to "run", but "flies" might also be stemmed to "fli".

Lemmatization:
Lemmatization, on the other hand, aims to reduce words to their canonical form or lemma.
Unlike stemming, lemmatization considers the context of the word and its part of speech (POS). It involves looking up words in a lexicon (such as WordNet) and applying morphological analysis to determine the lemma. Lemmatization ensures that the resulting word is a valid word in the language.
For example, "ran" would be lemmatized to "run", and "better" would be lemmatized to "good".

In [17]:
!pip install nltk
!pip install python-docx
import nltk
import numpy as np
import pandas as pd

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [18]:
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [19]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.stem import PorterStemmer,WordNetLemmatizer
from nltk import pos_tag
import docx as Document
from sklearn.feature_extraction.text import TfidfVectorizer

In [20]:
# sentence1 = "Stemming and lemmatization are different techniques used to reduce words to their root form, but they produce varying results. Lemmatization is better than stemming"
# sentence1 = docx2txt.process('asgn1.docx')

sentence1 = open('hello.txt', "r").read()
sentence2= open('hello2.txt', "r").read()
sentence1

'Stemming and lemmatization are different techniques used ran better to reduce words to their root form, but they produce varying results. Lemmatization is better than stemming\n'

In [21]:
import string
def Tokenise(sentence: str):
    punctuation=string.punctuation+ '[]{}()<>'
    for char in punctuation:
        sentence=sentence.replace(char," ")
#         print(sentence)
    sentence= sentence.lower()
    tokens=sentence.split()
    return tokens

tokens=Tokenise(sentence1)
print(tokens)
print(len(tokens))
# token1=word_tokenize(sentence1)
token1=Tokenise(sentence1)
# token2=word_tokenize(sentence2)
token2=Tokenise(sentence2)
print(token2)
print(len(token2))

['stemming', 'and', 'lemmatization', 'are', 'different', 'techniques', 'used', 'ran', 'better', 'to', 'reduce', 'words', 'to', 'their', 'root', 'form', 'but', 'they', 'produce', 'varying', 'results', 'lemmatization', 'is', 'better', 'than', 'stemming']
26
['helping', 'others', 'is', 'the', 'best', 'thing', 'in', 'the', 'running', 'world', 'keeps', 'you', 'busy', 'and', 'helping', 'always', 'be', 'grateful']
18


In [22]:
def RemStopWord(token):
    stop_word=set(stopwords.words('english'))
    filtered=[word for word in token if word not in stop_word]
    return filtered

token1=RemStopWord(token1)
token2=RemStopWord(token2)
print(token1)
print(token2)

['stemming', 'lemmatization', 'different', 'techniques', 'used', 'ran', 'better', 'reduce', 'words', 'root', 'form', 'produce', 'varying', 'results', 'lemmatization', 'better', 'stemming']
['helping', 'others', 'best', 'thing', 'running', 'world', 'keeps', 'busy', 'helping', 'always', 'grateful']


In [23]:
pos_tag_list=pos_tag(tokens)
pos_tag_list
# VBG: Verb, Gerund/Present Participle
# NN: Noun, Singular or Mass
# JJ: Adjective
# NNS: Noun, Plural
# VBN: Verb, Past Participle
# VB: Verb, Base Form
# VBP: Verb, Non-3rd Person Singular Present
# RBR: Adverb, Comparative

[('stemming', 'VBG'),
 ('and', 'CC'),
 ('lemmatization', 'NN'),
 ('are', 'VBP'),
 ('different', 'JJ'),
 ('techniques', 'NNS'),
 ('used', 'VBN'),
 ('ran', 'VBD'),
 ('better', 'RBR'),
 ('to', 'TO'),
 ('reduce', 'VB'),
 ('words', 'NNS'),
 ('to', 'TO'),
 ('their', 'PRP$'),
 ('root', 'JJ'),
 ('form', 'NN'),
 ('but', 'CC'),
 ('they', 'PRP'),
 ('produce', 'VBP'),
 ('varying', 'VBG'),
 ('results', 'NNS'),
 ('lemmatization', 'NN'),
 ('is', 'VBZ'),
 ('better', 'JJR'),
 ('than', 'IN'),
 ('stemming', 'VBG')]

In [24]:
#stemming
stemmer=PorterStemmer()
stemmed=[]
for w in tokens:
    print(f"{w} : {stemmer.stem(w)}")
    stemmed.append(stemmer.stem(w))

stemming : stem
and : and
lemmatization : lemmat
are : are
different : differ
techniques : techniqu
used : use
ran : ran
better : better
to : to
reduce : reduc
words : word
to : to
their : their
root : root
form : form
but : but
they : they
produce : produc
varying : vari
results : result
lemmatization : lemmat
is : is
better : better
than : than
stemming : stem


In [25]:
lemmatizer = WordNetLemmatizer()

for w in tokens:
    print(f"{w} : {lemmatizer.lemmatize(w)}")

stemming : stemming
and : and
lemmatization : lemmatization
are : are
different : different
techniques : technique
used : used
ran : ran
better : better
to : to
reduce : reduce
words : word
to : to
their : their
root : root
form : form
but : but
they : they
produce : produce
varying : varying
results : result
lemmatization : lemmatization
is : is
better : better
than : than
stemming : stemming


In [37]:
def CalcTF(tokens):
    term_freq = {}
    for word in tokens:
        if word not in term_freq:
            term_freq[word]=tokens.count(word)/len(tokens)
            
    return term_freq

CalcTF(tokens)
tk1=CalcTF(token1)
tk2=CalcTF(token2)
# TREM FREq  = occ/total

# IDF(t,D)=
# log( Number of documents in corpus ∣D∣/Total number of documents containing term t)

In [38]:
import math


def calculate_idf(docList):
    all_tokens=[]
    for d in docList:
        all_tokens+=Tokenise(d)
    print(all_tokens)
    nodocs=len(docList)
    idf=dict()
    for t in all_tokens:
        f=0
        for d in docList:
            l=Tokenise(d)
            if t in l:
                f+=1
        idf[t]=math.log(nodocs/f)
    return idf,all_tokens

In [39]:
idf,all_tokens=calculate_idf([sentence1,sentence2])
idf

['stemming', 'and', 'lemmatization', 'are', 'different', 'techniques', 'used', 'ran', 'better', 'to', 'reduce', 'words', 'to', 'their', 'root', 'form', 'but', 'they', 'produce', 'varying', 'results', 'lemmatization', 'is', 'better', 'than', 'stemming', 'helping', 'others', 'is', 'the', 'best', 'thing', 'in', 'the', 'running', 'world', 'keeps', 'you', 'busy', 'and', 'helping', 'always', 'be', 'grateful']


{'stemming': 0.6931471805599453,
 'and': 0.0,
 'lemmatization': 0.6931471805599453,
 'are': 0.6931471805599453,
 'different': 0.6931471805599453,
 'techniques': 0.6931471805599453,
 'used': 0.6931471805599453,
 'ran': 0.6931471805599453,
 'better': 0.6931471805599453,
 'to': 0.6931471805599453,
 'reduce': 0.6931471805599453,
 'words': 0.6931471805599453,
 'their': 0.6931471805599453,
 'root': 0.6931471805599453,
 'form': 0.6931471805599453,
 'but': 0.6931471805599453,
 'they': 0.6931471805599453,
 'produce': 0.6931471805599453,
 'varying': 0.6931471805599453,
 'results': 0.6931471805599453,
 'is': 0.0,
 'than': 0.6931471805599453,
 'helping': 0.6931471805599453,
 'others': 0.6931471805599453,
 'the': 0.6931471805599453,
 'best': 0.6931471805599453,
 'thing': 0.6931471805599453,
 'in': 0.6931471805599453,
 'running': 0.6931471805599453,
 'world': 0.6931471805599453,
 'keeps': 0.6931471805599453,
 'you': 0.6931471805599453,
 'busy': 0.6931471805599453,
 'always': 0.6931471805599453,
 'be

In [43]:
#tf idf
tfidf={}
for word in all_tokens:
    if word not in tfidf:
        if word in tk1:
            tfidf[word]=tk1[word]*idf[word]
        elif word in tk2:
            tfidf[word]=tk2[word]*idf[word]
for key,vl in tfidf.items():
    print(f"{key} : {vl}")

stemming : 0.08154672712469944
lemmatization : 0.08154672712469944
different : 0.04077336356234972
techniques : 0.04077336356234972
used : 0.04077336356234972
ran : 0.04077336356234972
better : 0.08154672712469944
reduce : 0.04077336356234972
words : 0.04077336356234972
root : 0.04077336356234972
form : 0.04077336356234972
produce : 0.04077336356234972
varying : 0.04077336356234972
results : 0.04077336356234972
helping : 0.12602676010180824
others : 0.06301338005090412
best : 0.06301338005090412
thing : 0.06301338005090412
running : 0.06301338005090412
world : 0.06301338005090412
keeps : 0.06301338005090412
busy : 0.06301338005090412
always : 0.06301338005090412
grateful : 0.06301338005090412


In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Example tokens array (list of strings)
tokens_array = token1

# Convert the tokens array into a single string (document)
document = " ".join(tokens_array)
document2 = " ".join(token2)

# Create a TfidfVectorizer object
tfidf_vectorizer = TfidfVectorizer()

# Fit the vectorizer to the document and transform the document into TF-IDF representation
tfidf_representation = tfidf_vectorizer.fit_transform([document,document2])

# Get the feature names (terms) from the vectorizer
feature_names = tfidf_vectorizer.get_feature_names_out()

# Print the TF-IDF representation
print("TF-IDF representation:")
print(tfidf_representation)

# Print the feature names (terms)
print("\nFeature names:")
print(feature_names)


TF-IDF representation:
  (0, 14)	0.2085144140570748
  (0, 21)	0.2085144140570748
  (0, 11)	0.2085144140570748
  (0, 5)	0.2085144140570748
  (0, 15)	0.2085144140570748
  (0, 22)	0.2085144140570748
  (0, 13)	0.2085144140570748
  (0, 2)	0.4170288281141496
  (0, 12)	0.2085144140570748
  (0, 20)	0.2085144140570748
  (0, 18)	0.2085144140570748
  (0, 4)	0.2085144140570748
  (0, 9)	0.4170288281141496
  (0, 17)	0.4170288281141496
  (1, 6)	0.2773500981126146
  (1, 0)	0.2773500981126146
  (1, 3)	0.2773500981126146
  (1, 8)	0.2773500981126146
  (1, 23)	0.2773500981126146
  (1, 16)	0.2773500981126146
  (1, 19)	0.2773500981126146
  (1, 1)	0.2773500981126146
  (1, 10)	0.2773500981126146
  (1, 7)	0.5547001962252291

Feature names:
['always' 'best' 'better' 'busy' 'different' 'form' 'grateful' 'helping'
 'keeps' 'lemmatization' 'others' 'produce' 'ran' 'reduce' 'results'
 'root' 'running' 'stemming' 'techniques' 'thing' 'used' 'varying' 'words'
 'world']
