Stemming:
Stemming is the process of reducing words to their word stem, base, or root form. 
It involves removing suffixes or prefixes from words to derive their base form. Stemming algorithms are typically rule-based and can be fast and efficient. However, they may not always produce a valid root word. For example, the word "running" 
might be stemmed to "run", but "flies" might also be stemmed to "fli".

Lemmatization:
Lemmatization, on the other hand, aims to reduce words to their canonical form or lemma.
Unlike stemming, lemmatization considers the context of the word and its part of speech (POS). It involves looking up words in a lexicon (such as WordNet) and applying morphological analysis to determine the lemma. Lemmatization ensures that the resulting word is a valid word in the language.
For example, "ran" would be lemmatized to "run", and "better" would be lemmatized to "good".

In [1]:
!pip install nltk
!pip install python-docx
import nltk
import numpy as np
import pandas as pd




[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [30]:
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [44]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.stem import PorterStemmer,WordNetLemmatizer
from nltk import pos_tag
import docx as Document
import docx2txt
from sklearn.feature_extraction.text import TfidfVectorizer

In [46]:
# sentence1 = "Stemming and lemmatization are different techniques used to reduce words to their root form, but they produce varying results. Lemmatization is better than stemming"
# sentence1 = docx2txt.process('asgn1.docx')

sentence1 = open('hello.txt', "r").read()
sentence1

'Stemming and lemmatization are different techniques used ran better to reduce words to their root form, but they produce varying results. Lemmatization is better than stemming\n'

In [47]:
import string
def Tokenise(sentence: str):
    punctuation=string.punctuation+ '[]{}()<>'
    for char in punctuation:
        sentence=sentence.replace(char," ")
#         print(sentence)
    sentence= sentence.lower()
    tokens=sentence.split()
    return tokens

tokens=Tokenise(sentence1)
print(tokens)
print(len(tokens))

['stemming', 'and', 'lemmatization', 'are', 'different', 'techniques', 'used', 'ran', 'better', 'to', 'reduce', 'words', 'to', 'their', 'root', 'form', 'but', 'they', 'produce', 'varying', 'results', 'lemmatization', 'is', 'better', 'than', 'stemming']
26


In [48]:
def RemStopWord(token):
    stop_word=set(stopwords.words('english'))
    filtered=[word for word in token if word not in stop_word]
    return filtered

tokens=RemStopWord(tokens)
print(tokens)

['stemming', 'lemmatization', 'different', 'techniques', 'used', 'ran', 'better', 'reduce', 'words', 'root', 'form', 'produce', 'varying', 'results', 'lemmatization', 'better', 'stemming']


In [49]:
pos_tag_list=pos_tag(tokens)
pos_tag_list
# VBG: Verb, Gerund/Present Participle
# NN: Noun, Singular or Mass
# JJ: Adjective
# NNS: Noun, Plural
# VBN: Verb, Past Participle
# VB: Verb, Base Form
# VBP: Verb, Non-3rd Person Singular Present
# RBR: Adverb, Comparative

[('stemming', 'VBG'),
 ('lemmatization', 'NN'),
 ('different', 'JJ'),
 ('techniques', 'NNS'),
 ('used', 'VBN'),
 ('ran', 'VBD'),
 ('better', 'RBR'),
 ('reduce', 'VB'),
 ('words', 'NNS'),
 ('root', 'VBP'),
 ('form', 'NN'),
 ('produce', 'VBP'),
 ('varying', 'VBG'),
 ('results', 'NNS'),
 ('lemmatization', 'NN'),
 ('better', 'RBR'),
 ('stemming', 'NN')]

In [50]:
#stemming
stemmer=PorterStemmer()
stemmed=[]
for w in tokens:
    print(f"{w} : {stemmer.stem(w)}")
    stemmed.append(stemmer.stem(w))

stemming : stem
lemmatization : lemmat
different : differ
techniques : techniqu
used : use
ran : ran
better : better
reduce : reduc
words : word
root : root
form : form
produce : produc
varying : vari
results : result
lemmatization : lemmat
better : better
stemming : stem


In [51]:
lemmatizer = WordNetLemmatizer()

for w in tokens:
    print(f"{w} : {lemmatizer.lemmatize(w)}")

stemming : stemming
lemmatization : lemmatization
different : different
techniques : technique
used : used
ran : ran
better : better
reduce : reduce
words : word
root : root
form : form
produce : produce
varying : varying
results : result
lemmatization : lemmatization
better : better
stemming : stemming


In [52]:
def CalcTF(tokens):
    term_freq = {}
    for word in tokens:
        if word not in term_freq:
            term_freq[word]=tokens.count(word)/len(tokens)
            
    return term_freq

CalcTF(tokens)
# TREM FREq  = occ/total

# IDF(t,D)=
# log( Number of documents in corpus ∣D∣/Total number of documents containing term t)

{'stemming': 0.11764705882352941,
 'lemmatization': 0.11764705882352941,
 'different': 0.058823529411764705,
 'techniques': 0.058823529411764705,
 'used': 0.058823529411764705,
 'ran': 0.058823529411764705,
 'better': 0.11764705882352941,
 'reduce': 0.058823529411764705,
 'words': 0.058823529411764705,
 'root': 0.058823529411764705,
 'form': 0.058823529411764705,
 'produce': 0.058823529411764705,
 'varying': 0.058823529411764705,
 'results': 0.058823529411764705}

In [53]:
# def calculateTF_IDF(documents):
#     documents = sent_tokenize(documents)
#     document_map = {}
#     document_tf = {}
#     unique_words = set()
#     word_idf = {}

#     for i, document in enumerate(documents):
#         tokenizedWords  = Tokenise(document)
#         document_map[i] = tokenizedWords

#         document_tf[i] = CalcTF(tokenizedWords)

#         for word in tokenizedWords:
#             unique_words.add(word)

#     for word in unique_words:
#         count = 0
#         for _, tokenedWords in document_map.items():
#             if word in tokenedWords:
#                 count += 1

#         word_idf[word] = count

#     return word_idf, document_tf


# word_idf, document_tf = calculateTF_IDF(sentence1)
# print(word_idf)
# print()
# print(document_tf)
from sklearn.feature_extraction.text import TfidfVectorizer

# Example tokens array (list of strings)
tokens_array = tokens

# Convert the tokens array into a single string (document)
document = " ".join(tokens_array)

# Create a TfidfVectorizer object
tfidf_vectorizer = TfidfVectorizer()

# Fit the vectorizer to the document and transform the document into TF-IDF representation
tfidf_representation = tfidf_vectorizer.fit_transform([document])

# Get the feature names (terms) from the vectorizer
feature_names = tfidf_vectorizer.get_feature_names_out()

# Print the TF-IDF representation
print("TF-IDF representation:")
print(tfidf_representation)

# Print the feature names (terms)
print("\nFeature names:")
print(feature_names)


TF-IDF representation:
  (0, 7)	0.20851441405707477
  (0, 12)	0.20851441405707477
  (0, 4)	0.20851441405707477
  (0, 2)	0.20851441405707477
  (0, 8)	0.20851441405707477
  (0, 13)	0.20851441405707477
  (0, 6)	0.20851441405707477
  (0, 0)	0.41702882811414954
  (0, 5)	0.20851441405707477
  (0, 11)	0.20851441405707477
  (0, 10)	0.20851441405707477
  (0, 1)	0.20851441405707477
  (0, 3)	0.41702882811414954
  (0, 9)	0.41702882811414954

Feature names:
['better' 'different' 'form' 'lemmatization' 'produce' 'ran' 'reduce'
 'results' 'root' 'stemming' 'techniques' 'used' 'varying' 'words']
