<h3>Objective: Take a document as input and Tokenize it, the remove its stop words and finally lemmatize it</h3>

<h3>Importing Libraries</h3>

In [4]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from string import punctuation
from nltk.stem import WordNetLemmatizer

<h3>Document Loading</h3>

In [5]:
file=open("sample.txt", mode="r")

text=file.readline()

file.close()

print("Original Text: " + text)

Original Text: The meaning of NLP is Natural Language Processing (NLP) which is a fascinating and rapidly evolving field that intersects computer science, artificial intelligence, and linguistics. NLP focuses on the interaction between computers and human language, enabling machines to understand, interpret, and generate human language in a way that is both meaningful and useful. With the increasing volume of text data generated every day, from social media posts to research articles, NLP has become an essential tool for extracting valuable insights and automating various tasks.


<h3>Tokenization</h3>

In [6]:
tokens=word_tokenize(text)

print(tokens)

['The', 'meaning', 'of', 'NLP', 'is', 'Natural', 'Language', 'Processing', '(', 'NLP', ')', 'which', 'is', 'a', 'fascinating', 'and', 'rapidly', 'evolving', 'field', 'that', 'intersects', 'computer', 'science', ',', 'artificial', 'intelligence', ',', 'and', 'linguistics', '.', 'NLP', 'focuses', 'on', 'the', 'interaction', 'between', 'computers', 'and', 'human', 'language', ',', 'enabling', 'machines', 'to', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'language', 'in', 'a', 'way', 'that', 'is', 'both', 'meaningful', 'and', 'useful', '.', 'With', 'the', 'increasing', 'volume', 'of', 'text', 'data', 'generated', 'every', 'day', ',', 'from', 'social', 'media', 'posts', 'to', 'research', 'articles', ',', 'NLP', 'has', 'become', 'an', 'essential', 'tool', 'for', 'extracting', 'valuable', 'insights', 'and', 'automating', 'various', 'tasks', '.']


<h3>Stop Words Removal</h3>

In [7]:
stop=set(stopwords.words("English"))
stop_word_list = list(stop)+list(punctuation)

tokens_without_stop_words = [word for word in tokens if word.lower() not in stop_word_list]

print(tokens_without_stop_words)

['meaning', 'NLP', 'Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'rapidly', 'evolving', 'field', 'intersects', 'computer', 'science', 'artificial', 'intelligence', 'linguistics', 'NLP', 'focuses', 'interaction', 'computers', 'human', 'language', 'enabling', 'machines', 'understand', 'interpret', 'generate', 'human', 'language', 'way', 'meaningful', 'useful', 'increasing', 'volume', 'text', 'data', 'generated', 'every', 'day', 'social', 'media', 'posts', 'research', 'articles', 'NLP', 'become', 'essential', 'tool', 'extracting', 'valuable', 'insights', 'automating', 'various', 'tasks']


<h3>Lemmatization</h3>

In [8]:
wordnet_lemmatizer = WordNetLemmatizer()

lemmatized_words = []
print("{0:20}{1:20}\n".format("WORD", "LEMMA"))
for word in tokens_without_stop_words:
    lemma = wordnet_lemmatizer.lemmatize(word, pos="v")
    lemmatized_words.append(lemma)
    print("{0:20}{1:20}".format(word, lemma))

processed_text=" ".join(lemmatized_words)

WORD                LEMMA               

meaning             mean                
NLP                 NLP                 
Natural             Natural             
Language            Language            
Processing          Processing          
NLP                 NLP                 
fascinating         fascinate           
rapidly             rapidly             
evolving            evolve              
field               field               
intersects          intersect           
computer            computer            
science             science             
artificial          artificial          
intelligence        intelligence        
linguistics         linguistics         
NLP                 NLP                 
focuses             focus               
interaction         interaction         
computers           computers           
human               human               
language            language            
enabling            enable              
machines       

In [9]:
processed_text

'mean NLP Natural Language Processing NLP fascinate rapidly evolve field intersect computer science artificial intelligence linguistics NLP focus interaction computers human language enable machine understand interpret generate human language way meaningful useful increase volume text data generate every day social media post research article NLP become essential tool extract valuable insights automate various task'

<h3>Label Encoding</h3>

In [10]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
labels = label_encoder.fit_transform([processed_text])

print("Label Encoded Output:", labels)

Label Encoded Output: [0]


<h3>TF-IDF</h3>

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([processed_text]).toarray()

np.save("tfidf.npy", tfidf_matrix)
print("TF-IDF Matrix:\n", tfidf_matrix)

TF-IDF Matrix:
 [[0.11470787 0.11470787 0.11470787 0.11470787 0.11470787 0.11470787
  0.11470787 0.11470787 0.11470787 0.11470787 0.11470787 0.11470787
  0.11470787 0.11470787 0.11470787 0.11470787 0.22941573 0.22941573
  0.11470787 0.11470787 0.11470787 0.11470787 0.11470787 0.11470787
  0.3441236  0.11470787 0.11470787 0.11470787 0.11470787 0.11470787
  0.11470787 0.45883147 0.11470787 0.11470787 0.11470787 0.11470787
  0.11470787 0.11470787 0.11470787 0.11470787 0.11470787 0.11470787
  0.11470787 0.11470787 0.11470787 0.11470787 0.11470787]]
