# Assignment 10

Name : Ghanashyam Patil  
Roll No : 31162  
Subject : DSBDAL  

Problem Statement :
1. Extract Sample document and apply following document preprocessing methods: Tokenization, POS Tagging, stop words removal, Stemming and Lemmatization.
2. Create representation of documents by calculating Term Frequency and Inverse Document Frequency.

In [33]:
import nltk
from nltk.corpus import stopwords,wordnet
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag
from sklearn.feature_extraction.text import TfidfVectorizer

In [34]:
# Here we are importing the txt file for processing
file_path = "patil_gm.txt"
with open(file_path, "r") as file:
    document = file.read()

# Tockenisation
Punkt tokenizer is used to Tokenization in the Natural Language Toolkit (NLTK).  
Tokenization is the process of dividing a text into individual units, called tokens, which can be words, sentences, or subwords.

In [35]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ghanashyam/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [36]:
# Tokenization
tokens = word_tokenize(document)
print("Tockens : ",tokens)

Tockens :  ['Hello', 'people', '...', '!', 'how', 'is', 'going', 'your', 'preparation', 'of', 'DSBDAL', 'Practicles', 'hope', 'this', 'repository', 'helping', 'you', 'out']


In [37]:
sentences=sent_tokenize(document)
print("Sentences : ",sentences)

Sentences :  ['Hello people...!', 'how is going your preparation of DSBDAL Practicles\nhope this repository helping you out']


# Pos Tagging

averaged_perceptron_tagger is a part-of-speech (POS) tagger implemented in the Natural Language Toolkit (NLTK) library.  
POS tagging is the process of assigning grammatical tags to individual words in a sentence, indicating their syntactic category or part of speech.(like NOUN,PRON,ADJ,VERB, etc.)

In [38]:
# POS Tagging
pos_tags = pos_tag(tokens)
print(pos_tags)

[('Hello', 'NNP'), ('people', 'NNS'), ('...', ':'), ('!', '.'), ('how', 'WRB'), ('is', 'VBZ'), ('going', 'VBG'), ('your', 'PRP$'), ('preparation', 'NN'), ('of', 'IN'), ('DSBDAL', 'NNP'), ('Practicles', 'NNP'), ('hope', 'VBP'), ('this', 'DT'), ('repository', 'NN'), ('helping', 'VBG'), ('you', 'PRP'), ('out', 'IN')]


# Stopwords

Stopwords typically include common words such as "a," "an," "the," "is," "and," "of," etc.  
These words are often removed from text during data preprocessing  
to focus on the more important and meaningful words for analysis or modeling.

In [39]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ghanashyam/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [40]:
# Stop Words Removal
stop_words = set(stopwords.words('english'))
# filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
# stop_words
filtered_tokens=[]
for token in tokens:
    if token not in stop_words:
        filtered_tokens.append(token)
        
print(filtered_tokens)

['Hello', 'people', '...', '!', 'going', 'preparation', 'DSBDAL', 'Practicles', 'hope', 'repository', 'helping']


# Stemming
The algorithm follows a series of steps to progressively remove common word endings until the stem is obtained.  
e.g. ('running, 'ran', 'runs', 'happiness', 'happier','happiest') converted to
     ('run', 'ran', 'run', 'happi', 'happier', 'happiest')

In [41]:
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
print(stemmed_tokens)

['hello', 'peopl', '...', '!', 'go', 'prepar', 'dsbdal', 'practicl', 'hope', 'repositori', 'help']


# wordnet
In natural language processing (NLP), WordNet is a lexical database and resource that provides a structured and organized collection of words and their semantic relationships. It is widely used for various NLP tasks, including word sense disambiguation, synonym identification, semantic analysis, and information retrieval

# Lemmatization
Lemmatization is the process of reducing words to their base or dictionary form, known as the lemma.  
The lemma represents the canonical or uninflected form of a word that conveys its meaning.  
The goal of lemmatization is to convert different inflected or derived word forms into a common base form, enabling easier analysis and interpretation of text data.

In [42]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/ghanashyam/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [43]:
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

# Term Frequency-Inverse
TF-IDF, short for term frequency–inverse document frequency, is a numeric measure that is use to score the importance of a word in a document based on how often did it appear in that document and a given collection of documents. The intuition for this measure is : If a word appears frequently in a document, then it should be important and we should give that word a high score. But if a word appears in too many other documents, it’s probably not a unique identifier, therefore we should assign a lower score to that word

# Inverse document frequency
The inverse document frequency is a measure of how much information the word provides, i.e., if it's common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient)

In [44]:
# Term Frequency-Inverse Document Frequency (TF-IDF) representation
documents = [document]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

In [45]:
# Print the results
print("Tokens:", tokens)
print("Sentences:", sentences)
print("POS Tags:", pos_tags)
print("Filtered Tokens (after stop words removal):", filtered_tokens)
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)
print("TF-IDF Matrix:")
print(tfidf_matrix.toarray())
# print("Feature Names:", vectorizer.get_feature_names())

Tokens: ['Hello', 'people', '...', '!', 'how', 'is', 'going', 'your', 'preparation', 'of', 'DSBDAL', 'Practicles', 'hope', 'this', 'repository', 'helping', 'you', 'out']
Sentences: ['Hello people...!', 'how is going your preparation of DSBDAL Practicles\nhope this repository helping you out']
POS Tags: [('Hello', 'NNP'), ('people', 'NNS'), ('...', ':'), ('!', '.'), ('how', 'WRB'), ('is', 'VBZ'), ('going', 'VBG'), ('your', 'PRP$'), ('preparation', 'NN'), ('of', 'IN'), ('DSBDAL', 'NNP'), ('Practicles', 'NNP'), ('hope', 'VBP'), ('this', 'DT'), ('repository', 'NN'), ('helping', 'VBG'), ('you', 'PRP'), ('out', 'IN')]
Filtered Tokens (after stop words removal): ['Hello', 'people', '...', '!', 'going', 'preparation', 'DSBDAL', 'Practicles', 'hope', 'repository', 'helping']
Stemmed Tokens: ['hello', 'peopl', '...', '!', 'go', 'prepar', 'dsbdal', 'practicl', 'hope', 'repositori', 'help']
Lemmatized Tokens: ['Hello', 'people', '...', '!', 'going', 'preparation', 'DSBDAL', 'Practicles', 'hope', '