# **Experiment-8**

### Objective: 
Write a program to read text data from a file and perform Part of Speech Tagging, stop-word removal, stemming (using different stemmers), lemmatization and identify the difference between  stemming and lemmatization.

In [5]:
%pip install nltk pandas

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [6]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
import pandas as pd

# Download required NLTK data files
nltk.download("punkt", download_dir="C:/nltk_data")
nltk.download("stopwords", download_dir="C:/nltk_data")
nltk.download("averaged_perceptron_tagger", download_dir="C:/nltk_data")
nltk.download("wordnet", download_dir="C:/nltk_data")
nltk.download("omw-1.4", download_dir="C:/nltk_data")

[nltk_data] Downloading package punkt to C:/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to C:/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [7]:
def read_text_file(file_path):
    """Read text data from the specified file."""
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
    return text

# Specify the file path
file_path = 'input.txt'

# Read the content of input.txt
text = read_text_file(file_path)
print("Content of input.txt:")
print(text)


Content of input.txt:
Microsoft announced a new product launch event in New York City on December 1st, 2024. 
Elon Musk, the CEO of Tesla, hinted at a partnership with NASA for a Mars mission. 
The GDP of India grew by 7% in the first quarter of 2023, according to a report by Reuters. 
Amazon is planning to open a new data center in Dublin, Ireland next year.
Barack Obama gave a keynote speech at Stanford University last Thursday.
Bitcoin reached an all-time high of $68,000 in November 2021.
The Louvre Museum in Paris saw record-breaking attendance last summer.



In [8]:
# Tokenize the text
tokens = word_tokenize(text.lower())

# Load English stopwords
stop_words = set(stopwords.words("english"))

# Remove stopwords and filter non-alphabetic words
filtered_words = [word for word in tokens if word.isalpha() and word not in stop_words]

print("\nFiltered Words (After Stop-word Removal):")
print(filtered_words)


Filtered Words (After Stop-word Removal):
['microsoft', 'announced', 'new', 'product', 'launch', 'event', 'new', 'york', 'city', 'december', 'elon', 'musk', 'ceo', 'tesla', 'hinted', 'partnership', 'nasa', 'mars', 'mission', 'gdp', 'india', 'grew', 'first', 'quarter', 'according', 'report', 'reuters', 'amazon', 'planning', 'open', 'new', 'data', 'center', 'dublin', 'ireland', 'next', 'year', 'barack', 'obama', 'gave', 'keynote', 'speech', 'stanford', 'university', 'last', 'thursday', 'bitcoin', 'reached', 'high', 'november', 'louvre', 'museum', 'paris', 'saw', 'attendance', 'last', 'summer']


In [9]:
# Perform POS tagging
pos_tags = nltk.pos_tag(filtered_words)

# Display the POS tags
df_pos_tags = pd.DataFrame(pos_tags, columns=["Word", "POS Tag"])
df_pos_tags

Unnamed: 0,Word,POS Tag
0,microsoft,RB
1,announced,VBD
2,new,JJ
3,product,NN
4,launch,JJ
5,event,NN
6,new,JJ
7,york,NN
8,city,NN
9,december,VBP


In [10]:
# Initialize stemmers
porter_stemmer = PorterStemmer()
lancaster_stemmer = LancasterStemmer()
snowball_stemmer = SnowballStemmer("english")

# Apply stemming using different stemmers
porter_stems = [porter_stemmer.stem(word) for word in filtered_words]
lancaster_stems = [lancaster_stemmer.stem(word) for word in filtered_words]
snowball_stems = [snowball_stemmer.stem(word) for word in filtered_words]

# Create a DataFrame to compare results
df_stemming = pd.DataFrame({
    "Original Word": filtered_words,
    "Porter Stemmer": porter_stems,
    "Lancaster Stemmer": lancaster_stems,
    "Snowball Stemmer": snowball_stems
})
df_stemming

Unnamed: 0,Original Word,Porter Stemmer,Lancaster Stemmer,Snowball Stemmer
0,microsoft,microsoft,microsoft,microsoft
1,announced,announc,annount,announc
2,new,new,new,new
3,product,product,produc,product
4,launch,launch,launch,launch
5,event,event,ev,event
6,new,new,new,new
7,york,york,york,york
8,city,citi,city,citi
9,december,decemb,decemb,decemb


In [11]:
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Apply lemmatization
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]

# Create a DataFrame to compare original words with lemmatized words
df_lemmatization = pd.DataFrame({
    "Original Word": filtered_words,
    "Lemmatized Word": lemmatized_words
})
df_lemmatization

Unnamed: 0,Original Word,Lemmatized Word
0,microsoft,microsoft
1,announced,announced
2,new,new
3,product,product
4,launch,launch
5,event,event
6,new,new
7,york,york
8,city,city
9,december,december


In [12]:
# Display a comparison of stemming vs. lemmatization
df_comparison = pd.DataFrame({
    "Original Word": filtered_words,
    "Porter Stemmer": porter_stems,
    "Lemmatized Word": lemmatized_words
})

Unnamed: 0,Original Word,Porter Stemmer,Lemmatized Word
0,microsoft,microsoft,microsoft
1,announced,announc,announced
2,new,new,new
3,product,product,product
4,launch,launch,launch
5,event,event,event
6,new,new,new
7,york,york,york
8,city,citi,city
9,december,decemb,december
