# **Experiment-7**

### Objective: 
Write a program to read text data from a file and perform Part of Speech Tagging, stop-word removal, stemming (using different stemmers), lemmatization and identify the difference between stemming and lemmatization.

In [1]:
%pip install nltk pandas

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### **Importing Libraries**

In [2]:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer
import pandas as pd


### **Downloading Necessary NLTK Data**

In [3]:

# Download necessary NLTK data
nltk.download('punkt', download_dir="C:/nltk_data")
nltk.download('averaged_perceptron_tagger', download_dir="C:/nltk_data")
nltk.download('stopwords', download_dir="C:/nltk_data")
nltk.download('wordnet', download_dir="C:/nltk_data")


[nltk_data] Downloading package punkt to C:/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to C:/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### **Reading Text from a Text File**

In [4]:
def read_text_file(file_path):
    """Read text data from the specified file."""
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
    return text

# Specify the file path
file_path = 'input.txt'

# Read the content of input.txt
text = read_text_file(file_path)
print("Content of input.txt:")
print(text)


Content of input.txt:
Microsoft announced a new product launch event in New York City on December 1st, 2024. 
Elon Musk, the CEO of Tesla, hinted at a partnership with NASA for a Mars mission. 
The GDP of India grew by 7% in the first quarter of 2023, according to a report by Reuters. 
Amazon is planning to open a new data center in Dublin, Ireland next year.
Barack Obama gave a keynote speech at Stanford University last Thursday.
Bitcoin reached an all-time high of $68,000 in November 2021.
The Louvre Museum in Paris saw record-breaking attendance last summer.



### **Tokenization and Stopword Removal**

In [5]:

# Tokenize the text into words
tokens = word_tokenize(text.lower())

# Stopword removal
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words and word.isalpha()]


### **Part of Speech Tagging**

In [6]:

# Part of Speech Tagging
pos_tags = pos_tag(filtered_tokens)
print("POS Tagging:")
print(pos_tags)


POS Tagging:
[('microsoft', 'RB'), ('announced', 'VBD'), ('new', 'JJ'), ('product', 'NN'), ('launch', 'JJ'), ('event', 'NN'), ('new', 'JJ'), ('york', 'NN'), ('city', 'NN'), ('december', 'VBP'), ('elon', 'NN'), ('musk', 'NN'), ('ceo', 'NN'), ('tesla', 'NN'), ('hinted', 'VBD'), ('partnership', 'NN'), ('nasa', 'JJ'), ('mars', 'NNS'), ('mission', 'NN'), ('gdp', 'NN'), ('india', 'NN'), ('grew', 'VBD'), ('first', 'JJ'), ('quarter', 'NN'), ('according', 'VBG'), ('report', 'NN'), ('reuters', 'NNS'), ('amazon', 'VBP'), ('planning', 'VBG'), ('open', 'JJ'), ('new', 'JJ'), ('data', 'NNS'), ('center', 'NN'), ('dublin', 'NN'), ('ireland', 'NN'), ('next', 'JJ'), ('year', 'NN'), ('barack', 'NN'), ('obama', 'NN'), ('gave', 'VBD'), ('keynote', 'VBN'), ('speech', 'NN'), ('stanford', 'NN'), ('university', 'NN'), ('last', 'JJ'), ('thursday', 'JJ'), ('bitcoin', 'NN'), ('reached', 'VBD'), ('high', 'JJ'), ('november', 'NN'), ('louvre', 'NN'), ('museum', 'NN'), ('paris', 'NN'), ('saw', 'VBD'), ('attendance', '

### **Stemming Using PorterStemmer**

In [7]:

# Stemming using PorterStemmer
porter_stemmer = PorterStemmer()
porter_stems = [porter_stemmer.stem(word) for word in filtered_tokens]
print("\nStemming (PorterStemmer):")
print(porter_stems)



Stemming (PorterStemmer):
['microsoft', 'announc', 'new', 'product', 'launch', 'event', 'new', 'york', 'citi', 'decemb', 'elon', 'musk', 'ceo', 'tesla', 'hint', 'partnership', 'nasa', 'mar', 'mission', 'gdp', 'india', 'grew', 'first', 'quarter', 'accord', 'report', 'reuter', 'amazon', 'plan', 'open', 'new', 'data', 'center', 'dublin', 'ireland', 'next', 'year', 'barack', 'obama', 'gave', 'keynot', 'speech', 'stanford', 'univers', 'last', 'thursday', 'bitcoin', 'reach', 'high', 'novemb', 'louvr', 'museum', 'pari', 'saw', 'attend', 'last', 'summer']


### **Stemming Using SnowballStemmer**

In [8]:

# Stemming using SnowballStemmer
snowball_stemmer = SnowballStemmer('english')
snowball_stems = [snowball_stemmer.stem(word) for word in filtered_tokens]
print("\nStemming (SnowballStemmer):")
print(snowball_stems)



Stemming (SnowballStemmer):
['microsoft', 'announc', 'new', 'product', 'launch', 'event', 'new', 'york', 'citi', 'decemb', 'elon', 'musk', 'ceo', 'tesla', 'hint', 'partnership', 'nasa', 'mar', 'mission', 'gdp', 'india', 'grew', 'first', 'quarter', 'accord', 'report', 'reuter', 'amazon', 'plan', 'open', 'new', 'data', 'center', 'dublin', 'ireland', 'next', 'year', 'barack', 'obama', 'gave', 'keynot', 'speech', 'stanford', 'univers', 'last', 'thursday', 'bitcoin', 'reach', 'high', 'novemb', 'louvr', 'museum', 'pari', 'saw', 'attend', 'last', 'summer']


### **Lemmatization**

In [9]:

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print("\nLemmatization:")
print(lemmas)



Lemmatization:
['microsoft', 'announced', 'new', 'product', 'launch', 'event', 'new', 'york', 'city', 'december', 'elon', 'musk', 'ceo', 'tesla', 'hinted', 'partnership', 'nasa', 'mar', 'mission', 'gdp', 'india', 'grew', 'first', 'quarter', 'according', 'report', 'reuters', 'amazon', 'planning', 'open', 'new', 'data', 'center', 'dublin', 'ireland', 'next', 'year', 'barack', 'obama', 'gave', 'keynote', 'speech', 'stanford', 'university', 'last', 'thursday', 'bitcoin', 'reached', 'high', 'november', 'louvre', 'museum', 'paris', 'saw', 'attendance', 'last', 'summer']


### **Comparing Stemming and Lemmatization**

In [10]:

# Difference between Stemming and Lemmatization in table format
data = {
    "Original Word": filtered_tokens,
    "PorterStem": porter_stems,
    "SnowballStem": snowball_stems,
    "Lemma": lemmas
}

# Create a DataFrame
df_comparison = pd.DataFrame(data)


### **Comparison Table**

In [11]:
df_comparison

Unnamed: 0,Original Word,PorterStem,SnowballStem,Lemma
0,microsoft,microsoft,microsoft,microsoft
1,announced,announc,announc,announced
2,new,new,new,new
3,product,product,product,product
4,launch,launch,launch,launch
5,event,event,event,event
6,new,new,new,new
7,york,york,york,york
8,city,citi,citi,city
9,december,decemb,decemb,december
