# ASSIGNMENT 9

### Q1. Write a unique paragraph (5-6 sentences) about your favorite topic (e.g., sports, technology, food, books, etc.).
#### 1. Convert text to lowercase and remove punctuation.
#### 2. Tokenize the text into words and sentences.
#### 3. Remove stopwords (using NLTK's stopwords list).
#### 4. Display word frequency distribution (excluding stopwords).

In [7]:
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import FreqDist

paragraph = """Agentic AI is an exciting advancement in artificial intelligence where systems can act independently. 
These AI systems are capable of making decisions, learning from experiences, and adapting to their environment. 
They function with minimal human intervention and have great potential in various fields such as healthcare, 
education, and robotics. As AI continues to evolve, agentic systems could become companions and collaborators. 
The rise of Agentic AI also raises important questions about responsibility, safety, and ethics."""

text_cleaned = paragraph.lower().translate(str.maketrans('', '', string.punctuation))

words = word_tokenize(text_cleaned)
sentences = sent_tokenize(paragraph)

stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word not in stop_words and word.isalpha()]

fdist = FreqDist(filtered_words)

print("Original Sentences:\n", sentences, "\n")
print("Filtered Words (no stopwords):\n", filtered_words, "\n")
print("Top Word Frequencies:")
for word, freq in fdist.most_common(10):
    print(f"{word}: {freq}")


Original Sentences:
 ['Agentic AI is an exciting advancement in artificial intelligence where systems can act independently.', 'These AI systems are capable of making decisions, learning from experiences, and adapting to their environment.', 'They function with minimal human intervention and have great potential in various fields such as healthcare, \neducation, and robotics.', 'As AI continues to evolve, agentic systems could become companions and collaborators.', 'The rise of Agentic AI also raises important questions about responsibility, safety, and ethics.'] 

Filtered Words (no stopwords):
 ['agentic', 'ai', 'exciting', 'advancement', 'artificial', 'intelligence', 'systems', 'act', 'independently', 'ai', 'systems', 'capable', 'making', 'decisions', 'learning', 'experiences', 'adapting', 'environment', 'function', 'minimal', 'human', 'intervention', 'great', 'potential', 'various', 'fields', 'healthcare', 'education', 'robotics', 'ai', 'continues', 'evolve', 'agentic', 'systems', 

### Q2. Stemming and Lemmatization
#### 1. Take the tokenized words from Question 1 (after stopword removal).
#### 2. Apply stemming using NLTK's PorterStemmer and LancasterStemmer.
#### 3. Apply lemmatization using NLTK's WordNetLemmatizer.
#### 4. Compare and display results of both techniques.

In [9]:
from nltk.stem import PorterStemmer, LancasterStemmer, WordNetLemmatizer

porter = PorterStemmer()
lancaster = LancasterStemmer()
lemmatizer = WordNetLemmatizer()

print("Word | PorterStem | LancasterStem | Lemmatized\n")
for word in filtered_words[:15]:  
    print(f"{word:<12} | {porter.stem(word):<12} | {lancaster.stem(word):<14} | {lemmatizer.lemmatize(word)}")


Word | PorterStem | LancasterStem | Lemmatized

agentic      | agent        | ag             | agentic
ai           | ai           | ai             | ai
exciting     | excit        | excit          | exciting
advancement  | advanc       | adv            | advancement
artificial   | artifici     | art            | artificial
intelligence | intellig     | intellig       | intelligence
systems      | system       | system         | system
act          | act          | act            | act
independently | independ     | independ       | independently
ai           | ai           | ai             | ai
systems      | system       | system         | system
capable      | capabl       | cap            | capable
making       | make         | mak            | making
decisions    | decis        | decid          | decision
learning     | learn        | learn          | learning


### Q3. Regular Expressions and Text Splitting
#### 1. Take their original text from Question 1.
#### 2. Use regular expressions to:
#### a. Extract all words with more than 5 letters.
#### b. Extract all numbers (if any exist in their text).
#### c. Extract all capitalized words.
#### 3. Use text splitting techniques to:
#### a. Split the text into words containing only alphabets (removing digits and special characters).
#### b. Extract words starting with a vowel.

In [10]:
import re

words_5plus = re.findall(r'\b\w{6,}\b', paragraph)

numbers = re.findall(r'\b\d+\b', paragraph)

capitalized_words = re.findall(r'\b[A-Z][a-z]+\b', paragraph)

alpha_words = re.findall(r'\b[a-zA-Z]+\b', paragraph)

vowel_words = re.findall(r'\b[aeiouAEIOU][a-zA-Z]*\b', paragraph)

print("Words with >5 letters:", words_5plus)
print("Numbers:", numbers)
print("Capitalized Words:", capitalized_words)
print("Alphabetic Words Only:", alpha_words[:10])  # show first 10
print("Words starting with vowels:", vowel_words)


Words with >5 letters: ['Agentic', 'exciting', 'advancement', 'artificial', 'intelligence', 'systems', 'independently', 'systems', 'capable', 'making', 'decisions', 'learning', 'experiences', 'adapting', 'environment', 'function', 'minimal', 'intervention', 'potential', 'various', 'fields', 'healthcare', 'education', 'robotics', 'continues', 'evolve', 'agentic', 'systems', 'become', 'companions', 'collaborators', 'Agentic', 'raises', 'important', 'questions', 'responsibility', 'safety', 'ethics']
Numbers: []
Capitalized Words: ['Agentic', 'These', 'They', 'As', 'The', 'Agentic']
Alphabetic Words Only: ['Agentic', 'AI', 'is', 'an', 'exciting', 'advancement', 'in', 'artificial', 'intelligence', 'where']
Words starting with vowels: ['Agentic', 'AI', 'is', 'an', 'exciting', 'advancement', 'in', 'artificial', 'intelligence', 'act', 'independently', 'AI', 'are', 'of', 'experiences', 'and', 'adapting', 'environment', 'intervention', 'and', 'in', 'as', 'education', 'and', 'As', 'AI', 'evolve',

### Q4. Custom Tokenization & Regex-based Text Cleaning
#### 1. Take original text from Question 1.
#### 2. Write a custom tokenization function that:
#### a. Removes punctuation and special symbols, but keeps contractions (e.g.,"isn't" should not be split into "is" and "n't").
#### b. Handles hyphenated words as a single token (e.g., "state-of-the-art" remains a single token).
#### c. Tokenizes numbers separately but keeps decimal numbers intact (e.g., "3.14" should remain as is).
#### 3. Use Regex Substitutions (re.sub) to:
#### a. Replace email addresses with '<EMAIL>' placeholder.
#### b. Replace URLs with '<URL>' placeholder.
#### c. Replace phone numbers (formats: 123-456-7890 or +91 9876543210) with '<PHONE>' placeholder.

In [12]:
def custom_tokenizer(text):
    pattern = r"(?:\d+\.\d+)|(?:[a-zA-Z]+(?:-[a-zA-Z]+)*)|(?:\w+'\w+)|(?:\d+)"
    return re.findall(pattern, text)

sample_text = "Contact us at ai@example.com or visit https://agentic.ai. Call 123-456-7890 or +91 9876543210. This state-of-the-art model isn’t bad. It costs 3.14 dollars!"

tokens = custom_tokenizer(sample_text)

text_cleaned = re.sub(r'\S+@\S+', '<EMAIL>', sample_text)
text_cleaned = re.sub(r'https?://\S+', '<URL>', text_cleaned)
text_cleaned = re.sub(r'(\+91\s?\d{10}|\d{3}[-.\s]?\d{3}[-.\s]?\d{4})', '<PHONE>', text_cleaned)

print("Custom Tokens:", tokens)
print("Cleaned Text:", text_cleaned)


Custom Tokens: ['Contact', 'us', 'at', 'ai', 'example', 'com', 'or', 'visit', 'https', 'agentic', 'ai', 'Call', '123', '456', '7890', 'or', '91', '9876543210', 'This', 'state-of-the-art', 'model', 'isn', 't', 'bad', 'It', 'costs', '3.14', 'dollars']
Cleaned Text: Contact us at <EMAIL> or visit <URL> Call <PHONE> or <PHONE>. This state-of-the-art model isn’t bad. It costs 3.14 dollars!
