<a href="https://colab.research.google.com/github/manu-eldho/344/blob/main/nlp_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Concordance

In [2]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [6]:
import nltk
from nltk.text import Text
from nltk.tokenize import word_tokenize

nltk.download('punkt')

custom_text = """
Life is what happens when you're busy making other plans.
Life can be beautiful if you know how to live it.
Everyone's life has a purpose and a meaning.
The life of a sailor is full of adventure.
She dedicated her life to science and discovery.
"""

tokens = word_tokenize(custom_text)
text_obj = Text(tokens)
word = input("Enter the word to find concordance for: ")
width = int(input("Enter the window size (e.g., 40): "))
print(f"\nConcordance for '{word}' with window size {width}:\n")
text_obj.concordance(word, width=width)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Enter the word to find concordance for: life
Enter the window size (e.g., 40): 10

Concordance for 'life' with window size 10:

Displaying 5 of 5 matches:
   Life is
 . Life ca
's life ha
he life of
er life to


In [5]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

custom_text = """
Life is what happens when you're busy making other plans.
Life can be beautiful if you know how to live it.
Everyone's life has a purpose and a meaning.
The life of a sailor is full of adventure.
She dedicated her life to science and discovery.
"""
tokens = word_tokenize(custom_text)
target_word = input("Enter the word to search: ").lower()
window_size = int(input("Enter the window size: "))
tokens_lower = [word.lower() for word in tokens]
print(f"\nConcordance for '{target_word}' with window size {window_size}:\n")

for i, word in enumerate(tokens_lower):
    if word == target_word:
        left = tokens[max(0, i - window_size): i]
        right = tokens[i + 1: i + 1 + window_size]
        center = tokens[i]
        print("... " + ' '.join(left) + " [" + center + "] " + ' '.join(right) + " ...")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Enter the word to search: life
Enter the window size: 2

Concordance for 'life' with window size 2:

...  [Life] is what ...
... plans . [Life] can be ...
... Everyone 's [life] has a ...
... . The [life] of a ...
... dedicated her [life] to science ...


Counting Vocabulary

In [8]:
import nltk
from nltk.tokenize import word_tokenize
from collections import Counter

nltk.download('punkt')

custom_text = """
Life is what happens when you're busy making other plans.
Life can be beautiful if you know how to live it.
Everyone's life has a purpose and a meaning.
The life of a sailor is full of adventure.
She dedicated her life to science and discovery.
"""

tokens = word_tokenize(custom_text.lower())  # Normalize to lowercase
total_tokens = len(tokens)
unique_tokens = len(set(tokens))

# ----------- Frequency Analysis ------------
freq_dist = Counter(tokens)
the_count = freq_dist["the"]
the_percentage = (the_count / total_tokens) * 100
ttr = unique_tokens / total_tokens

# ----------- Display Results ------------
print(f"1. Total number of words (tokens): {total_tokens}")
print(f"2. Number of different words (types): {unique_tokens}")
print(f"3. Occurrences of the word 'the': {the_count}")
print(f"4. Percentage of 'the' in the text: {the_percentage:.2f}%")
print(f"5. Type-Token Ratio (TTR): {ttr:.3f}")
print("\n6. Occurrence of each word in the text:\n")

# ----------- Print Word Frequencies ------------
for word, count in freq_dist.items():
    print(f"{word}: {count}")


Counter({'life': 5, '.': 5, 'a': 3, 'is': 2, 'you': 2, 'to': 2, 'and': 2, 'of': 2, 'what': 1, 'happens': 1, 'when': 1, "'re": 1, 'busy': 1, 'making': 1, 'other': 1, 'plans': 1, 'can': 1, 'be': 1, 'beautiful': 1, 'if': 1, 'know': 1, 'how': 1, 'live': 1, 'it': 1, 'everyone': 1, "'s": 1, 'has': 1, 'purpose': 1, 'meaning': 1, 'the': 1, 'sailor': 1, 'full': 1, 'adventure': 1, 'she': 1, 'dedicated': 1, 'her': 1, 'science': 1, 'discovery': 1})
1. Total number of words (tokens): 53
2. Number of different words (types): 38
3. Occurrences of the word 'the': 1
4. Percentage of 'the' in the text: 1.89%
5. Type-Token Ratio (TTR): 0.717

6. Occurrence of each word in the text:

life: 5
is: 2
what: 1
happens: 1
when: 1
you: 2
're: 1
busy: 1
making: 1
other: 1
plans: 1
.: 5
can: 1
be: 1
beautiful: 1
if: 1
know: 1
how: 1
to: 2
live: 1
it: 1
everyone: 1
's: 1
has: 1
a: 3
purpose: 1
and: 2
meaning: 1
the: 1
of: 2
sailor: 1
full: 1
adventure: 1
she: 1
dedicated: 1
her: 1
science: 1
discovery: 1


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Preprocessing

In [9]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# ----------- Custom Text ------------
text = """
Life is what happens when you're busy making other plans.
Life can be beautiful if you know how to live it.
Everyone's life has a purpose and a meaning.
The life of a sailor is full of adventure.
She dedicated her life to science and discovery.
"""

# ----------- Step 1: Tokenization ------------
tokens = word_tokenize(text)
print("Tokens:")
print(tokens)

# ----------- Step 2: Filtration (remove punctuation, numbers) ------------
filtered_tokens = [word for word in tokens if word.isalpha()]
print("\nAfter Filtration (alphabetic only):")
print(filtered_tokens)

# ----------- Step 3: Stop Word Removal ------------
stop_words = set(stopwords.words('english'))
tokens_no_stopwords = [word for word in filtered_tokens if word.lower() not in stop_words]
print("\nAfter Stop Word Removal:")
print(tokens_no_stopwords)

# ----------- Step 4: Stemming ------------
stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in tokens_no_stopwords]
print("\nAfter Stemming:")
print(stemmed)

# ----------- Step 5: Lemmatization ------------
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word.lower()) for word in tokens_no_stopwords]
print("\nAfter Lemmatization:")
print(lemmatized)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Tokens:
['Life', 'is', 'what', 'happens', 'when', 'you', "'re", 'busy', 'making', 'other', 'plans', '.', 'Life', 'can', 'be', 'beautiful', 'if', 'you', 'know', 'how', 'to', 'live', 'it', '.', 'Everyone', "'s", 'life', 'has', 'a', 'purpose', 'and', 'a', 'meaning', '.', 'The', 'life', 'of', 'a', 'sailor', 'is', 'full', 'of', 'adventure', '.', 'She', 'dedicated', 'her', 'life', 'to', 'science', 'and', 'discovery', '.']

After Filtration (alphabetic only):
['Life', 'is', 'what', 'happens', 'when', 'you', 'busy', 'making', 'other', 'plans', 'Life', 'can', 'be', 'beautiful', 'if', 'you', 'know', 'how', 'to', 'live', 'it', 'Everyone', 'life', 'has', 'a', 'purpose', 'and', 'a', 'meaning', 'The', 'life', 'of', 'a', 'sailor', 'is', 'full', 'of', 'adventure', 'She', 'dedicated', 'her', 'life', 'to', 'science', 'and', 'discovery']

After Stop Word Removal:
['Life', 'happens', 'busy', 'making', 'plans', 'Life', 'beautiful', 'know', 'live', 'Everyone', 'life', 'purpose', 'meaning', 'life', 'sailor',

In [10]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import PorterStemmer, WordNetLemmatizer
import unicodedata

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# ----------- Custom Text ------------
text = """
Life is what happens when you're busy making other plans.
人生は、他の計画を立てているときに起こるものです。
Life can be beautiful if you know how to live it.
Everyone's life has a purpose and a meaning.
The life of a sailor is full of adventure.
She dedicated her life to science and discovery.
"""

# ----------- Step 1: Tokenization ------------
tokens = word_tokenize(text)
print("Tokens:")
print(tokens)

# ----------- Step 2: Filtration (keep only alphabetic words) ------------
alphabetic_tokens = [word for word in tokens if word.isalpha()]

# ----------- Step 3: Script Validation (only Latin/English script) ------------
def is_latin(word):
    for ch in word:
        if 'LATIN' not in unicodedata.name(ch, ''):
            return False
    return True

validated_tokens = [word for word in alphabetic_tokens if is_latin(word)]
print("\nAfter Script Validation:")
print(validated_tokens)

# ----------- Step 4: Stop Word Removal ------------
stop_words = set(stopwords.words('english'))
tokens_no_stopwords = [word for word in validated_tokens if word.lower() not in stop_words]
print("\nAfter Stop Word Removal:")
print(tokens_no_stopwords)

# ----------- Step 5: Stemming ------------
stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in tokens_no_stopwords]
print("\nAfter Stemming:")
print(stemmed)

# ----------- Step 6: Lemmatization ------------
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word.lower()) for word in tokens_no_stopwords]
print("\nAfter Lemmatization:")
print(lemmatized)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Tokens:
['Life', 'is', 'what', 'happens', 'when', 'you', "'re", 'busy', 'making', 'other', 'plans', '.', '人生は、他の計画を立てているときに起こるものです。', 'Life', 'can', 'be', 'beautiful', 'if', 'you', 'know', 'how', 'to', 'live', 'it', '.', 'Everyone', "'s", 'life', 'has', 'a', 'purpose', 'and', 'a', 'meaning', '.', 'The', 'life', 'of', 'a', 'sailor', 'is', 'full', 'of', 'adventure', '.', 'She', 'dedicated', 'her', 'life', 'to', 'science', 'and', 'discovery', '.']

After Script Validation:
['Life', 'is', 'what', 'happens', 'when', 'you', 'busy', 'making', 'other', 'plans', 'Life', 'can', 'be', 'beautiful', 'if', 'you', 'know', 'how', 'to', 'live', 'it', 'Everyone', 'life', 'has', 'a', 'purpose', 'and', 'a', 'meaning', 'The', 'life', 'of', 'a', 'sailor', 'is', 'full', 'of', 'adventure', 'She', 'dedicated', 'her', 'life', 'to', 'science', 'and', 'discovery']

After Stop Word Removal:
['Life', 'happens', 'busy', 'making', 'plans', 'Life', 'beautiful', 'know', 'live', 'Everyone', 'life', 'purpose', 'meaning',

Parsing

In [12]:
import nltk
from nltk import CFG, PCFG
from nltk.parse import ChartParser
from nltk.parse import pchart

sentence = "the cat chased the mouse".split()

cfg_grammar = CFG.fromstring("""
S -> NP VP
NP -> Det N
VP -> V NP
Det -> 'the'
N -> 'cat' | 'mouse'
V -> 'chased'
""")

print("=== Constituency Parsing (CFG) ===")
cfg_parser = ChartParser(cfg_grammar)
for tree in cfg_parser.parse(sentence):
    print(tree)
    tree.pretty_print()

pcfg_grammar = PCFG.fromstring("""
S -> NP VP [1.0]
NP -> Det N [1.0]
VP -> V NP [1.0]
Det -> 'the' [1.0]
N -> 'cat' [0.5] | 'mouse' [0.5]
V -> 'chased' [1.0]
""")

print("\n=== Probabilistic Parsing (PCFG) ===")
viterbi_parser = pchart.InsideChartParser(pcfg_grammar)
for tree in viterbi_parser.parse(sentence):
    print(tree)
    tree.pretty_print()


=== Constituency Parsing (CFG) ===
(S (NP (Det the) (N cat)) (VP (V chased) (NP (Det the) (N mouse))))
              S                 
      ________|_____             
     |              VP          
     |         _____|___         
     NP       |         NP      
  ___|___     |      ___|____    
Det      N    V    Det       N  
 |       |    |     |        |   
the     cat chased the     mouse


=== Probabilistic Parsing (PCFG) ===
(S
  (NP (Det the) (N cat))
  (VP (V chased) (NP (Det the) (N mouse)))) (p=0.25)
              S                 
      ________|_____             
     |              VP          
     |         _____|___         
     NP       |         NP      
  ___|___     |      ___|____    
Det      N    V    Det       N  
 |       |    |     |        |   
the     cat chased the     mouse



Bag of Words

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

# Sample documents
documents = [
    "I love Natural Language Processing!",
    "Language processing with Python is fun.",
    "Natural language processing includes text mining and NLP."
]

# Step 1: Vectorization using Bag of Words
vectorizer = CountVectorizer(lowercase=True, stop_words='english', token_pattern=r'\b\w+\b')
X = vectorizer.fit_transform(documents)

# Step 2: Convert to DataFrame (for viewing)
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

# Step 3: Compute Cosine Similarity
cosine_sim = cosine_similarity(X)

# Step 4: Display
print("Vocabulary:")
print(vectorizer.get_feature_names_out())

print("\n Bag of Words (Document-Term Matrix):")
print(bow_df)

print("\n Cosine Similarity Between Sentences:")
cosine_df = pd.DataFrame(cosine_sim, index=[f"Doc{i+1}" for i in range(len(documents))],
                                      columns=[f"Doc{i+1}" for i in range(len(documents))])
print(cosine_df)


Vocabulary:
['fun' 'includes' 'language' 'love' 'mining' 'natural' 'nlp' 'processing'
 'python' 'text']

🧾 Bag of Words (Document-Term Matrix):
   fun  includes  language  love  mining  natural  nlp  processing  python  \
0    0         0         1     1       0        1    0           1       0   
1    1         0         1     0       0        0    0           1       1   
2    0         1         1     0       1        1    1           1       0   

   text  
0     0  
1     0  
2     1  

📏 Cosine Similarity Between Sentences:
          Doc1      Doc2      Doc3
Doc1  1.000000  0.500000  0.566947
Doc2  0.500000  1.000000  0.377964
Doc3  0.566947  0.377964  1.000000


Similar Sentence

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Step 1: Read the file
filename = "/content/similar.txt"  # Ensure this file exists
with open(filename, 'r', encoding='utf-8') as file:
    sentences = [line.strip() for line in file if line.strip()]

# Step 2: Get user input
input_sentence = input("Enter a sentence: ")

# Step 3: Add the input sentence to the list
all_sentences = sentences + [input_sentence]

# Step 4: Vectorize using TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(all_sentences)

# Step 5: Compute cosine similarity between input sentence and all others
similarities = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1])

# Step 6: Find the index of the most similar sentence
most_similar_index = similarities.argmax()
most_similar_score = similarities[0][most_similar_index]
most_similar_sentence = sentences[most_similar_index]

# Step 7: Print result
print("\nMost similar sentence from the file:")
print(f"> {most_similar_sentence}")
print(f"\nCosine Similarity Score: {most_similar_score:.4f}")


Enter a sentence: natural language

Most similar sentence from the file:
> Natural language processing is a field of artificial intelligence.

Cosine Similarity Score: 0.4168


NER

In [19]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [21]:
import nltk
nltk.download('maxent_ne_chunker_tab')

[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker_tab.zip.


True

In [10]:
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

text = "Barack Obama was born in Hawaii and served as the 44th President of the United States."

tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)
print(pos_tags)
named_entities = ne_chunk(pos_tags)
print("Named Entities in the text:\n")
print(named_entities)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


[('Barack', 'NNP'), ('Obama', 'NNP'), ('was', 'VBD'), ('born', 'VBN'), ('in', 'IN'), ('Hawaii', 'NNP'), ('and', 'CC'), ('served', 'VBD'), ('as', 'IN'), ('the', 'DT'), ('44th', 'CD'), ('President', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('United', 'NNP'), ('States', 'NNPS'), ('.', '.')]
Named Entities in the text:

(S
  (PERSON Barack/NNP)
  (PERSON Obama/NNP)
  was/VBD
  born/VBN
  in/IN
  (GPE Hawaii/NNP)
  and/CC
  served/VBD
  as/IN
  the/DT
  44th/CD
  President/NNP
  of/IN
  the/DT
  (GPE United/NNP States/NNPS)
  ./.)


In [24]:
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk

# Download only what's necessary
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')

# Input text
text = "Elon Musk founded SpaceX and Tesla in the United States."

# Tokenize and tag
tokens = word_tokenize(text)
tags = pos_tag(tokens)

# Named Entity Recognition
tree = ne_chunk(tags)

# Display named entities
for subtree in tree:
    if isinstance(subtree, nltk.Tree):
        entity = " ".join(word for word, tag in subtree.leaves())
        entity_type = subtree.label()
        print(f"{entity} => {entity_type}")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!


Elon => PERSON
Musk => PERSON
SpaceX => ORGANIZATION
Tesla => GPE
United States => GPE


TF-IDF

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Sample documents
documents = [
    "The sky is blue.",
    "The sun is bright.",
    "The sun in the sky is bright.",
    "We can see the shining sun, the bright sun."
]

# Create the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)

# Get the TF-IDF scores as a DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Display the TF-IDF values
print("TF-IDF values for each word in each document:\n")
print(tfidf_df)


TF-IDF values for each word in each document:

       blue    bright       can        in        is       see   shining  \
0  0.659191  0.000000  0.000000  0.000000  0.420753  0.000000  0.000000   
1  0.000000  0.522109  0.000000  0.000000  0.522109  0.000000  0.000000   
2  0.000000  0.321846  0.000000  0.504235  0.321846  0.000000  0.000000   
3  0.000000  0.239102  0.374599  0.000000  0.000000  0.374599  0.374599   

        sky       sun       the        we  
0  0.519714  0.000000  0.343993  0.000000  
1  0.000000  0.522109  0.426858  0.000000  
2  0.397544  0.321846  0.526261  0.000000  
3  0.000000  0.478204  0.390963  0.374599  


POS tagging

In [27]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import treebank
from nltk.tag import hmm
from nltk.probability import LaplaceProbDist

nltk.download('treebank')

tag_expansion = {
    'CC': 'Coordinating Conjunction',
    'CD': 'Cardinal Digit',
    'DT': 'Determiner',
    'EX': 'Existential There',
    'FW': 'Foreign Word',
    'IN': 'Preposition/Subordinating Conjunction',
    'JJ': 'Adjective',
    'JJR': 'Adjective, Comparative',
    'JJS': 'Adjective, Superlative',
    'LS': 'List Item Marker',
    'MD': 'Modal',
    'NN': 'Noun, Singular',
    'NNS': 'Noun, Plural',
    'NNP': 'Proper Noun, Singular',
    'NNPS': 'Proper Noun, Plural',
    'PDT': 'Predeterminer',
    'POS': 'Possessive Ending',
    'PRP': 'Personal Pronoun',
    'PRP$': 'Possessive Pronoun',
    'RB': 'Adverb',
    'RBR': 'Adverb, Comparative',
    'RBS': 'Adverb, Superlative',
    'RP': 'Particle',
    'SYM': 'Symbol',
    'TO': 'To',
    'UH': 'Interjection',
    'VB': 'Verb, Base Form',
    'VBD': 'Verb, Past Tense',
    'VBG': 'Verb, Gerund or Present Participle',
    'VBN': 'Verb, Past Participle',
    'VBP': 'Verb, Non-3rd Person Singular Present',
    'VBZ': 'Verb, 3rd Person Singular Present',
    'WDT': 'Wh-determiner',
    'WP': 'Wh-pronoun',
    'WP$': 'Possessive Wh-pronoun',
    'WRB': 'Wh-adverb'
}

def expand_tag(tag):		# Function to expand the tag
    return tag_expansion.get(tag, "Unknown Tag")

text = input("Enter a sentence: ")		# Word tokenization
tokens = nltk.word_tokenize(text)
print("Tokens:", tokens)

lemmatizer = WordNetLemmatizer()		# Lemmatization
l_tokens = [lemmatizer.lemmatize(token) for token in tokens]
print("Lemmatized tokens:", l_tokens)


# Train the HMM POS Tagger using Treebank corpus
train_sents = treebank.tagged_sents()[:4000]  	# First 3000 sentences for training
test_sents = treebank.tagged_sents()[4000:]  	# Remaining sentences for testing

trainer = hmm.HiddenMarkovModelTrainer()	    # Train the HMM POS Tagger
hmm_tagger = trainer.train(train_sents, estimator=LaplaceProbDist)	 # Use Laplace smoothing

tags = hmm_tagger.tag(l_tokens)			        # Use the trained tagger on the lemmatized tokens

print("\nPOS tags for the lemmatized sentence using HMM tagger:")
for l_token, tag in tags:
    expanded_tag = expand_tag(tag)
    print(f"{l_token}: {tag} ({expanded_tag})")

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.


Enter a sentence: thq brown fox jumbs
Tokens: ['thq', 'brown', 'fox', 'jumbs']
Lemmatized tokens: ['thq', 'brown', 'fox', 'jumbs']

POS tags for the lemmatized sentence using HMM tagger:
thq: DT (Determiner)
brown: NN (Noun, Singular)
fox: . (Unknown Tag)
jumbs: '' (Unknown Tag)


In [28]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import treebank
from nltk.tag import hmm
from nltk.probability import LaplaceProbDist

text = input("Enter a sentence: ")		        # Word tokenization
tokens = nltk.word_tokenize(text)
print("Tokens:", tokens)

lemmatizer = WordNetLemmatizer()		        # Lemmatization
l_tokens = [lemmatizer.lemmatize(token) for token in tokens]
print("Lemmatized tokens:", l_tokens)

train_sents = treebank.tagged_sents()[:4000]
trainer = hmm.HiddenMarkovModelTrainer()
hmm_tagger = trainer.train(train_sents, estimator=LaplaceProbDist)

tags = hmm_tagger.tag(l_tokens)

print("\nPOS tags for the lemmatized sentence using HMM tagger:")
for l_token, tag in tags:
    print(f"{l_token}: {tag}")


Enter a sentence: the brown fox jumps
Tokens: ['the', 'brown', 'fox', 'jumps']
Lemmatized tokens: ['the', 'brown', 'fox', 'jump']

POS tags for the lemmatized sentence using HMM tagger:
the: DT
brown: NN
fox: .
jump: ''


Basic Chatbot


In [29]:
import random

qa_pairs = {
    "hi": ["Hello!", "Hi there!", "Hey! How can I help you?"],
    "how are you": ["I'm doing well, thank you!", "Great, and you?"],
    "what is your name": ["I'm a simple chatbot.", "You can call me ChatBuddy!"],
    "bye": ["Goodbye!", "See you later!", "Have a great day!"],
    "what is python": ["Python is a popular programming language.", "A powerful, easy-to-learn programming language."],
    "who created you": ["I was created by a programmer using Python.", "That's a secret"],
    "thank you": ["You're welcome!", "No problem!", "Glad I could help!"]
}

def get_response(user_input):
    user_input = user_input.lower().strip()
    for question in qa_pairs:
        if question in user_input:
            return random.choice(qa_pairs[question])
    return "I'm not sure how to respond to that. Can you ask something else?"

print("ChatBot: Hello! Ask me something or type 'bye' to exit.")
while True:
    user_input = input("You: ")
    if user_input.lower() == "bye":
        print("ChatBot:", random.choice(qa_pairs["bye"]))
        break
    response = get_response(user_input)
    print("ChatBot:", response)


ChatBot: Hello! Ask me something or type 'bye' to exit.
You: hi
ChatBot: Hello!
You: what is python
ChatBot: A powerful, easy-to-learn programming language.
You: bye
ChatBot: Goodbye!


Text Classification

In [32]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score

# Dataset
data = {
    'text': [
        "I loved the movie, it was fantastic!",
        "What a terrible film. I will never watch it again.",
        "Absolutely wonderful acting and storyline.",
        "The plot was dull and the characters were boring.",
        "An excellent movie with great performances.",
        "Worst movie I've seen in years.",
        "It was okay, not the best but not the worst.",
        "The cinematography was beautiful, loved it.",
        "The film lacked depth and emotion.",
        "Great direction and a touching story."
    ],
    'label': ['positive', 'negative', 'positive', 'negative', 'positive',
              'negative', 'positive', 'positive', 'negative', 'positive']
}

df = pd.DataFrame(data)

# Text Vectorization (TF-IDF)
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['text'])

# Labels
y = df['label']

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train classifier
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# --------- User Input Section ---------
while True:
    user_input = input("\nEnter a sentence to predict sentiment (or type 'exit' to quit): ")
    if user_input.lower() == 'exit':
        break
    input_vector = vectorizer.transform([user_input])  # Transform input using same vectorizer
    prediction = model.predict(input_vector)
    print("Predicted Sentiment:", prediction[0])



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Accuracy: 0.0

Classification Report:
              precision    recall  f1-score   support

    negative       0.00      0.00      0.00       2.0
    positive       0.00      0.00      0.00       0.0

    accuracy                           0.00       2.0
   macro avg       0.00      0.00      0.00       2.0
weighted avg       0.00      0.00      0.00       2.0


Enter a sentence to predict sentiment (or type 'exit' to quit): i love this movie
Predicted Sentiment: positive

Enter a sentence to predict sentiment (or type 'exit' to quit): exit


In [31]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Sample dataset with 3 classes
data = {
    'text': [
        "I love this product",
        "This is the best thing ever!",
        "I am very happy with the results",
        "This is okay, not great but not bad",
        "It is fine",
        "Nothing special about it",
        "I hate it",
        "This is the worst purchase",
        "Totally disappointed"
    ],
    'label': [
        'positive', 'positive', 'positive',
        'neutral', 'neutral', 'neutral',
        'negative', 'negative', 'negative'
    ]
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.3, random_state=42)

# Build pipeline: TF-IDF + Logistic Regression
model = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression(solver='liblinear'))
])

# Train model
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print("Classification Report:\n")
print(classification_report(y_test, y_pred))

# Predict user input
while True:
    user_input = input("\nEnter a sentence (or 'exit' to quit): ")
    if user_input.lower() == 'exit':
        break
    prediction = model.predict([user_input])[0]
    print(f"Predicted Sentiment: {prediction}")


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Classification Report:

              precision    recall  f1-score   support

    negative       0.00      0.00      0.00         1
     neutral       0.00      0.00      0.00         1
    positive       0.50      1.00      0.67         1

    accuracy                           0.33         3
   macro avg       0.17      0.33      0.22         3
weighted avg       0.17      0.33      0.22         3


Enter a sentence (or 'exit' to quit): i dont like this
Predicted Sentiment: positive

Enter a sentence (or 'exit' to quit): i like this 
Predicted Sentiment: positive

Enter a sentence (or 'exit' to quit): exit


Translator

In [33]:
!pip install Translate

Collecting Translate
  Downloading translate-3.6.1-py2.py3-none-any.whl.metadata (7.7 kB)
Collecting libretranslatepy==2.1.1 (from Translate)
  Downloading libretranslatepy-2.1.1-py3-none-any.whl.metadata (233 bytes)
Downloading translate-3.6.1-py2.py3-none-any.whl (12 kB)
Downloading libretranslatepy-2.1.1-py3-none-any.whl (3.2 kB)
Installing collected packages: libretranslatepy, Translate
Successfully installed Translate-3.6.1 libretranslatepy-2.1.1


In [35]:
from translate import Translator

text = input("Enter the text to translate: ")
target_lang = input("Enter target language code (e.g., 'fr' for French): ")
translator = Translator(from_lang='en', to_lang=target_lang)
translation = translator.translate(text)
print(f"\nTranslated text en → {target_lang}): {translation}")

Enter the text to translate: i like ice cream
Enter target language code (e.g., 'fr' for French): ml

Translated text en → ml): എനിക്ക് ഐസ്ക്രീം ഇഷ്ടമാണ്


Preprocessing

In [36]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Download necessary resources
nltk.download('punkt')
nltk.download('stopwords')

# Read input from user
text = input("Enter a line of text: ")

# Tokenization
tokens = word_tokenize(text)
print("Tokens:", tokens)

# Stopword Removal
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print("Tokens after removing stopwords:", filtered_tokens)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Enter a line of text: heloo good morning my name in manu
Tokens: ['heloo', 'good', 'morning', 'my', 'name', 'in', 'manu']
Tokens after removing stopwords: ['heloo', 'good', 'morning', 'name', 'manu']


In [37]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download necessary resources
nltk.download('punkt')
nltk.download('wordnet')

# Read input from user
text = input("Enter a line of text: ")

# Tokenization
tokens = word_tokenize(text)
print("Tokens:", tokens)

# Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in tokens]
print("Stemmed Tokens:", stemmed_tokens)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
print("Lemmatized Tokens:", lemmatized_tokens)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Enter a line of text: those people were eating leaves
Tokens: ['those', 'people', 'were', 'eating', 'leaves']
Stemmed Tokens: ['those', 'peopl', 'were', 'eat', 'leav']
Lemmatized Tokens: ['those', 'people', 'were', 'eating', 'leaf']


In [39]:
!pip install contractions

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl.metadata (1.5 kB)
Collecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyahocorasick-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (118 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.3/118.3 kB[0m 

In [44]:
import nltk
import re
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
import contractions

# Download required NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Helper: Get WordNet POS tag
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

# Helper: Get synonym
def get_synonym(word, pos=None):
    synsets = wordnet.synsets(word, pos=pos) if pos else wordnet.synsets(word)
    if synsets:
        lemmas = synsets[0].lemmas()
        for lemma in lemmas:
            if lemma.name().lower() != word.lower():
                return lemma.name().replace('_', ' ')
    return word

# Helper: Get antonym
def get_antonym(word):
    for syn in wordnet.synsets(word):
        for lemma in syn.lemmas():
            if lemma.antonyms():
                return lemma.antonyms()[0].name().replace('_', ' ')
    return word

# --- Main Program ---

text = input("Enter a sentence: ")

# Step 1: Expand contractions
expanded_text = contractions.fix(text)
print("\nExpanded Text:", expanded_text)

# Step 2: Tokenize and POS tag
tokens = word_tokenize(expanded_text)
tagged = pos_tag(tokens)

# Step 3: Replace with synonyms and antonyms
modified_tokens = []
i = 0
while i < len(tagged):
    word, tag = tagged[i]
    wn_pos = get_wordnet_pos(tag)
    if word.lower() == "not" and i + 1 < len(tagged):
        # Look for antonym of next word
        next_word, next_tag = tagged[i + 1]
        antonym = get_antonym(next_word)
        modified_tokens.append(antonym)
        i += 2  # Skip "not" and next word
    else:
        # Replace with synonym
        synonym = get_synonym(word, wn_pos)
        modified_tokens.append(synonym)
        i += 1

print("\nModified Text:")
print(' '.join(modified_tokens))


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Enter a sentence: i can't say if he is not happy

Expanded Text: i cannot say if he is not happy

Modified Text:
iodine tin say if helium be unhappy


REGEX

In [45]:
import re

# Sample input
text = input("Enter a sentence: ")

# Define regex patterns and their replacements
replacements = {
    r"\bdog\b": "cat",            # Replace the word 'dog' with 'cat'
    r"\brun(ning)?\b": "walk",    # Replace 'run' or 'running' with 'walk'
    r"\b[A-Z]{2,}\b": "ABBR",     # Replace all uppercase words (abbreviations) with 'ABBR'
    r"\d{4}": "YEAR"              # Replace any 4-digit number with 'YEAR'
}

# Apply all replacements
for pattern, repl in replacements.items():
    text = re.sub(pattern, repl, text)

# Output
print("\nModified text:")
print(text)


Enter a sentence: the dog is running

Modified text:
the cat is walk


In [46]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

text = input("Enter some text:\n")
tokens = word_tokenize(text)
words = [word.lower() for word in tokens if word.isalpha()]
filtered_words = [word for word in words if word not in stopwords.words('english')]
word_list_corpus = sorted(set(filtered_words))

print("\nWord List Corpus:")
print(word_list_corpus)


Enter some text:
the brown fox jumps over the crazy cat on his back

Word List Corpus:
['back', 'brown', 'cat', 'crazy', 'fox', 'jumps']


In [47]:
#Write a Python program to create a part-of-speech tagged word corpus after tokenizing a line of text,
#filtering out stopwords, performing lemmatization and then performing part-of-speech tagging.
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer

# Uncomment if running for the first time
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('wordnet')

# Function to convert POS tag to WordNet format for lemmatization
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

# Input text
text = input("Enter a line of text:\n")

# Tokenization
tokens = word_tokenize(text)

# Remove stopwords and non-alphabetic tokens
words = [w.lower() for w in tokens if w.isalpha() and w.lower() not in stopwords.words('english')]

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(w, get_wordnet_pos(tag)) for w, tag in nltk.pos_tag(words)]

# POS Tagging
pos_tagged = nltk.pos_tag(lemmatized_words)

# Output the result
print("\nPOS-Tagged Word Corpus:")
for word, tag in pos_tagged:
    print(f"{word}: {tag}")


Enter a line of text:
he was playing with ball

POS-Tagged Word Corpus:
play: NN
ball: NN


In [48]:
#Write a Python program to tag proper names.

import nltk
from nltk import word_tokenize, pos_tag, ne_chunk

text = input("Enter a sentence: ")
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)
named_entities = ne_chunk(pos_tags)
print("\nProper Names Tagged:")
for chunk in named_entities:
    if hasattr(chunk, 'label'):
        print(f"{chunk.label()}: {' '.join(c[0] for c in chunk)}")


Enter a sentence: Barack was born in Hawaii

Proper Names Tagged:
PERSON: Barack
GPE: Hawaii


In [49]:
 #Write a Python program to perform tagging using regular expressions.
import nltk
from nltk import word_tokenize
from nltk.tag import RegexpTagger

# Sample text
text = input("Enter a sentence: ")

# Tokenize the input
tokens = word_tokenize(text)

# Define regular expression patterns for tagging
patterns = [
    (r'^[A-Z][a-z]+$', 'NNP'),       # Proper noun (initial capital)
    (r'.*ing$', 'VBG'),              # Gerunds
    (r'.*ed$', 'VBD'),               # Past tense verbs
    (r'.*es$', 'VBZ'),               # 3rd person singular verbs
    (r'.*ould$', 'MD'),              # Modals
    (r'.*\'s$', 'POS'),              # Possessive nouns
    (r'.*s$', 'NNS'),                # Plural nouns
    (r'^-?[0-9]+(\.[0-9]+)?$', 'CD'),# Cardinal numbers
    (r'.*', 'NN')                    # Default noun
]

# Create the RegexpTagger
regexp_tagger = RegexpTagger(patterns)

# Tag the tokens
tagged = regexp_tagger.tag(tokens)

# Output the result
print("\nTagged Output:")
for word, tag in tagged:
    print(f"{word}: {tag}")


Enter a sentence: manu is playing

Tagged Output:
manu: NN
is: NNS
playing: VBG


In [5]:
import nltk
from nltk.corpus import treebank
from nltk.tag.sequential import ClassifierBasedTagger
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
nltk.download('punkt_tab')
nltk.download('wordnet')
def feature_detector(tokens, index, history):
    word = tokens[index]
    features = {
        'word': word,
        'is_first': index == 0,
        'is_last': index == len(tokens) - 1,
        'is_capitalized': word[0].upper() == word[0],
        'is_all_caps': word.upper() == word,
        'is_all_lower': word.lower() == word,
        'prefix-1': word[0],
        'prefix-2': word[:2],
        'suffix-1': word[-1],
        'suffix-2': word[-2:],
        'prev_word': '' if index == 0 else tokens[index - 1],
        'next_word': '' if index == len(tokens) - 1 else tokens[index + 1],
    }
    return features

# Load training data
train_sents = treebank.tagged_sents()[:3000]

# Train the tagger
print("Training classifier-based tagger...")
tagger = ClassifierBasedTagger(train=train_sents, feature_detector=feature_detector)
print("Training complete!\n")

# Input and processing
text = input("Enter a sentence: ")
tokens = word_tokenize(text)
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(token) for token in tokens]

# Tagging
tagged = tagger.tag(lemmatized)

# Output
print("\nPOS Tags:")
for word, tag in tagged:
    print(f"{word}: {tag}")


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


Training classifier-based tagger...
Training complete!

Enter a sentence: brown fox

POS Tags:
brown: JJ
fox: NN


In [9]:
#Write a Python program to create NER tagged word corpus.
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')
# Input sentence
text = input("Enter a sentence: ")

# Tokenize the sentence
tokens = word_tokenize(text)

# POS tagging
pos_tags = pos_tag(tokens)

# Perform Named Entity Recognition
ner_tree = ne_chunk(pos_tags)
print(ner_tree)

# Display NER-tagged words
print("\nNER Tagged Word Corpus:")
for subtree in ner_tree:
    if hasattr(subtree, 'label'):
        # Named Entity (e.g., PERSON, ORGANIZATION)
        entity_name = " ".join([token for token, pos in subtree.leaves()])
        print(f"{entity_name}: {subtree.label()}")
    else:
        # Not a named entity
        word, tag = subtree
        print(f"{word}: {tag}")


[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker_tab is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


Enter a sentence: Obama in Hawaii
(S (GPE Obama/NNP) in/IN (GPE Hawaii/NNP))

NER Tagged Word Corpus:
Obama: GPE
in: IN
Hawaii: GPE


In [53]:
# Write a Python program to rank the words in a document using TF-IDF.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample input: list of documents (you can replace or modify this)
documents = [
    "The movie was fantastic and full of emotions.",
    "The film had great direction and brilliant acting.",
    "What a terrible movie. I won't recommend it.",
    "A fantastic plot with superb performance."
]

# Create TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)

# Get feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Create a DataFrame with TF-IDF scores
df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)

# Rank words in each document by TF-IDF score
for idx, row in df.iterrows():
    print(f"\nDocument {idx + 1} Word Rankings:")
    sorted_row = row.sort_values(ascending=False)
    for word, score in sorted_row.items():
        if score > 0:
            print(f"{word}: {score:.4f}")



Document 1 Word Rankings:
emotions: 0.3926
of: 0.3926
full: 0.3926
was: 0.3926
movie: 0.3096
fantastic: 0.3096
and: 0.3096
the: 0.3096

Document 2 Word Rankings:
acting: 0.3716
brilliant: 0.3716
direction: 0.3716
film: 0.3716
great: 0.3716
had: 0.3716
and: 0.2929
the: 0.2929

Document 3 Word Rankings:
terrible: 0.4218
won: 0.4218
what: 0.4218
recommend: 0.4218
it: 0.4218
movie: 0.3325

Document 4 Word Rankings:
with: 0.4652
plot: 0.4652
performance: 0.4652
superb: 0.4652
fantastic: 0.3667
