<a href="https://colab.research.google.com/github/praveen61204/HCLTECH/blob/master/assignment/paper2/paper2_assignment3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
#NLTK

# a) Import necessary libraries
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from autocorrect import Speller
from nltk.stem import PorterStemmer, WordNetLemmatizer
from textblob import TextBlob

# Download required NLTK data (run only once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

# b) Load the text corpus
with open("file.txt", "r", encoding="utf-8") as f:
    text = f.read()

print("Original Text:\n", text)
print("\n" + "-"*60)

# c) Tokenization
tokens = word_tokenize(text)
print("\nFirst 30 Tokens:")
print(tokens[:30])
print("\n" + "-"*60)

# d) Spelling correction
spell = Speller(lang='en')
corrected_tokens = [spell(token) for token in tokens]

print("\nFirst 10 Corrected Tokens:")
print(corrected_tokens[:10])

corrected_text = " ".join(corrected_tokens)
print("\nCorrected Text Corpus:")
print(corrected_text)
print("\n" + "-"*60)

# e) POS Tagging
pos_tags = nltk.pos_tag(corrected_tokens)

print("\nPOS Tags for Corrected Tokens:")
print(pos_tags[:30])
print("\n" + "-"*60)

# f) Stopword Removal
stop_words = set(stopwords.words("english"))
filtered_tokens = [tok for tok in corrected_tokens if tok.lower() not in stop_words]

print("\nFirst 20 Tokens After Stopword Removal:")
print(filtered_tokens[:20])
print("\n" + "-"*60)

# g) Stemming and Lemmatization
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

stemmed = [stemmer.stem(tok) for tok in filtered_tokens]
lemmatized = [lemmatizer.lemmatize(tok) for tok in filtered_tokens]

print("\nFirst 20 Stemmed Tokens:")
print(stemmed[:20])

print("\nFirst 20 Lemmatized Tokens:")
print(lemmatized[:20])
print("\n" + "-"*60)

# h) Sentence Boundary Detection
sentences = sent_tokenize(text)
print("\nTotal Number of Sentences:", len(sentences))
print("\nSentences:")
print(sentences)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Original Text:
 Ths is an exampel txt corpus. It has sme mistakes, bad formating and impropr sentnces!!
Natural languege processin is very importnt for AI, ML and data scince applcatons.

Tokeniztion helps in breakng down the text. Lemmatiztion and steming also hlps.
This txt contains multiple lines ,    weird spacings, and som wrong spellings.

the cat siting on the mat was lookng at the brd. Meanwhile the dog dog was runing behind the carrr.

how many sentnces are here?? maybe 3 Or 4 I am not sure. Let's see...!!


------------------------------------------------------------

First 30 Tokens:
['Ths', 'is', 'an', 'exampel', 'txt', 'corpus', '.', 'It', 'has', 'sme', 'mistakes', ',', 'bad', 'formating', 'and', 'impropr', 'sentnces', '!', '!', 'Natural', 'languege', 'processin', 'is', 'very', 'importnt', 'for', 'AI', ',', 'ML', 'and']

------------------------------------------------------------

First 10 Corrected Tokens:
['The', 'is', 'an', 'example', 'txt', 'corpus', '.', 'It', 'has',

In [7]:
# a) Import the necessary packages
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import re

# b) Fetch the dataset and store in a DataFrame
newsgroups = fetch_20newsgroups(subset='all',
                                remove=('headers', 'footers', 'quotes'))

df = pd.DataFrame({
    "text": newsgroups.data,
    "target": newsgroups.target
})

print("Dataset Loaded!")
print(df.head())
print("\nNumber of documents:", len(df))

# c) Clean the text data
def clean_text(text):
    text = text.lower()                                  # Lowercase
    text = re.sub(r'[^\w\s]', ' ', text)                 # Remove punctuation
    text = re.sub(r'\d+', ' ', text)                     # Remove numbers
    text = re.sub(r'\s+', ' ', text).strip()             # Remove extra spaces
    return text

df["clean_text"] = df["text"].apply(clean_text)

print("\nCleaned Text Sample:")
print(df["clean_text"].iloc[0])

# d) Create a Bag-of-Words (BoW) model
bow_vectorizer = CountVectorizer(stop_words='english')
bow_matrix = bow_vectorizer.fit_transform(df["clean_text"])

print("\nBoW Shape:", bow_matrix.shape)

# Sum frequency of each word
bow_word_counts = bow_matrix.sum(axis=0).A1
bow_vocab = bow_vectorizer.get_feature_names_out()

bow_freq_df = pd.DataFrame({
    "word": bow_vocab,
    "count": bow_word_counts
}).sort_values(by="count", ascending=False)

print("\nTop 20 words (BoW):")
print(bow_freq_df.head(20))

# e) Create a TF-IDF model
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(df["clean_text"])

print("\nTF-IDF Shape:", tfidf_matrix.shape)

# Sum TF-IDF scores for each word
tfidf_word_scores = tfidf_matrix.sum(axis=0).A1
tfidf_vocab = tfidf_vectorizer.get_feature_names_out()

tfidf_freq_df = pd.DataFrame({
    "word": tfidf_vocab,
    "score": tfidf_word_scores
}).sort_values(by="score", ascending=False)

print("\nTop 20 words (TF-IDF):")
print(tfidf_freq_df.head(20))

# f) Compare both models
print("\n================= COMPARISON =================")
print("\nTop 20 BoW Words:")
print(list(bow_freq_df.head(20)["word"]))

print("\nTop 20 TF-IDF Words:")
print(list(tfidf_freq_df.head(20)["word"]))


Dataset Loaded!
                                                text  target
0  \n\nI am sure some bashers of Pens fans are pr...      10
1  My brother is in the market for a high-perform...       3
2  \n\n\n\n\tFinally you said what you dream abou...      17
3  \nThink!\n\nIt's the SCSI card doing the DMA t...       3
4  1)    I have an old Jasmine drive which I cann...       4

Number of documents: 18846

Cleaned Text Sample:
i am sure some bashers of pens fans are pretty confused about the lack of any kind of posts about the recent pens massacre of the devils actually i am bit puzzled too and a bit relieved however i am going to put an end to non pittsburghers relief with a bit of praise for the pens man they are killing those devils worse than i thought jagr just showed you why he is much better than his regular season stats he is also a lot fo fun to watch in the playoffs bowman should let jagr have a lot of fun in the next couple of games since the pens are going to beat the pulp