# **Part 1: Text Preprocessing**



**Data-** given in the below code cell

**1.1: Preprocessing From Scratch**

**Goal:** Write a function clean_text_scratch(text) that performs the following without using NLTK or Spacy:

1. Lowercasing: Convert text to lowercase.

2. Punctuation Removal: Use Python's re (regex) library or string methods to remove special characters (!, ., ,, :, ;, ..., ').

3. Tokenization: Split the string into a list of words based on whitespace.

4. Stopword Removal: Filter out words found in this list: ['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or'].

5. Simple Stemming: Create a helper function that removes suffixes 'ing', 'ly', 'ed', and 's' from the end of words.


Note: This is a "Naive" stemmer. It will break words like "sing" -> "s". This illustrates why we need libraries!

**Task:** Run this function on the first sentence of the corpus and print the result.

In [None]:
corpus = [
    "Artificial Intelligence is transforming the world; however, ethical concerns remain!",
    "The pizza was absolutely delicious, but the service was terrible ... I won't go back.",
    "The quick brown fox jumps over the lazy dog.",
    "To be, or not to be, that is the question: Whether 'tis nobler in the mind.",
    "Data science involves statistics, linear algebra, and machine learning.",
    "I love machine learning, but I hate the math behind it."
]

In [13]:
import re

stop_words = ['the', 'is', 'in', 'to', 'of', 'and', 'a', 'it', 'was', 'but', 'or']

def simple_stemmer(word):
    if word.endswith('ing'):
        return word[:-3]
    elif word.endswith('ly'):
        return word[:-2]
    elif word.endswith('ed'):
        return word[:-2]
    elif word.endswith('s'):
        return word[:-1]
    return word

def clean_text_scratch(text):
    # 1. Lowercasing
    text = text.lower()

    # 2. Punctuation rem
    text = re.sub(r'[!.,;:...\[\]‘’\-]', '', text)

    # 3. tokenization
    tokens = text.split()

    # 4. Stopwordremoval
    tokens = [word for word in tokens if word not in stop_words]

    # 5. stemming
    tokens = [simple_stemmer(word) for word in tokens]

    return tokens

#running on first line of corpus
sentence1 = corpus[0]
cleaned_sentence1= clean_text_scratch(sentence1)
print(cleaned_sentence1)


['artificial', 'intelligence', 'transform', 'world', 'however', 'ethical', 'concern', 'remain']


**1.2: Preprocessing Using Tools**

**Goal:** Use the nltk library to perform the same cleaning on the entire corpus.

**Steps:**

1. Use nltk.tokenize.word_tokenize.
2. Use nltk.corpus.stopwords.
3. Use nltk.stem.WordNetLemmatizer

to convert words to their root (e.g., "jumps" $\to$ "jump", "transforming" $\to$ "transform").


**Task:** Print the cleaned, lemmatized tokens for the second sentence (The pizza review).

In [14]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re # Import re for punctuation removal

def clean_text_nltk(text):
    # Lowercasing
    text = text.lower()

    # Tokenization
    tokens = word_tokenize(text)

    # punctuation removal
    tokens = [word for word in tokens if word.isalpha()]

    # stopward removal
    stop_words_nltk = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words_nltk]

    # Lemmatiza.
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return tokens

#getting setence from corpus
sentence2 = corpus[1]

# Clean  second sentence
cleaned_sentence2 = clean_text_nltk(sentence2)

print(cleaned_sentence2)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


['pizza', 'absolutely', 'delicious', 'service', 'terrible', 'wo', 'go', 'back']


# **Part 2: Text Representation**

**2.1: Bag of Words (BoW)**

**Logic:**

**Build Vocabulary:** Create a list of all unique words in the entire corpus (after cleaning). Sort them alphabetically.

**Vectorize:** Write a function that takes a sentence and returns a list of numbers. Each number represents the count of a vocabulary word in that sentence.

**Task:** Print the unique Vocabulary list. Then, print the BoW vector for: "The quick brown fox jumps over the lazy dog."

In [17]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# NLTK downloads (run once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

def clean_text_nltk(text):

    text = text.lower()

    tokens = word_tokenize(text)

    tokens = [word for word in tokens if word.isalpha()]

    stop_words_nltk = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words_nltk]


    lemmatizer = WordNetLemmatizer()


    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return tokens

# Clean whole corpus
cleaned_corpus = []
for sentence in corpus:
    cleaned_corpus.extend(clean_text_nltk(sentence))

# Build vocablist
vocab = sorted(list(set(cleaned_corpus)))
print("Unique VocabList:", vocab)

# Function to vectorize a sentence
def vectorize_bow(sentence, vocab):
    sentence_tokens = clean_text_nltk(sentence)

    vector = [0] * len(vocab)

    for token in sentence_tokens:
        if token in vocab:
         vector[vocab.index(token)] += 1
    return vector

# The sentence to vectorize
sentence_to_vectorize = "The quick brown fox jumps over the lazy dog."
bow_vector = vectorize_bow(sentence_to_vectorize, vocab)
print(f"\nBoW vector for '{sentence_to_vectorize}': {bow_vector}")


Unique VocabList: ['absolutely', 'algebra', 'artificial', 'back', 'behind', 'brown', 'concern', 'data', 'delicious', 'dog', 'ethical', 'fox', 'go', 'hate', 'however', 'intelligence', 'involves', 'jump', 'lazy', 'learning', 'linear', 'love', 'machine', 'math', 'mind', 'nobler', 'pizza', 'question', 'quick', 'remain', 'science', 'service', 'statistic', 'terrible', 'transforming', 'whether', 'wo', 'world']

BoW vector for 'The quick brown fox jumps over the lazy dog.': [0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**2.2: BoW Using Tools**

**Task:** Use sklearn.feature_extraction.text.CountVectorizer.

**Steps:**

1. Instantiate the vectorizer.

2. fit_transform the raw corpus.

3. Convert the result to an array (.toarray()) and print it.

In [21]:
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate vectorizer
vectorizer = CountVectorizer()

# Fit and transform row corpus
bow_matrix = vectorizer.fit_transform(corpus)

# Convert to array
print(bow_matrix.toarray())




[[0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1]
 [1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
  0 0 0 0 1 0 1 0 2 0 0 0 2 0 1 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0
  0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0]
 [0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0
  1 0 0 0 0 0 0 1 2 1 2 0 0 1 0 0]
 [0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0
  0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]]


**2.3: TF-IDF From Scratch (The Math)**

**Goal:** Manually calculate the score for the word "machine" in the last sentence:

"I love machine learning, but I hate the math behind it."

**Formula:**

*TF (Term Frequency):* $\frac{\text{Count of 'machine' in sentence}}{\text{Total words in sentence}}$

*IDF (Inverse Document Frequency):* $\log(\frac{\text{Total number of documents}}{\text{Number of documents containing 'machine'}})$ (Use math.log).

**Result:** TF * IDF.

**Task:** Print your manual calculation result.

In [None]:
import math
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer


def clean_text_nltk(text):

    text = text.lower()

    tokens = word_tokenize(text)


    tokens = [word for word in tokens if word.isalpha()]

    stop_words_nltk = set(stopwords.words('english'))


    tokens = [word for word in tokens if word not in stop_words_nltk]

    lemmatizer = WordNetLemmatizer()

    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return tokens

# Target word and sentence
target_word = "machine"
sentence_index = 5
target_sentence = corpus[sentence_index]

print(f" target_word: '{target_word}'")
print(f" target_sentence:'{target_sentence}'")

# 1. TF calculations

cleaned_targetsen = clean_text_nltk(target_sentence)

print(f"Cleaned tokens: {cleaned_targetsen}")

target_word_in_sent = cleaned_targetsen.count(target_word)

total_words = len(cleaned_targetsen)

if total_words > 0:
    tf = target_word_in_sent / total_words
else:
    tf = 0.0

print(f"Count of '{target_word}' : {target_word_in_sent}")

print(f" words in cleaned sentence: {total_words}")

print(f"TF for '{target_word}': {tf:.4f}")




#  2. idf calculation

total_doc = len(corpus)

doc_containing_machine = 0

for doc_idx, doc_text in enumerate(corpus):

     cleaned_doc = clean_text_nltk(doc_text)
     if target_word in cleaned_doc:

        doc_containing_machine += 1

print(f"Total documents : {total_doc}")
print(f"Num. of doc containing '{target_word}': {doc_containing_machine}")

if doc_containing_machine > 0:
    idf = math.log(total_doc / doc_containing_machine)
else:
    idf = 0.0

print(f"IDF for '{target_word}': {idf:.4f}")

#  3. Calculate TF-IDF
tf_idf = tf * idf

print(f"TF-IDF for '{target_word}' in sentence: {tf_idf:.4f}")


**2.4: TF-IDF Using Tools**

**Task:** Use sklearn.feature_extraction.text.TfidfVectorizer.

**Steps:** Fit it on the corpus and print the vector for the first sentence.

**Observation:** Compare the score of unique words (like "Intelligence") vs common words (like "is"). Which is higher?

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

#vectorizer initialization
vectorizer = TfidfVectorizer()

# Fit it on corpus
tfidf_matrix = vectorizer.fit_transform(corpus)

# Get the feature names (vocabulary)
feature_names = vectorizer.get_feature_names_out()

# Get the TF-IDF vector for the first sentence (corpus[0])
# The .toarray() converts the sparse matrix row to a dense numpy array
first_sentence_tfidf_vector = tfidf_matrix[0].toarray()[0]

print("TF-IDF Vector for the first sentence:")

# making a dictionary
word_tfidf_scores = {}

for word_idx, score in enumerate(first_sentence_tfidf):
      if score > 0: #

         word_tfidf_scores[feature_names[word_idx]] = score

# Sort by score for better comparison
sorted_scores = sorted(word_tfidf_scores.items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_scores:
    print(f"  {word}: {score:.4f}")


TF-IDF Vector for the first sentence:
  artificial: 0.3345
  concerns: 0.3345
  ethical: 0.3345
  however: 0.3345
  intelligence: 0.3345
  remain: 0.3345
  transforming: 0.3345
  world: 0.3345
  is: 0.2743
  the: 0.1714


# **Part 3- Word Embeddings**

**3.1: Word2Vec Using Tools**

**Task:** Train a model using gensim.models.Word2Vec.

**Steps:**

1. Pass your cleaned tokenized corpus (from Part 1.2) to Word2Vec.

2. Set min_count=1 (since our corpus is small, we want to keep all words).

3. Set vector_size=10 (small vector size for easy viewing).

**Experiment:** Print the vector for the word "learning".

In [42]:
!pip install gensim
from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# ntlk download
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

#corpus uploading in model
tokenizedcorpus_for_word2vec = []
for sentence in corpus:
    tokenizedcorpus_for_word2vec.append(clean_text_nltk(sentence))

# Training model
model = Word2Vec(sentences=processed_corpus_for_word2vec, vector_size=10, min_count=1, workers=4)

# Experiment: Print the vector for the word "has"
word = "has"

if word in model.wv:

    print(f"Vector of'{word}':")

    print(model.wv[word])
else:
    print(f"'{word}' not found .")


Vector of'learning':
[-0.00535678  0.00238785  0.05107836  0.09016657 -0.09301379 -0.07113771
  0.06464887  0.08973394 -0.05023384 -0.03767424]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**3.3: Pre-trained GloVe (Understanding Global Context)**

**Task:** Use gensim.downloader to load 'glove-wiki-gigaword-50'

**Analogy Task:** Compute the famous analogy:$\text{King} - \text{Man} + \text{Woman} = ?$

Use model.most_similar(positive=['woman', 'king'], negative=['man']).

**Question:** Does the model correctly guess "Queen"?

In [38]:
import gensim.downloader as api

# Load pre-trained GloVe model with the correct name
glove_model = api.load('glove-wiki-gigaword-50')

# Analogy Task: King - Man + Woman = ?
# Use model.most_similar(positive=['woman', 'king'], negative=['man'])
analogy_result = glove_model.most_similar(positive=['woman', 'king'], negative=['man'])

print("Analogy: King - Man + Woman = ?")
print(analogy_result)

# Check if the model correctly guesses "Queen"
if analogy_result and analogy_result[0][0].lower() == 'queen':
    print("\nObservation: Yes, the model correctly guesses 'Queen' as the top result.")
else:
    print("\nObservation: The model did not guess 'Queen' as the top result or there was no result. The top result is: " + analogy_result[0][0])


Analogy: King - Man + Woman = ?
[('queen', 0.8523604273796082), ('throne', 0.7664334177970886), ('prince', 0.7592144012451172), ('daughter', 0.7473883628845215), ('elizabeth', 0.7460219860076904), ('princess', 0.7424570322036743), ('kingdom', 0.7337412238121033), ('monarch', 0.721449077129364), ('eldest', 0.7184861898422241), ('widow', 0.7099431157112122)]

Observation: Yes, the model correctly guesses 'Queen' as the top result.


# **Part 5- Sentiment Analysis (The Application)**

**Concept:** Sentiment Analysis determines whether a piece of text is Positive, Negative, or Neutral. We will use VADER (Valence Aware Dictionary and sEntiment Reasoner) from NLTK. VADER is specifically designed for social media text; it understands that capital letters ("LOVE"), punctuation ("!!!"), and emojis change the sentiment intensity.

**Task:**

1. Initialize the SentimentIntensityAnalyzer.

2. Pass the Pizza Review (corpus[1]) into the analyzer.

3. Pass the Math Complaint (corpus[5]) into the analyzer.

**Analysis:** Look at the compound score for both.

**Compound Score Range:** -1 (Most Negative) to +1 (Most Positive).

Does the model correctly identify that "delicious" and "terrible" in the same sentence result in a mixed or neutral score?

In [47]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

# donload vader
nltk.download('vader_lexicon')

#   initiating sentimentintensityanalyzer
sia = SentimentIntensityAnalyzer()

#  pizza review of a sentence 2
pizza_review = corpus[1]
print(f"'{pizza_review}'")

pizza_sent = sia.polarity_scores(pizza_review)
print(f"Sentiment: {pizza_sent}")

#  Mth complaint (corpus[5])



math_complaint = corpus[5]


print(f"math complaint: '{math_complaint}'")


mth_sent = sia.polarity_scores(math_complaint)


print(f"sentiment for mth complaint: {mth_sent}")

print(f" Review cscore: {pizza_sent['compound']:.2f}")
print(f" complaint cscore: {mth_sent['compound']:.2f}")


if pizza_sentiment['compound'] >= -0.05 and pizza_sentiment['compound'] <= 0.05:


    print("obsrvn: The model correctly identifies the pizza review.")
else:
    print("obsrvn: The model did not identify the pizza review .")


'The pizza was absolutely delicious, but the service was terrible ... I won't go back.'
Sentiment: {'neg': 0.223, 'neu': 0.644, 'pos': 0.134, 'compound': -0.3926}
math complaint: 'I love machine learning, but I hate the math behind it.'
sentiment for mth complaint: {'neg': 0.345, 'neu': 0.478, 'pos': 0.177, 'compound': -0.5346}
 Review cscore: -0.39
 complaint cscore: -0.53
obsrvn: The model did not identify the pizza review .


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
