# üßë‚Äçüè´ Lab Session: Text Representation in NLP (Bag of Words & TF-IDF)

## üéØ Learning Objectives
- Understand text representation in NLP.  
- Extract unique words (vocabulary) from text.  
- Implement **Bag of Words (BoW)** manually in Python.  
- Implement **TF-IDF** manually in Python.  
- Compare manual implementation with library (`sklearn`).  

---

## üîπ Part 1: Working with Text in Python

1.   List item
2.   List item



In [2]:
# Sample documents
docs = [
    "I love NLP",
    "I love Machine Learning"
]

# Convert to lowercase and split
for doc in docs:
    words = doc.lower().split()
    print(words)

['i', 'love', 'nlp']
['i', 'love', 'machine', 'learning']


üîπ Part 2: Finding Unique Words (Vocabulary)

In [4]:
all_words = []
for doc in docs:
    words = doc.lower().split()
    all_words.extend(words)

# Unique words
vocab = sorted(set(all_words))
print("Vocabulary:", vocab)


Vocabulary: ['i', 'learning', 'love', 'machine', 'nlp']


#üîπ Part 3: Bag of Words (BoW)

In [5]:
# Initialize BoW matrix
bow_matrix = []

for doc in docs:
    words = doc.lower().split()
    row = []
    for word in vocab:
        row.append(words.count(word))  # count frequency
    bow_matrix.append(row)

print("BoW Matrix:")
for row in bow_matrix:
    print(row)


BoW Matrix:
[1, 0, 1, 0, 1]
[1, 1, 1, 1, 0]


##üîπ Part 4: TF-IDF (Manual Implementation)

###Step 1: Term Frequency (TF)

In [6]:
tf_matrix = []

for doc in docs:
    words = doc.lower().split()
    row = []
    for word in vocab:
        row.append(words.count(word) / len(words))  # normalized frequency
    tf_matrix.append(row)

print("TF Matrix:")
for row in tf_matrix:
    print(row)

TF Matrix:
[0.3333333333333333, 0.0, 0.3333333333333333, 0.0, 0.3333333333333333]
[0.25, 0.25, 0.25, 0.25, 0.0]


###Step 2: Inverse Document Frequency (IDF)

In [7]:
import math

N = len(docs)
idf = []

for word in vocab:
    count = sum(1 for doc in docs if word in doc.lower().split())
    idf_value = math.log(N / (1 + count))
    idf.append(round(idf_value, 3))

print("IDF Values:", dict(zip(vocab, idf)))

IDF Values: {'i': -0.405, 'learning': 0.0, 'love': -0.405, 'machine': 0.0, 'nlp': 0.0}


###Step 3: TF-IDF = TF √ó IDF

In [8]:
tfidf_matrix = []

for row in tf_matrix:
    tfidf_row = []
    for i in range(len(vocab)):
        tfidf_row.append(round(row[i] * idf[i], 3))
    tfidf_matrix.append(tfidf_row)

print("TF-IDF Matrix:")
for row in tfidf_matrix:
    print(row)

TF-IDF Matrix:
[-0.135, 0.0, -0.135, 0.0, 0.0]
[-0.101, 0.0, -0.101, 0.0, 0.0]


##üîπ Part 5: Using Libraries (Preview)

In [9]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Bag of Words
cv = CountVectorizer()
print("BoW (sklearn):")
print(cv.fit_transform(docs).toarray())
print("Vocabulary:", cv.get_feature_names_out())

# TF-IDF
tv = TfidfVectorizer()
print("\nTF-IDF (sklearn):")
print(tv.fit_transform(docs).toarray())
print("Vocabulary:", tv.get_feature_names_out())

BoW (sklearn):
[[0 1 0 1]
 [1 1 1 0]]
Vocabulary: ['learning' 'love' 'machine' 'nlp']

TF-IDF (sklearn):
[[0.         0.57973867 0.         0.81480247]
 [0.6316672  0.44943642 0.6316672  0.        ]]
Vocabulary: ['learning' 'love' 'machine' 'nlp']


##üìù Lab Exercise

###Use these 3 sentences:

"Data Science is fun"

"I love Data Science"

"I love Python"

üëâ Do the following:

Extract the vocabulary.

Build BoW manually.

Compute TF manually.

Compute TF-IDF manually.

Compare results with sklearn‚Äôs CountVectorizer and TfidfVectorizer