<div style="  background: linear-gradient(145deg, #0f172a, #1e293b);  border: 4px solid transparent;  border-radius: 14px;  padding: 18px 22px;  margin: 12px 0;  font-size: 26px;  font-weight: 600;  color: #f8fafc;  box-shadow: 0 6px 14px rgba(0,0,0,0.25);  background-clip: padding-box;  position: relative;">  <div style="    position: absolute;    inset: 0;    padding: 4px;    border-radius: 14px;    background: linear-gradient(90deg, #06b6d4, #3b82f6, #8b5cf6);    -webkit-mask:       linear-gradient(#fff 0 0) content-box,       linear-gradient(#fff 0 0);    -webkit-mask-composite: xor;    mask-composite: exclude;    pointer-events: none;  "></div>    <b>Word counts with bag-of-words</b>    <br/>  <span style="color:#9ca3af; font-size: 18px; font-weight: 400;">(Introduction to Natural Language Processing in Python)</span></div>

## Table of Contents

1. [Bag-of-words Concept](#section-1)
2. [Bag-of-words in Python](#section-2)
3. [Simple Text Preprocessing](#section-3)
4. [Text Preprocessing with Python](#section-4)
5. [Introduction to Gensim](#section-5)
6. [Creating a Gensim Corpus](#section-6)
7. [Tf-idf with Gensim](#section-7)
8. [Conclusion](#section-8)

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 1. Bag-of-words Concept</span><br>

### What is Bag-of-words?
The **Bag-of-words (BoW)** model is a fundamental method in Natural Language Processing (NLP) for finding topics within a text. It simplifies text data by disregarding grammar and word order, focusing entirely on **word frequency**.

**Key Principles:**
*   **Tokenization:** You must first break the text into individual units called tokens (words).
*   **Counting:** You count the occurrence of every token.
*   **Frequency = Importance:** The underlying assumption is that the more frequent a word is, the more significant it is to the meaning of the text.

### Example
Consider the following text:
> "The cat is in the box. The cat likes the box. The box is over the cat."

If we strip punctuation and count the words, we get the following distribution:

| Word | Count |
| :--- | :--- |
| "The" / "the" | 3 |
| "box" | 3 |
| "cat" | 3 |
| "is" | 2 |
| "in" | 1 |
| "likes" | 1 |
| "over" | 1 |

<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip:</b> In a raw Bag-of-words model, the sentence structure is lost. "The cat eats the fish" and "The fish eats the cat" would look identical mathematically if they have the same word counts. </div>

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 2. Bag-of-words in Python</span><br>

To implement this in Python, we typically use the `nltk` library for tokenization and the `collections` library to handle the counting.

### Original Code (From Document)


In [1]:
from nltk.tokenize import word_tokenize
from collections import Counter

# Note: The input string contains newlines in the original document
Counter(word_tokenize("""The cat is in the box. The cat likes the box. 
The box is over the cat."""))

# To find the most common words
# counter.most_common(2)


Counter({'The': 3,
         'cat': 3,
         'the': 3,
         'box': 3,
         '.': 3,
         'is': 2,
         'in': 1,
         'likes': 1,
         'over': 1})


### Enhanced Executable Code
Below is a complete, runnable example. We ensure the necessary NLTK data is downloaded and print the results clearly.



In [2]:
import nltk
from nltk.tokenize import word_tokenize
from collections import Counter

# Ensure the tokenizer model is downloaded
nltk.download('punkt', quiet=True)

# Input text
text = """The cat is in the box. The cat likes the box. 
The box is over the cat."""

# 1. Tokenize the text
tokens = word_tokenize(text)

# 2. Create the Bag-of-words counter
counter = Counter(tokens)

# Display the full counter
print("Full Counter Object:")
print(counter)

# 3. Retrieve the top 2 most common words
print("\nTop 2 most common words:")
print(counter.most_common(2))


Full Counter Object:
Counter({'The': 3, 'cat': 3, 'the': 3, 'box': 3, '.': 3, 'is': 2, 'in': 1, 'likes': 1, 'over': 1})

Top 2 most common words:
[('The', 3), ('cat', 3)]



**Explanation:**
1.  `word_tokenize`: Splits the string into a list of words and punctuation.
2.  `Counter`: Creates a dictionary-like object where keys are words and values are their frequencies.
3.  `.most_common(n)`: A method of `Counter` that returns the top `n` frequent elements.

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 3. Simple Text Preprocessing</span><br>

### Why Preprocess?
Raw text is often messy. Preprocessing helps create better input data for machine learning models or statistical analysis.

**Common Preprocessing Steps:**
1.  **Tokenization:** Creating the bag of words.
2.  **Lowercasing:** Ensuring "Cat" and "cat" are treated as the same word.
3.  **Lemmatization/Stemming:** Shortening words to their root stems (e.g., "walking" $\rightarrow$ "walk").
4.  **Filtering:** Removing stop words (common words like "the", "is", "and"), punctuation, or unwanted tokens.

### Preprocessing Example
*   **Input Text:** "Cats, dogs and birds are common pets. So are fish."
*   **Target Output Tokens:** `cat`, `dog`, `bird`, `common`, `pet`, `fish`.

Notice how punctuation is removed, plurals are handled (in advanced stemming), and common words like "are" and "and" are removed.

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 4. Text Preprocessing with Python</span><br>

We can combine list comprehensions with NLTK's stopword list to clean our text efficiently.

### Original Code (From Document)


In [3]:
from nltk.corpus import stopwords

text = """The cat is in the box. The cat likes the box. 
The box is over the cat."""

# Tokenize and lowercase, keep only alphabetic tokens
tokens = [w for w in word_tokenize(text.lower()) if w.isalpha()]

# Remove stopwords
no_stops = [t for t in tokens if t not in stopwords.words('english')]

# Count results
Counter(no_stops).most_common(2)


[('cat', 3), ('box', 3)]


### Enhanced Executable Code


In [4]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter

# Download necessary NLTK data
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)

text = """The cat is in the box. The cat likes the box. 
The box is over the cat."""

# 1. Tokenize and Lowercase
# We also check w.isalpha() to remove punctuation like '.'
tokens = [w for w in word_tokenize(text.lower()) if w.isalpha()]

# 2. Remove Stopwords
# We load the english stopwords list
english_stops = stopwords.words('english')
no_stops = [t for t in tokens if t not in english_stops]

# 3. Analyze results
print(f"Original Tokens (subset): {tokens[:5]}...")
print(f"Cleaned Tokens: {no_stops}")

# 4. Most Common
print("\nMost Common Cleaned Words:")
print(Counter(no_stops).most_common(2))


Original Tokens (subset): ['the', 'cat', 'is', 'in', 'the']...
Cleaned Tokens: ['cat', 'box', 'cat', 'likes', 'box', 'box', 'cat']

Most Common Cleaned Words:
[('cat', 3), ('box', 3)]



<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip:</b> Notice that "is", "in", "the", and "over" were removed because they are in the standard English stopword list. This leaves us with the content-heavy words: "cat", "box", and "likes". </div>

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 5. Introduction to Gensim</span><br>

### What is Gensim?
**Gensim** is a popular open-source NLP library designed for topic modeling and document similarity analysis. It uses top academic models to perform complex tasks such as:
*   Building document or word vectors.
*   Performing topic identification.
*   Document comparison.

### Word Vectors
Gensim is famous for handling Word Vectors (embeddings). A word vector represents a word as a multi-dimensional coordinate. This allows for mathematical operations on words, such as the famous analogy:
$$ \text{King} - \text{Man} + \text{Woman} = \text{Queen} $$

### Building a Dictionary
The first step in using Gensim for Bag-of-words is creating a **Dictionary**. This maps every unique word in the corpus to a unique integer ID.

### Enhanced Executable Code: Creating a Dictionary


In [5]:
from gensim.corpora.dictionary import Dictionary
from nltk.tokenize import word_tokenize

# A corpus of movie reviews/descriptions
my_documents = [
    'The movie was about a spaceship and aliens.',
    'I really liked the movie!',
    'Awesome action scenes, but boring characters.',
    'The movie was awful! I hate alien films.',
    'Space is cool! I liked the movie.',
    'More space films, please!',
]

# Tokenize all documents
tokenized_docs = [word_tokenize(doc.lower()) for doc in my_documents]

# Create the Gensim Dictionary
dictionary = Dictionary(tokenized_docs)

# View the token to ID mapping
print("Token ID Mapping (first 10):")
# We iterate to show just a few, as dictionaries can be large
for k, v in list(dictionary.token2id.items())[:10]:
    print(f"Word: '{k}' -> ID: {v}")


Token ID Mapping (first 10):
Word: '.' -> ID: 0
Word: 'a' -> ID: 1
Word: 'about' -> ID: 2
Word: 'aliens' -> ID: 3
Word: 'and' -> ID: 4
Word: 'movie' -> ID: 5
Word: 'spaceship' -> ID: 6
Word: 'the' -> ID: 7
Word: 'was' -> ID: 8
Word: '!' -> ID: 9



***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 6. Creating a Gensim Corpus</span><br>

Once we have a dictionary, we can convert our documents into a **Corpus**. In Gensim, a corpus is a list of bag-of-words vectors.

Instead of storing the strings, Gensim stores a list of tuples: `(word_id, word_frequency)`.

### Original Code (From Document)


In [6]:
# corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
# corpus



### Enhanced Executable Code


In [7]:
# Create the corpus using the dictionary from the previous section
# doc2bow converts a collection of words to its bag-of-words representation
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]

print("Gensim Corpus (Bag-of-words representation):")
for i, doc_vector in enumerate(corpus):
    print(f"Doc {i}: {doc_vector}")


Gensim Corpus (Bag-of-words representation):
Doc 0: [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)]
Doc 1: [(5, 1), (7, 1), (9, 1), (10, 1), (11, 1), (12, 1)]
Doc 2: [(0, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1)]
Doc 3: [(0, 1), (5, 1), (7, 1), (8, 1), (9, 1), (10, 1), (20, 1), (21, 1), (22, 1), (23, 1)]
Doc 4: [(0, 1), (5, 1), (7, 1), (9, 1), (10, 1), (11, 1), (24, 1), (25, 1), (26, 1)]
Doc 5: [(9, 1), (13, 1), (22, 1), (26, 1), (27, 1), (28, 1)]



**Understanding the Output:**
If the output for a document is `[(0, 1), (1, 1)]`:
*   The word with ID `0` appears `1` time.
*   The word with ID `1` appears `1` time.

This format is highly memory efficient compared to storing raw text. Gensim models can be easily saved, updated, and reused.

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 7. Tf-idf with Gensim</span><br>

### What is Tf-idf?
**Tf-idf** stands for **Term Frequency - Inverse Document Frequency**. It is a numerical statistic intended to reflect how important a word is to a document in a collection (corpus).

*   **Problem with raw counts:** Common words like "the" or "movie" (in a movie corpus) appear frequently but carry little specific meaning.
*   **Solution:** Tf-idf down-weights words that appear in *many* documents and up-weights words that appear frequently in *one* document but rarely elsewhere.

### The Formula
The weight $w_{i,j}$ for token $i$ in document $j$ is calculated as:

$$ w_{i,j} = tf_{i,j} * \log(\frac{N}{df_i}) $$

Where:
*   $tf_{i,j}$ = number of occurrences of token $i$ in document $j$.
*   $df_i$ = number of documents that contain token $i$.
*   $N$ = total number of documents.

### Implementing Tf-idf in Gensim
Gensim allows you to transform a simple Bag-of-words corpus into a Tf-idf corpus.

### Enhanced Executable Code


In [8]:
from gensim.models.tfidfmodel import TfidfModel

# 1. Initialize the Tf-idf model using the existing corpus
tfidf = TfidfModel(corpus)

# 2. Apply the model to a specific document (e.g., the second document at index 1)
# doc_vector is the BoW representation of "I really liked the movie!"
doc_vector = corpus[1]
tfidf_weights = tfidf[doc_vector]

print(f"Original BoW Vector for Doc 1: {doc_vector}")
print(f"Tf-idf Weighted Vector for Doc 1: {tfidf_weights}")

# Let's see what words these IDs correspond to
print("\nInterpretation:")
for word_id, weight in tfidf_weights:
    print(f"Word: '{dictionary[word_id]}' \t Weight: {weight:.4f}")


Original BoW Vector for Doc 1: [(5, 1), (7, 1), (9, 1), (10, 1), (11, 1), (12, 1)]
Tf-idf Weighted Vector for Doc 1: [(5, np.float64(0.1746298276735174)), (7, np.float64(0.1746298276735174)), (9, np.float64(0.1746298276735174)), (10, np.float64(0.29853166221463673)), (11, np.float64(0.47316148988815415)), (12, np.float64(0.7716931521027908))]

Interpretation:
Word: 'movie' 	 Weight: 0.1746
Word: 'the' 	 Weight: 0.1746
Word: '!' 	 Weight: 0.1746
Word: 'i' 	 Weight: 0.2985
Word: 'liked' 	 Weight: 0.4732
Word: 'really' 	 Weight: 0.7717



**Observation:**
You will notice that common words (if they appear in many documents) have lower weights, while unique words specific to this sentence have higher weights.

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 8. Conclusion</span><br>

In this notebook, we explored the foundational techniques for processing text data in Python:

1.  **Bag-of-words (BoW):** We learned that BoW converts text into word counts, discarding grammar but keeping frequency information.
2.  **Preprocessing:** We utilized `nltk` to clean data by tokenizing, lowercasing, and removing stopwords to improve data quality.
3.  **Gensim Dictionary & Corpus:** We moved beyond simple lists to using Gensim's efficient `Dictionary` (mapping words to IDs) and `Corpus` (sparse vector representations).
4.  **Tf-idf:** We applied the Tf-idf transformation to down-weight common words and highlight significant, document-specific terms.

**Next Steps:**
These techniques form the basis for more advanced NLP tasks such as:
*   **Topic Modeling:** Using Latent Dirichlet Allocation (LDA) on the Gensim corpus.
*   **Document Similarity:** Calculating Cosine Similarity between Tf-idf vectors to find similar documents.
*   **Sentiment Analysis:** Using the weighted word vectors as features for machine learning classifiers.
