<div style="  background: linear-gradient(145deg, #0f172a, #1e293b);  border: 4px solid transparent;  border-radius: 14px;  padding: 18px 22px;  margin: 12px 0;  font-size: 26px;  font-weight: 600;  color: #f8fafc;  box-shadow: 0 6px 14px rgba(0,0,0,0.25);  background-clip: padding-box;  position: relative;">  <div style="    position: absolute;    inset: 0;    padding: 4px;    border-radius: 14px;    background: linear-gradient(90deg, #06b6d4, #3b82f6, #8b5cf6);    -webkit-mask:       linear-gradient(#fff 0 0) content-box,       linear-gradient(#fff 0 0);    -webkit-mask-composite: xor;    mask-composite: exclude;    pointer-events: none;  "></div>    <b>Bag-of-Words Sentiment Analysis in Python</b>    <br/>  <span style="color:#9ca3af; font-size: 18px; font-weight: 400;">(Text Processing, N-Grams, and Feature Engineering)</span></div>

## Table of Contents

1. [Introduction to Bag-of-Words (BOW)](#section-1)
2. [The Bag-of-Words Model Example](#section-2)
3. [Implementing BOW with Scikit-Learn](#section-3)
4. [Getting Granular with N-Grams](#section-4)
5. [Specifying Vocabulary Size](#section-5)
6. [Feature Engineering from Text](#section-6)
7. [Language Detection](#section-7)
8. [Conclusion](#section-8)

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 1. Introduction to Bag-of-Words (BOW)</span><br>

### What is a Bag-of-Words?

The **Bag-of-Words (BOW)** model is a fundamental concept in Natural Language Processing (NLP) used to transform text into numerical features that machine learning algorithms can understand.

**Key Characteristics:**
*   **Occurrence Description**: It describes the occurrence of words within a document or a collection of documents (known as a **corpus**).
*   **Vocabulary Building**: It builds a vocabulary of all unique words found in the corpus and creates a measure of their presence (usually a count).
*   **Loss of Structure**: As the name implies, it treats text as a "bag" of words, disregarding grammar rules and word order.

<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip:</b> The core idea is that similar documents will have similar word counts. If two documents contain the word "excellent" multiple times, they are likely to have a similar positive sentiment. </div>

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 2. The Bag-of-Words Model Example</span><br>

### Amazon Product Reviews Data

To understand BOW, let's look at a sample dataset of Amazon product reviews. The dataset typically contains a sentiment score and the text review.

| | score | review |
|:---|:---:|:---|
| **0** | 1 | Stuning even for the non-gamer: This sound tr... |
| **1** | 1 | The best soundtrack ever to anything.: I'm re... |
| **2** | 1 | Amazing!: This soundtrack is my favorite musi... |
| **3** | 1 | Excellent Soundtrack: I truly like this sound... |
| **4** | 1 | Remember, Pull Your Jaw Off The Floor After H... |
| **5** | 1 | an absolute masterpiece: I am quite sure any ... |
| **6** | 0 | Buyer beware: This is a self-published book, ... |
| **7** | 1 | Glorious story: I loved Whisper of the wicked... |
| **8** | 1 | A FIVE STAR BOOK: I just finished reading Whi... |
| **9** | 1 | Whispers of the Wicked Saints: This was a eas... |

### How BOW Works Conceptually

Consider the following positive review:
> *"This is the best book ever. I loved the book and highly recommend it!!!"*

A Bag-of-Words model converts this sentence into a frequency dictionary (or vector). It ignores punctuation and case (usually) and counts the tokens.



In [1]:
# Conceptual representation of the review above
bow_representation = {
    'This': 1,
    'is': 1,
    'the': 2,
    'best': 1,
    'book': 2,
    'ever': 1,
    'I': 1,
    'loved': 1,
    'and': 1,
    'highly': 1,
    'recommend': 1,
    'it': 1
}

print("Frequency Dictionary:", bow_representation)


Frequency Dictionary: {'This': 1, 'is': 1, 'the': 2, 'best': 1, 'book': 2, 'ever': 1, 'I': 1, 'loved': 1, 'and': 1, 'highly': 1, 'recommend': 1, 'it': 1}



**The Resulting Matrix:**
When applied to a whole corpus, the output looks like a matrix where:
*   **Rows** represent individual documents (reviews).
*   **Columns** represent every unique word in the vocabulary.
*   **Values** represent the count of that word in that document.

| | ... | wrong | wrote | year | years | yes | yet | you | young | your | yourself |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| **0** | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| **1** | ... | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| **2** | ... | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 0 |
| **3** | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| **4** | ... | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 0 | 1 | 0 |

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 3. Implementing BOW with Scikit-Learn</span><br>

We use the `CountVectorizer` from the `sklearn.feature_extraction.text` module to automate this process.

### Step 1: Setup and Data Creation
First, let's create a small DataFrame to mimic the Amazon reviews dataset so the code is executable.



In [2]:
import pandas as pd

# Creating a dummy dataset to mimic the slides
data = {
    'score': [1, 1, 1, 1, 0],
    'review': [
        "Stunning even for the non-gamer: This sound track is beautiful",
        "The best soundtrack ever to anything.",
        "Amazing!: This soundtrack is my favorite music",
        "Excellent Soundtrack: I truly like this soundtrack",
        "Buyer beware: This is a self-published book and it is not good."
    ]
}
reviews = pd.DataFrame(data)
print(reviews.head())


   score                                             review
0      1  Stunning even for the non-gamer: This sound tr...
1      1              The best soundtrack ever to anything.
2      1     Amazing!: This soundtrack is my favorite music
3      1  Excellent Soundtrack: I truly like this soundt...
4      0  Buyer beware: This is a self-published book an...



### Step 2: Using CountVectorizer
We instantiate the vectorizer, fit it to our data, and transform the data into a sparse matrix.



In [3]:
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate the vectorizer
# max_features=1000 limits the vocabulary to the top 1000 most frequent words
vect = CountVectorizer(max_features=1000)

# Fit the vectorizer to the data (learn the vocabulary)
vect.fit(reviews.review)

# Transform the data (create the matrix)
X = vect.transform(reviews.review)

# Output the shape and type of the matrix
print(f"Shape of X: {X.shape}")
print(X)


Shape of X: (5, 32)
<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 40 stored elements and shape (5, 32)>
  Coords	Values
  (0, 3)	1
  (0, 8)	1
  (0, 12)	1
  (0, 13)	1
  (0, 15)	1
  (0, 20)	1
  (0, 24)	1
  (0, 26)	1
  (0, 27)	1
  (0, 28)	1
  (0, 30)	1
  (1, 2)	1
  (1, 4)	1
  (1, 9)	1
  (1, 25)	1
  (1, 27)	1
  (1, 29)	1
  (2, 0)	1
  (2, 11)	1
  (2, 15)	1
  (2, 18)	1
  (2, 19)	1
  (2, 25)	1
  (2, 28)	1
  (3, 10)	1
  (3, 17)	1
  (3, 25)	2
  (3, 28)	1
  (3, 31)	1
  (4, 1)	1
  (4, 5)	1
  (4, 6)	1
  (4, 7)	1
  (4, 14)	1
  (4, 15)	2
  (4, 16)	1
  (4, 21)	1
  (4, 22)	1
  (4, 23)	1
  (4, 28)	1



### Step 3: Transforming the Output to a DataFrame
The output `X` is a sparse matrix (which saves memory by only storing non-zero values). To visualize it or use it in standard pandas workflows, we can convert it to a dense array and then a DataFrame.



In [None]:
# Transform to an array
my_array = X.toarray()

# Transform back to a dataframe, assign column names
# get_feature_names_out() is the modern sklearn equivalent of get_feature_names()
X_df = pd.DataFrame(my_array, columns=vect.get_feature_names_out())

# Display the first few rows and columns
print(X_df.iloc[:, :10].head())

   amazing  and  anything  beautiful  best  beware  book  buyer  even  ever
0        0    0         0          1     0       0     0      0     1     0
1        0    0         1          0     1       0     0      0     0     1
2        1    0         0          0     0       0     0      0     0     0
3        0    0         0          0     0       0     0      0     0     0
4        0    1         0          0     0       1     1      1     0     0



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 4. Getting Granular with N-Grams</span><br>

### Why Context Matters
Standard Bag-of-Words considers words in isolation (Unigrams). However, context often changes the meaning entirely, especially with negation.

*   **Example 1**: "I am *happy*, **not** *sad*."
*   **Example 2**: "I am *sad*, **not** *happy*."

If we only count "happy" and "sad", both sentences look identical. We need to capture the sequence.

### N-Gram Types
*   **Unigrams**: Single tokens (e.g., "The", "weather").
*   **Bigrams**: Pairs of tokens (e.g., "The weather", "weather today").
*   **Trigrams**: Triples of tokens (e.g., "The weather today").

**Example Sentence**: "The weather today is wonderful."

*   **Unigrams**: `{The, weather, today, is, wonderful}`
*   **Bigrams**: `{The weather, weather today, today is, is wonderful}`
*   **Trigrams**: `{The weather today, weather today is, today is wonderful}`

### Implementing N-Grams
We can capture these sequences using the `ngram_range` argument in `CountVectorizer`.



In [None]:
# Example: Capturing Unigrams and Bigrams
# ngram_range=(min_n, max_n)

# Only unigrams (Default)
vect_uni = CountVectorizer(ngram_range=(1, 1))

# Uni- and Bigrams
vect_bi = CountVectorizer(ngram_range=(1, 2))

# Fit and transform to see the difference
vect_bi.fit(reviews.review)
print("Vocabulary with Bigrams (First 10):")
print(vect_bi.get_feature_names_out()[:10])

Vocabulary with Bigrams (First 10):
['amazing' 'amazing this' 'and' 'and it' 'anything' 'beautiful' 'best'
 'best soundtrack' 'beware' 'beware this']



<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip:</b> <b>What is the best n?</b> Longer sequences result in more features and higher precision for models, but they drastically increase the risk of <b>overfitting</b> because specific long phrases may only appear in the training set. </div>

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 5. Specifying Vocabulary Size</span><br>

When working with large corpora, the vocabulary can become massive. We can control the size using three key parameters in `CountVectorizer`.

1.  **`max_features`**:
    *   If specified (int), it includes only the top most frequent words.
    *   If `None`, all words are included.
2.  **`max_df`** (Maximum Document Frequency):
    *   Ignores terms that appear in *too many* documents (likely stop words like "the", "is").
    *   **Float (0.0 - 1.0)**: Represents a proportion (e.g., 0.95 means ignore words appearing in >95% of docs).
    *   **Integer**: Absolute count.
3.  **`min_df`** (Minimum Document Frequency):
    *   Ignores terms that appear in *too few* documents (likely typos or extremely rare words).
    *   **Float**: Proportion.
    *   **Integer**: Absolute count.



In [None]:
# Example configuration
vect_limited = CountVectorizer(
    max_features=500,  # Keep top 500 words
    max_df=0.95,       # Ignore words appearing in > 95% of reviews
    min_df=2           # Ignore words appearing in < 2 reviews
)

# Note: With our tiny dummy dataset, min_df=2 might filter out almost everything, 
# but this is the syntax for real-world data.


---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 6. Feature Engineering from Text</span><br>

### Goal: Enriching the Dataset
Beyond just word counts, we can extract meta-features from the text to help our model. Common features include:
*   Length of the review.
*   Number of sentences.
*   Parts of speech involved.
*   Count of punctuation marks (e.g., many "!" might indicate strong emotion).

### Tokenizing Strings
We use `nltk` (Natural Language Toolkit) to split text into tokens accurately.



In [None]:
import nltk
from nltk import word_tokenize

# Ensure you have the tokenizer data downloaded
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')
    nltk.download('punkt_tab')

# Example string
anna_k = 'Happy families are all alike, every unhappy family is unhappy in its own way.'

# Tokenize
tokens = word_tokenize(anna_k)
print(tokens)

['Happy', 'families', 'are', 'all', 'alike', ',', 'every', 'unhappy', 'family', 'is', 'unhappy', 'in', 'its', 'own', 'way', '.']



### Creating a "Number of Tokens" Feature
We can apply this logic to the entire `reviews` column using list comprehension.



In [None]:
# 1. Create a list of tokens for every review
word_tokens = [word_tokenize(review) for review in reviews.review]

print(f"Type of word_tokens: {type(word_tokens)}")
print(f"Type of first element: {type(word_tokens[0])}")

# 2. Calculate the length of each token list
len_tokens = []
for i in range(len(word_tokens)):
    len_tokens.append(len(word_tokens[i]))

# 3. Assign this new list as a column in the DataFrame
reviews['n_tokens'] = len_tokens

# View the result
print(reviews[['score', 'review', 'n_tokens']].head())

Type of word_tokens: <class 'list'>
Type of first element: <class 'list'>
   score                                             review  n_tokens
0      1  Stunning even for the non-gamer: This sound tr...        11
1      1              The best soundtrack ever to anything.         7
2      1     Amazing!: This soundtrack is my favorite music         9
3      1  Excellent Soundtrack: I truly like this soundt...         8
4      0  Buyer beware: This is a self-published book an...        14



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 7. Language Detection</span><br>

### The Problem
In a global dataset, reviews might be in different languages. We want to detect the language of each string and capture the most likely language in a new column. We use the `langdetect` library.

### Detecting Language of a Single String



In [10]:
from langdetect import detect_langs

# Example Spanish sentence
foreign = 'Este libro ha sido uno de los mejores libros que he leido.'

# Detect language
# Returns a list of probabilities, e.g., [es:0.999...]
result = detect_langs(foreign)
print(result)

[es:0.9999941038205692]



### Building a Feature for the Language
We iterate through the DataFrame rows and apply detection.



In [11]:
languages = []

# Loop through the reviews
# Note: In a real scenario, ensure 'langdetect' is installed via pip
# We will simulate the loop based on our dummy dataframe
for row in range(len(reviews)):
    # We use a try-except block because short text or numbers can cause errors
    try:
        # iloc[row, 1] refers to the 'review' column in our specific dataframe structure
        # Adjust index based on your actual dataframe columns
        lang_res = detect_langs(reviews.iloc[row, 1])
        languages.append(lang_res)
    except:
        languages.append("unknown")

print("Raw detection results (first 5):", languages[:5])


Raw detection results (first 5): [[en:0.9999957297420001], [en:0.9999972643875168], [en:0.999994856018333], [en:0.9999970741645459], [en:0.999996468162091]]



### Cleaning the Output
The output of `detect_langs` is a list of objects. To get a clean string like 'es' or 'en', we need to parse it. The slides suggest converting to string and splitting, though accessing the object attributes directly is also possible. Here we follow the slide's logic:



In [12]:
# Logic demonstrated in slides:
# 1. Convert list to string -> "[es:0.999]"
# 2. Split on colon -> ["['es", "0.999]"]
# 3. Take first element -> "['es"
# 4. Slice to remove bracket -> "es"

# Applying this logic to the list
clean_languages = []
for lang in languages:
    if lang != "unknown":
        # Convert the first detection result to string and parse
        # Note: lang[0] gets the most probable language object
        code = str(lang[0]).split(':')[0]
        clean_languages.append(code)
    else:
        clean_languages.append('unknown')

# Assign to DataFrame
reviews['language'] = clean_languages

print(reviews[['review', 'language']].head())


                                              review language
0  Stunning even for the non-gamer: This sound tr...       en
1              The best soundtrack ever to anything.       en
2     Amazing!: This soundtrack is my favorite music       en
3  Excellent Soundtrack: I truly like this soundt...       en
4  Buyer beware: This is a self-published book an...       en



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 8. Conclusion</span><br>

In this notebook, we covered the essential techniques for preparing text data for sentiment analysis:

1.  **Bag-of-Words (BOW)**: We learned how to convert unstructured text into a numeric matrix of token counts using `CountVectorizer`.
2.  **N-Grams**: We explored how to capture context (like "not happy") by including Bigrams and Trigrams, rather than just single words.
3.  **Vocabulary Management**: We discussed using `max_features`, `max_df`, and `min_df` to optimize the size of our feature set and reduce noise.
4.  **Feature Engineering**: We extracted meta-features like review length (`n_tokens`) to enrich the dataset.
5.  **Language Detection**: We used `langdetect` to identify the language of reviews, which is crucial for filtering or specific processing in multi-lingual datasets.

**Next Steps:**
With these numeric features (the BOW matrix and the extra columns), you are now ready to train machine learning models (like Logistic Regression or Naive Bayes) to predict the sentiment score of new reviews.
