## Bag of Words Encoding Exercise

This notebook demonstrates Bag of Words (BoW) representation using scikit-learn's CountVectorizer on custom text documents.

In [11]:
# Import required libraries
from sklearn.feature_extraction.text import CountVectorizer

In [12]:
# Define the text documents
documents = [
    "Often, machine learning tutorials will recommend or require that you prepare your data in specific ways before fitting a machine learning model.",
    "Getting started in applied machine learning can be difficult, especially when working with real-world data.",
    "One good example is to use a one-hot encoding on categorical data."
]

# Preprocess: convert to lowercase and remove punctuation
processed_docs = [doc.lower().replace(",", "").replace(".", "").replace("-", " ") for doc in documents]

print("Processed documents:")
for i, doc in enumerate(processed_docs, 1):
    print(f"Doc {i}: {doc}")

Processed documents:
Doc 1: often machine learning tutorials will recommend or require that you prepare your data in specific ways before fitting a machine learning model
Doc 2: getting started in applied machine learning can be difficult especially when working with real world data
Doc 3: one good example is to use a one hot encoding on categorical data


## Bag of Words with Frequency Count

BoW represents each document as a vector where each element represents the frequency of a word from the vocabulary.

In [13]:
# Build a Bag of Words representation for the corpus
count_vect = CountVectorizer()

# Fit and transform the documents
bow_rep = count_vect.fit_transform(processed_docs)

# Look at the vocabulary mapping
print("Our vocabulary:")
print(count_vect.vocabulary_)
print(f"\nVocabulary size: {len(count_vect.vocabulary_)} unique words")

Our vocabulary:
{'often': 19, 'machine': 17, 'learning': 16, 'tutorials': 31, 'will': 35, 'recommend': 25, 'or': 22, 'require': 26, 'that': 29, 'you': 39, 'prepare': 23, 'your': 40, 'data': 5, 'in': 14, 'specific': 27, 'ways': 33, 'before': 2, 'fitting': 10, 'model': 18, 'getting': 11, 'started': 28, 'applied': 0, 'can': 3, 'be': 1, 'difficult': 6, 'especially': 8, 'when': 34, 'working': 37, 'with': 36, 'real': 24, 'world': 38, 'one': 21, 'good': 12, 'example': 9, 'is': 15, 'to': 30, 'use': 32, 'hot': 13, 'encoding': 7, 'on': 20, 'categorical': 4}

Vocabulary size: 41 unique words


In [14]:
# Display vocabulary in sorted order
print("Vocabulary (sorted by index):")
sorted_vocab = sorted(count_vect.vocabulary_.items(), key=lambda x: x[1])
for word, idx in sorted_vocab:
    print(f"  {idx:2d}: {word}")

Vocabulary (sorted by index):
   0: applied
   1: be
   2: before
   3: can
   4: categorical
   5: data
   6: difficult
   7: encoding
   8: especially
   9: example
  10: fitting
  11: getting
  12: good
  13: hot
  14: in
  15: is
  16: learning
  17: machine
  18: model
  19: often
  20: on
  21: one
  22: or
  23: prepare
  24: real
  25: recommend
  26: require
  27: specific
  28: started
  29: that
  30: to
  31: tutorials
  32: use
  33: ways
  34: when
  35: will
  36: with
  37: working
  38: world
  39: you
  40: your


In [15]:
# See the BoW representation for each document
print("BoW representation for Document 1:")
print(bow_rep[0].toarray())
print(f"Shape: {bow_rep[0].toarray().shape}")

print("\nBoW representation for Document 2:")
print(bow_rep[1].toarray())

print("\nBoW representation for Document 3:")
print(bow_rep[2].toarray())

BoW representation for Document 1:
[[0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 2 2 1 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 0 1
  0 0 0 1 1]]
Shape: (1, 41)

BoW representation for Document 2:
[[1 1 0 1 0 1 1 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0
  1 1 1 0 0]]

BoW representation for Document 3:
[[0 0 0 0 1 1 0 1 0 1 0 0 1 1 0 1 0 0 0 0 1 2 0 0 0 0 0 0 0 0 1 0 1 0 0 0
  0 0 0 0 0]]


In [16]:
# Show which words appear in each document
feature_names = count_vect.get_feature_names_out()

for i, doc in enumerate(processed_docs, 1):
    print(f"\nDocument {i}: '{doc[:60]}...'")
    bow_vector = bow_rep[i-1].toarray()[0]
    print("Words and their frequencies:")
    for word, freq in zip(feature_names, bow_vector):
        if freq > 0:
            print(f"  {word:15s}: {int(freq)}")


Document 1: 'often machine learning tutorials will recommend or require t...'
Words and their frequencies:
  before         : 1
  data           : 1
  fitting        : 1
  in             : 1
  learning       : 2
  machine        : 2
  model          : 1
  often          : 1
  or             : 1
  prepare        : 1
  recommend      : 1
  require        : 1
  specific       : 1
  that           : 1
  tutorials      : 1
  ways           : 1
  will           : 1
  you            : 1
  your           : 1

Document 2: 'getting started in applied machine learning can be difficult...'
Words and their frequencies:
  applied        : 1
  be             : 1
  can            : 1
  data           : 1
  difficult      : 1
  especially     : 1
  getting        : 1
  in             : 1
  learning       : 1
  machine        : 1
  real           : 1
  started        : 1
  when           : 1
  with           : 1
  working        : 1
  world          : 1

Document 3: 'one good example is to use a one ho

In [17]:
# Transform a new text using the learned vocabulary
new_text = ["machine learning is often difficult but good tutorials can help with data preparation"]
new_bow = count_vect.transform(new_text)

print(f"New text: '{new_text[0]}'")
print(f"\nBoW representation: {new_bow.toarray()}")
print("\nWords found in vocabulary:")
feature_names = count_vect.get_feature_names_out()
for word, freq in zip(feature_names, new_bow.toarray()[0]):
    if freq > 0:
        print(f"  {word:15s}: {int(freq)}")

New text: 'machine learning is often difficult but good tutorials can help with data preparation'

BoW representation: [[0 0 0 1 0 1 1 0 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
  1 0 0 0 0]]

Words found in vocabulary:
  can            : 1
  data           : 1
  difficult      : 1
  good           : 1
  is             : 1
  learning       : 1
  machine        : 1
  often          : 1
  tutorials      : 1
  with           : 1


## Bag of Words with Binary Vectors

Instead of word frequencies, we can use binary vectors where 1 indicates the word is present and 0 indicates absence.

In [18]:
# BoW with binary vectors
count_vect_binary = CountVectorizer(binary=True)
bow_rep_binary = count_vect_binary.fit_transform(processed_docs)

print("Binary BoW representation for all documents:")
print(bow_rep_binary.toarray())
print(f"\nShape: {bow_rep_binary.shape} (documents x vocabulary)")

Binary BoW representation for all documents:
[[0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 0 1
  0 0 0 1 1]
 [1 1 0 1 0 1 1 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0
  1 1 1 0 0]
 [0 0 0 0 1 1 0 1 0 1 0 0 1 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0
  0 0 0 0 0]]

Shape: (3, 41) (documents x vocabulary)


In [19]:
# Transform new text with binary representation
new_text_with_repetition = ["machine learning machine learning is good"]
new_bow_binary = count_vect_binary.transform(new_text_with_repetition)

print(f"New text: '{new_text_with_repetition[0]}'")
print(f"\nBinary BoW representation: {new_bow_binary.toarray()}")
print("\nNote: Even though 'machine' and 'learning' appear twice, they are represented as 1 (present) not 2.")

New text: 'machine learning machine learning is good'

Binary BoW representation: [[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0]]

Note: Even though 'machine' and 'learning' appear twice, they are represented as 1 (present) not 2.


## Comparison: Frequency vs Binary

Let's compare how the same text is represented with frequency counting vs binary encoding.

In [20]:
# Compare frequency vs binary for a text with repeated words
test_text = ["data data data machine learning"]

# Frequency-based
freq_bow = count_vect.transform(test_text)
print(f"Test text: '{test_text[0]}'")
print(f"\nFrequency BoW: {freq_bow.toarray()}")

# Binary
binary_bow = count_vect_binary.transform(test_text)
print(f"Binary BoW:    {binary_bow.toarray()}")

print("\nDifference: Frequency BoW shows 'data' appears 3 times, Binary BoW just shows it's present (1).")

Test text: 'data data data machine learning'

Frequency BoW: [[0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0]]
Binary BoW:    [[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0]]

Difference: Frequency BoW shows 'data' appears 3 times, Binary BoW just shows it's present (1).
