In [4]:
import re
from collections import Counter


class BagOfWords:
    def __init__(self, stopwords=None):
        """
        Initializes the Bag of Words model.

        Parameters:
        - stopwords (set): A set of words to ignore. Default is None.
        """
        self.stopwords = stopwords if stopwords else set()
        self.vocab = set()
        self.word_to_index = {}

    def preprocess(self, text):
        """
        Preprocesses the input text (tokenization, lowercasing, removing
        stopwords).

        Parameters:
        - text (str): The raw text string.

        Returns:
        - tokens (list): A list of preprocessed tokens.
        """
        # Lowercase the text and remove special characters
        text = text.lower()
        text = re.sub(r"\W+", " ", text)
        tokens = text.split()

        # Remove stopwords
        tokens = [word for word in tokens if word not in self.stopwords]
        return tokens

    def build_vocab(self, corpus):
        """
        Builds the vocabulary from a corpus of documents.

        Parameters:
        - corpus (list of str): A list of documents (each document is a
        string).
        """
        for document in corpus:
            tokens = self.preprocess(document)
            self.vocab.update(tokens)

        # Assign an index to each unique word
        self.word_to_index = {w: i for i, w in enumerate(sorted(self.vocab))}

    def vectorize(self, document):
        """
        Converts a document into its Bag of Words vector representation.

        Parameters:
        - document (str): A single document (string).

        Returns:
        - vector (list): A list representing the BoW vector for the document.
        """
        tokens = self.preprocess(document)
        vector = [0] * len(self.vocab)

        token_counts = Counter(tokens)
        for token, count in token_counts.items():
            if token in self.word_to_index:
                index = self.word_to_index[token]
                vector[index] = count

        return vector

    def transform(self, corpus):
        """
        Transforms a corpus into BoW vectors.

        Parameters:
        - corpus (list of str): A list of documents.

        Returns:
        - vectors (list of list): A list of BoW vectors.
        """
        return [self.vectorize(doc) for doc in corpus]


In [5]:
corpus = [
    "The quick brown fox jumps over the lazy dog",
    "Never jump over the lazy dog quickly",
    "A quick movement of the enemy will jeopardize six gunboats"
]

# Initialize Bag of Words model with some stopwords
stopwords = {"the", "over", "a", "will"}
bow_model = BagOfWords(stopwords=stopwords)

# Build vocabulary and vectorize the corpus
bow_model.build_vocab(corpus)
vectors = bow_model.transform(corpus)

# Print vocabulary and vectors
print("Vocabulary:", bow_model.word_to_index)
print("\nVectors:")
for i, vector in enumerate(vectors):
    print(f"Document {i+1}: {vector}")

Vocabulary: {'brown': 0, 'dog': 1, 'enemy': 2, 'fox': 3, 'gunboats': 4, 'jeopardize': 5, 'jump': 6, 'jumps': 7, 'lazy': 8, 'movement': 9, 'never': 10, 'of': 11, 'quick': 12, 'quickly': 13, 'six': 14}

Vectors:
Document 1: [1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0]
Document 2: [0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0]
Document 3: [0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1]


## Notes:

why does bag of word return the frequency of the word in the final vector?

The Bag of Words (BoW) model represents text data in a way that's easy for machine learning algorithms to work with. Here's why it specifically uses word frequency in the final vector:

### 1. Capturing the Importance of Words
- Frequency reflects how important a word is in a given text. Words that appear more frequently in a document are likely more relevant to the content of that document.
- For example, in a document about "sports," words like "game," "team," and "player" may appear more often. By counting their frequency, the model captures these as important features.
### 2. Simplicity and Interpretability
BoW is a simple and interpretable approach:
- Each dimension in the vector corresponds to a specific word from the vocabulary.
- The value in each dimension is the number of times that word appears in the document.
Example: For a vocabulary of ['apple', 'banana', 'orange']:
Document: "apple banana apple"
Vector: [2, 1, 0] (2 occurrences of "apple", 1 of "banana", and 0 of "orange").
- This representation makes it straightforward to use as input for algorithms.
### 3. Preserving Contextual Weight in the Document
- High frequency words have a stronger impact.
- Algorithms (like Naive Bayes or logistic regression) can leverage word counts to predict relationships between certain words and specific outcomes, such as classifying documents by topic or sentiment.
### 4. Compatibility with Machine Learning
- Machine learning models often work with numerical data. By converting text into vectors of word frequencies:
- The algorithms can treat text data just like any other numeric feature.
- Word frequency provides an initial way to quantify text in a meaningful manner.
### 5. Why Frequency and Not Just Presence?
Using frequency (instead of just 1 or 0 to indicate presence/absence) allows the model to:
- Weigh more frequent words higher.
- Differentiate between documents with similar words but varying amounts of repetition.
- Presence-only (binary) vectors lose information about the importance conveyed by repetition:
E.g., "I love cats" vs. "I love cats cats cats" would both result in [1, 1] using binary, but [1, 3] in frequency-based BoW.

### Potential Extensions of BoW:
TF-IDF: Extends BoW by normalizing word frequency based on how commonly the word appears across all documents. This reduces the importance of very common words (like "the," "and").
In summary, BoW uses word frequency because it captures the importance of words in a document and provides a simple numeric representation of text that is effective for many basic machine learning models.