
## 1️⃣ CountVectorizer

### **Formula**

For a document ( d ) and word ( w ):

$$
[
\text{Count}(w, d) = \text{number of times } w \text{ appears in } d
]
$$
---

### **Small Example**

**Documents**

```
D1: "machine learning is fun machine"
D2: "machine learning is powerful"
```

**Vocabulary (V)**

```
["machine", "learning", "is", "fun", "powerful"]
```

**Count Vectors**

| Word     | D1 | D2 |
| -------- | -- | -- |
| machine  | 2  | 1  |
| learning | 1  | 1  |
| is       | 1  | 1  |
| fun      | 1  | 0  |
| powerful | 0  | 1  |

So:

```
D1 → [2, 1, 1, 1, 0]
D2 → [1, 1, 1, 0, 1]
```

---

Bag of Words → the idea

CountVectorizer → BoW implementation using counts

TF-IDF → BoW + weighting

Binary BoW → 0/1 (present or not)

In [2]:
from collections import Counter, defaultdict

class CountVectorizer:
    """
    A simple Count Vectorizer implemented from scratch.
    """

    def __init__(self) -> None:
        self.vocabulary_: dict[str, int] = {} # word -> index
        self.fitted: bool = False

    def _tokenize(self, text: str) -> list[str]:
        return text.lower().split()

    def fit(self, documents: list[str]) -> None:
        """
        Learn the vocabulary from the documents.
        """
        tokenized_docs = [self._tokenize(doc) for doc in documents]

        vocab = set()
        for tokens in tokenized_docs:
            vocab.update(tokens)

        # vocabulary should have this word:index format.
        self.vocabulary_ = {
            word: idx for idx, word in enumerate(sorted(vocab))
        }

        self.fitted = True

    def transform(self, documents: list[str]) -> list[list[int]]:
        """
        Transform documents to count vectors.
        """
        if not self.fitted:
            raise ValueError("The vectorizer must be fitted before calling transform().")

        vectors: list[list[int]] = []

        for doc in documents:
            tokens = self._tokenize(doc)
            token_counts = Counter(tokens)

            vector = [0] * len(self.vocabulary_)

            for word, count in token_counts.items():
                if word in self.vocabulary_:
                    idx = self.vocabulary_[word]
                    vector[idx] = count

            vectors.append(vector)

        return vectors

    def fit_transform(self, documents: list[str]) -> list[list[int]]:
        """
        Fit to data, then transform it.
        """
        self.fit(documents)
        return self.transform(documents)


In [7]:
documents = [
    "I love machine learning love",
    "I love deep learning",
    "Machine learning is fun"
]

vectorizer = CountVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

for vec in tfidf_matrix:
    print(vec)


[0, 0, 1, 0, 1, 2, 1]
[1, 0, 1, 0, 1, 1, 0]
[0, 1, 0, 1, 1, 0, 1]


| Function        | Time Complexity | Space Complexity | Explanation                            |
| --------------- | --------------- | ---------------- | -------------------------------------- |
| `_tokenize`     | O(L)            | O(L)             | Splits text into tokens                |
| `fit`           | O(N·L)          | O(V)             | Scans all tokens to build vocabulary   |
| `transform`     | O(N·(L + V))    | O(N·V)           | Counts tokens and builds dense vectors |
| `fit_transform` | O(N·(L + V))    | O(N·V)           | `fit` + `transform` combined           |




