# What is CountVectorizer?

`CountVectorizer` is a tool provided by libraries such as scikit-learn in Python. It converts a collection of text documents into a matrix of token counts. Essentially, it creates a vocabulary of all the unique words in the documents and counts the occurrences of each word.

### Example

Suppose we have the following three simple documents:

1. "The cat sat on the mat"
2. "The dog sat on the log"
3. "The cat and the dog"

#### Step-by-Step Process

1. **Tokenization**:
   - Split each document into words (tokens).
   
   Document 1: `["The", "cat", "sat", "on", "the", "mat"]`
   Document 2: `["The", "dog", "sat", "on", "the", "log"]`
   Document 3: `["The", "cat", "and", "the", "dog"]`

2. **Creating the Vocabulary**:
   - Identify all unique words in the documents.
   
   Vocabulary: `["and", "cat", "dog", "log", "mat", "on", "sat", "the"]`
   
3. **Creating the Count Matrix**:
   - For each document, count the occurrences of each word in the vocabulary.

   |    | and | cat | dog | log | mat | on | sat | the |
   |----|-----|-----|-----|-----|-----|----|-----|-----|
   | D1 |  0  |  1  |  0  |  0  |  1  | 1  |  1  |  2  |
   | D2 |  0  |  0  |  1  |  1  |  0  | 1  |  1  |  2  |
   | D3 |  1  |  1  |  1  |  0  |  0  | 0  |  0  |  2  |

   Each row in the matrix corresponds to a document, and each column corresponds to a word in the vocabulary. The values indicate the count of each word in each document.

- The word "the" appears twice in each document, so the count for "the" is 2 in all rows.
- The word "cat" appears once in the first and third documents but not in the second, hence the counts `1, 0, 1` for "cat".
- The word "dog" appears once in the second and third documents but not in the first, hence the counts `0, 1, 1` for "dog".
- Similarly, other words are counted based on their occurrences in each document.



# TF-IDF Vectorizer

TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). It is the product of two statistics, Term Frequency (TF) and Inverse Document Frequency (IDF).

1. **Term Frequency (TF)**:
   - Measures how frequently a term appears in a document.
   - Formula: $ \text{TF}(t,d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} $

2. **Inverse Document Frequency (IDF)**:
   - Measures how important a term is in the entire corpus.
   - Formula: $ \text{IDF}(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t}\right) $

3. **TF-IDF**:
   - Combines TF and IDF to give a weight for each term in each document.
   - Formula: $ \text{TF-IDF}(t,d) = \text{TF}(t,d) \times \text{IDF}(t) $

### Example Documents

1. Document 1 (D1): "The cat sat on the mat"
2. Document 2 (D2): "The dog sat on the log"
3. Document 3 (D3): "The cat and the dog"

### Step-by-Step Calculation

#### Step 1: Calculate Term Frequency (TF)

For each term in each document:

**Document 1 (D1):**
- TF("the", D1) = 2/6 = 0.333
- TF("cat", D1) = 1/6 = 0.167
- TF("sat", D1) = 1/6 = 0.167
- TF("on", D1) = 1/6 = 0.167
- TF("mat", D1) = 1/6 = 0.167

**Document 2 (D2):**
- TF("the", D2) = 2/6 = 0.333
- TF("dog", D2) = 1/6 = 0.167
- TF("sat", D2) = 1/6 = 0.167
- TF("on", D2) = 1/6 = 0.167
- TF("log", D2) = 1/6 = 0.167

**Document 3 (D3):**
- TF("the", D3) = 2/5 = 0.4
- TF("cat", D3) = 1/5 = 0.2
- TF("and", D3) = 1/5 = 0.2
- TF("dog", D3) = 1/5 = 0.2

#### Step 2: Calculate Inverse Document Frequency (IDF)

For each term in the vocabulary:
- IDF("the") = log(3/3) = log(1) = 0
- IDF("cat") = log(3/2) = 0.176
- IDF("sat") = log(3/2) = 0.176
- IDF("on") = log(3/2) = 0.176
- IDF("mat") = log(3/1) = 0.477
- IDF("dog") = log(3/2) = 0.176
- IDF("log") = log(3/1) = 0.477
- IDF("and") = log(3/1) = 0.477

#### Step 3: Calculate TF-IDF

Multiply the TF by the corresponding IDF for each term in each document.

**Document 1 (D1):**
- TF-IDF("the", D1) = 0.333 * 0 = 0
- TF-IDF("cat", D1) = 0.167 * 0.176 = 0.029
- TF-IDF("sat", D1) = 0.167 * 0.176 = 0.029
- TF-IDF("on", D1) = 0.167 * 0.176 = 0.029
- TF-IDF("mat", D1) = 0.167 * 0.477 = 0.080

**Document 2 (D2):**
- TF-IDF("the", D2) = 0.333 * 0 = 0
- TF-IDF("dog", D2) = 0.167 * 0.176 = 0.029
- TF-IDF("sat", D2) = 0.167 * 0.176 = 0.029
- TF-IDF("on", D2) = 0.167 * 0.176 = 0.029
- TF-IDF("log", D2) = 0.167 * 0.477 = 0.080

**Document 3 (D3):**
- TF-IDF("the", D3) = 0.4 * 0 = 0
- TF-IDF("cat", D3) = 0.2 * 0.176 = 0.035
- TF-IDF("and", D3) = 0.2 * 0.477 = 0.095
- TF-IDF("dog", D3) = 0.2 * 0.176 = 0.035

#### Summary Table

|    | and  | cat  | dog  | log  | mat  | on   | sat  | the |
|----|------|------|------|------|------|------|------|-----|
| D1 | 0    | 0.029| 0    | 0    | 0.080| 0.029| 0.029| 0   |
| D2 | 0    | 0    | 0.029| 0.080| 0    | 0.029| 0.029| 0   |
| D3 | 0.095| 0.035| 0.035| 0    | 0    | 0    | 0    | 0   |


- The values in the DataFrame represent the TF-IDF scores for each term in each document.
- Higher scores indicate higher importance of the term in the respective document relative to the corpus.

By using TF-IDF, we can better understand the significance of words in documents, which helps in improving the performance and accuracy of machine learning models.


# Limitations of Bag of Words Models

1. **Ignoring Word Order**:
   - BoW models do not capture the order of words in the text. This means they miss out on the syntactic and semantic relationships between words, which can be crucial for understanding the context and meaning.

2. **Ignoring Context**:
   - These models treat each word independently and ignore the context in which words appear. For example, the words "bank" in "river bank" and "bank" in "savings bank" are treated the same.

3. **High Dimensionality**:
   - The feature space can become very large, especially for large corpora with extensive vocabularies. This leads to high computational costs and increased memory usage.

4. **Sparse Matrices**:
   - The resulting matrices are often sparse, with many zero entries, since most words do not appear in most documents. Sparse matrices can be computationally expensive to process.

5. **Lack of Semantic Understanding**:
   - BoW models do not capture the meaning of words. Words with similar meanings (synonyms) or different forms of the same word (stemming, lemmatization) are treated as distinct features.

6. **Fixed Vocabulary**:
   - The vocabulary is fixed at the time of model training. New words that appear in future documents but were not present in the training set are not handled well.

7. **Sensitivity to Frequent Words**:
   - Common words (stop words) can dominate the feature space if not removed, potentially drowning out less frequent but more informative terms.

8. **IDF Sensitivity to Rare Words**:
   - While TF-IDF reduces the impact of frequent words, it can overly amplify the importance of rare words that may not be relevant.

### Example of Limitations

Consider the sentences:
1. "The cat sat on the mat."
2. "The mat sat on the cat."

BoW models will treat these sentences as identical because they have the same words with the same frequencies, despite having different meanings due to word order.



Feature extraction using word embedding models like Word2Vec, GloVe, and BERT helps overcome the limitations of Bag of Words (BoW) models in several ways. Hereâ€™s how these embeddings improve upon BoW models and their additional advantages:

### Overcoming Limitations of BoW Models

1. **Capturing Semantic Meaning**:
   - **BoW Limitation**: BoW models treat each word independently and ignore semantic similarities.
   - **Embeddings Solution**: Word embeddings represent words in a continuous vector space where semantically similar words are mapped to nearby points. This allows the model to understand that "king" and "queen" are related or that "bank" in "river bank" and "savings bank" have different contexts.

2. **Contextual Understanding**:
   - **BoW Limitation**: BoW and TF-IDF ignore the context in which words appear.
   - **Embeddings Solution**: Contextualized embeddings like BERT create different vectors for the same word depending on its context, capturing the nuanced meaning of words based on their surrounding text.

3. **Handling Synonyms and Polysemy**:
   - **BoW Limitation**: Synonyms are treated as different features, and polysemous words (words with multiple meanings) are treated the same.
   - **Embeddings Solution**: Embeddings capture the relationships between synonyms and disambiguate polysemous words by considering context (especially with models like BERT).

4. **Dimensionality Reduction**:
   - **BoW Limitation**: BoW models result in very high-dimensional sparse vectors.
   - **Embeddings Solution**: Word embeddings produce dense vectors of fixed, much lower dimensions, reducing the computational load and making the feature space more manageable.

5. **Fixed Vocabulary**:
   - **BoW Limitation**: New words not seen during training are not handled well.
   - **Embeddings Solution**: Pre-trained embedding models have extensive vocabularies and can handle a wide range of words. Additionally, models like FastText can generate embeddings for out-of-vocabulary words by using subword information.

## Additional Advantages of Embedding Models

1. **Transfer Learning**:
   - Pre-trained embedding models like BERT can be fine-tuned on specific tasks with relatively small amounts of task-specific data, leveraging large-scale pre-training on vast corpora.

2. **Efficient Representation**:
   - Dense embeddings reduce memory usage and computational requirements compared to sparse representations from BoW models.

3. **Handling Variable-Length Inputs**:
   - Embeddings provide a fixed-size vector representation regardless of the input text length, making them suitable for downstream machine learning models that require fixed-size inputs.

4. **Capturing Long-Range Dependencies**:
   - Models like BERT and GPT, based on the Transformer architecture, are particularly effective at capturing long-range dependencies and relationships between words over long contexts.

5. **Improved Performance on Downstream Tasks**:
   - Embedding models have shown superior performance on a wide range of natural language processing tasks, including sentiment analysis, named entity recognition, machine translation, and question answering.