<a href="https://colab.research.google.com/github/minhaz1172/NLP/blob/main/Count_Vectorizer(python).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CountVectorizer in NLP

`CountVectorizer` is a tool in Python's **scikit-learn** library used to convert a collection of text documents into a matrix of token counts. It essentially creates a **Bag of Words (BoW)** representation, where each unique word becomes a feature, and the value for each document-feature pair is the frequency of that word in the document.

This process transforms text data into a numerical format that can be used by machine learning models.

---

## What CountVectorizer Does

- **Tokenization:**  
  `CountVectorizer` breaks down text into individual words or tokens.
  
- **Vocabulary Building:**  
  It identifies all the unique words (or tokens) across the entire text corpus.
  
- **Counting:**  
  For each document, it counts how many times each unique word appears.
  
- **Matrix Creation:**  
  It organizes these counts into a matrix where:
  - Rows represent documents.
  - Columns represent unique words.
  - Cells contain the word counts.

---

## Why Use CountVectorizer?

- **Text as Input:**  
  Machine learning models generally require numerical input. `CountVectorizer` converts text into a numerical format that models can understand.

- **Feature Extraction:**  
  It identifies and quantifies the frequency of words, allowing you to extract meaningful features from text data.

- **Efficiency:**  
  `scikit-learn`'s implementation is optimized for performance, especially when dealing with large datasets.

---


In [1]:
from sklearn.feature_extraction.text import CountVectorizer

Text=['My name is Minhaz.',
      'I am an Engineering student.',
      'And I am studying at RUET.',
      'Minhaz is practicing ML.']

In [2]:
# create a object for count vectorizer
vectorizer=CountVectorizer()

In [3]:
# fit and transform the corpus
X=vectorizer.fit_transform(Text)

In [17]:
# Get the feature names
feature_names=vectorizer.get_feature_names_out()
feature_names

array(['am', 'an', 'and', 'at', 'engineering', 'i', 'is', 'minhaz', 'ml',
       'my', 'name', 'practicing', 'ruet', 'student', 'studying'],
      dtype=object)

# the single word 'I' is missing from the get_feature_names_out() output is because by default, CountVectorizer ignores all tokens (words) that are only one character long.

In [7]:
# Include also single word like I'
vectorizer=CountVectorizer(token_pattern=r"(?u)\b\w+\b")
X=vectorizer.fit_transform(Text)
feature_names=vectorizer.get_feature_names_out()
feature_names

array(['am', 'an', 'and', 'at', 'engineering', 'i', 'is', 'minhaz', 'ml',
       'my', 'name', 'practicing', 'ruet', 'student', 'studying'],
      dtype=object)

NumPy arrays don’t have the toarray() method.X is a sparse matrix

In [15]:
from scipy.sparse import issparse

if issparse(X):
    print(f"X = {X.toarray()}")
else:
    print(f"X = {X}")


X = [[0 0 0 0 0 0 1 1 0 1 1 0 0 0 0]
 [1 1 0 0 1 1 0 0 0 0 0 0 0 1 0]
 [1 0 1 1 0 1 0 0 0 0 0 0 1 0 1]
 [0 0 0 0 0 0 1 1 1 0 0 1 0 0 0]]
