
### CountVectorizer

`CountVectorizer` converts a collection of documents into a matrix of token counts. This transformation allows the text data to be used with machine learning models. Essentially, it counts the number of times each word appears in the documentoject?..

In [4]:
# Import the CountVectorizer class from the sklearn.feature_extraction.text module
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
docs = ["Mayur is a nice boy.", "Mayur rocks! Wohoo", "My name is Mayur, and I am a Pythonista!"]

# Initialize the CountVectorizer
cv = CountVectorizer()

# Fit and transform the documents
x = cv.fit_transform(docs)

# Print the dense representation of the matrix
print(x.todense())

# Print the vocabulary
print(cv.vocabulary_)


[[0 0 1 1 1 0 0 1 0 0 0]
 [0 0 0 0 1 0 0 0 0 1 1]
 [1 1 0 1 1 1 1 0 1 0 0]]
{'mayur': 4, 'is': 3, 'nice': 7, 'boy': 2, 'rocks': 9, 'wohoo': 10, 'my': 5, 'name': 6, 'and': 1, 'am': 0, 'pythonista': 8}


## DictVectorizer

`DictVectorizer` converts mappings to vectors.


In [6]:
from sklearn.feature_extraction import DictVectorizer

# Sample documents as dictionaries
docs = [
    {"Mayur": 1, "is": 1, "awesome": 2},
    {"No": 1, "I": 1, "dont": 2, "wanna": 3, "fall": 1, "in": 2, "love": 3}
]

# Initialize the DictVectorizer
dv = DictVectorizer()

# Fit and transform the documents
x = dv.fit_transform(docs)

# Print the dense representation of the transformed documents
print(x.todense())


[[0. 1. 0. 2. 0. 0. 0. 1. 0. 0.]
 [1. 0. 1. 0. 2. 1. 2. 0. 3. 3.]]


### TfidfVectorizer
In many text analytics applications, we need to convert text into vectors to use with Machine Learning algorithms. This is known as the Vector Space Model. While `CountVectorizer` could be a solution, words like "the", "a", "in", etc., are common and often used in all kinds of documents. Using `CountVectorizer` gives more emphasis on such word counts, which are not always relevant. You could circumvent this problem using `stop_words="english"` to filter out common words, but let's say you have a different vocabulary. For instance, a conversation between two Computer Science students might frequently include words like "RAM", "processor", "GPU", and you'd have to manually add these stop words every time for all the problems you solve.

In such scenarios, it is recommended to use `TfidfVectorizer`, which takes care of these issues. Every word is given a number according to the following formula:

$$
\text{tfidf}(\text{word}) = \text{tf}(\text{word}, \text{document}_i) \cdot \text{idf}(\text{word})
$$

Where:
1. **tf(word, document_i)** = Term Frequency of a word in the specific document \(i\).
2. **idf(word)** = Inverse Document Frequency of the word.

Inverse Document Frequency is defined as the log of the ratio of the number of documents to the number of times the word has occurred in any document:

$$
\text{idf}(w) = \log\left(\frac{n_d}{df(w)}\right)
$$

Where:
1. **df(w)** = number of times the word has occurred in any document.

Intuitively, if a word has occurred too many times in other documents as well (common words like "the", "is"), it gives lesser weight to such words in contrast to words that have occurred more frequently in a single document compared to others. This means that if a particular word occurs more frequently in a single document only, it might be an important feature.

Note that the numerator and denominator are added with `1` to avoid underflow, e.g., when the document frequency is 0.

`sklearn` additionally normalizes the output of `tfidf` to have a norm of 1. This is important since we're interested in similarities; hence vectors like (1,1) and (3,3) are really the same (they go in the same direction, just have different magnitudes). This is achieved by dividing by the length of the vector:

$$
v_i = \frac{v_i}{|v|_2} = \frac{v_i}{\sqrt{v_1^2 + v_2^2 + v_3^2 + \ldots + v_n^2}}
$$

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# Initialize TfidfVectorizer and CountVectorizer
tfidf_vectorizer = TfidfVectorizer()
cv_vectorizer = CountVectorizer()

# Sample documents
docs = [
    "Mayur is a Guitarist",
    "Mayur is a Musician",
    "Mayur is also a programmer"
]

# Fit and transform the documents using TfidfVectorizer
x_idf = tfidf_vectorizer.fit_transform(docs)

# Fit and transform the documents using CountVectorizer
x_cv = cv_vectorizer.fit_transform(docs)

# Print the dense representation of the TF-IDF transformed documents
print(x_idf.todense())

# Print the vocabulary learned by TfidfVectorizer
print(tfidf_vectorizer.vocabulary_)

# Print the dense representation of the CountVectorizer transformed documents
print(x_cv.todense())

print("We can se")

[[0.         0.76749457 0.45329466 0.45329466 0.         0.        ]
 [0.         0.         0.45329466 0.45329466 0.76749457 0.        ]
 [0.6088451  0.         0.35959372 0.35959372 0.         0.6088451 ]]
{'mayur': 3, 'is': 2, 'guitarist': 1, 'musician': 4, 'also': 0, 'programmer': 5}
[[0 1 1 1 0 0]
 [0 0 1 1 1 0]
 [1 0 1 1 0 1]]
