In [1]:
from sklearn.feature_extraction.text import CountVectorizer

This line imports the CountVectorizer class from the feature_extraction.text module of the scikit-learn library. CountVectorizer is a technique to convert text data into a numerical format which is required for machine learning models


In [2]:
documents = ["the quick brown fox", "jumped over the lazy dog"]

Here we define a Python list named documents, which contains two strings. Each string is considered as a separate document. In a real-world scenario, these could be two different paragraphs, emails, or any other collection of text

In [3]:
vectorizer = CountVectorizer()

This line initializes an instance of the CountVectorizer class and assigns it to the variable vectorizer. This object will be configured to perform tokenization and count occurrences of tokens (words) in the documents.

In [4]:
bow_matrix = vectorizer.fit_transform(documents)

The fit_transform method is called on the vectorizer object with our documents list as the argument. What this does is:
fit part: learns the vocabulary of the entire text corpus (in this case, the two documents combined).
transform part: transforms the text documents into a sparse matrix where each row corresponds to a document and each column corresponds to a term (word) in the learned vocabulary. The values in the matrix represent the frequency count of each term in each document

In [5]:
print(vectorizer.get_feature_names_out())

['brown' 'dog' 'fox' 'jumped' 'lazy' 'over' 'quick' 'the']


After fitting the model to the data, get_feature_names_out() is called to get the list of feature names, which in this case are the words extracted from the documents. This will print out the vocabulary that the vectorizer has learned from the input documents

In [6]:
print(bow_matrix.toarray())

[[1 0 1 0 0 0 1 1]
 [0 1 0 1 1 1 0 1]]


Finally, toarray() is called on the bow_matrix object to convert the sparse matrix into a regular (dense) NumPy array for easy viewing. The resulting array is printed out, and each row of the array corresponds to one of the documents, with the columns representing the word counts of the respective terms in the vocabulary