# Problem statement
We are learning feature engineering and we are looking at how to extract features from Text data. So the problem at hand is that how do we convert text data into numerical data so as to run them through ML algorithms.
One way to do this is by using "Bag of words" method.

We start off by importing CountVectorizer from sklearn python library.

In [5]:

from sklearn.feature_extraction.text import CountVectorizer

We then pass the text data that we want to vectorize. Our goal is to convert these two sentences into a numerical representation.

In [16]:
text_data = ["We will use short sentences to illustrate the concepts.",
             "Once we get the concept of bag of words, we can move on."]

The CountVectorizer function is a class in the sklearn library that is used to create a bag of words representation of text data. It does this by taking a corpus of text data and creating a vocabulary of words from it.

In [17]:
vectorizer = CountVectorizer()

From the CountVectorizer class we call the fit_tranform function that converts the text data into a bag of words vectors.

In [21]:
bag_of_words = vectorizer.fit_transform(text_data)

We can now see the full vocabulary. Here the vocabulary represents all the words we have in our text data.

The way the vocabulary is created is by taking each unique word from our text data and assigning it a number.

The result can be seen below.

In [19]:
print("Vocabulary:", vectorizer.vocabulary_)

Vocabulary: {'we': 15, 'will': 16, 'use': 14, 'short': 11, 'sentences': 10, 'to': 13, 'illustrate': 5, 'the': 12, 'concepts': 3, 'once': 9, 'get': 4, 'concept': 2, 'of': 7, 'bag': 0, 'words': 17, 'can': 1, 'move': 6, 'on': 8}


Finally here is the "bag of words" representation of our text data. 

We go through each sentence in our text data. For each word in the sentence we look up the position in the vocabulary and increment that position.

This way for each sentence we get a vector representation. In this representation we can tell how many times any word occurs in our sentences.

In [20]:
print("Bag of words:", bag_of_words.toarray())

Bag of words: [[0 0 0 1 0 1 0 0 0 0 1 1 1 1 1 1 1 0]
 [1 1 1 0 1 0 1 2 1 1 0 0 1 0 0 2 0 1]]
