The <b>bag-of-words</b> model is a way of representing text data when modeling text with machine learning algorithms.

In [6]:
docs = ['Ram is in eighth grade and ready to go to ninth grade',
        'Shanti is in sixth grade']

The objective is to turn each document of free text into a vector that we can use as input or output for a machine learning model. To score each word, the simplest scoring method is to mark the presence of words as a boolean value, 0 for absent, 1 for present. <br>

Binary vector can be created by making `binary parameter` as <b>True</b> in Count Vectorizer.

### <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer">Count Vectorizer</a>

The `CountVectorizer` provides a simple way to both <b>tokenize</b> a collection of text documents and <b>build vocabulary</b> of known words, but also to <b>encode</b> new documents using that vocabulary.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

In [17]:
vectorizer = CountVectorizer() #Create an instance of the CountVectorizer class
vectorizer.fit(docs) #Call the fit() function in order to learn a vocabulary from one or more documents.
vectorizer.vocabulary_

{'ram': 7,
 'is': 5,
 'in': 4,
 'eighth': 1,
 'grade': 3,
 'and': 0,
 'ready': 8,
 'to': 11,
 'go': 2,
 'ninth': 6,
 'shanti': 9,
 'sixth': 10}

A vocabulary of 12 words is learned from the documents and each word is assigned a unique integer index. We can see that all words were made lowercase by default and that the punctuation is ignored.

In [20]:
x = vectorizer.transform(docs)
x

<2x12 sparse matrix of type '<class 'numpy.int64'>'
	with 15 stored elements in Compressed Sparse Row format>

In [21]:
x.toarray()[:5]

array([[1, 1, 1, 2, 1, 1, 1, 1, 1, 0, 0, 2],
       [0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0]], dtype=int64)

In [10]:
vectorizer2 = CountVectorizer(ngram_range=(2,2))
x2 = vectorizer2.fit_transform(docs)
x2

<2x14 sparse matrix of type '<class 'numpy.int64'>'
	with 15 stored elements in Compressed Sparse Row format>

In [11]:
vectorizer2.get_feature_names()

['and ready',
 'eighth grade',
 'go to',
 'grade and',
 'in eighth',
 'in sixth',
 'is in',
 'ninth grade',
 'ram is',
 'ready to',
 'shanti is',
 'sixth grade',
 'to go',
 'to ninth']

In [12]:
x2.toarray()

array([[1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0]], dtype=int64)

### <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html">TF-IDF Vectorizer</a>

The <b>TfidfVectorizer</b> will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents. Alternately, if you already have a learned CountVectorizer, you can use it with a <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer">TfidfTransformer</a> to just calculate the inverse document frequencies and start encoding documents.

<b>TF:</b> `Term Frequency`, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:

`TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).`

<b>IDF:</b> `Inverse Document Frequency`, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:

`IDF(t) = log_e(Total number of documents / Number of documents with term t in it).`

<b>Example:</b>

Consider a document containing 100 words wherein the word `cat` appears 3 times. The term frequency (i.e., tf) for cat is then `(3 / 100) = 0.03`. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as `log(10,000,000 / 1,000) = 4`. Thus, the <b>Tf-idf weight is the product of tf and idf:</b> `0.03 * 4 = 0.12`.

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [14]:
corpus = ['this is the first document',
          'this document is the second document',
          'and this is the third one',
          'is this the first document']

In [22]:
vectorizer = TfidfVectorizer()
vectorizer.fit(corpus) #Tokenize and build vocabulary
vectorizer.vocabulary_

{'this': 8,
 'is': 3,
 'the': 6,
 'first': 2,
 'document': 1,
 'second': 5,
 'and': 0,
 'third': 7,
 'one': 4}

In [23]:
vectorizer.idf_

array([1.91629073, 1.22314355, 1.51082562, 1.        , 1.91629073,
       1.91629073, 1.        , 1.91629073, 1.        ])

The inverse document frequencies are calculated for each word in the vocabulary, assigning the lowest score of 1.0 to the most frequently observed words: "is", "the" and "this" at indices 3, 6 and 8.

In [24]:
x3 = vectorizer.transform(corpus)
x3

<4x9 sparse matrix of type '<class 'numpy.float64'>'
	with 21 stored elements in Compressed Sparse Row format>

In [25]:
x3.toarray()

array([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524],
       [0.        , 0.6876236 , 0.        , 0.28108867, 0.        ,
        0.53864762, 0.28108867, 0.        , 0.28108867],
       [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
        0.        , 0.26710379, 0.51184851, 0.26710379],
       [0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524]])

The scores are normalized using `l2 norm` to values between 0 and 1 and the encoded document vectors can then be used directly with most machine learning algorithms.

Each output row will have unit norm, either: <br>
`l2`: Sum of squares of vector elements is 1. <br>
`l1`: Sum of absolute values of vector elements is 1. <br>

In [26]:
0.2116 + 0.3364 + 0.1444 + 0.1444 + 0.14444 #l2 norm - sum of squares of the first row in x3

0.98124

### Limitations of Bag-of-Words
The bag-of-words model is very simple to understand and implement and offers a lot of flexibility for customization on your specific text data. Nevertheless, it suffers from some shortcomings, such as:

<b>Vocabulary:</b> The vocabulary requires careful design, most specifically in order to manage the size, which impacts the sparsity of the document representations.<br>
<b>Sparsity</b>: Sparse representations are harder to model both for computational reasons (space and time complexity) and also for information reasons, where the challenge is for the models to harness so little information in such a large representational space.<br>
<b>Meaning</b>: Discarding word order ignores the context, and in turn meaning of words in the document (semantics). Context and meaning can offer a lot to the model, that if modeled could tell the difference between the same words differently arranged (“this is interesting” vs “is this interesting”), synonyms (“old bike” vs “used bike”), and much more.

### When to use Bag-of-Words

If your dataset is `small and context is domain specific`, BoW may work better than Word Embedding. Context is very domain specific which means that you cannot find corresponding Vector from pre-trained word embedding models (GloVe, fastText etc).