## Bag of words

Bag of words (1-gram) counts the words and keep the numbers and serve as the features.

2 steps:

Tokenizing: Segement the corpus into "words"

Counting: Count the appearance frequecy of each word.

CountVectorizer from sklearn combine the tokenizing and counting. You can take a look at its definition from the following link. 
 https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html


In [1]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer

Let's create a corpus or collection of documents with four documents

In [2]:
corpus = ['This is the first document or first entry.',
          'This is the second second document.',
          'And the third one.',
          'Is this the first document?'
         ]

# Learn the vocabulary dictionary and return term-document matrix
X_txt = vectorizer.fit_transform(corpus) 

# check the feature names after transformation. It will return individual word.
vectorizer.get_feature_names_out() 

array(['and', 'document', 'entry', 'first', 'is', 'one', 'or', 'second',
       'the', 'third', 'this'], dtype=object)

In [3]:
# X is the BoW feature of X
print(X_txt.toarray())

[[0 1 1 2 1 0 1 0 1 0 1]
 [0 1 0 0 1 0 0 2 1 0 1]
 [1 0 0 0 0 1 0 0 1 1 0]
 [0 1 0 1 1 0 0 0 1 0 1]]


In [4]:
# column index mapping to each word
print(vectorizer.vocabulary_)

{'this': 10, 'is': 4, 'the': 8, 'first': 3, 'document': 1, 'or': 6, 'entry': 2, 'second': 7, 'and': 0, 'third': 9, 'one': 5}


In [5]:
# unseen words are ignored for test data
vectorizer.transform(['Something completely new.']).toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

### N-gram (N > 1)

Bag of Words features can't caputure local information such as order of words.

>E.g. "believe or not" has the same features as "not or believe".

Bi-gram preserve more local information, which regrads 2 contagious words as one word in the vocabulary. In the example, "believe or", "or not", "not or" and "or believe" are counted. 

The feature is shown in the code below.

In [6]:
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2),       # range of the number of words inside a vocabulary
                                    token_pattern=r'\b\w+\b', # define the format of 'word': any char or number between 2 symbols except '_'
                                    min_df = 1)               # ignore the words that appears less than `min_df` times

analyze = bigram_vectorizer.build_analyzer() # Return a callable that handles preprocessing and tokenization
print(analyze('believe or not a b c d e'))

['believe or', 'or not', 'not a', 'a b', 'b c', 'c d', 'd e']


Extract bi-gram features for the corpus

In [7]:
X_txt_2 = bigram_vectorizer.fit_transform(corpus).toarray()
print(X_txt_2)

[[0 1 1 1 1 0 1 0 0 1 0 0 0 1 0]
 [0 0 0 0 1 0 0 1 1 0 1 0 0 1 0]
 [1 0 0 0 0 0 0 0 0 0 0 1 1 0 0]
 [0 0 1 0 0 1 0 0 0 1 0 0 0 0 1]]


In [9]:
bigram_vectorizer.get_feature_names_out()

array(['and the', 'document or', 'first document', 'first entry',
       'is the', 'is this', 'or first', 'second document',
       'second second', 'the first', 'the second', 'the third',
       'third one', 'this is', 'this the'], dtype=object)

### tf-idf
Some words has very high frequency(e.g. “the”, “a”, ”which”), and therefore, carrying not much meaningful information about the actual contents of the document.

We need to compensate them to prevent the high-frequency shadowing of other words. tf-idf (term frequency-inverse document frequency) is used to reflect the importance of a word to a document in a collection or corpus. There are two terms in the tf-idf weight: term frequence (tf) and inverse document frequency (idf). <br>

TF measures how frequently a term occurs in a document. It is often divided by the document length or the total number of terms in the document as a way of normalization. <br>

$$tf(t,d)=\frac{\text{Number of times term t appears in a document}}{\text{Total number of terms in the document}}$$


Idf measures how important a term is. In tf, all terms have equal importance. However, certain terms, such as "is", "of", and "that", may appear a lot of times, but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following.

$$idf(t) = log(\frac{1 + n_d}{1 + df(t)}) + 1$$

- $n_d$ is the number of document.
- $df(t)$ is the number of documents containing $t$. <br>

Finally tf-idf is calculated as follows.

$$tf\text{-}idf(t, d) = tf(t, d) \times idf(t)$$

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

In [11]:
print(corpus)

['This is the first document or first entry.', 'This is the second second document.', 'And the third one.', 'Is this the first document?']


In [12]:
# 'norm = l2' means we want to normalize each row (document) to have a unit l2 norm
# 'smooth_idf = False'. If True, 1 is added to the numerator and denominator of the idf to prevent zero divisions
vectorizer = TfidfVectorizer(norm='l2', smooth_idf=False)

X_txt_3 = vectorizer.fit_transform(corpus)
print(X_txt_3.toarray())

[[0.         0.23981982 0.444427   0.63066849 0.23981982 0.
  0.444427   0.         0.18624148 0.         0.23981982]
 [0.         0.24014568 0.         0.         0.24014568 0.
  0.         0.89006176 0.18649454 0.         0.24014568]
 [0.56115953 0.         0.         0.         0.         0.56115953
  0.         0.         0.23515939 0.56115953 0.        ]
 [0.         0.43306685 0.         0.56943086 0.43306685 0.
  0.         0.         0.33631504 0.         0.43306685]]


In [13]:
# see the l2 norm is 1 for each row
np.square(X_txt_3.toarray()).sum(axis=1)

array([1., 1., 1., 1.])