# Text Feature Extraction using Bag of Words Model

In [18]:
import numpy as np

Let's use the Zen of Python as our data:

In [19]:
zen = """Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!"""

lines = [l.lower() for l in zen.splitlines()]

In [20]:
lines

['beautiful is better than ugly.',
 'explicit is better than implicit.',
 'simple is better than complex.',
 'complex is better than complicated.',
 'flat is better than nested.',
 'sparse is better than dense.',
 'readability counts.',
 "special cases aren't special enough to break the rules.",
 'although practicality beats purity.',
 'errors should never pass silently.',
 'unless explicitly silenced.',
 'in the face of ambiguity, refuse the temptation to guess.',
 'there should be one-- and preferably only one --obvious way to do it.',
 "although that way may not be obvious at first unless you're dutch.",
 'now is better than never.',
 'although never is often better than *right* now.',
 "if the implementation is hard to explain, it's a bad idea.",
 'if the implementation is easy to explain, it may be a good idea.',
 "namespaces are one honking great idea -- let's do more of those!"]

# Count Vectorizer

In [21]:
from sklearn.feature_extraction.text import CountVectorizer

cvec = CountVectorizer().fit(lines)
cvec.vocabulary_

{'although': 0,
 'ambiguity': 1,
 'and': 2,
 'are': 3,
 'aren': 4,
 'at': 5,
 'bad': 6,
 'be': 7,
 'beats': 8,
 'beautiful': 9,
 'better': 10,
 'break': 11,
 'cases': 12,
 'complex': 13,
 'complicated': 14,
 'counts': 15,
 'dense': 16,
 'do': 17,
 'dutch': 18,
 'easy': 19,
 'enough': 20,
 'errors': 21,
 'explain': 22,
 'explicit': 23,
 'explicitly': 24,
 'face': 25,
 'first': 26,
 'flat': 27,
 'good': 28,
 'great': 29,
 'guess': 30,
 'hard': 31,
 'honking': 32,
 'idea': 33,
 'if': 34,
 'implementation': 35,
 'implicit': 36,
 'in': 37,
 'is': 38,
 'it': 39,
 'let': 40,
 'may': 41,
 'more': 42,
 'namespaces': 43,
 'nested': 44,
 'never': 45,
 'not': 46,
 'now': 47,
 'obvious': 48,
 'of': 49,
 'often': 50,
 'one': 51,
 'only': 52,
 'pass': 53,
 'practicality': 54,
 'preferably': 55,
 'purity': 56,
 're': 57,
 'readability': 58,
 'refuse': 59,
 'right': 60,
 'rules': 61,
 'should': 62,
 'silenced': 63,
 'silently': 64,
 'simple': 65,
 'sparse': 66,
 'special': 67,
 'temptation': 68,
 'than

In [22]:
bow = cvec.transform(lines)
bow.shape

(19, 79)

In [23]:
cvec.get_feature_names()

['although',
 'ambiguity',
 'and',
 'are',
 'aren',
 'at',
 'bad',
 'be',
 'beats',
 'beautiful',
 'better',
 'break',
 'cases',
 'complex',
 'complicated',
 'counts',
 'dense',
 'do',
 'dutch',
 'easy',
 'enough',
 'errors',
 'explain',
 'explicit',
 'explicitly',
 'face',
 'first',
 'flat',
 'good',
 'great',
 'guess',
 'hard',
 'honking',
 'idea',
 'if',
 'implementation',
 'implicit',
 'in',
 'is',
 'it',
 'let',
 'may',
 'more',
 'namespaces',
 'nested',
 'never',
 'not',
 'now',
 'obvious',
 'of',
 'often',
 'one',
 'only',
 'pass',
 'practicality',
 'preferably',
 'purity',
 're',
 'readability',
 'refuse',
 'right',
 'rules',
 'should',
 'silenced',
 'silently',
 'simple',
 'sparse',
 'special',
 'temptation',
 'than',
 'that',
 'the',
 'there',
 'those',
 'to',
 'ugly',
 'unless',
 'way',
 'you']

In [14]:
bow.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

## Applying Tfidf (Term-Frequency Inverse-Document Frequency) Encoding

Tfidf encoding rescales words that are common to have lesser weight (less important).

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer

tvec = TfidfVectorizer().fit(lines)
tvec.transform(lines).toarray()

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

Tf-idfs are a way to represent documents as feature vectors. Tf-idfs can be understood as a modification of the raw term frequencies (tf); the tf is the count of how often a particular word occurs in a given document. The concept behind the tf-idf is to downweight terms proportionally to the number of documents in which they occur. Here, the idea is that terms that occur in many different documents are likely unimportant or don't contain any useful information for Natural Language Processing tasks such as document classification.

## Word Order using n-grams

So far, we've used unigram tokenization: Each token represents a single element with regard to the splittting criterion. 

In some cases, word ordering may be of importance. In such cases, we use n-grams to consider **n** words as a token.

For example consider: "this is how you get ants"
- 1-gram: "this", "is", "how", "you", "get", "ants"
- 2-gram: "this is", "is how", "how you", "you get", "get ants"
- 3-gram: "this is how", "is how you", "how you get", "you get ants"

Now, let's try using a bigram model with `CountVectorizer`

In [27]:
bivec = CountVectorizer(ngram_range=(2,2)).fit(lines)
bivec.get_feature_names()

['although never',
 'although practicality',
 'although that',
 'ambiguity refuse',
 'and preferably',
 'are one',
 'aren special',
 'at first',
 'bad idea',
 'be good',
 'be obvious',
 'be one',
 'beats purity',
 'beautiful is',
 'better than',
 'break the',
 'cases aren',
 'complex is',
 'do it',
 'do more',
 'easy to',
 'enough to',
 'errors should',
 'explain it',
 'explicit is',
 'explicitly silenced',
 'face of',
 'first unless',
 'flat is',
 'good idea',
 'great idea',
 'hard to',
 'honking great',
 'idea let',
 'if the',
 'implementation is',
 'in the',
 'is better',
 'is easy',
 'is hard',
 'is often',
 'it bad',
 'it may',
 'let do',
 'may be',
 'may not',
 'more of',
 'namespaces are',
 'never is',
 'never pass',
 'not be',
 'now is',
 'obvious at',
 'obvious way',
 'of ambiguity',
 'of those',
 'often better',
 'one and',
 'one honking',
 'one obvious',
 'only one',
 'pass silently',
 'practicality beats',
 'preferably only',
 're dutch',
 'readability counts',
 'refuse the

In [28]:
bivec.transform(lines).toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

### What value of n should you consider?

*It depends on the algorithm, dataset and goal. Consider n as a tuning paramater and adjust it accordingly.*

For example, consider the below snippet to find the most common uni-, bi- and tri-gram:

In [35]:
for n in range(1, 4):
    bivec = CountVectorizer(ngram_range=(n,n)).fit(lines)
    data = bivec.transform(lines)
    most_common = np.argmax(data.sum(axis=0))
    
    feature = bivec.get_feature_names()[most_common]
    print("Most common {}-gram: {}".format(n, feature))

Most common 1-gram: is
Most common 2-gram: better than
Most common 3-gram: is better than


As you can see, it all depends on the context of what you are trying to achieve.