# The Bag-of-Words Model

In language processing, the vectors x are derived from textual data, in order to
reflect various linguistic properties of the text.

This is called feature extraction or feature encoding. A popular and simple method of feature
extraction with text data is called the bag-of-words model of text.

A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:
    
1) A vocabulary of known words.

2) A measure of the presence of known words.

#### Mechanism of Bag of Words

In [None]:
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,

In [None]:
#finding the total unique words
it
was
the
best
of
times
worst
age
wisdom
foolishness

In [None]:
#create document vector
it = 1
was = 1
the = 1
best = 1
of = 1
times = 1
worst = 0
age = 0
wisdom = 0
foolishness = 0

In [None]:
#for the first sentence
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]


#for other three sentence
"it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
"it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
"it was the age of foolishness" = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]

#### Scoring Words

Once a vocabulary has been chosen, the occurrence of words in example documents needs to be
scored. In the worked example, we have already seen one very simple approach to scoring: a
binary scoring of the presence or absence of words. Some additional simple scoring methods
include:
    
1) Counts-> Count the number of times each word appears in a document.

2) Frequencies-> Calculate the frequency that each word appears in a document out of all the words in the document.

#### Word Hashing

#### TF-IDF

A problem with scoring word frequency is that highly frequent words start to dominate in the
document (e.g. larger score), but may not contain as much informational content to the model
as rarer but perhaps domain specific words. One approach is to rescale the frequency of words
by how often they appear in all documents, so that the scores for frequent words like the that
are also frequent across all documents are penalized. This approach to scoring is called Term
Frequency - Inverse Document Frequency, or TF-IDF for short, where:
    
1) Term Frequency: is a scoring of the frequency of the word in the current document.

2) Inverse Document Frequency: is a scoring of how rare the word is across documents.

#### Limitation of Bag-of-Words

The bag-of-words model is very simple to understand and implement and offers a lot of flexibility
for customization on your specific text data. It has been used with great success on prediction
problems like language modeling and documentation classification. Nevertheless, it suffers from
some shortcomings, such as:
    
1) Vocabulary: The vocabulary requires careful design, most specifically in order to manage the size, which impacts the sparsity of the document representations.

2) Sparsity: Sparse representations are harder to model both for computational reasons(space and time complexity) and also for information reasons, where the challenge is for the models to harness so little information in such a large representational space.

3) Meaning: Discarding word order ignores the context, and in turn meaning of words in the document (semantics). Context and meaning can offer a lot to the model, that if modeled could tell the difference between the same words differently arranged (this is interesting vs is this interesting), synonyms (old bike vs used bike), and much more.