### Vectorizing
Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect __numerical feature vectors with a fixed size__ rather than the __raw text documents with variable length.__

In order to address this issue, early researchers came up with some methods to extract numerical features from text content. All these methods have the following steps in common:

1. **Tokenizing** strings, for instance, by using white-spaces and punctuation as token separators. And then giving an integer-id for each possible token.

2. __Counting__ the occurrences of tokens in each string/sentence/document.

3. **Normalizing and Weighting** with diminishing importance tokens that occur in the majority of samples / documents.

A set of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

In NLP, document can be a string, english sentence or complete word file. Also, __corpus__ is nothing but collection of document. Research community likes to use a lot of technical, domain-specific jargon. Your job is to not let these words scare you!

I(Ankur) often try to map these words to something that I already know. This makes it easy to remember and also in the process (of mapping or creating analogy) I end up understanding the topic better. Because, if some analogy don’t work (i.e. capture the meaning) then I am forced to look for new one and so on…

We call __vectorization__ the general process of turning a collection of text documents into numerical feature vectors. Documents are described by __word occurrences__ while completely __ignoring the relative position information__ of the words in the document.

We want an algebraic model representing textual information as a vector, the components of this vector could represent the absence or presence (Bag of Words) of it in a document or even the importance of a term (tf–idf) in the document.

__CountVectorizer__

The first step in modeling the document into a vector is to create a dictionary of terms present in documents. To do that, you can simple tokenize the complete document & select all the unique terms from the document, but we know that there are some kind of words (stop words like the, are, etc) that are present in almost all documents, and what we’re doing is extracting important features from documents. So using terms like “the, at, on”, etc.. isn’t going to help us, in the differentiating them, hence we’ll just ignore them.

**Implementation**
Lets say you have two sentences in train_set as well as test_set.

In [1]:
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.", "We can see the shining sun, the bright sun.")


In [7]:
## We use CountVectorizer to convert these sentences into vectors

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

The __CountVectorizer__ already uses as default analyzer called __Word__ (press Shift + TAB to see the list of arguments), which is responsible to convert the text to lowercase, accents removal, token extraction, filter stop words, etc… you can see more information by printing the class information:

In [21]:
vectorizer.fit(train_set)
vectorizer.vocabulary_

{'the': 5, 'sky': 3, 'is': 2, 'blue': 0, 'sun': 4, 'bright': 1}

 vocabulary_ is just a normal python dictionary. The key is token, and the value is index. We can try sorting it based on index value.

In [26]:
vocab_dict = vectorizer.vocabulary_.copy()
dict(sorted(vocab_dict.items(), key=lambda item: item[1]))

{'blue': 0, 'bright': 1, 'is': 2, 'sky': 3, 'sun': 4, 'the': 5}

In [36]:
vocab_dict.items()

dict_items([('the', 5), ('sky', 3), ('is', 2), ('blue', 0), ('sun', 4), ('bright', 1)])

In [38]:
test_set

('The sun in the sky is bright.',
 'We can see the shining sun, the bright sun.')

In [40]:
test_vec = vectorizer.transform(test_set)
test_vec.toarray()

array([[0, 1, 1, 1, 1, 2],
       [0, 1, 0, 0, 2, 2]], dtype=int64)

In [42]:
vectorizer.transform(['The ball is red']).toarray()

array([[0, 0, 1, 0, 0, 1]], dtype=int64)

We can also reverse the transformation operation by using inverse_transform method.

In [45]:
vectorizer.inverse_transform(test_vec)

[array(['bright', 'is', 'sky', 'sun', 'the'], dtype='<U6'),
 array(['bright', 'sun', 'the'], dtype='<U6')]

As we can see we only get the token back. Not the order of the token. That information was lost in the transformation process. This is a clear limitation of vectorization techniques, because we as humans know "how important order is" in natural languages.

However, the main problem with this (also know as **term-frequency**) approach is that it **scales up frequent terms**, and **scales down rare terms**; which are empirically more informative than the high frequency terms. The basic intuition is that a term that occurs frequently in many documents is not a good discriminator; the important question here is: why would you, in a classification problem for instance, emphasize on a term which is present in almost all the documents in the corpus?

So, technically speaking there are two problems with this approach:

- document size is not taken into consideration. (normalization)

- document frequency of words are also ignored. (weighting)

__Tf–idf vectorizer__
The tf-idf weight comes to solve this problem. What tf-idf gives is how important a word is to a document in a collection, and that’s why tf-idf incorporates both local and global information. Tf-idf takes in consideration not only the isolated term but also the term within the document collection.

__tf-idf__ scales down the frequent terms while scaling up the rare terms; a term that occurs 10 times more than another, isn’t 10 times more important than it, that’s why tf-idf uses the logarithmic scale to do that.

The use of this simple term frequency could alleviate problems like __keyword spamming__, which is when we have a repeated term in a document with the purpose of improving its ranking on an IR (Information Retrieval) system or even create a bias towards long documents, making them look more important than they are just because of the high frequency of the term in the document.

Tf-idf as the name suggests, combines two different process: __term frequency__ and inverse document frequency. The term frequency of a document is generally used for normalized (__Vector Normalization__). Let’s see now, how idf (__inverse document frequency__) is then defined:


$$
\text{idf}(t) = \log\left(\frac{|D|}{1 + |\{d : t \in d\}|}\right)
$$

Where:
* $|D|$ represents the total number of documents in the corpus.
* $|\{d : t \in d\}|$ represents the number of documents containing the term $t$. The '+1' in the denominator is often added to prevent division by zero for terms not present in any document and to smooth the values.

The formula fot the tf-idf is then:
$$\text{tfidf(t)} = \text{tf(t)} \times\text{idf(t)}$$

and this formula has an important consequence: a high weight of the tf-idf calculation is reached when you have a high term frequency (tf) in the given document (local parameter) and a low document frequency of the term in the whole collection (global parameter).

Don’t worry if things don’t make complete sense. Just rememeber, __term-frequency__ represents “the frequency of the term in the document”. And __inverse-document-frequency__ represents “in how many documents the term was present”.

The first capture local information (document level) and the second captures global information (collection level). For each term, the weight is calculated by combine (here, multiplying) these two pieces of information.

__Implementation__

The first step is to create our training and testing document set and computing the term frequency matrix:

In [55]:
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.", "We can see the shining sun, the bright sun.")

In [59]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(train_set)
vectorizer.vocabulary_

{'the': 5, 'sky': 3, 'is': 2, 'blue': 0, 'sun': 4, 'bright': 1}

In [61]:
## transform the sentences to vectors
vectorizer.transform(test_set).toarray()

array([[0.        , 0.42519636, 0.30253071, 0.42519636, 0.42519636,
        0.60506143],
       [0.        , 0.37729199, 0.        , 0.        , 0.75458397,
        0.53689271]])