# Vectorizing

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.

In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely:

   * **tokenizing** strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.
   * **counting** the occurrences of tokens in each document.
   * **normalizing and weighting** with diminishing importance tokens that occur in the majority of samples / documents.

A set of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. Documents are described by **word occurrences** while completely **ignoring the relative position information** of the words in the document.

We want an algebraic model representing textual information as a vector, the components of this vector could represent the absence or presence (Bag of Words) of it in a document or even the importance of a term (tf–idf) in the document. 



## CountVectorizer

The first step in modeling the document into a vector is to create a dictionary of terms present in documents. To do that, you can simple select all terms from the document and convert it to a dimension in the vector space, but we know that there are some kind of words (stop words) that are present in almost all documents, and what we’re doing is extracting important features from documents, features do identify them among other similar documents, so using terms like “the, at, on”, etc.. isn’t going to help us, so in the information extraction, we’ll just ignore them.


Let’s take the documents below to define our (stupid) document space:

>**Train Document Set:**

>d1: The sky is blue.<br>
>d2: The sun is bright.

>**Test Document Set:**

>d3: The sun in the sky is bright.<br>
>d4: We can see the shining sun, the bright sun.

```python

    ['sky', 'blue', 'sun', 'bright']
```
### Indexed Vocalbulary or Dictionary vectorizer (not much important but I like talking about it)

we’re going to use the **term-frequency** to represent each term in our vector space; the term-frequency is nothing more than a measure of how many times the terms present in our vocabulary, are present in the documents.



In [1]:
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
    "We can see the shining sun, the bright sun.")

In [21]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1,1))

The **CountVectorizer** already uses as default “analyzer” called **WordNGramAnalyzer**, which is responsible to convert the text to lowercase, accents removal, token extraction, filter stop words, etc… you can see more information by printing the class information:

Let’s create now the vocabulary index:

In [25]:
vectorizer.fit(train_set)
vectorizer.vocabulary_

{'the': 5, 'sky': 3, 'is': 2, 'blue': 0, 'sun': 4, 'bright': 1}

Let’s use the same vectorizer now to create the vectors of our test_set documents:

In [26]:
print(test_set)

('The sun in the sky is bright.', 'We can see the shining sun, the bright sun.')


In [27]:
test_vec = vectorizer.transform(test_set)
#print(test_vec)
test_vec.toarray()
# type(test_vec)

array([[0, 1, 1, 1, 1, 2],
       [0, 1, 0, 0, 2, 2]], dtype=int64)

In [28]:
vectorizer.transform(['The ball is red']).toarray()

array([[0, 0, 1, 0, 0, 1]], dtype=int64)

In [7]:
vectorizer.inverse_transform(test_vec)

[array(['bright', 'is', 'sky', 'sun', 'the'], dtype='<U6'),
 array(['bright', 'sun', 'the'], dtype='<U6')]

However, the main problem with the term-frequency approach is that it **scales up frequent terms** and **scales down rare terms** which are empirically more informative than the high frequency terms. The basic intuition is that a term that occurs frequently in many documents is not a good discriminator, and really makes sense (at least in many experimental tests); the important question here is: why would you, in a classification problem for instance, emphasize a term which is almost present in the entire corpus of your documents ?

Two problems with this approach:
   * document size is not taken into consideration. (normalization)
   * document frequency of words are also ignored. (weighting)

## Tf–idf term weighting

The tf-idf weight comes to solve this problem. What tf-idf gives is how important is a word to a document in a collection, and that’s why tf-idf incorporates local and global parameters, because it takes in consideration not only the isolated term but also the term within the document collection. What tf-idf then does to solve that problem, is to scale down the frequent terms while scaling up the rare terms; a term that occurs 10 times more than another isn’t 10 times more important than it, that’s why tf-idf uses the logarithmic scale to do that.

The use of this simple term frequency could lead us to problems like keyword spamming, which is when we have a repeated term in a document with the purpose of improving its ranking on an IR (Information Retrieval) system or even create a bias towards long documents, making them look more important than they are just because of the high frequency of the term in the document.

To overcome this problem, the term frequency of a document on a vector space is usually normalized.(**vector Normalization**)

Let’s see now, how idf (inverse document frequency) is then defined:

![](img/tf-idf.png)

where,
   * |D| Number of documents 
   * |{d : t in d}| is the number of documents where the term t appears, when the term-frequency function satisfies tf(t,d) != 0, we’re only adding 1 into the formula to avoid zero-division.

The formula for the tf-idf is then:

![](img/tf-idf2.png)

and this formula has an important consequence: a high weight of the tf-idf calculation is reached when you have a high term frequency (tf) in the given document (local parameter) and a low document frequency of the term in the whole collection (global parameter).

### Lets try implementing it in python

The first step is to create our training and testing document set and computing the term frequency matrix:

In [29]:
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
"We can see the shining sun, the bright sun.")


In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(train_set)
vectorizer.vocabulary_

{'the': 5, 'sky': 3, 'is': 2, 'blue': 0, 'sun': 4, 'bright': 1}

In [18]:
vectorizer.transform(train_set).toarray()

array([[0.57615236, 0.        , 0.40993715, 0.57615236, 0.        ,
        0.40993715],
       [0.        , 0.57615236, 0.40993715, 0.        , 0.57615236,
        0.40993715]])

In [10]:
vectorizer.transform(test_set).toarray()

array([[0.        , 0.42519636, 0.30253071, 0.42519636, 0.42519636,
        0.60506143],
       [0.        , 0.37729199, 0.        , 0.        , 0.75458397,
        0.53689271]])

### Vectorizing a large text corpus with the hashing trick (self reading) !!

# References

   * [Feature Extraction sklearn docs](http://scikit-learn.org/dev/modules/feature_extraction.html)
   * Reference Documentations for [CountVectorizer](http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer), [tfidfVectorizer](http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) and [DictVectorizer](http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.DictVectorizer.html#sklearn.feature_extraction.DictVectorizer).