### Prepare Text Data for Machine Learning with scikit-learn

__Text data requires special preparation before you can start using it for predictive modeling.__

The text must be parsed to extract words (tokens), called tokenization. 
Then the words need to be encoded as integers or floating point values which will be passed as an input to a machine learning algorithm, the process is called as feature extraction (or vectorization).

The scikit-learn library offers easy-to-use tools to perform both tokenization and feature extraction from your text data.

In this activity, you will discover exactly how you can prepare your text data for predictive modeling in Python with scikit-learn.
- How to convert text to word frequency vectors with TfidfVectorizer

### TF-IDF Explanation:



    TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:

    TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

    IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:

    IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

In [23]:
data = '''Time flies like an arrow. 
Fruit flies like a banana. 
Sam sat on the cat.
The cat is a b c d e white.'''

print(data)

Time flies like an arrow. 
Fruit flies like a banana. 
Sam sat on the cat.
The cat is a b c d e white.


### Consider each sentence as a document. Split the data into sentences based on new line.

In [24]:
dataset = data.split('\n')
dataset

['Time flies like an arrow. ',
 'Fruit flies like a banana. ',
 'Sam sat on the cat.',
 'The cat is a b c d e white.']

In [25]:
type(dataset)

list

In [26]:
len(dataset)

4

#### TF-IDF


The most popular method is called TF-IDF. This is an acronym than stands for “Term Frequency – Inverse Document Frequency" which are the components of the resulting scores assigned to each word.

- Term Frequency: This summarizes how often a given word appears within a document.
- Inverse Document Frequency: This downscale words that appear a lot across documents.

TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents. 

Below is an example of using the TfidfVectorizer to learn vocabulary and inverse document frequencies across 4 small documents and then encode one of those documents.

In [27]:
# sklearn.feature_extraction.text.TfidfVectorizer - Convert a collection of raw documents to a matrix of TF-IDF features
# idfs are calculated by TfidfTransformer's fit()
# tfidfs are calculated by TfidfTransformer's transform()

tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,1), lowercase = False) # stop_words='english'
tfidf = tfidf_vectorizer.fit_transform(dataset)  
print(type(tfidf))
print(tfidf)

<class 'scipy.sparse.csr.csr_matrix'>
  (0, 1)	0.4854606118156975
  (0, 0)	0.4854606118156975
  (0, 7)	0.3827427224171519
  (0, 4)	0.3827427224171519
  (0, 12)	0.4854606118156975
  (1, 2)	0.5552826649411127
  (1, 5)	0.5552826649411127
  (1, 7)	0.43779123108611473
  (1, 4)	0.43779123108611473
  (2, 3)	0.3827427224171519
  (2, 11)	0.3827427224171519
  (2, 8)	0.4854606118156975
  (2, 10)	0.4854606118156975
  (2, 9)	0.4854606118156975
  (3, 13)	0.5552826649411127
  (3, 6)	0.5552826649411127
  (3, 3)	0.43779123108611473
  (3, 11)	0.43779123108611473


In [28]:
dataset

['Time flies like an arrow. ',
 'Fruit flies like a banana. ',
 'Sam sat on the cat.',
 'The cat is a b c d e white.']

Finally, the 4 documents are encoded as 14-element sparse array and we can review the final scorings of each word with different values for the words in the vocabulary.

The scores are normalized to values between 0 and 1 and the encoded document vectors can then be used directly with machine learning algorithms.

#### Write tfidf as dataframe

In [29]:
feature_names = tfidf_vectorizer.get_feature_names()
print(len(feature_names))
print(feature_names,)

14
['an', 'arrow', 'banana', 'cat', 'flies', 'fruit', 'is', 'like', 'on', 'sam', 'sat', 'the', 'time', 'white']


In [11]:
dataset

['Time flies like an arrow. ',
 'Fruit flies like a banana. ',
 'Sam sat on the cat.',
 'The cat is white.']

In [8]:
print(tfidf.toarray())

[[0.48546061 0.48546061 0.         0.         0.38274272 0.
  0.         0.38274272 0.         0.         0.         0.
  0.48546061 0.        ]
 [0.         0.         0.55528266 0.         0.43779123 0.55528266
  0.         0.43779123 0.         0.         0.         0.
  0.         0.        ]
 [0.         0.         0.         0.38274272 0.         0.
  0.         0.         0.48546061 0.48546061 0.48546061 0.38274272
  0.         0.        ]
 [0.         0.         0.         0.43779123 0.         0.
  0.55528266 0.         0.         0.         0.         0.43779123
  0.         0.55528266]]


In [9]:
text_df= pd.DataFrame(tfidf.toarray(), columns= tfidf_vectorizer.get_feature_names()) # Array mapping from feature integer indices to feature names
text_df['text'] = dataset
text_df

Unnamed: 0,an,arrow,banana,cat,flies,fruit,is,like,on,sam,sat,the,time,white,text
0,0.485461,0.485461,0.0,0.0,0.382743,0.0,0.0,0.382743,0.0,0.0,0.0,0.0,0.485461,0.0,Time flies like an arrow.
1,0.0,0.0,0.555283,0.0,0.437791,0.555283,0.0,0.437791,0.0,0.0,0.0,0.0,0.0,0.0,Fruit flies like a banana.
2,0.0,0.0,0.0,0.382743,0.0,0.0,0.0,0.0,0.485461,0.485461,0.485461,0.382743,0.0,0.0,Sam sat on the cat.
3,0.0,0.0,0.0,0.437791,0.0,0.0,0.555283,0.0,0.0,0.0,0.0,0.437791,0.0,0.555283,The cat is white.


In [18]:
print(tfidf)

  (0, 1)	0.4854606118156975
  (0, 0)	0.4854606118156975
  (0, 7)	0.3827427224171519
  (0, 4)	0.3827427224171519
  (0, 12)	0.4854606118156975
  (1, 2)	0.5552826649411127
  (1, 5)	0.5552826649411127
  (1, 7)	0.43779123108611473
  (1, 4)	0.43779123108611473
  (2, 3)	0.3827427224171519
  (2, 11)	0.3827427224171519
  (2, 8)	0.4854606118156975
  (2, 10)	0.4854606118156975
  (2, 9)	0.4854606118156975
  (3, 13)	0.5552826649411127
  (3, 6)	0.5552826649411127
  (3, 3)	0.43779123108611473
  (3, 11)	0.43779123108611473


__Reference__: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html