### Why we need tfidf?
* Text Analysis is a major application field for machine learning algorithms. 
* Raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.
* In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely:
    * **tokenizing** strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.
    * **counting** the occurrences of tokens in each document.
    * **normalizing** and weighting with diminishing importance tokens that occur in the majority of samples / documents.

#### What are features and samples ?
* **feature** = each individual token occurrence frequency (normalized or not).
* **sample** = the vector of all the token frequencies for a given document is considered a multivariate sample.


#### What are vectorization and Bag of Words?
* We call vectorization the general process of turning a collection of text documents into numerical feature vectors. 
* This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. 


### TF-IDF(term frequency–inverse document frequency)

*  Term Frequency – Inverse Document Frequency(tf-idf)is a method to evaluate the importance of a specific word in a document.
* It converts the textual representation of information into a Vector Space Model (VSM), or into sparse features,
* VSM is an algebraic model representing textual information as a vector, the components of this vector could represent the importance of a term (tf–idf) or even the absence or presence (Bag of Words) of it in a document
* This is a common term weighting scheme in information retrieval.
* The goal is to scale down the impact of tokens that occur very frequently in a given corpus.


In [5]:
import pandas as pd

In [6]:
corpus=[
'The sky is blue',
'The sky is not blue',
]


In [7]:
df = pd.DataFrame({'Text':corpus})
df

Unnamed: 0,Text
0,The sky is blue
1,The sky is not blue


---
<img src="https://github.com/iAmKankan/MachineLearning_With_Python/blob/master/IMG2.jpg?raw=true">

#### Our dataset or corpus:

Document a='The sky is blue'     
Document b='The sky is not blue'

#### Step 1:

|   |tf(a)|tf(b)| 
|---|--|---|---| 
|the|1|1| 
|sky|1|1| 
|is|1|1| 
|blue|1|1| 
|not|0|1 

#### Step 2:


* On a large document the frequency of the terms will be much higher than the smaller ones. Hence we need to normalize the document based on its size

|   |tf(a)|tf(b)|N(a)|N(b)| 
|---|--|---|---|---|---|
|the|1|1|1/4|1/5| 
|sky|1|1|1/4|1/5| 
|is|1|1|1/4|1/5| 
|blue|1|1|1/4|1/5| 
|not|0|1|0|1/5| 

#### Step 3:


* IDF = 1 + loge(Total Number Of Documents / Number Of Documents with term t appear)

|   |tf(a)|tf(b)|N(a)|N(b)| IDF|
|---|--|---|---|---|---|
|the|1|1|1/4|1/5|1+loge(2/2)|
|sky|1|1|1/4|1/5|1+loge(2/2)|
|is|1|1|1/4|1/5|1+loge(2/2)|
|blue|1|1|1/4|1/5|1+loge(2/2)|
|not|0|1|0|1/5|1+loge(2/1)|


---
#### Final Step:

* Final tf * idf

|   |tf(a)|tf(b)|N(a)|N(b)| IDF|tf * idf|Document a|Document b|
|---|--|---|---|---|---|--|
|the|1|1|1/4|1/5|1+loge(2/2)|  |.25|.2|
|sky|1|1|1/4|1/5|1+loge(2/2)|  |.25|.2|
|is|1|1|1/4|1/5|1+loge(2/2)|  |.25|.2|
|blue|1|1|1/4|1/5|1+loge(2/2)|  |.25|.2|
|not|0|1|0|1/5|1+loge(2/1)|  |0|.1386294|




OR the matrix form:



|  | blue |is  |  not| sky |the|
|---|--|----|----|---|---|
|Document a|.25|.25|0|.25|.25|
|Document b|.2|.2|.1386294|.2|.2|


In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer

* We can custom Stop words: "stop_words = frozenset([list of words we want to restrict])"

In [51]:
tfidf = TfidfVectorizer()#stop_words = frozenset(['sky','the']))

In [52]:
tfidf.fit_transform(df.Text).toarray()

array([[ 0.5       ,  0.5       ,  0.        ,  0.5       ,  0.5       ],
       [ 0.4090901 ,  0.4090901 ,  0.57496187,  0.4090901 ,  0.4090901 ]])

In [53]:
tfidf.get_feature_names()

['blue', 'is', 'not', 'sky', 'the']

In [54]:
tfidf.idf_

array([ 1.        ,  1.        ,  1.40546511,  1.        ,  1.        ])

---
#### Default Stop Words
Currently there are 318 words in that frozenset.

In [13]:
from sklearn.feature_extraction import stop_words

In [14]:
print(stop_words.ENGLISH_STOP_WORDS)

frozenset({'never', 'fifteen', 'such', 'because', 'thereby', 'on', 'every', 'rather', 'amongst', 'describe', 'itself', 'cant', 'six', 'once', 'ten', 'whereas', 'themselves', 'amoungst', 'these', 'mostly', 'found', 'if', 'are', 'beforehand', 'him', 'meanwhile', 'done', 'into', 'ever', 'more', 'same', 'hers', 'towards', 'who', 'alone', 'back', 'down', 'along', 'each', 'bottom', 'due', 'from', 'less', 'someone', 'without', 'you', 'moreover', 'those', 'ourselves', 'first', 'further', 'hereby', 'fire', 'had', 'her', 'inc', 'my', 'always', 'hundred', 'seeming', 'get', 're', 'though', 'even', 'con', 'until', 'onto', 'we', 'somehow', 'there', 'sometime', 'i', 'almost', 'whose', 'that', 'yourself', 'fill', 'mill', 'where', 'a', 'seems', 'although', 'none', 'very', 'whatever', 'beyond', 'other', 'while', 'cannot', 'only', 'be', 'thin', 'amount', 'except', 'under', 'detail', 'across', 'whereafter', 'latterly', 'me', 'thus', 'anyone', 'namely', 'twenty', 'everywhere', 'un', 'might', 'not', 'otherw

---

#### Further study: 

Link Tutorial: https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/      
               http://jonathansoma.com/lede/foundations/classes/text%20processing/tf-idf/            
Punctuation Tokenizer:            
               https://sirinnes.wordpress.com/2015/01/22/custom-vectorizer-for-scikit-learn/        
Link Doc: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html     
          http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting

### Summary:
* The output of this program is ment to be fed to an algorithm (i.e KMeans,knn,naive bayes etc..) for further process.

---
