In [1]:
import numpy as np

In [2]:
import textblob as tb

In [3]:
print(dir(tb))

['Blobber', 'PACKAGE_DIR', 'Sentence', 'TextBlob', 'Word', 'WordList', '__all__', '__author__', '__builtins__', '__cached__', '__doc__', '__file__', '__license__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_text', 'base', 'blob', 'compat', 'decorators', 'en', 'exceptions', 'inflect', 'mixins', 'np_extractors', 'os', 'parsers', 'sentiments', 'taggers', 'tokenizers', 'translate', 'utils']


### Before sparse matrix computation

* Convert the text into lower case
* Remove all stopwords
* Apply Stemming (to get the base word)
* Check for semantics

> Lemmatization, Tokenization, Semantics, Stemming are all important for text-preprocessing.

* Lemmatization gives the actual verb form of the word
    - 'go', 'went', 'gone' → go
* Tokenization gives the separated (splits) the entire text into words
    - 'I am goind' → 'I', 'am', 'going'
* Semantics gives the sense between two or multiple words
    - 'tasty' & 'delicious' → both are same
* Stemming gives the base word among the words provided

### BoW

link - https://www.youtube.com/watch?v=IRKDrrzh4dE

* BoW → Bag of Words, used to convert a text into numerical vectors
    - Uni-grams
    - Bi-grams
    - Tri-grams
    - n-grams
    - mainly used text pre-processing

> **Example**

* text 1 → Rome is not built in a day.
    - Here there are no repeated words. So 
    - unigrams → not,  in, day, is, a, built, Rome
    - bigrams → Rome is, is not, not built, built in, in a, a day.
    - trigrams → Rome is not, is not built, not built in, built in a, in a day.

* Here as there is no repetetion of words, 
    - number of unigrams > number of bigrams > number of trigrams.

---

* text 2 → horse is a horse, of course, of course. accept  it
    - Here there are repeated words
    - unigrams → horse, of, course, a, is, it, accept
    - bigrams → horse is, is a , a horse, horse of, of course, course of, course accept, accept it
    - trigrams → horse is a, is a horse, a horse of, horse of course, of course of, course of course, of course accept, course accept it
* Here there is repetetion of words in the given sentence. Hence
    - number of unigrams < number of bigrams < number of trigrams.

### TF-IDF

Imagine, we have `N` reviews or documents and each review is a combination of words

$$r_1 \rightarrow w_1, w_2, w_3, w_2, w_5$$

and

$$r_2 \rightarrow w_1, w_3, w_4, w_5, w_6, w_2$$

then

* TF → Term Frequency

    - $$TF(w_i, r_j) = \frac{w_i \ \text{count}}{\text{total words in} \ r_j} \implies 0 \leq TF(w_i, r_j) \leq 1$$
    
    - $TF(w_2, r_1) = \frac{2}{5}$
    - $TF(w_2, r_2) = \frac{1}{6}$

* IDF → Inverse Document Frequency

    - $$IDF(w_i, D_c) = \log \bigg[\frac{N}{n_i} \bigg] \implies (n_i \leq N), \bigg(\frac{N}{n_i} \geq 1 \bigg), \ \text{and} \ \bigg(\log \bigg[\frac{N}{n_i} \bigg] \geq 0 \bigg)$$
    - $$IDF(w_i, D_c) \geq 0$$
    - where $N$ is total number of documents and $n_i$ is the number of documents that contain the word $w_i$
    
    - We compute IDF of each word with respect to the entire Data Corpus (includes all documents or reviews).
    - Let $D_c$ is my corpus data where $D_c = \{r_1, r_2, r_3, \dots, r_N \}$
    - And let $w_1$ occurs in $r_1, r_3, \text{and}, r_6$ then
    - $IDF(w_1, D_c) = \log \big[\frac{N}{3} \big]$

---

Finally

$$\text{TFIDF} = TF(w_i, r_j) * IDF(w_i, D_c) \implies \frac{\text{word count in} \ r_j}{\text{total words in} \ r_j} * \log \bigg[\frac{\text{total docs}}{\text{count of docs where word is contained}} \bigg]$$

* more importance to rarer words
* more importance if a word is frequent in a document

### Word 2 Vec

* It is a nice way of converting a word into vector. It holds the semantics between words and considered to be one of the state of the art techniques.

* The idea behind Word2Vec is pretty simple. We're making an assumption that the meaning of a word can be inferred by the company it keeps. This is analogous to the saying, "show me your friends, and I’ll tell who you are".

> Helpful link → [here](https://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.Xr_Dp0QzY_4)

> Simulation → https://projector.tensorflow.org/

### Sentence 2 VeC

`Word 2 Vec` recturns a vector represenation of a particular word. Unlike TF-IDF and bag of words, we cannot get the vecotr form of a sentence or a document.

In order to obtain it, we follow techniques such as -

* Avg Word 2 Vec → Simple take the average of vecotrs obtained from `Word 2 Vec` method
* TF-IDF Word 2 Vec → weighted average (considereing the TF-IDF and each vector form word after `Word 2 Vec` method)

The above work very well but not always.