<h1> Text Representation </h1>

<h2>One Hot Vector</h2>

When we have to deal with text processing, we cannot use word directly to calculate or evaluate on the model. So the first thing we have to do is to encode the text to some numeric representation. The most popular representation for texts is "vector".

Before we go any further, I would like to present you guys for the easiest way to represent the text as a vector, that is "one-hot vector."

In [30]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
vocab = np.array(['animal','bird','cow','dolphin'])
onehot_encoder = OneHotEncoder(categories='auto')
one_hot = onehot_encoder.fit_transform(vocab.reshape(-1,1))
print(one_hot.toarray())
test_doc = np.array(['animal','dolphin'])
test_onehot = onehot_encoder.transform(test_doc.reshape(-1,1)).toarray()
print(test_onehot)
onehot_encoder.inverse_transform(test_onehot)

[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]
[[1. 0. 0. 0.]
 [0. 0. 0. 1.]]


array([['animal'],
       ['dolphin']], dtype='<U7')

These one-hot vector can have a summation to combine words in a sentence to one vector. That will be important in CBOW algorithm in Word2Vec. (One hot vector is also used in skip-gram algorithm in Word2Vec ). We'll talk about this at the bottom of this notebook.

<h2> TF-IDF Vectorizer </h2>
When we come up with something more complicated, the first thing we should consider how the word can effect the context is "frequency". So the word "frequency" will play an essential role on this type of vector. Let's have a close look on each technical term for this vector.
<h3> TF : Term Frequency </h3>
TF value can be calculate directly like the meaning of this term, which is the frequencey of the specific word in the specific document. (I'll show an example in the code below.) 
For now just now the formula is below.**

$$ TF(w,d) = \frac{\text{The number of word w in document d}}{\text{Total number of word in document d}}$$

<h3>IDF : Inverse Document Frequency</h3>
IDF value can call in another word as specificity. The more idf value, The more uniqueness. This value also help to remove stopword as well. Same as TF value, there are various way to calculate IDF value**. So this is one of them.

$$ IDF(w) = \log(\frac{\text{Total number of documents}}{\text{The number of documents occurs word w}}) $$

**Both TF and IDF have various way to calculate, see more at https://en.wikipedia.org/wiki/Tf%E2%80%93idf 

Fortunately, we don't have to do all of that stuff by ourselves. sklearn also has a function TfidfVectorizer that you can use it easily. 
This is an example.

In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'hello this is the first sentence',
    'hi there this is the second sentence',
    'what is up this is the third sentence',
    'yo finally we are in the last sentence'
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print("Feature Names : ")
print(vectorizer.get_feature_names())
print("Vector for the first document : ")
print(X[0].toarray())
print("Size of Transformed vector: ")
print(X.shape)
print("The vector that transform from a new sentence:")
print(vectorizer.transform(['hello there we are in this together']))

Feature Names : 
['are', 'finally', 'first', 'hello', 'hi', 'in', 'is', 'last', 'second', 'sentence', 'the', 'there', 'third', 'this', 'up', 'we', 'what', 'yo']
Vector for the first document : 
[[0.         0.         0.54558875 0.54558875 0.         0.
  0.34824223 0.         0.         0.28471084 0.28471084 0.
  0.         0.34824223 0.         0.         0.         0.        ]]
Size of Transformed vector: 
(4, 18)
The vector that transform from a new sentence:
  (0, 15)	0.43003651715871155
  (0, 13)	0.27448673838643983
  (0, 11)	0.43003651715871155
  (0, 5)	0.43003651715871155
  (0, 3)	0.43003651715871155
  (0, 0)	0.43003651715871155


<h4> Interesting Fact !! </h4>

 I've just realized that we can do some preprocessing before TfidfVectorizer tokenize words (such as cleaning text) by assign "preprocessor" attribute when we construct TfidfVectorizer.
 You can study more on this link : 
 
 https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

<h2>Word2Vec</h2>

Now, we are going to use something more popular for the word embedding, Word2Vec. Word2Vec use corpus to train with neural network. Word2Vec has 2 algorithm to train the vector, Skip-gram and CBOW.

<h3>Skip-gram</h3>

Skip-gram is the algorithm that use the word to predict the context. This skip-gram model can learn very well even with the small data, and it can handle rare words very well but it take time longer than CBOW to train.

<h3>CBOW : Continuous Bag of Words</h3>

CBOW is the algorithm that use contexts to predict the word. This model can train faster than Skip-gram. The accuracy is slightly better than skipgram when dealing with frequent words.

In [14]:
import multiprocessing
from gensim.models import Word2Vec
cores = multiprocessing.cpu_count()
print("CPU cores : " + str(cores))
corpus = [
    'hello everyone i am a researcher',
    'hi guys i want to be a programmer',
    'programmer can code like a magician',
    'researcher can invent a new thing',
    'both researcher and programmer help our world to grow',
    'programmer and research have to learn a lot'
]
corpus = [x.split() for x in corpus]
#This model will use skip-gram algorithm
w2v_model = Word2Vec(min_count=1, # the minimum word frequency that would be considered
                     window=5, # The size of windows that take all the word to consider
                     size=5, # The dimensionality of feature vectors
                     sample=6e-5,  # Threshold value for configuring which high-frequency word should be downsampled
                     alpha=0.03, # The initial learning rate
                     min_alpha=0.0007, # Learning rate will linearly drop to min_alpha
                     negative=20, # if > 0, it will use negative sampling (how many noise should be drown)
                     workers=cores-1, # The number of core processor (thread) to train the model
                     sg=1)  # Just the flag for using skip-gram algorithm
w2v_model.build_vocab(corpus,progress_per=10)
w2v_model.train(corpus,
                total_examples=w2v_model.corpus_count,
                epochs=10,
                report_delay=1)
w2v_model.init_sims(replace=True) # Precomputing L2-Norm for wordvector and make it memory-efficient Since we no longer train anymore
w2v_model.wv.most_similar(positive=['programmer'])

CPU cores : 4


[('have', 0.9511440992355347),
 ('am', 0.9339777231216431),
 ('help', 0.6450687646865845),
 ('lot', 0.4974794089794159),
 ('guys', 0.4706776738166809),
 ('be', 0.42309653759002686),
 ('our', 0.36935853958129883),
 ('thing', 0.33716315031051636),
 ('hi', 0.27980124950408936),
 ('invent', 0.2632341980934143)]

From the Word2Vec model above, if you want to change to CBOW algorithm. You just change the parameter sg=0, or just igore it. 

<h2>GloVe</h2>
This vector is invented by Standford NLP research group. Actually you don't have to implement that all by yourself, but it's also good to know how it's work. The GloVe stand for "Global Vector". This vector is count-based vector that store the frequency of words in the matrix VxV, that V is the number of vocabulary that we have.
$X_{ij}$ is represent the frequency (sometime they call "point"**) of the word $w_i$ with context $w_j$ 

To reduce noise word, we use logarithm to normalize all the value in matrix X, so we can calculate word vector $w$ and $\bar{w}$ that respect this condition.

$$
                                   w_i \cdot \bar{w_j} + b_i + \bar{b_j} = \log{X_{ij}}
$$

Which $b_i$ and $\bar{b_j}$ are bias constant for $w_i$ and $\bar{w_j}$ respectively.

So we can calculate the cost function like this : 

$$
                                    J = \sum_{i,j} G(X_{ij})(w_i \cdot \bar{w_j} + b_i + \bar{b_j} - \log{X_{ij}})
$$
Which $G(X_{i,j})$ is a weighting function that 

$$
                                    G(x) = 
\begin{cases}
      (x/x_{max})^{\alpha} \hspace{2.5cm}       ; x < x_{max} \\ \\
      1 \hspace{4cm}                          ; otherwise
\end{cases}
$$

and we'll minimize the cost and adjust the parameter by using stochastic gradient descent.

In [16]:
from glove import Corpus, Glove

ImportError: cannot import name 'Corpus' from 'glove' (//anaconda3/lib/python3.7/site-packages/glove/__init__.py)

In [17]:
!pip install glove_python

Collecting glove_python
  Using cached https://files.pythonhosted.org/packages/3e/79/7e7e548dd9dcb741935d031117f4bed133276c2a047aadad42f1552d1771/glove_python-0.1.0.tar.gz
Building wheels for collected packages: glove-python
  Building wheel for glove-python (setup.py) ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: //anaconda3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/j1/_14trtl10y73wf576sb4l3bm4584lb/T/pip-install-n265nbqf/glove-python/setup.py'"'"'; __file__='"'"'/private/var/folders/j1/_14trtl10y73wf576sb4l3bm4584lb/T/pip-install-n265nbqf/glove-python/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /private/var/folders/j1/_14trtl10y73wf576sb4l3bm4584lb/T/pip-wheel-5hwqcszm --python-tag cp37
       cwd: /private/var/folders/j1/_14trtl10y73wf576sb4l3bm4584lb/T/p