<h1> Text Representation </h1>

<h2>One Hot Vector</h2>

When we have to deal with text processing, we cannot use word directly to calculate or evaluate on the model. So the first thing we have to do is to encode the text to some numeric representation. The most popular representation for texts is "vector".

Before we go any further, I would like to present you guys for the easiest way to represent the text as a vector, that is "one-hot vector."

In [30]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
vocab = np.array(['animal','bird','cow','dolphin'])
onehot_encoder = OneHotEncoder(categories='auto')
one_hot = onehot_encoder.fit_transform(vocab.reshape(-1,1))
print(one_hot.toarray())
test_doc = np.array(['animal','dolphin'])
test_onehot = onehot_encoder.transform(test_doc.reshape(-1,1)).toarray()
print(test_onehot)
onehot_encoder.inverse_transform(test_onehot)

[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]
[[1. 0. 0. 0.]
 [0. 0. 0. 1.]]


array([['animal'],
       ['dolphin']], dtype='<U7')

These one-hot vector can have a summation to combine words in a sentence to one vector. That will be important in CBOW algorithm in Word2Vec. (One hot vector is also used in skip-gram algorithm in Word2Vec ). We'll talk about this at the bottom of this notebook.

<h2> TF-IDF Vectorizer </h2>
When we come up with something more complicated, the first thing we should consider how the word can effect the context is "frequency". So the word "frequency" will play an essential role on this type of vector. Let's have a close look on each technical term for this vector.
<h3> TF : Term Frequency </h3>
TF value can be calculate directly like the meaning of this term, which is the frequencey of the specific word in the specific document. (I'll show an example in the code below.) 
For now just now the formula is below.**

$$ TF(w,d) = \frac{\text{The number of word w in document d}}{\text{Total number of word in document d}}$$

<h3>IDF : Inverse Document Frequency</h3>
IDF value can call in another word as specificity. The more idf value, The more uniqueness. This value also help to remove stopword as well. Same as TF value, there are various way to calculate IDF value**. So this is one of them.

$$ IDF(w) = \log(\frac{\text{Total number of documents}}{\text{The number of documents occurs word w}}) $$

**Both TF and IDF have various way to calculate, see more at https://en.wikipedia.org/wiki/Tf%E2%80%93idf 

Fortunately, we don't have to do all of that stuff by ourselves. sklearn also has a function TfidfVectorizer that you can use it easily. 
This is an example.

In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'hello this is the first sentence',
    'hi there this is the second sentence',
    'what is up this is the third sentence',
    'yo finally we are in the last sentence'
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print("Feature Names : ")
print(vectorizer.get_feature_names())
print("Vector for the first document : ")
print(X[0].toarray())
print("Size of Transformed vector: ")
print(X.shape)
print("The vector that transform from a new sentence:")
print(vectorizer.transform(['hello there we are in this together']))

Feature Names : 
['are', 'finally', 'first', 'hello', 'hi', 'in', 'is', 'last', 'second', 'sentence', 'the', 'there', 'third', 'this', 'up', 'we', 'what', 'yo']
Vector for the first document : 
[[0.         0.         0.54558875 0.54558875 0.         0.
  0.34824223 0.         0.         0.28471084 0.28471084 0.
  0.         0.34824223 0.         0.         0.         0.        ]]
Size of Transformed vector: 
(4, 18)
The vector that transform from a new sentence:
  (0, 15)	0.43003651715871155
  (0, 13)	0.27448673838643983
  (0, 11)	0.43003651715871155
  (0, 5)	0.43003651715871155
  (0, 3)	0.43003651715871155
  (0, 0)	0.43003651715871155


<h4> Interesting Fact !! </h4>

 I've just realized that we can do some preprocessing before TfidfVectorizer tokenize words (such as cleaning text) by assign "preprocessor" attribute when we construct TfidfVectorizer.
 You can study more on this link : 
 
 https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html