## Text Encoding

Since machines doesn't understand characters, words or sentence and only process nunmbers, Text data must be encoded as numbers for input or output for any machine.

**Text encoding is a process to convert meaningful text into number/vector representation so as to preserve the context and relationship between words and sentences, such that a machine can understand the pattern associated in any text and can make out the context of sentences.**


There are a lot of methods to convert text into numerical vectors. They are:
- Index-based encoding
- Bag of words
- TF-IDF encoding
- Word2Vector Encoding
- Bert Encoding

**Document Corpus:** This is the whole set of text we have, i.e, text corpus

**Data Corpus:** Collection of unique words in our document corpus

## Index-based encoding
- As the name suggest, index-based, we need to give all the unique words an index in our data corpus.

- Max padding is used to make all the inputs of same length for input into our model.

In [30]:
document_corpus = ["this is a good phone phone", 
                    "this is a bad mobile mobile",
                    "she is a good good cat", 
                    "he has a bad temper temper", 
                    "this mobile phone phone is not good good"]

In [31]:
data_corpus = set()
for sentence in document_corpus:
    for word in sentence.split():
        data_corpus.add(word)

data_corpus = sorted(data_corpus)
print(data_corpus)

['a', 'bad', 'cat', 'good', 'has', 'he', 'is', 'mobile', 'not', 'phone', 'she', 'temper', 'this']


In [43]:
print(len(data_corpus))

13


In [10]:
# get the maximum length of the sentence for padding
res = len(max(document_corpus, key = len).split())
print(res)

8


In [11]:
index_based_encoding = []
for sentence in document_corpus:
    sentence_encoding = []
    split = sentence.split()
    for i in range(res):
        if i <= len(split) - 1:
            sentence_encoding.append(data_corpus.index(split[i]) + 1)
        else:
            sentence_encoding.append(0)
    index_based_encoding.append(sentence_encoding)

print(index_based_encoding)

[[13, 7, 1, 4, 10, 10, 0, 0], [13, 7, 1, 2, 8, 8, 0, 0], [11, 7, 1, 4, 4, 3, 0, 0], [6, 5, 1, 2, 12, 12, 0, 0], [13, 8, 10, 10, 7, 9, 4, 4]]


## Bag of Words(BOW)
- BoW is another form of encoding where we use the whole data corpus to encode our sentences.

- There are two kinds of BOW:

**Binary BOW:** It encode 1 or 0 for wach word appearing or non-appearint in the sentece and doesn't take into consideration the frequency of the word appearning in that sentence

**BOW:** It also considers the frequency of each word occuring in that sentence.

**BOW completely discards the sequence information of our sentences**

In [12]:
# Binary BOW
one_hot_encoding = []
for sentence in document_corpus:
    sentence_encoding = []
    split = sentence.split()
    for word in data_corpus:
        if word in split:
            sentence_encoding.append(1)
        else:
            sentence_encoding.append(0)
    
    one_hot_encoding.append(sentence_encoding)

print(one_hot_encoding)

[[1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1], [1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1], [1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0], [1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1]]


In [13]:
a = [1, 2, 3, 1, 3, 4, 2, 1]
a.count(1)

3

In [14]:
# BoW
one_hot_encoding = []
for sentence in document_corpus:
    sentence_encoding = []
    split = sentence.split()
    for word in data_corpus:
        count = split.count(word)
        sentence_encoding.append(count)
    
    one_hot_encoding.append(sentence_encoding)

print(one_hot_encoding)

[[1, 0, 0, 1, 0, 0, 1, 0, 0, 2, 0, 0, 1], [1, 1, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 1], [1, 0, 1, 2, 0, 0, 1, 0, 0, 0, 1, 0, 0], [1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 2, 0], [0, 0, 0, 2, 0, 0, 1, 1, 1, 2, 0, 0, 1]]


## TF-IDF Encoding
- Term frequency - Inverse document frequency
- We give every word a relative frequency coding w.r.t the current sentence and the whole document.

**Term Frequency:** is the occurence of the current word in the current sentence w.r.t the total number of words in the current sentence.

**Inverse Frequency:** Log of the total number of words in the whole data corupus w.r.t the total number of sentence containing the current word.




In [32]:
tf_dict = {}
i = 0
for sentence in document_corpus:
    sentence_dict = {}
    split = sentence.split()
    for word in split:
        if word not in sentence_dict.keys():
            sentence_dict[word] = split.count(word)
    tf_dict[i] = sentence_dict
    i += 1

print(tf_dict)

{0: {'this': 1, 'is': 1, 'a': 1, 'good': 1, 'phone': 2}, 1: {'this': 1, 'is': 1, 'a': 1, 'bad': 1, 'mobile': 2}, 2: {'she': 1, 'is': 1, 'a': 1, 'good': 2, 'cat': 1}, 3: {'he': 1, 'has': 1, 'a': 1, 'bad': 1, 'temper': 2}, 4: {'this': 1, 'mobile': 1, 'phone': 2, 'is': 1, 'not': 1, 'good': 2}}


In [60]:
import math 
def calculate_tf(word, sentence_num):
    row_dict = tf_dict[int(sentence_num)]
    return row_dict[word] / sum(row_dict.values())

In [61]:
calculate_tf("phone", 0)

0.3333333333333333

In [62]:
def calculate_idf(word):
    doc_num = 0
    for key, value in tf_dict.items():
        if word in value.keys():
            doc_num += 1
    print(doc_num)
    print(len(data_corpus))
    return math.log(len(data_corpus) / doc_num + 1)

In [63]:
calculate_idf("phone")

2
13


2.0149030205422647

In [64]:
def tf_idf(word, sentence_num):
    tf = calculate_tf(word, sentence_num)
    idf = calculate_idf(word)
    return tf * idf

In [65]:
tf_idf('phone', 0)

2
13


0.6716343401807549

In [20]:
tf_idf_encoding = []
for i in range(len(document_corpus)):
    sentence = document_corpus[i]
    split = sentence.split()
    sentence_encoding = []
    for word in data_corpus:
        if word in split:
            sentence_encoding.append(tf_idf(word, i))
        else:
            sentence_encoding.append(0)
    tf_idf_encoding.append(sentence_encoding)

print(tf_idf_encoding)


[[0.24115, 0, 0, 0.279, 0, 0, 0.24115, 0, 0, 0.67163, 0, 0, 0.279], [0.24115, 0.33582, 0, 0, 0, 0, 0.24115, 0.67163, 0, 0, 0, 0, 0.279], [0.24115, 0, 0.43984, 0.55799, 0, 0, 0.24115, 0, 0, 0, 0.43984, 0, 0], [0.24115, 0.33582, 0, 0, 0.43984, 0.43984, 0, 0, 0, 0, 0, 0.87969, 0], [0, 0, 0, 0.41849, 0, 0, 0.18086, 0.25186, 0.32988, 0.50373, 0, 0, 0.20925]]


In [22]:
len(tf_idf_encoding[0])

13

### Python Library Implementation

In [26]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(document_corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())

['bad' 'cat' 'good' 'has' 'he' 'is' 'mobile' 'not' 'phone' 'she' 'temper'
 'this']
[[0 0 1 0 0 1 0 0 2 0 0 1]
 [1 0 0 0 0 1 2 0 0 0 0 1]
 [0 1 2 0 0 1 0 0 0 1 0 0]
 [1 0 0 1 1 0 0 0 0 0 2 0]
 [0 0 2 0 0 1 1 1 2 0 0 1]]


In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(document_corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())

['bad' 'cat' 'good' 'has' 'he' 'is' 'mobile' 'not' 'phone' 'she' 'temper'
 'this']
[[0.         0.         0.34273991 0.         0.         0.28832362
  0.         0.         0.82578944 0.         0.         0.34273991]
 [0.4023674  0.         0.         0.         0.         0.28097242
  0.80473481 0.         0.         0.         0.         0.33400129]
 [0.         0.49317635 0.6605719  0.         0.         0.27784695
  0.         0.         0.         0.49317635 0.         0.        ]
 [0.31283963 0.         0.         0.38775666 0.38775666 0.
  0.         0.         0.         0.         0.77551332 0.        ]
 [0.         0.         0.51309679 0.         0.         0.2158166
  0.30906082 0.38307292 0.61812163 0.         0.         0.2565484 ]]


### N-gram

In [66]:
document_corpus = ["this is a good phone phone", 
                    "this is a bad mobile mobile",
                    "she is a good good cat", 
                    "he has a bad temper temper", 
                    "this mobile phone phone is not good good"]

In [72]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(ngram_range=(1, 2))
X = cv.fit_transform(document_corpus)
print(cv.get_feature_names_out())
print(X.toarray())

['bad' 'bad mobile' 'bad temper' 'cat' 'good' 'good cat' 'good good'
 'good phone' 'has' 'has bad' 'he' 'he has' 'is' 'is bad' 'is good'
 'is not' 'mobile' 'mobile mobile' 'mobile phone' 'not' 'not good' 'phone'
 'phone is' 'phone phone' 'she' 'she is' 'temper' 'temper temper' 'this'
 'this is' 'this mobile']
[[0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 2 0 1 0 0 0 0 1 1 0]
 [1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 2 1 0 0 0 0 0 0 0 0 0 0 1 1 0]
 [0 0 0 1 2 1 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0]
 [1 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0]
 [0 0 0 0 2 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 2 1 1 0 0 0 0 1 0 1]]
