### One hot encoding:
- It is the method of representing text/tokens into numerical values. for example, text='i love you'. now we spilt the sentence into tokens and each tokens have unique vectors. The onehot encoding for the text can be [[1, 0, 0], [0, 1, 0], [0, 0,1]], where i=[1, 0, 0], love=[0, 1, 0], you=[0, 0, 1]. There are 3 elements in the vector because we have to create the vector size equal to that of size of tokens.

In [1]:
documents=['Dog bites man.', 'Man bites dog.', 'Dog eats meat.', 'Man eats food']

In [2]:
processed_doc=[doc.lower().replace('.', '') for doc in documents]
processed_doc

['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']

In [7]:
#@ Building Vocabulary:
vocab={}
count=0
for doc in processed_doc:
  for word in doc.split():
    if word not in vocab:
      count+=1
      vocab[word]=count

print(vocab)

{'dog': 1, 'bites': 2, 'man': 3, 'eats': 4, 'meat': 5, 'food': 6}


In [8]:
#@ onehot encoding for any string:
def get_onehot_vector(text):
  onehot_encoded=[]
  for word in text.split():
    temp=[0]*len(vocab)
    if word in vocab:
       temp[vocab[word]-1]=1
    onehot_encoded.append(temp)
  return onehot_encoded

In [9]:
get_onehot_vector(processed_doc[0])

[[1, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0]]

### N-gram
- It works by breaking text into chunks of n contiguous words (or
tokens). This can help us capture some context,. Each chunk is called an n-gram. The corpus vocabulary, V, is then nothing but a
collection of all unique n-grams across the text corpus. Then, each document in the
corpus is represented by a vector of length |V|. This vector simply contains the fre‐
quency counts of n-grams present in the document and zero for the n-grams that are
not present.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

In [12]:
count_vect=CountVectorizer(ngram_range=(1,3)) #unigram, bigram, trigram

#bagofword:
bow=count_vect.fit_transform(processed_doc)

#for vocabulary mapping:
print('Our Vocabulary:', count_vect.vocabulary_)

# bow for first two document:
print("BOW of ''dog bites man:", bow[0].toarray())
print("BoW representation for 'man bites dog: ",bow[1].toarray())

#for new text:
temp=count_vect.transform(['dog and dog are friends'])

print("Bow representation for 'dog and dog are friends':", temp.toarray())

Our Vocabulary: {'dog': 3, 'bites': 0, 'man': 12, 'dog bites': 4, 'bites man': 2, 'dog bites man': 5, 'man bites': 13, 'bites dog': 1, 'man bites dog': 14, 'eats': 8, 'meat': 17, 'dog eats': 6, 'eats meat': 10, 'dog eats meat': 7, 'food': 11, 'man eats': 15, 'eats food': 9, 'man eats food': 16}
BOW of ''dog bites man: [[1 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0]]
BoW representation for 'man bites dog:  [[1 1 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0]]
Bow representation for 'dog and dog are friends': [[0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


### TF-IDF
- if a word w appears many times in a docu‐
ment di
 but does not occur much in the rest of the documents dj
 in the corpus, then
the word w must be of great importance to the document di
. The importance of w
should increase in proportion to its frequency in di
, but at the same time, its impor‐
tance should decrease in proportion to the word’s frequency in other documents dj
 in
the corpus. Mathematically, this is captured using two quantities: TF and IDF. The
two are then combined to arrive at the TF-IDF score.

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [17]:
tfidf=TfidfVectorizer()
bow_tf=tfidf.fit_transform(processed_doc)
print(tfidf.idf_)
print(tfidf.get_feature_names_out())

temp = tfidf.transform(["dog and man are friends"])
print("Tfidf representation for 'dog and man are friends':\n", temp.toarray())

[1.51082562 1.22314355 1.51082562 1.91629073 1.22314355 1.91629073]
['bites' 'dog' 'eats' 'food' 'man' 'meat']
Tfidf representation for 'dog and man are friends':
 [[0.         0.70710678 0.         0.         0.70710678 0.        ]]
