🔹 Encoding
Encoding is the initial transformation of words or characters into numbers. It's typically done using simple, deterministic methods and is often sparse.

Common Encoding Methods:
One-Hot Encoding

Each word is represented as a binary vector of the vocabulary size.

Only one index is marked "1" (rest are "0").

Disadvantage: Very high-dimensional and sparse.

Bag of Words (BoW)

Counts the frequency of each word in a document.

Simple but ignores word order and context.

TF-IDF (Term Frequency - Inverse Document Frequency)

Weights terms based on how important they are in a document relative to a corpus.

Helps reduce the weight of common words (e.g., "the", "is").


Encoding:
--------

Encoding is the process of converting data from one format to another. 
It often transforms categorical or textual data into numerical format to be used in machine learning models.


Types of Encoding:
-----------------

Label Encoding:
---------------

Converts each unique category to an integer.


Red → 0, Blue → 1, Green → 2


Usage: Ordinal data (like size: Small, Medium, Large).



One-Hot Encoding:
---------------

Creates a binary column for each category.


Red → [1,0,0], Blue → [0,1,0], Green → [0,0,1]


Usage: Categorical data without inherent order.



Binary Encoding:
----------------

Converts categories to binary numbers.


Category A → 001, B → 010, C → 011

Usage: High-cardinality categorical variables.



Ordinal Encoding:
----------------

Encodes categories with a meaningful order.


Low → 1, Medium → 2, High → 3

Usage: When the order of categories matters.



Frequency Encoding:
------------------

Encodes categories based on their frequency in the dataset.


Red (10 times) → 0.1, Blue (20 times) → 0.2

Usage: When the frequency of occurrence is important.



Target Encoding:
---------------

Encodes categories based on the mean of the target variable.


Category A → 0.3 (mean of target for category A)

Usage: Handling categorical variables with respect to the target.



In [1]:
from tensorflow.keras.preprocessing.text import one_hot

In [2]:
### sentences
sent=[  'the glass of milk',
     'the glass of juice',
     'the cup of tea',
    'I am a good boy',
     'I am a good developer',
     'understand the meaning of words',
     'your videos are good',]

In [3]:
sent

['the glass of milk',
 'the glass of juice',
 'the cup of tea',
 'I am a good boy',
 'I am a good developer',
 'understand the meaning of words',
 'your videos are good']

In [4]:
### Vocabulary size   --- Total size of vocabulary
voc_size=10000

#### One Hot Representation

In [5]:
onehot_repr=[one_hot(words,voc_size)for words in sent]
print(onehot_repr)

[[285, 4765, 7291, 3512], [285, 4765, 7291, 2429], [285, 4192, 7291, 1107], [4044, 4257, 1137, 8110, 5014], [4044, 4257, 1137, 8110, 4011], [6304, 285, 3973, 7291, 845], [4644, 6135, 9378, 8110]]


### Word Embedding Represntation

In [6]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences   ### pad_sequence is helps to make the sentences in the same length
from tensorflow.keras.models import Sequential

In [7]:
import numpy as np

In [8]:
sent_length=8
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)   # pre means if the sent_length is less than 8 at the begining it will add zeros to make length to 8
print(embedded_docs)

[[   0    0    0    0  285 4765 7291 3512]
 [   0    0    0    0  285 4765 7291 2429]
 [   0    0    0    0  285 4192 7291 1107]
 [   0    0    0 4044 4257 1137 8110 5014]
 [   0    0    0 4044 4257 1137 8110 4011]
 [   0    0    0 6304  285 3973 7291  845]
 [   0    0    0    0 4644 6135 9378 8110]]


In [9]:
dim=10


In [10]:
model=Sequential()
model.add(Embedding(voc_size,10,input_length=sent_length))
model.compile('adam','mse')



In [11]:
model.summary()

In [12]:
print(model.predict(embedded_docs))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 95ms/step
[[[ 0.0072916  -0.04939718  0.00980201  0.01132406 -0.01122274
    0.02594313 -0.0177862  -0.02067568 -0.04062396 -0.01236427]
  [ 0.0072916  -0.04939718  0.00980201  0.01132406 -0.01122274
    0.02594313 -0.0177862  -0.02067568 -0.04062396 -0.01236427]
  [ 0.0072916  -0.04939718  0.00980201  0.01132406 -0.01122274
    0.02594313 -0.0177862  -0.02067568 -0.04062396 -0.01236427]
  [ 0.0072916  -0.04939718  0.00980201  0.01132406 -0.01122274
    0.02594313 -0.0177862  -0.02067568 -0.04062396 -0.01236427]
  [-0.03398281 -0.03193555 -0.04279214 -0.00214678  0.02035755
    0.00392882 -0.02644522 -0.04339386 -0.01101477  0.03416219]
  [ 0.0200134   0.04149667  0.00343012 -0.02342827 -0.01753107
   -0.04941878  0.02995748  0.00408471 -0.03031365  0.02967873]
  [-0.00590355 -0.0402675   0.01093559 -0.03672932 -0.02785796
    0.00191156 -0.01378654  0.00135098  0.01399452 -0.00674499]
  [-0.02456751 -0.03695143 -0.04575268 

In [13]:
embedded_docs[0]  ## here the first value 0 is converted into vector values of 10 because we given dimensions as 10 so 10 values are created for each and every word

array([   0,    0,    0,    0,  285, 4765, 7291, 3512])

In [14]:
print(model.predict(embedded_docs)[0])

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step
[[ 0.0072916  -0.04939718  0.00980201  0.01132406 -0.01122274  0.02594313
  -0.0177862  -0.02067568 -0.04062396 -0.01236427]
 [ 0.0072916  -0.04939718  0.00980201  0.01132406 -0.01122274  0.02594313
  -0.0177862  -0.02067568 -0.04062396 -0.01236427]
 [ 0.0072916  -0.04939718  0.00980201  0.01132406 -0.01122274  0.02594313
  -0.0177862  -0.02067568 -0.04062396 -0.01236427]
 [ 0.0072916  -0.04939718  0.00980201  0.01132406 -0.01122274  0.02594313
  -0.0177862  -0.02067568 -0.04062396 -0.01236427]
 [-0.03398281 -0.03193555 -0.04279214 -0.00214678  0.02035755  0.00392882
  -0.02644522 -0.04339386 -0.01101477  0.03416219]
 [ 0.0200134   0.04149667  0.00343012 -0.02342827 -0.01753107 -0.04941878
   0.02995748  0.00408471 -0.03031365  0.02967873]
 [-0.00590355 -0.0402675   0.01093559 -0.03672932 -0.02785796  0.00191156
  -0.01378654  0.00135098  0.01399452 -0.00674499]
 [-0.02456751 -0.03695143 -0.04575268  0.00994636 -0.

In [1]:
import pandas as pd

In [2]:
data=pd.DataFrame({"text":["people watch cricket","cricket watch cricket","people give comment","cricket give comment"],"output":[1,1,0,0]})

In [3]:
data

Unnamed: 0,text,output
0,people watch cricket,1
1,cricket watch cricket,1
2,people give comment,0
3,cricket give comment,0


In [4]:
from sklearn.feature_extraction.text import CountVectorizer

In [5]:
BOW=CountVectorizer()

In [6]:
data["text"]

0     people watch cricket
1    cricket watch cricket
2      people give comment
3     cricket give comment
Name: text, dtype: object

In [7]:
document_matrix=BOW.fit_transform(data["text"])

In [8]:
document_matrix

<4x5 sparse matrix of type '<class 'numpy.int64'>'
	with 11 stored elements in Compressed Sparse Row format>

### Alphabatical sequence

In [9]:
BOW.vocabulary_

{'people': 3, 'watch': 4, 'cricket': 1, 'give': 2, 'comment': 0}

In [10]:
document_matrix[0].toarray()

array([[0, 1, 0, 1, 1]], dtype=int64)

In [11]:
document_matrix[1].toarray()

array([[0, 2, 0, 0, 1]], dtype=int64)

In [12]:
document_matrix[2].toarray()

array([[1, 0, 1, 1, 0]], dtype=int64)

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

In [14]:
bigram=CountVectorizer(ngram_range=(2,2))

In [15]:
bigramvocab=bigram.fit_transform(data["text"])

In [16]:
data["text"]

0     people watch cricket
1    cricket watch cricket
2      people give comment
3     cricket give comment
Name: text, dtype: object

In [17]:
bigram.vocabulary_

{'people watch': 4,
 'watch cricket': 5,
 'cricket watch': 1,
 'people give': 3,
 'give comment': 2,
 'cricket give': 0}

In [18]:
trigram=CountVectorizer(ngram_range=(3,3))

In [19]:
trigramvocab=trigram.fit_transform(data["text"])

In [20]:
trigram.vocabulary_

{'people watch cricket': 3,
 'cricket watch cricket': 1,
 'people give comment': 2,
 'cricket give comment': 0}

In [21]:
mix=CountVectorizer(ngram_range=(1,3))

In [22]:
mix_vocab=mix.fit_transform(data["text"])

In [23]:
mix.vocabulary_

{'people': 8,
 'watch': 13,
 'cricket': 1,
 'people watch': 11,
 'watch cricket': 14,
 'people watch cricket': 12,
 'cricket watch': 4,
 'cricket watch cricket': 5,
 'give': 6,
 'comment': 0,
 'people give': 9,
 'give comment': 7,
 'people give comment': 10,
 'cricket give': 2,
 'cricket give comment': 3}