
Embedding:
----------

Embedding is a technique used to represent high-dimensional data (like words or items) in a low-dimensional space.
It essentially converts discrete data (like words) into continuous vector representations.



Why Use Embedding :
-------------------

To capture the semantic meaning of words or items.

To reduce dimensionality and make data manageable.

To find relationships or similarities between items.

To make data compatible with neural networks.



Types of Embeddings:
--------------------

Word Embeddings (Text):
------------------------

Represent words as vectors in a continuous vector space.

Common Techniques:
------------------

Word2Vec: Uses CBOW or Skip-Gram to learn word vectors.

GloVe (Global Vectors): Captures word co-occurrence statistics.

FastText: Considers subword information, good for rare words.

ELMo (Embeddings from Language Models): Contextual embeddings using BiLSTM.

BERT Embeddings: Contextual embeddings using Transformers.




Document Embeddings (Sentences/Paragraphs):
----------------------------------------------

Encodes entire sentences or paragraphs as vectors.

Techniques:
-----------

Doc2Vec: An extension of Word2Vec for sentences.

BERT Sentence Embeddings: Context-aware embeddings for sentences.



Image Embeddings:
------------------

Transforms images into vectors.

Techniques:
----------

CNN-based Embeddings: Extracts features from images.

Pre-trained Models: VGG, ResNet for image vectorization.



Graph Embeddings:
-----------------

Represents graph nodes or structures as vectors.

Techniques:
----------

Node2Vec: Learns node embeddings through random walks.

DeepWalk: Similar to Node2Vec but uses uniform sampling.

GraphSAGE: Samples neighborhoods for embedding learning.



Categorical Embeddings (Tabular Data):
--------------------------------------

Converts categorical features into continuous vectors.

Techniques:
------------

One-Hot Encoding + Dense Layer: Learns embeddings during training.

Embedding Layer in Neural Networks: Common in recommendation systems.




Encoding:
--------

Encoding is the process of converting data from one format to another. 
It often transforms categorical or textual data into numerical format to be used in machine learning models.


Types of Encoding:
-----------------

Label Encoding:
---------------

Converts each unique category to an integer.


Red → 0, Blue → 1, Green → 2


Usage: Ordinal data (like size: Small, Medium, Large).



One-Hot Encoding:
---------------

Creates a binary column for each category.


Red → [1,0,0], Blue → [0,1,0], Green → [0,0,1]


Usage: Categorical data without inherent order.



Binary Encoding:
----------------

Converts categories to binary numbers.


Category A → 001, B → 010, C → 011

Usage: High-cardinality categorical variables.



Ordinal Encoding:
----------------

Encodes categories with a meaningful order.


Low → 1, Medium → 2, High → 3

Usage: When the order of categories matters.



Frequency Encoding:
------------------

Encodes categories based on their frequency in the dataset.


Red (10 times) → 0.1, Blue (20 times) → 0.2

Usage: When the frequency of occurrence is important.



Target Encoding:
---------------

Encodes categories based on the mean of the target variable.


Category A → 0.3 (mean of target for category A)

Usage: Handling categorical variables with respect to the target.



In [25]:
from tensorflow.keras.preprocessing.text import one_hot

In [26]:
### sentences
sent=[  'the glass of milk',
     'the glass of juice',
     'the cup of tea',
    'I am a good boy',
     'I am a good developer',
     'understand the meaning of words',
     'your videos are good',]

In [27]:
sent

['the glass of milk',
 'the glass of juice',
 'the cup of tea',
 'I am a good boy',
 'I am a good developer',
 'understand the meaning of words',
 'your videos are good']

In [28]:
### Vocabulary size   --- Total size of vocabulary
voc_size=10000

#### One Hot Representation

In [29]:
onehot_repr=[one_hot(words,voc_size)for words in sent]
print(onehot_repr)

[[9499, 6526, 7681, 9648], [9499, 6526, 7681, 9883], [9499, 3651, 7681, 2738], [9111, 8944, 2167, 1209, 4980], [9111, 8944, 2167, 1209, 1114], [7952, 9499, 3754, 7681, 6490], [1604, 3696, 6464, 1209]]


### Word Embedding Represntation

In [30]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences   ### pad_sequence is helps to make the sentences in the same length
from tensorflow.keras.models import Sequential

In [31]:
import numpy as np

In [32]:
sent_length=8
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)   # pre means if the sent_length is less than 8 at the begining it will add zeros to make length to 8
print(embedded_docs)

[[   0    0    0    0 9499 6526 7681 9648]
 [   0    0    0    0 9499 6526 7681 9883]
 [   0    0    0    0 9499 3651 7681 2738]
 [   0    0    0 9111 8944 2167 1209 4980]
 [   0    0    0 9111 8944 2167 1209 1114]
 [   0    0    0 7952 9499 3754 7681 6490]
 [   0    0    0    0 1604 3696 6464 1209]]


In [33]:
dim=10


In [34]:
model=Sequential()
model.add(Embedding(voc_size,10,input_length=sent_length))
model.compile('adam','mse')



In [35]:
model.summary()

In [36]:
print(model.predict(embedded_docs))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 108ms/step
[[[ 1.4113676e-02 -2.0687116e-02  8.9847334e-03  3.4140501e-02
    3.1552557e-02 -8.3116405e-03 -3.4494221e-02  6.3979737e-03
   -2.1842290e-02 -2.9660333e-02]
  [ 1.4113676e-02 -2.0687116e-02  8.9847334e-03  3.4140501e-02
    3.1552557e-02 -8.3116405e-03 -3.4494221e-02  6.3979737e-03
   -2.1842290e-02 -2.9660333e-02]
  [ 1.4113676e-02 -2.0687116e-02  8.9847334e-03  3.4140501e-02
    3.1552557e-02 -8.3116405e-03 -3.4494221e-02  6.3979737e-03
   -2.1842290e-02 -2.9660333e-02]
  [ 1.4113676e-02 -2.0687116e-02  8.9847334e-03  3.4140501e-02
    3.1552557e-02 -8.3116405e-03 -3.4494221e-02  6.3979737e-03
   -2.1842290e-02 -2.9660333e-02]
  [ 4.3204203e-03 -3.7728667e-02  2.1580543e-02 -2.1478271e-02
   -1.8456735e-02 -3.6473013e-02 -2.2550656e-02  1.4367912e-02
    4.4218052e-02  3.1032078e-03]
  [-6.0575828e-03  3.6007762e-03 -4.7282945e-02 -3.8094223e-02
    4.0959742e-02  2.7068149e-02  1.0177027e-02 -3.9934706e-02
 

In [37]:
embedded_docs[0]  ## here the first value 0 is converted into vector values of 10 because we given dimensions as 10 so 10 values are created for each and every word

array([   0,    0,    0,    0, 9499, 6526, 7681, 9648])

In [38]:
print(model.predict(embedded_docs)[0])

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
[[ 0.01411368 -0.02068712  0.00898473  0.0341405   0.03155256 -0.00831164
  -0.03449422  0.00639797 -0.02184229 -0.02966033]
 [ 0.01411368 -0.02068712  0.00898473  0.0341405   0.03155256 -0.00831164
  -0.03449422  0.00639797 -0.02184229 -0.02966033]
 [ 0.01411368 -0.02068712  0.00898473  0.0341405   0.03155256 -0.00831164
  -0.03449422  0.00639797 -0.02184229 -0.02966033]
 [ 0.01411368 -0.02068712  0.00898473  0.0341405   0.03155256 -0.00831164
  -0.03449422  0.00639797 -0.02184229 -0.02966033]
 [ 0.00432042 -0.03772867  0.02158054 -0.02147827 -0.01845673 -0.03647301
  -0.02255066  0.01436791  0.04421805  0.00310321]
 [-0.00605758  0.00360078 -0.04728295 -0.03809422  0.04095974  0.02706815
   0.01017703 -0.03993471 -0.04469674 -0.01343983]
 [ 0.04203964  0.01515545 -0.00990856  0.00821701  0.02771887  0.04490746
   0.02663592 -0.01101935  0.00472919  0.03900719]
 [ 0.03316785 -0.04668368 -0.0418429  -0.02534375 -0.

In [1]:
import pandas as pd

In [2]:
data=pd.DataFrame({"text":["people watch cricket","cricket watch cricket","people give comment","cricket give comment"],"output":[1,1,0,0]})

In [3]:
data

Unnamed: 0,text,output
0,people watch cricket,1
1,cricket watch cricket,1
2,people give comment,0
3,cricket give comment,0


In [4]:
from sklearn.feature_extraction.text import CountVectorizer

In [5]:
BOW=CountVectorizer()

In [6]:
data["text"]

0     people watch cricket
1    cricket watch cricket
2      people give comment
3     cricket give comment
Name: text, dtype: object

In [7]:
document_matrix=BOW.fit_transform(data["text"])

In [8]:
document_matrix

<4x5 sparse matrix of type '<class 'numpy.int64'>'
	with 11 stored elements in Compressed Sparse Row format>

### Alphabatical sequence

In [9]:
BOW.vocabulary_

{'people': 3, 'watch': 4, 'cricket': 1, 'give': 2, 'comment': 0}

In [10]:
document_matrix[0].toarray()

array([[0, 1, 0, 1, 1]], dtype=int64)

In [11]:
document_matrix[1].toarray()

array([[0, 2, 0, 0, 1]], dtype=int64)

In [12]:
document_matrix[2].toarray()

array([[1, 0, 1, 1, 0]], dtype=int64)

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

In [14]:
bigram=CountVectorizer(ngram_range=(2,2))

In [15]:
bigramvocab=bigram.fit_transform(data["text"])

In [16]:
data["text"]

0     people watch cricket
1    cricket watch cricket
2      people give comment
3     cricket give comment
Name: text, dtype: object

In [17]:
bigram.vocabulary_

{'people watch': 4,
 'watch cricket': 5,
 'cricket watch': 1,
 'people give': 3,
 'give comment': 2,
 'cricket give': 0}

In [18]:
trigram=CountVectorizer(ngram_range=(3,3))

In [19]:
trigramvocab=trigram.fit_transform(data["text"])

In [20]:
trigram.vocabulary_

{'people watch cricket': 3,
 'cricket watch cricket': 1,
 'people give comment': 2,
 'cricket give comment': 0}

In [21]:
mix=CountVectorizer(ngram_range=(1,3))

In [22]:
mix_vocab=mix.fit_transform(data["text"])

In [23]:
mix.vocabulary_

{'people': 8,
 'watch': 13,
 'cricket': 1,
 'people watch': 11,
 'watch cricket': 14,
 'people watch cricket': 12,
 'cricket watch': 4,
 'cricket watch cricket': 5,
 'give': 6,
 'comment': 0,
 'people give': 9,
 'give comment': 7,
 'people give comment': 10,
 'cricket give': 2,
 'cricket give comment': 3}