In [1]:
from tensorflow.keras.preprocessing.text import one_hot

In [2]:
### sentences
sent=[  'the glass of milk',
     'the glass of juice',
     'the cup of tea',
    'I am a good boy',
     'I am a good developer',
     'understand the meaning of words',
     'your videos are good',]

In [3]:

voc_size=1000

one_hot_repr=[one_hot(words,voc_size) for words in sent]
one_hot_repr

[[320, 92, 27, 544],
 [320, 92, 27, 435],
 [320, 665, 27, 709],
 [146, 11, 927, 962, 651],
 [146, 11, 927, 962, 990],
 [591, 320, 454, 27, 180],
 [388, 945, 109, 962]]

The one_hot function from Keras does not create a full one-hot vector for each word. Instead, it assigns a unique integer index (from 1 to voc_size) to each word in your sentences. So, each word is represented by a single number, not a full vector.

Why?

This is called "integer encoding" or "tokenization," not true one-hot encoding.
It's more memory-efficient: instead of storing a long vector of mostly zeros for each word, you just store its index.
The actual one-hot vector (with 1 at the index and 0 elsewhere) is created later, usually inside the embedding layer of your neural network.
Summary:

one_hot(words, voc_size) → returns a list of integers (word indices).
True one-hot vectors are not created at this step; they are created on-the-fly if needed, or more commonly, the indices are passed to an embedding layer.
Let me know if you want to see how to create true one-hot vectors or how embeddings work!

In [4]:
## word Embedding Representation

from tensorflow.keras.layers import Embedding
from tensorflow.keras.utils import pad_sequences
from tensorflow.keras.models import Sequential

In [5]:
sent_length=8
embedded_docs=pad_sequences(one_hot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

[[  0   0   0   0 320  92  27 544]
 [  0   0   0   0 320  92  27 435]
 [  0   0   0   0 320 665  27 709]
 [  0   0   0 146  11 927 962 651]
 [  0   0   0 146  11 927 962 990]
 [  0   0   0 591 320 454  27 180]
 [  0   0   0   0 388 945 109 962]]


### What is pad_sequences and why is it used?
- Neural networks expect input data to have the same shape (length).
- In NLP, sentences have different numbers of words, so their integer-encoded representations have different lengths.
- `pad_sequences` is used to make all sequences the same length by adding zeros (or another value) at the beginning or end.
- This allows you to batch sentences together and feed them into the model efficiently.
- Example usage:
```python
from tensorflow.keras.utils import pad_sequences
sent_length = 8
embedded_docs = pad_sequences(one_hot_repr, padding='pre', maxlen=sent_length)
print(embedded_docs)
```
- Here, all sentences are padded to length 8, so the input to the model is a consistent shape.

In [6]:
dim =10 

model=Sequential()
model.add(Embedding(voc_size,dim))
model.compile('adam','mse')
model.build(input_shape=(None, sent_length))

print(voc_size)
print(dim)
print(sent_length)



1000
10
8


### Explanation of Embedding Layer Parameters and Data Shape
- **voc_size**: The size of your vocabulary (total number of unique words you want to represent). Each word index from 1 to voc_size will have its own embedding vector.
    - Example: If voc_size=1000, the embedding layer can represent 1000 different words.
- **dim**: The dimension of the embedding vector for each word. This is how many numbers represent each word in the embedding space.
    - Example: If dim=10, each word is represented by a 10-dimensional vector (e.g., [0.12, -0.34, ..., 0.56]).
- **sent_length**: The length to which all input sequences (sentences) are padded or truncated. This ensures all input data has the same shape for batching.
    - Example: If sent_length=8, every sentence is represented as a sequence of 8 word indices (padded with zeros if shorter).

#### How does the data look?
- Before embedding: Each sentence is a list of integers (word indices), all padded to length `sent_length`.
    - Example: `[0, 0, 0, 0, 12, 45, 23, 67]` (if the sentence had 4 words, padded to 8)
- After embedding: Each integer is replaced by a vector of length `dim`. So, each sentence becomes a matrix of shape `(sent_length, dim)`.
    - Example:
        - Input: `[0, 0, 0, 0, 12, 45, 23, 67]`
        - Output:
            ```
            [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
             [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
             [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
             [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
             [0.12, -0.34, ..., 0.56],
             [0.22, 0.11, ..., -0.09],
             [0.05, 0.18, ..., 0.33],
             [-0.12, 0.44, ..., 0.21]]
            ```
- The output shape from the embedding layer is `(batch_size, sent_length, dim)`.
- This format is ready for input to RNN, LSTM, or other sequence models.

In [7]:
model.summary()

In [8]:
model.predict(embedded_docs)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 67ms/step


array([[[-8.77772644e-03,  3.36878560e-02, -2.94252522e-02,
         -4.00644056e-02,  2.90828608e-02,  6.52499124e-03,
         -4.72992063e-02,  4.38555814e-02,  1.96745731e-02,
         -2.31575612e-02],
        [-8.77772644e-03,  3.36878560e-02, -2.94252522e-02,
         -4.00644056e-02,  2.90828608e-02,  6.52499124e-03,
         -4.72992063e-02,  4.38555814e-02,  1.96745731e-02,
         -2.31575612e-02],
        [-8.77772644e-03,  3.36878560e-02, -2.94252522e-02,
         -4.00644056e-02,  2.90828608e-02,  6.52499124e-03,
         -4.72992063e-02,  4.38555814e-02,  1.96745731e-02,
         -2.31575612e-02],
        [-8.77772644e-03,  3.36878560e-02, -2.94252522e-02,
         -4.00644056e-02,  2.90828608e-02,  6.52499124e-03,
         -4.72992063e-02,  4.38555814e-02,  1.96745731e-02,
         -2.31575612e-02],
        [ 2.17866041e-02, -2.10267790e-02, -5.89563698e-03,
          3.59113477e-02, -1.44238099e-02,  2.82112025e-02,
          4.78067137e-02,  1.81851424e-02, -1.140854

### Interpreting the Output of `model.predict(embedded_docs)`

- The output of this cell is a **3D numpy array** with shape `(number of sentences, sent_length, dim)`.
    - In this example: `(7, 8, 10)` because there are 7 sentences, each padded to length 8, and each word is represented by a 10-dimensional embedding vector.

- **What does each number mean?**
    - For each sentence, for each word position (including padding), you get a vector of length 10.
    - The values are floating-point numbers. They are the learned (or randomly initialized, if not trained) embedding values for each word index.

- **Example:**
    - For the first sentence, the output might look like:
        ```
        [[ 0.01, -0.02, ..., 0.03],   # padding (index 0)
         [ 0.01, -0.02, ..., 0.03],   # padding (index 0)
         ...
         [ 0.12, 0.45, ..., -0.11],   # word 1
         [ 0.22, -0.15, ..., 0.09],   # word 2
         ... ]
        ```

- **Padding positions** (where the input was 0) will have the embedding for index 0, which is usually a vector of zeros or small random values.

- **Why is this useful?**
    - The embedding layer transforms each word index into a dense vector that captures semantic meaning.
    - The output is ready to be fed into further layers like RNN, LSTM, or CNN for sequence modeling.

- **Summary Table:**

| Dimension         | Meaning                                 |
|-------------------|-----------------------------------------|
| 1st (axis 0)      | Sentence index (batch)                  |
| 2nd (axis 1)      | Word position in the sentence           |
| 3rd (axis 2)      | Embedding vector for each word (length=dim) |

- You can inspect the shape with:
```python
output = model.predict(embedded_docs)
print(output.shape)  # (7, 8, 10)
```

Let me know if you want to see how to visualize or further process these embeddings!

In [9]:
output = model.predict(embedded_docs)
print(output.shape)  # (7, 8, 10)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step
(7, 8, 10)


In [10]:
embedded_docs[0]

array([  0,   0,   0,   0, 320,  92,  27, 544], dtype=int32)

In [11]:
output[0] # Output for the first sentence [  0,   0,   0,   0, 677, 911, 388, 953]
# why 8 lists? because 8 words, each word converted to a 10-dimensional vector

array([[-8.7777264e-03,  3.3687856e-02, -2.9425252e-02, -4.0064406e-02,
         2.9082861e-02,  6.5249912e-03, -4.7299206e-02,  4.3855581e-02,
         1.9674573e-02, -2.3157561e-02],
       [-8.7777264e-03,  3.3687856e-02, -2.9425252e-02, -4.0064406e-02,
         2.9082861e-02,  6.5249912e-03, -4.7299206e-02,  4.3855581e-02,
         1.9674573e-02, -2.3157561e-02],
       [-8.7777264e-03,  3.3687856e-02, -2.9425252e-02, -4.0064406e-02,
         2.9082861e-02,  6.5249912e-03, -4.7299206e-02,  4.3855581e-02,
         1.9674573e-02, -2.3157561e-02],
       [-8.7777264e-03,  3.3687856e-02, -2.9425252e-02, -4.0064406e-02,
         2.9082861e-02,  6.5249912e-03, -4.7299206e-02,  4.3855581e-02,
         1.9674573e-02, -2.3157561e-02],
       [ 2.1786604e-02, -2.1026779e-02, -5.8956370e-03,  3.5911348e-02,
        -1.4423810e-02,  2.8211202e-02,  4.7806714e-02,  1.8185142e-02,
        -1.1408545e-02,  2.8855726e-04],
       [-4.9878430e-02, -2.0577634e-02,  3.6887061e-02,  1.7705563e-02,
   

In [12]:
output[0][0] #  # Output for the first word in the first sentence
# why length 10 ? because dim is 10 in the embedding layer

array([-0.00877773,  0.03368786, -0.02942525, -0.04006441,  0.02908286,
        0.00652499, -0.04729921,  0.04385558,  0.01967457, -0.02315756],
      dtype=float32)