### Tokenization
- Tokenization is a part of the preprocessing step before the input is fed into the encoder of a Large Language Model (LLM).

- Tokenization is the process of splitting the input and output texts into smaller units or tokens that can be processed by the LLM AI models. 
    - Tokens can be words, characters, subwords, or symbols, depending on the type and the size of the model

#### Submit your prompts (input)
- Tokenization
    - Splitting the input text or prompts into tokens
    - Vectorization (converting tokens into numerical vectors) -> Feeding into the model.

#### install **Tensorflow** if needed

In [1]:
!pip install tensorflow




[notice] A new release of pip is available: 23.1.2 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


# Importing TensorFlow and Keras
In this code, we import the TensorFlow library as `tf` and the Keras module from TensorFlow as `keras`. TensorFlow is an open-source framework for machine learning, and Keras is a high-level API for building and training neural networks. 

We also import the `Tokenizer` class from the `tensorflow.keras.preprocessing.text` module, which allows us to convert text into sequences of integers.

```python
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer


In [2]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

# Creating a Tokenizer and Fitting on Texts
In this code, we create a `Tokenizer` object with a maximum vocabulary size of 100 words. The `Tokenizer` class is a utility that allows us to convert text into sequences of integers. We then use the `fit_on_texts` method to update the internal vocabulary based on a list of sentences. The sentences are:

```python
sentences =[
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]


In [3]:
sentences =[
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

Tokenizer = Tokenizer(num_words=100)

# Getting the Word Index from the Tokenizer
In this code, we use the `word_index` attribute of the `Tokenizer` object to get a dictionary that maps each word in the sentences to a unique integer. The sentences are the same as before:

```python
sentences =[
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

This `output` shows the mapping of each word in the sentences to an integer. The order of the pairs in the dictionary is arbitrary and does not reflect the order of the words in the sentences. The purpose of this mapping is to convert text into numerical sequences that can be used as input for machine learning models.

In [4]:
Tokenizer.fit_on_texts(sentences)

word_index = Tokenizer.word_index

print(word_index)

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}


# Texts To Sequences(sentences)
The code `sequences = Tokenizer.texts_to_sequences(sentences)` is a Python statement that assigns the value of the `texts_to_sequences` method of the Tokenizer object to the variable sequences. The `texts_to_sequences` method is a utility that converts a list of texts into a list of sequences of integers, based on the word-to-integer mapping that the Tokenizer object learned from the sentences. 

This `output` shows the numerical representation of each sentence in the list of texts. The order of the sublists and the integers in the sublists reflects the order of the sentences and the words in the sentences. The purpose of this method is to transform text into numerical sequences that can be used as input for machine learning models.

In [5]:
sequences = Tokenizer.texts_to_sequences(sentences)

print(sequences)

[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]


### Count how many times words are repeated
The code `print(Tokenizer.word_counts)` is a Python statement that prints the value of the `word_counts` attribute of the Tokenizer object to the standard output. The `word_counts` attribute is an ordered dictionary that stores the frequency of each word in the sentences that the Tokenizer object was fitted on.

This `output` shows the frequency of each word in the sentences in the order they were first encountered. The order of the pairs in the ordered dictionary reflects the order of the words in the sentences. The purpose of this attribute is to keep track of the word counts for further analysis or processing.

```python
sentences =[
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

In [6]:
print(Tokenizer.word_counts)

OrderedDict([('i', 2), ('love', 3), ('my', 4), ('dog', 3), ('cat', 1), ('you', 2), ('do', 1), ('think', 1), ('is', 1), ('amazing', 1)])
