<a href="https://colab.research.google.com/github/lcbjrrr/genai/blob/main/00_RAG_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fundamentals

##  Natural Language Processing (NLP)

Natural Language Processing (NLP) is a multidisciplinary field at the intersection of computer science, artificial intelligence (AI), and linguistics. Its core goal is to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful. This involves developing algorithms and models to process raw text and speech data, allowing machines to perform tasks like translation, sentiment analysis, text summarization, and voice assistants. NLP is fundamental to how people interact with technology, making systems more intuitive and capable of handling the complexity and nuance of human communication

## Language Neural Networks and Deep Leanring


Deep learning, based on neural networks and specifically Recurrent Neural Networks (RNNs), revolutionized Natural Language Processing (NLP) by enabling computers to understand the sequential nature of human language. RNN variants like LSTMs and GRUs utilize an internal hidden state (memory) to maintain context across a sequence of words, overcoming the limitation of standard neural networks. This capability allows them to excel in sequential tasks such as machine translation (using encoder-decoder models), language modeling for text generation, and sentiment analysis. Although the newer Transformer architecture has become dominant, RNNs established the crucial deep learning foundation for handling the complexity, context, and dependencies inherent in text data.

![](https://pbs.twimg.com/media/G55BA6oXIAA4x_y?format=jpg&name=900x900)

### A Feedforward Neural Network (FNN)

A forward-feeding neural network, often called a Feedforward Neural Network (FNN), is the simplest type of artificial neural network. In an FNN, information moves in only one direction—forward—from the input layer, through any hidden layers, and finally to the output layer. There are no loops or cycles, meaning data flows linearly without looping back, making them suitable for tasks like classification and regression where input is mapped directly to an output.

![](https://pbs.twimg.com/media/G55BMjMWwAAFIVf?format=png&name=360x360)

### Recurrent Neural Networks (RNNs)

A Recurrent Neural Network (RNN) is a type of neural network specifically designed to process sequential data, such as text, speech, or time series. Unlike Feedforward Networks, RNNs have a loop that allows information to be passed from one step of the network to the next, effectively giving them an internal memory or hidden state. This memory enables the network to consider the context of previous elements in a sequence when processing the current one. This makes RNNs uniquely suited for tasks like natural language processing (NLP), where the meaning of a word depends heavily on the words that preceded it, allowing them to perform sequence-dependent tasks like language modeling and machine translation

![](https://pbs.twimg.com/media/G55B_46WMAA12ts?format=png&name=360x360)

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense
text = "Deep learning, based on neural networks and specifically Recurrent Neural Networks (RNNs), revolutionized Natural Language Processing (NLP)"
chars = sorted(list(set(text)))
char_to_index = {char: i for i, char in enumerate(chars)}
index_to_char = {i: char for i, char in enumerate(chars)}
char_to_index

{' ': 0,
 '(': 1,
 ')': 2,
 ',': 3,
 'D': 4,
 'L': 5,
 'N': 6,
 'P': 7,
 'R': 8,
 'a': 9,
 'b': 10,
 'c': 11,
 'd': 12,
 'e': 13,
 'f': 14,
 'g': 15,
 'i': 16,
 'k': 17,
 'l': 18,
 'n': 19,
 'o': 20,
 'p': 21,
 'r': 22,
 's': 23,
 't': 24,
 'u': 25,
 'v': 26,
 'w': 27,
 'y': 28,
 'z': 29}

In [None]:
seq = text[0:0 + 3]
label = text[3]
print(seq,([char_to_index[char] for char in seq]))
print(label,char_to_index[label])

Dee [4, 13, 13]
p 21


In [None]:
seq_length = 3
sequences = []
labels = []
for i in range(len(text) - seq_length):
    seq = text[i:i + seq_length]
    label = text[i + seq_length]
    sequences.append([char_to_index[char] for char in seq])
    labels.append(char_to_index[label])

X = np.array(sequences)
y = np.array(labels)
print(y)
X

[21  0 18 13  9 22 19 16 19 15  3  0 10  9 23 13 12  0 20 19  0 19 13 25
 22  9 18  0 19 13 24 27 20 22 17 23  0  9 19 12  0 23 21 13 11 16 14 16
 11  9 18 18 28  0  8 13 11 25 22 22 13 19 24  0  6 13 25 22  9 18  0  6
 13 24 27 20 22 17 23  0  1  8  6  6 23  2  3  0 22 13 26 20 18 25 24 16
 20 19 16 29 13 12  0  6  9 24 25 22  9 18  0  5  9 19 15 25  9 15 13  0
  7 22 20 11 13 23 23 16 19 15  0  1  6  5  7  2]


array([[ 4, 13, 13],
       [13, 13, 21],
       [13, 21,  0],
       [21,  0, 18],
       [ 0, 18, 13],
       [18, 13,  9],
       [13,  9, 22],
       [ 9, 22, 19],
       [22, 19, 16],
       [19, 16, 19],
       [16, 19, 15],
       [19, 15,  3],
       [15,  3,  0],
       [ 3,  0, 10],
       [ 0, 10,  9],
       [10,  9, 23],
       [ 9, 23, 13],
       [23, 13, 12],
       [13, 12,  0],
       [12,  0, 20],
       [ 0, 20, 19],
       [20, 19,  0],
       [19,  0, 19],
       [ 0, 19, 13],
       [19, 13, 25],
       [13, 25, 22],
       [25, 22,  9],
       [22,  9, 18],
       [ 9, 18,  0],
       [18,  0, 19],
       [ 0, 19, 13],
       [19, 13, 24],
       [13, 24, 27],
       [24, 27, 20],
       [27, 20, 22],
       [20, 22, 17],
       [22, 17, 23],
       [17, 23,  0],
       [23,  0,  9],
       [ 0,  9, 19],
       [ 9, 19, 12],
       [19, 12,  0],
       [12,  0, 23],
       [ 0, 23, 21],
       [23, 21, 13],
       [21, 13, 11],
       [13, 11, 16],
       [11, 1

In [None]:
X_one_hot = tf.one_hot(X, len(chars))
y_one_hot = tf.one_hot(y, len(chars))
y_one_hot

<tf.Tensor: shape=(136, 30), dtype=float32, numpy=
array([[0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.]], dtype=float32)>

In [None]:
model = Sequential()
model.add(SimpleRNN(50, input_shape=(seq_length, len(chars)), activation='relu'))
model.add(Dense(len(chars), activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_one_hot, y_one_hot, epochs=100)

  super().__init__(**kwargs)


Epoch 1/100
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 27ms/step - accuracy: 0.0246 - loss: 3.4253
Epoch 2/100
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step - accuracy: 0.0228 - loss: 3.3845     
Epoch 3/100
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step - accuracy: 0.0441 - loss: 3.3660 
Epoch 4/100
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 23ms/step - accuracy: 0.0812 - loss: 3.3250
Epoch 5/100
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step - accuracy: 0.0731 - loss: 3.3103 
Epoch 6/100
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - accuracy: 0.0783 - loss: 3.2728
Epoch 7/100
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step - accuracy: 0.1293 - loss: 3.2363
Epoch 8/100
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step - accuracy: 0.1522 - loss: 3.2118 
Epoch 9/100
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x78e60817f2f0>

In [None]:
start_seq = "Deep learn"
generated_text = start_seq
x = np.array([[char_to_index[char] for char in generated_text[-seq_length:]]])
print(x)
x_one_hot = tf.one_hot(x, len(chars))
prediction = model.predict(x_one_hot)
print(prediction)
next_index = np.argmax(prediction)
print('i=',next_index)
index_to_char[next_index]

[[ 9 22 19]]
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 471ms/step
[[1.9723009e-02 9.9101022e-04 1.9884860e-04 1.7844429e-03 7.3856302e-07
  1.8663288e-03 5.4733478e-03 2.9078701e-03 4.0997765e-03 4.6949982e-04
  2.1737802e-04 2.3491155e-02 4.7710054e-03 2.4167136e-03 5.0558705e-05
  1.3338528e-02 7.9130316e-01 1.0195729e-04 9.0431850e-03 9.6205473e-03
  8.3617088e-05 3.6288267e-03 4.8437449e-03 5.6957636e-02 2.4613559e-03
  1.7326839e-02 1.2144120e-02 1.0215107e-03 9.6171442e-03 4.6135483e-05]]
i= 16


'i'

In [None]:
for i in range(60):
    x = np.array([[char_to_index[char] for char in generated_text[-seq_length:]]])
    x_one_hot = tf.one_hot(x, len(chars))
    prediction = model.predict(x_one_hot)
    next_index = np.argmax(prediction)
    next_char = index_to_char[next_index]
    generated_text += next_char

print("Generated Text:")
print(generated_text)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 121ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 124ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 129ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 139ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 133ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 121ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 139ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 126ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 110ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 166ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 118ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 150ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 121ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m 

## T5 (Text-to-Text Transfer Transformer)

The T5 (Text-to-Text Transfer Transformer) model, developed by Google AI, revolutionized NLP by introducing a unified framework where every language task is treated as a text-to-text problem. Built on the Transformer architecture's encoder-decoder structure, T5 can handle diverse tasks—including translation, summarization, question answering, and classification—by simply feeding the input with a task-specific prefix (e.g., "translate English to German: ...") and receiving the output as plain text. This consistency simplifies the model design and allows a single, pre-trained model to achieve state-of-the-art results across numerous benchmarks after fine-tuning

![](https://pbs.twimg.com/media/G55IpcfW8AAYrY4?format=png&name=360x360)

Transformers are a powerful neural network architecture that utilize an attention mechanism to weigh the importance of different parts of the input data, enabling them to process sequences in parallel and efficiently capture long-range dependencies, becoming the foundation for modern large language models like BERT and GPT

![](https://pbs.twimg.com/media/G55JZMnXAAAgub2?format=jpg&name=900x900)

In [None]:
# !pip install transformers torch sentencepiece

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

T5 language model for text-to-text tasks. It imports the necessary components from the transformers library, specifies the t5-small pre-trained model, and then initializes both the tokenizer (to convert text to numerical IDs) and the model itself from that pre-trained version.



In [None]:
input_text = "translate English to German: Good morning"
inputs = tokenizer(input_text, return_tensors="pt")
translation_ids = model.generate(inputs.input_ids, max_length=50, num_beams=5, early_stopping=True)
translation_text = tokenizer.decode(translation_ids[0], skip_special_tokens=True)
print("Translation:", translation_text)

Translation: Guten Morgen


This code snippet initializes a T5 tokenizer and model to perform machine translation. It tokenizes the input text ('translate English to German: Good morning'), generates translation IDs using the pre-trained T5 model, and then decodes these IDs back into human-readable text, finally printing the translated German phrase.