<a href="https://colab.research.google.com/github/giuli-c/MachineTranslation/blob/main/MachineTraslation_EX.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Esercitazione: Machine Translation
Costruire un modello sequenziale per la traduzione (dall'inglese all'italiano). Il modello prende una sequenza in inglese e torna in output una sequenza in italiano:

1. Definire un modello che abbia uno strato di embedding e almeno due strati ricorrenti e in uscita uno strato Dense con il numero di neuroni pari al vocabolario per la traduzione (italiano)/
NB: Per migliorare le performance sullo strato Dense conviene applicare un layer TimeDistributed in questo modo TimeDistributed(Dense())
> Il layer TimeDistributed() in Keras è una "scorciatoia intelligente" per applicare lo stesso layer (es. Dense) a ogni passo di una sequenza temporalmente strutturata (come testo, audio, serie temporali...).\
> IN POCHE PAROLE, SENZA USARE UN FLATTEN O GlobalAveragePooling1D che mi permettono di passare da un input 3D ad un 1D dello strato Dense, TimeDistributed, mi  permette di paplicare Dense direttamente su un tensore 3D.\
>
ESEMPIO:\
x = TimeDistributed(Dense(32))(x)\
"Per ogni delle 5 parole, applica la stessa rete Dense(32)"\
Caso d'uso:
* Dopo LSTM(return_sequences=True) >	Applichi un classificatore a ogni step
* Video frame per frame            >	CNN su ogni fotogramma
* Testo parola per parola	         >  Dense su ogni embedding
2. Eseguire l'addestramento per almeno 100 epoche

In [23]:
!pip install mlflow

Collecting mlflow
  Downloading mlflow-2.21.3-py3-none-any.whl.metadata (30 kB)
Collecting mlflow-skinny==2.21.3 (from mlflow)
  Downloading mlflow_skinny-2.21.3-py3-none-any.whl.metadata (31 kB)
Collecting alembic!=1.10.0,<2 (from mlflow)
  Downloading alembic-1.15.2-py3-none-any.whl.metadata (7.3 kB)
Collecting docker<8,>=4.0.0 (from mlflow)
  Downloading docker-7.1.0-py3-none-any.whl.metadata (3.8 kB)
Collecting graphene<4 (from mlflow)
  Downloading graphene-3.4.3-py2.py3-none-any.whl.metadata (6.9 kB)
Collecting gunicorn<24 (from mlflow)
  Downloading gunicorn-23.0.0-py3-none-any.whl.metadata (4.4 kB)
Collecting databricks-sdk<1,>=0.20.0 (from mlflow-skinny==2.21.3->mlflow)
  Downloading databricks_sdk-0.49.0-py3-none-any.whl.metadata (38 kB)
Collecting fastapi<1 (from mlflow-skinny==2.21.3->mlflow)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn<1 (from mlflow-skinny==2.21.3->mlflow)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 k

In [1]:
import warnings
warnings.filterwarnings("ignore")
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = '3'
import tensorflow as tf
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dense, TimeDistributed
from tensorflow.keras.layers import SimpleRNN, LSTM, GRU, Bidirectional
from keras.backend import clear_session

## 1. DATASET

In [3]:
df = pd.read_csv("machine_translation.csv")
df.head()

Unnamed: 0,italian,english
0,tom portò i suoi.,tom brought his.
1,a te non piace il pesce?,don't you like fish?
2,non abbiamo mai riso.,we never laughed.
3,aspetti un momento.,hang on a moment.
4,quando è finito?,when did that end?


## 2. DEFINIZIONE X,Y

In [4]:
x = df["english"].values
y = df["italian"].values

In [5]:
print(x.shape, y.shape)

(50000,) (50000,)


## 3. TRAIN_TEST_SPLIT

In [5]:
xtrain, xtest, ytrain, ytest = train_test_split(x,y, test_size=.2,random_state=1)

In [8]:
print(xtrain.shape, xtest.shape)

(40000,) (10000,)


## 4. TOKENIZER

In [6]:
tokenizer_eng = Tokenizer(num_words=50000)
tokenizer_ita = Tokenizer(num_words=50000)

## 5. DEFINIZIONE DEL VOCABOLARIO

In [7]:
tokenizer_eng.fit_on_texts(xtrain)
tokenizer_ita.fit_on_texts(ytrain)

In [8]:
vocabolary_eng = len(tokenizer_eng.index_word)+1
vocabolary_eng

4932

In [9]:
vocabolary_ita = len(tokenizer_ita.index_word)+1
vocabolary_ita

10005

## 6. COSTRUZIONE DELLE SEQUENZE

In [10]:
train_sequences_eng = tokenizer_eng.texts_to_sequences(xtrain)
test_sequences_eng = tokenizer_eng.texts_to_sequences(xtest)
train_sequences_ita = tokenizer_ita.texts_to_sequences(ytrain)
test_sequences_ita = tokenizer_ita.texts_to_sequences(ytest)

In [11]:
maxlen_eng = len(max(train_sequences_eng, key=len))
maxlen_ita = len(max(train_sequences_ita, key=len))

In [15]:
print(f"MAXLEN ENG: {maxlen_eng}, MAXLEN ITA: {maxlen_ita}")

MAXLEN ENG: 6, MAXLEN ITA: 10


## 7. PADDING

In [12]:
padded_train_sequences_eng = pad_sequences(train_sequences_eng, maxlen=maxlen_eng)
padded_test_sequences_eng = pad_sequences(test_sequences_eng, maxlen=maxlen_eng)
padded_train_sequences_ita = pad_sequences(train_sequences_ita, maxlen=maxlen_ita)
padded_test_sequences_ita = pad_sequences(test_sequences_ita, maxlen=maxlen_ita)

print("ENG: ", padded_train_sequences_eng, padded_test_sequences_eng)
print("ITA: ", padded_train_sequences_ita, padded_test_sequences_ita)

ENG:  [[   0    0    2    4   40 2026]
 [   0    0    1   29   22   88]
 [   0    0    0    8   43  421]
 ...
 [   0    0    0   16  379 3084]
 [   0    0    1 1586    5  844]
 [   0    0    0  107   13  521]] [[   0    0    0    2    4  577]
 [   0    0    1   53   28    6]
 [   0    0    0    2    4 1327]
 ...
 [   0   14    6   22  636  349]
 [   0    1   21    9  408   76]
 [   0    0   25    3  798   19]]
ITA:  [[   0    0    0 ...    2   31 2298]
 [   0    0    0 ...    7   88  131]
 [   0    0    0 ...    2   41 1031]
 ...
 [   0    0    0 ...    2 1028 3780]
 [   0    0    0 ... 5884    9 3131]
 [   0    0    0 ...   10  268 1055]] [[   0    0    0 ...    1    2  548]
 [   0    0    0 ...   81  135  252]
 [   0    0    0 ...    1    2  955]
 ...
 [   0    0    0 ...    6  735  120]
 [   0    0    0 ...    4   73 4045]
 [   0    0    0 ...   18  164 1167]]


In [13]:
padded_train_sequences_eng.shape

(40000, 6)

Visto che devo tradurre da inglese a italiano, e che le frasi in inglese sono paddate a 6, mentre quelle in italiano a 10, faccio il padding anche di quelle in inglese portandole a 10.

In [14]:
traslate_padding_sequences_train = pad_sequences(padded_train_sequences_eng, maxlen_ita)

In [19]:
traslate_padding_sequences_train.shape

(40000, 10)

## 8. COSTRUZIONE DELLA RETE
La rete viene costruita in base all'output, quindi con i dati di ciò che è italiano.


## 9. VALUTAZIONE DELLA RETE

In [15]:
# Devo paddare il padded_test_sequences_eng a maxlen_ita
traslate_padding_sequences_test = pad_sequences(padded_test_sequences_eng, maxlen_ita)

def evaluate_model(model):
    model.evaluate(traslate_padding_sequences_test, padded_test_sequences_ita)

## 10. DIMOSTRAZIONE DEI RISULTATI

In [20]:
def logits_to_text(logits, tokenizer):
    # tokenizer.word_index è una mappa del tipo: {'ciao': 1, 'mondo': 2, 'bella': 3, ...} ovvero parola ➝ indice
    # inverto in indice ➝ parola
    # invertendo posso recuperare dall'indice la parola più probabile predetta
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = '<PAD>'
    # converto i logits in parole scegliendo quella con punteggio massimo (argmax), ovvero la più probabile.
    return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])

In [17]:
def show_translate(model):
    embed_preds = model.predict(traslate_padding_sequences_test)
    # faccio la traduzione per 20 parole
    for i in range(0,20):
      print(xtest[i]) # stampo la traduzione effettiva
      print(logits_to_text(embed_preds[i], tokenizer_ita)) # stampo la traduzione del modello
      print("----------------------------------------\n")

### VANILLA RNN

In [23]:
model_RNN = Sequential()
#input = vocabolary_eng > input_length = maxlen_ita > perchè IL TUTTO E' PADDATO ALLA MAXLEN POSSIBILE.
model_RNN.add(Embedding(vocabolary_eng, 128, input_length=maxlen_ita))
# Devo ritornare le sequenze tra i vari neuronin perchè c'è più di un layer
model_RNN.add(SimpleRNN(64, return_sequences=True, activation="tanh"))
model_RNN.add(SimpleRNN(64, return_sequences=True, activation="tanh"))
model_RNN.add(TimeDistributed(Dense(vocabolary_ita, activation="softmax")))
model_RNN.summary()

In [24]:
model_RNN.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

In [25]:
model_RNN.fit(traslate_padding_sequences_train, padded_train_sequences_ita, validation_split=.2, epochs=100, batch_size=512)

Epoch 1/100
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 201ms/step - accuracy: 0.5753 - loss: 7.6936 - val_accuracy: 0.6536 - val_loss: 4.6003
Epoch 2/100
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 35ms/step - accuracy: 0.6506 - loss: 3.9957 - val_accuracy: 0.6536 - val_loss: 3.0107
Epoch 3/100
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 36ms/step - accuracy: 0.6523 - loss: 2.9721 - val_accuracy: 0.6536 - val_loss: 2.6413
Epoch 4/100
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 43ms/step - accuracy: 0.6519 - loss: 2.5796 - val_accuracy: 0.6536 - val_loss: 2.4761
Epoch 5/100
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 44ms/step - accuracy: 0.6520 - loss: 2.4462 - val_accuracy: 0.6547 - val_loss: 2.3954
Epoch 6/100
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 42ms/step - accuracy: 0.6553 - loss: 2.3877 - val_accuracy: 0.6630 - val_loss: 2.3597
Epoch 7/100
[1m63/63[0m 

<keras.src.callbacks.history.History at 0x7e7c3ae155d0>

In [26]:
show_translate(model_RNN)

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 18ms/step
tom is awake.
<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> tom è serio
----------------------------------------

i just did it.
<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> appena a fatto
----------------------------------------

tom is brilliant.
<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> tom è forte
----------------------------------------

allow me to help.
<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> a aiutare
----------------------------------------

we heard tom.
<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> delle tom
----------------------------------------

don't waste my time.
<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> non non mia tempo
----------------------------------------

you were sick.
<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> eri malato
----------------------------------------

tom plays piano.
<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> tom a no
----------------------------------------

i'm resting

In [31]:
evaluate_model(model_RNN)

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.7534 - loss: 1.4675


### BIDIRECTIONAL LSTM

In [16]:
model_BidLSTM = Sequential()
# input = vocabolary_eng > input_length = maxlen_ita > perchè IL TUTTO E' PADDATO ALLA MAXLEN POSSIBILE.
model_BidLSTM.add(Embedding(vocabolary_eng, 128, input_length=maxlen_ita))
# Devo ritornare le sequenze tra i vari neuronin perchè c'è più di un layer
model_BidLSTM.add(Bidirectional(LSTM(64, return_sequences=True, activation="tanh")))
model_BidLSTM.add(Bidirectional(LSTM(64, return_sequences=True, activation="tanh")))
model_BidLSTM.add(TimeDistributed(Dense(vocabolary_ita, activation="softmax")))
model_BidLSTM.summary()

In [17]:
import mlflow
import mlflow.keras

mlflow.set_experiment("BidLSTM_Translation_ING_ITA")

<Experiment: artifact_location='file:///content/mlruns/479345570290482522', creation_time=1743952063722, experiment_id='479345570290482522', last_update_time=1743952063722, lifecycle_stage='active', name='BidLSTM_Translation_ING_ITA', tags={}>

In [18]:
def show_translate_file(model, filename="translation_output.txt"):
    with open(filename, "w", encoding="utf-8") as f:
        for i in range(20):
            # Predico solo 1 frase alla volta per evitare problemi di memoria
            pred = model.predict(np.expand_dims(traslate_padding_sequences_test[i], axis=0), verbose=0)

            f.write("Original:\n")
            f.write(str(xtest[i]) + "\n")

            translation = logits_to_text(pred[0], tokenizer_ita)
            f.write("Predicted:\n")
            f.write(translation + "\n")
            f.write("-" * 40 + "\n")

In [25]:
with mlflow.start_run(run_name="initial_200_epochs"):
    model_BidLSTM.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

    history = model_BidLSTM.fit(
        traslate_padding_sequences_train,
        padded_train_sequences_ita,
        validation_split=0.2,
        epochs=200,
        batch_size=512
    )

    # Logga il modello
    mlflow.keras.log_model(model_BidLSTM, "BidLSTM_model_200")
    mlflow.log_param("epochs", 200)

    # Logga metriche di fine epoca
    for epoch, acc in enumerate(history.history["accuracy"]):
        mlflow.log_metric("train_accuracy", acc, step=epoch)
    for epoch, val_acc in enumerate(history.history["val_accuracy"]):
        mlflow.log_metric("val_accuracy", val_acc, step=epoch)

    show_translate_file(model_BidLSTM, "translation_output.txt")
    mlflow.log_artifact("translation_output.txt", artifact_path="translations")


Epoch 1/200
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 104ms/step - accuracy: 0.8409 - loss: 0.7701 - val_accuracy: 0.7909 - val_loss: 1.1479
Epoch 2/200
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 103ms/step - accuracy: 0.8422 - loss: 0.7633 - val_accuracy: 0.7897 - val_loss: 1.1468
Epoch 3/200
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 102ms/step - accuracy: 0.8423 - loss: 0.7596 - val_accuracy: 0.7902 - val_loss: 1.1452
Epoch 4/200
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 93ms/step - accuracy: 0.8435 - loss: 0.7539 - val_accuracy: 0.7903 - val_loss: 1.1448
Epoch 5/200
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 102ms/step - accuracy: 0.8431 - loss: 0.7547 - val_accuracy: 0.7930 - val_loss: 1.1444
Epoch 6/200
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 93ms/step - accuracy: 0.8436 - loss: 0.7511 - val_accuracy: 0.7916 - val_loss: 1.1438
Epoch 7/200
[1m63

