# **Pràctica 4**

# Entrenament de models de Word2Vec

In [16]:
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

Importem el dataset de la manera en la que s'indica a la pàgina web.

In [2]:
from datasets import load_dataset

dataset = load_dataset("projecte-aina/catalan_general_crawling")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Observem que en el dataset hi ha una part de train, per tant obtenim aquesta part anomenant-la train_dataset, i observem que el contingut de text es troba a la columna 'text' de train_dataset

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 1016113
    })
})

In [4]:
train_dataset = dataset['train']

In [8]:
train_dataset

Dataset({
    features: ['text'],
    num_rows: 1016113
})

Definim una funció per preprocessar el dataset. Aquesta funció neteja i normalitza el text, convertint-lo tot a minúscules, eliminant caràcters especials, i dividint-lo en paraules abans de tornar-lo a unir en un sol string.

In [36]:
import os
import re
from nltk.tokenize import word_tokenize

def preprocess(text):
    text = text.lower()
    text = re.sub(r'\W+', ' ', text)
    tokens = word_tokenize(text)
    return ' '.join(tokens)

A continuació es defineix una funció que s'utilitza per dividir el conjunt de dades en diverses parts i realitzar el preprocessament a cada part. Pren com a parametres d'entrada el conjunt de dades a preprocessarm el directori on es guradaran els arxius dividits i una llista de mides desitjades per cada part (en bytres), i com a sortida s'obté arxius de text preprocessats, dividits segons les mides especificades i guardats al directori de sortida.

Explicació del contingut de la funció:
- Es crea un directori de sortida, en cas de que no existeixi, per assegurar que es poden guardar els arxius resultants.
- Es defineixen diverses variables per mantenir el compte de la part que s'està preprocessant, la seva mida i per afegir el text preprocessat.
- S'extreu el text de la fila, es preprocessa utilitzant la funció anterior.
- Si s'arriba a la mida dessitjada, es guarda el text preprocessat a un arxiu de text en el directori de sortida.


In [37]:
def dividir_y_preprocesar_dataset(dataset, output_dir, tamano_partes):
    
    if not os.path.exists(output_dir): 
        os.makedirs(output_dir)
        
    total_bytes = 0
    contador = 1
    current_size = 0
    current_part = []
    
    for i, row in enumerate(dataset):
        text = row['text']
        preprocessed_text = preprocess(text)
        current_size += len(preprocessed_text.encode('utf-8'))
        current_part.append(preprocessed_text)
        
        if current_size >= tamano_partes[contador - 1]:
            with open(os.path.join(output_dir, f'parte_{contador}.txt'), 'w', encoding='utf-8') as f:
                for line in current_part:
                    f.write(line + '\n')
            current_part = []
            current_size = 0
            contador += 1
            
            if contador > len(tamano_partes):
                break

    if current_part:
        with open(os.path.join(output_dir, f'parte_{contador}.txt'), 'w', encoding='utf-8') as f:
            for line in current_part:
                f.write(line + '\n')


Una vegada definida la funció 'dividir_y_preprocesar_dataset', la cridem amb les mides dessitjades (100MB, 500MB i 1GB) i amb el directori de sortida corresponent.

In [38]:
tamano_partes = [100 * 1024 * 1024, 500 * 1024 * 1024, 1 * 1024 * 1024 * 1024] # 100MB, 500MB, 1GB
output_dir = 'divided_datasets'
dividir_y_preprocesar_dataset(train_dataset, output_dir, tamano_partes)

Ara passem a entrenar un model Word2Vec per a cada part del conjunt de dades dividit i preprocessat. 

Primer, es crea una llista de rutes als arxius dividits i preprocessats, per tal de poder accedir als textos i poder entrenar el model amb ells. 

Hem decidit utilitzar LineSentence de gensim per llegir les frases, ja que d'aquesta manera es converteix cada línea en una llista de paraules per a l'entrenament del model Word2vec.

Es crea un model Word2Vec amb els següents paràmetres:
- sentences: les frases preprocesades llegides del fitxer.
- vector_size=100: la dimensió dels vectors de paraules.
- window=5: la mida de la finestra de context.
- min_count=10: només les paraules que apareixen almenys 10 vegades seran considerades.
- workers=4: el nombre de fils per al processament.
- sg=1: utilitzar el model Skip-Gram (en lloc de CBOW).
- epochs=25: nombre d'iteracions sobre el conjunt de dades.

Finalment, el model entrenat es guarda en un fitxer.

In [39]:
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

dataset_parts = [f'divided_datasets/parte_{i}.txt' for i in range(1, len(tamano_partes) + 1)]

for i, part in enumerate(dataset_parts):
    sentences = LineSentence(part)
    
    model = Word2Vec(sentences=sentences, vector_size=100, window=5, min_count=10, workers=4, sg=1, epochs=25)
    
    model.save(f'word2vec_model_part_{i+1}.model')

    print(f'Model for part {i+1} trained and saved.')

Model for part 1 trained and saved.
Model for part 2 trained and saved.
Model for part 3 trained and saved.


In [None]:
#AIXÓ ES PER FER EL COMPLET, FALTA ACABAR-HO (O NO FER-HO HI HA A GENT QUE LI HA PETAT)

In [None]:
for i, row in enumerate(train_dataset):
    text = row['text']
    preprocessed_text = preprocess(text)

sentences = LineSentence(preprocessed_text)
model = Word2Vec(sentences=sentences, vector_size=100, window=5, min_count=10, workers=4, sg=1, epochs=25)
model.save(f'word2vec_model_original.model')

Després d'entrenar els models, volem comprovar que estiguin funcionant correctament, per tant afegim aquest procès de validació pel model de mida 100MB.

Comprovem les paraules més similars a 'informàtica' i la similitud entre 'informàtica' i 'digital'.

In [14]:
from gensim.models import Word2Vec

model = Word2Vec.load('word2vec_model_part_1.model')

similar_words = model.wv.most_similar('informàtica', topn=10)
print("Paraules similars a 'informàtica':")
for word, similarity in similar_words:
    print(f'{word}: {similarity:.4f}')

similarity = model.wv.similarity('informàtica', 'coordinador')
print(f"Similitud entre 'informàtica' i 'coordinador': {similarity:.4f}")


Paraules similars a 'informàtica':
enginyeria: 0.7088
instrumentació: 0.6706
tecnologia: 0.6700
sig: 0.6690
telecomunicació: 0.6620
ub: 0.6593
aplicacions: 0.6580
informàtic: 0.6461
automàtica: 0.6351
tecnologies: 0.6276
Similitud entre 'informàtica' i 'coordinador': 0.4499


# Model de Similitud de Text Semàntic 

In [None]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from gensim.models import Word2Vec
from scipy.stats import pearsonr
from datasets import load_dataset
import re
from nltk.tokenize import word_tokenize
import tensorflow as tf

In [None]:
# Función para calcular la correlación de Pearson
def compute_pearson(x_, y_, model):
    y_pred = model.predict(x_)
    print(f"y_pred shape: {y_pred.shape}, y_ shape: {y_.shape}")  # Agregar impresión para depuración
    correlation, _ = pearsonr(y_pred.flatten(), y_.flatten())
    return correlation

In [None]:
# Cargar el dataset
dataset_ts = load_dataset("projecte-aina/sts-ca")
train_data = dataset_ts['train']
val_data = dataset_ts['validation']
test_data = dataset_ts['test']

# Preprocesar el texto
def preprocess(text):
    text = text.lower()
    text = re.sub(r'\W+', ' ', text)
    tokens = word_tokenize(text)
    return tokens

train_data = [(preprocess(s1), preprocess(s2), label) for s1, s2, label in zip(train_data['sentence1'], train_data['sentence2'], train_data['label'])]
val_data = [(preprocess(s1), preprocess(s2), label) for s1, s2, label in zip(val_data['sentence1'], val_data['sentence2'], val_data['label'])]
test_data = [(preprocess(s1), preprocess(s2), label) for s1, s2, label in zip(test_data['sentence1'], test_data['sentence2'], test_data['label'])]


## One hot

### Preprocesamiento y Creación del Vocabulario

In [69]:
# Tokenizar las oraciones
def tokenize_sentences(data):
    return [[word for word in sentence.split()] for sentence in data]

# Obtener todas las oraciones del dataset
all_sentences = []
for s1, s2, _ in train_data + val_data + test_data:
    all_sentences.extend([s1, s2])

# Tokenizar todas las oraciones
tokenized_sentences = tokenize_sentences([' '.join(sent) for sent in all_sentences])

# Crear el vocabulario
vocab = list(set(word for sentence in tokenized_sentences for word in sentence))
word_to_index = {word: i for i, word in enumerate(vocab)}

# Convertir las oraciones tokenizadas en índices
def sentences_to_indices(sentences, word_to_index, max_length):
    indices = np.zeros((len(sentences), max_length))
    for i, sentence in enumerate(sentences):
        for j, word in enumerate(sentence.split()[:max_length]):
            indices[i, j] = word_to_index.get(word, 0)
    return indices

max_length = 50  # Define el máximo número de palabras por oración

# Convertir el dataset a índices
def pair_list_to_x_y_onehot(data, word_to_index, max_length):
    X1 = sentences_to_indices([' '.join(s1) for s1, _, _ in data], word_to_index, max_length)
    X2 = sentences_to_indices([' '.join(s2) for _, s2, _ in data], word_to_index, max_length)
    y = np.array([label for _, _, label in data])
    return (X1, X2), y

(x_train_1_onehot, x_train_2_onehot), y_train = pair_list_to_x_y_onehot(train_data, word_to_index, max_length)
(x_val_1_onehot, x_val_2_onehot), y_val = pair_list_to_x_y_onehot(val_data, word_to_index, max_length)
(x_test_1_onehot, x_test_2_onehot), y_test = pair_list_to_x_y_onehot(test_data, word_to_index, max_length)

# Verificar las formas de los datos
print(f"x_train_1_onehot shape: {x_train_1_onehot.shape}, x_train_2_onehot shape: {x_train_2_onehot.shape}, y_train shape: {y_train.shape}")


x_train_1_onehot shape: (2073, 50), x_train_2_onehot shape: (2073, 50), y_train shape: (2073,)


### Construir el Modelo con One-Hot Encoding

In [70]:
# Definir el modelo de regresión de similitud con One-Hot Encoding y estructura especificada
def build_and_compile_model_onehot(vocab_size, max_length, embedding_size=300, learning_rate=1e-3):
    # Capa de entrada para los pares de vectores
    input_1 = tf.keras.Input(shape=(max_length,))
    input_2 = tf.keras.Input(shape=(max_length,))

    # One-Hot Encoding
    one_hot_layer = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=vocab_size, input_length=max_length, trainable=False)
    
    # Obtener las representaciones One-Hot
    encoded_1 = one_hot_layer(input_1)
    encoded_2 = one_hot_layer(input_2)

    # Redimensionar las entradas para que coincidan con la estructura esperada
    reshaped_1 = tf.keras.layers.Flatten()(encoded_1)
    reshaped_2 = tf.keras.layers.Flatten()(encoded_2)

    # Proyección a través de una capa densa
    first_projection = tf.keras.layers.Dense(
        embedding_size,
        kernel_initializer=tf.keras.initializers.Identity(),
        bias_initializer=tf.keras.initializers.Zeros(),
    )
    projected_1 = first_projection(reshaped_1)
    projected_2 = first_projection(reshaped_2)

    # Calcular la distancia coseno utilizando una capa Lambda
    def cosine_distance(x):
        x1, x2 = x
        x1_normalized = tf.keras.backend.l2_normalize(x1, axis=1)
        x2_normalized = tf.keras.backend.l2_normalize(x2, axis=1)
        return 2.5 * (1.0 + tf.reduce_sum(x1_normalized * x2_normalized, axis=1))

    output = tf.keras.layers.Lambda(cosine_distance)([projected_1, projected_2])
    
    # Definir el modelo
    model = tf.keras.Model(inputs=[input_1, input_2], outputs=output)

    # Compilar el modelo
    model.compile(loss='mean_squared_error', optimizer=tf.keras.optimizers.Adamax(learning_rate))
    return model

# Construir y compilar el modelo
vocab_size = len(vocab)
model_onehot = build_and_compile_model_onehot(vocab_size, max_length)

# Entrenar el modelo
model_onehot.fit([x_train_1_onehot, x_train_2_onehot], y_train, epochs=10, batch_size=32)

# Evaluar el modelo
print(f"Correlación de Pearson (train): {compute_pearson([x_train_1_onehot, x_train_2_onehot], y_train, model_onehot)}")
print(f"Correlación de Pearson (validation): {compute_pearson([x_val_1_onehot, x_val_2_onehot], y_val, model_onehot)}")
print(f"Correlación de Pearson (test): {compute_pearson([x_test_1_onehot, x_test_2_onehot], y_test, model_onehot)}")





Epoch 1/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m148s[0m 2s/step - loss: 6.4457
Epoch 2/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m135s[0m 2s/step - loss: 6.6760
Epoch 3/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m135s[0m 2s/step - loss: 6.6350
Epoch 4/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m134s[0m 2s/step - loss: 6.6170
Epoch 5/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m134s[0m 2s/step - loss: 6.5776
Epoch 6/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m132s[0m 2s/step - loss: 6.4582
Epoch 7/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m126s[0m 2s/step - loss: 6.4443
Epoch 8/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m120s[0m 2s/step - loss: 6.4575
Epoch 9/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m137s[0m 2s/step - loss: 6.2641
Epoch 10/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m126s[0m 2s/step - loss: 5.988

## Word2vec

### Generar embeddings (ejemplo con Word2Vec):

S'HA DE FER LES PROVES PER CADA PART DEL WORD2VEC QUE HEM PREENTRENAT I HEM DE MIRAR LO DE MEAN I MEAN PONDERADA QUE NOSE QUE ÉS

In [71]:
# Cargar el modelo de Word2Vec preentrenado
word2vec_model = Word2Vec.load('word2vec_model_part_1.model') ## S'ha de canviar aixó per cada part que fem
vector_size = word2vec_model.vector_size
max_length = 50  # Ajusta según tus necesidades

# Función de preprocesamiento para Word2Vec
def word2vec_encode(tokens, model, max_length):
    word2vec_vector = np.zeros((max_length, model.vector_size))
    for i, token in enumerate(tokens):
        if i >= max_length:
            break
        if token in model.wv:
            word2vec_vector[i] = model.wv[token]
    return word2vec_vector

### Entrenar y evaluar el modelo:

In [20]:
!pip install tensorflow

Collecting tensorflow
  Downloading tensorflow-2.16.1-cp311-cp311-win_amd64.whl (2.1 kB)
Collecting tensorflow-intel==2.16.1
  Downloading tensorflow_intel-2.16.1-cp311-cp311-win_amd64.whl (377.0 MB)
     ---------------------------------------- 0.0/377.0 MB ? eta -:--:--
     ---------------------------------------- 0.0/377.0 MB ? eta -:--:--
     ---------------------------------------- 0.0/377.0 MB ? eta -:--:--
     -------------------------------------- 0.0/377.0 MB 281.8 kB/s eta 0:22:18
     -------------------------------------- 0.1/377.0 MB 416.7 kB/s eta 0:15:05
     -------------------------------------- 0.1/377.0 MB 568.9 kB/s eta 0:11:03
     -------------------------------------- 0.3/377.0 MB 983.9 kB/s eta 0:06:23
     ---------------------------------------- 0.4/377.0 MB 1.3 MB/s eta 0:04:55
     ---------------------------------------- 0.6/377.0 MB 1.6 MB/s eta 0:03:56
     ---------------------------------------- 0.8/377.0 MB 1.9 MB/s eta 0:03:22
     ----------------


[notice] A new release of pip is available: 23.0.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [72]:
# Convertir el dataset a vectores Word2Vec
def pair_list_to_x_y(data):
    X1 = np.array([word2vec_encode(s1, word2vec_model, max_length) for s1, _, _ in data])
    X2 = np.array([word2vec_encode(s2, word2vec_model, max_length) for _, s2, _ in data])
    y = np.array([label for _, _, label in data])
    return (X1, X2), y

In [73]:
(x_train_1, x_train_2), y_train = pair_list_to_x_y(train_data)
(x_val_1, x_val_2), y_val = pair_list_to_x_y(val_data)
(x_test_1, x_test_2), y_test = pair_list_to_x_y(test_data)

In [74]:
# Verificar las formas de los datos
print(f"x_train_1 shape: {x_train_1.shape}, x_train_2 shape: {x_train_2.shape}, y_train shape: {y_train.shape}")
print(f"x_val_1 shape: {x_val_1.shape}, x_val_2 shape: {x_val_2.shape}, y_val shape: {y_val.shape}")
print(f"x_test_1 shape: {x_test_1.shape}, x_test_2 shape: {x_test_2.shape}, y_test shape: {y_test.shape}")


x_train_1 shape: (2073, 50, 100), x_train_2 shape: (2073, 50, 100), y_train shape: (2073,)
x_val_1 shape: (500, 50, 100), x_val_2 shape: (500, 50, 100), y_val shape: (500,)
x_test_1 shape: (500, 50, 100), x_test_2 shape: (500, 50, 100), y_test shape: (500,)


In [76]:
# Definir el modelo de regresión de similitud
def build_and_compile_model(embedding_size: int = 300, learning_rate: float = 1e-3) -> tf.keras.Model:
    # Capa de entrada para los pares de vectores
    input_1 = tf.keras.Input(shape=(embedding_size,))
    input_2 = tf.keras.Input(shape=(embedding_size,))

    # Hidden layer
    first_projection = tf.keras.layers.Dense(
        embedding_size,
        kernel_initializer=tf.keras.initializers.Identity(),
        bias_initializer=tf.keras.initializers.Zeros(),
    )
    projected_1 = first_projection(input_1)
    projected_2 = first_projection(input_2)
    
    # Compute the cosine distance using a Lambda layer
    def cosine_distance(x):
        x1, x2 = x
        x1_normalized = tf.keras.backend.l2_normalize(x1, axis=1)
        x2_normalized = tf.keras.backend.l2_normalize(x2, axis=1)
        return 2.5 * (1.0 + tf.reduce_sum(x1_normalized * x2_normalized, axis=1))

    output = tf.keras.layers.Lambda(cosine_distance)([projected_1, projected_2])
    # Define output
    model = tf.keras.Model(inputs=[input_1, input_2], outputs=output)

    # Compile the model
    model.compile(loss='mean_squared_error',
                  optimizer=tf.keras.optimizers.Adamax(learning_rate))
    return model

# Construir y compilar el modelo
model = build_and_compile_model(vector_size * max_length)

# Ajustar los datos para que coincidan con la nueva entrada del modelo
X_train_1_flattened = x_train_1.reshape((x_train_1.shape[0], -1))
X_train_2_flattened = x_train_2.reshape((x_train_2.shape[0], -1))
X_val_1_flattened = x_val_1.reshape((x_val_1.shape[0], -1))
X_val_2_flattened = x_val_2.reshape((x_val_2.shape[0], -1))
X_test_1_flattened = x_test_1.reshape((x_test_1.shape[0], -1))
X_test_2_flattened = x_test_2.reshape((x_test_2.shape[0], -1))

# Entrenar el modelo
model.fit([X_train_1_flattened, X_train_2_flattened], y_train, epochs=10, batch_size=32)

Epoch 1/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 247ms/step - loss: 1.1840
Epoch 2/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 235ms/step - loss: 0.5329
Epoch 3/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 236ms/step - loss: 0.3010
Epoch 4/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 238ms/step - loss: 0.2043
Epoch 5/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 236ms/step - loss: 0.1629
Epoch 6/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 234ms/step - loss: 0.1227
Epoch 7/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 237ms/step - loss: 0.1242
Epoch 8/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 235ms/step - loss: 0.0970
Epoch 9/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 238ms/step - loss: 0.0779
Epoch 10/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 238ms

<keras.src.callbacks.history.History at 0x271fbc3e850>

In [77]:
# Evaluar el modelo
print(f"Correlación de Pearson (train): {compute_pearson([X_train_1_flattened, X_train_2_flattened], y_train, model)}")
print(f"Correlación de Pearson (validation): {compute_pearson([X_val_1_flattened, X_val_2_flattened], y_val, model)}")
print(f"Correlación de Pearson (test): {compute_pearson([X_test_1_flattened, X_test_2_flattened], y_test, model)}")

[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 34ms/step
y_pred shape: (2073,), y_ shape: (2073,)
Correlación de Pearson (train): 0.9606234184497318
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 29ms/step
y_pred shape: (500,), y_ shape: (500,)
Correlación de Pearson (validation): 0.24717394536854015
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 36ms/step
y_pred shape: (500,), y_ shape: (500,)
Correlación de Pearson (test): 0.3355597128657856


## Spacy

In [46]:
!python -m spacy download ca_core_news_md

Collecting ca-core-news-md==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/ca_core_news_md-3.7.0/ca_core_news_md-3.7.0-py3-none-any.whl (49.2 MB)
     ---------------------------------------- 0.0/49.2 MB ? eta -:--:--
     ---------------------------------------- 0.1/49.2 MB 1.1 MB/s eta 0:00:46
     ---------------------------------------- 0.1/49.2 MB 1.1 MB/s eta 0:00:46
     ---------------------------------------- 0.2/49.2 MB 1.6 MB/s eta 0:00:32
     ---------------------------------------- 0.3/49.2 MB 1.7 MB/s eta 0:00:29
     ---------------------------------------- 0.4/49.2 MB 1.7 MB/s eta 0:00:29
     ---------------------------------------- 0.4/49.2 MB 1.6 MB/s eta 0:00:32
     ---------------------------------------- 0.5/49.2 MB 1.6 MB/s eta 0:00:30
     ---------------------------------------- 0.5/49.2 MB 1.4 MB/s eta 0:00:36
     ---------------------------------------- 0.6/49.2 MB 1.4 MB/s eta 0:00:36
     ---------------------------------


[notice] A new release of pip is available: 23.0.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [50]:
import spacy
import numpy as np
import tensorflow as tf
from scipy.stats import pearsonr


# Cargar el modelo de spaCy
nlp = spacy.load("ca_core_news_md")
vector_size = nlp.vocab.vectors_length
max_length = 50  # Ajusta según tus necesidades

In [51]:
# Función de preprocesamiento para spaCy
def spacy_encode(sentence, nlp, max_length):
    spacy_vector = np.zeros((max_length, nlp.vocab.vectors_length))
    doc = nlp(sentence)
    for i, token in enumerate(doc):
        if i >= max_length:
            break
        spacy_vector[i] = token.vector
    return spacy_vector

In [52]:
# Convertir el dataset a vectores spaCy
def pair_list_to_x_y_spacy(data, nlp, max_length):
    X1 = np.array([spacy_encode(" ".join(s1), nlp, max_length) for s1, _, _ in data])
    X2 = np.array([spacy_encode(" ".join(s2), nlp, max_length) for _, s2, _ in data])
    y = np.array([label for _, _, label in data])
    return (X1, X2), y

# Convertir los datos
(x_train_1_spacy, x_train_2_spacy), y_train = pair_list_to_x_y_spacy(train_data, nlp, max_length)
(x_val_1_spacy, x_val_2_spacy), y_val = pair_list_to_x_y_spacy(val_data, nlp, max_length)
(x_test_1_spacy, x_test_2_spacy), y_test = pair_list_to_x_y_spacy(test_data, nlp, max_length)

# Verificar las formas de los datos
print(f"x_train_1_spacy shape: {x_train_1_spacy.shape}, x_train_2_spacy shape: {x_train_2_spacy.shape}, y_train shape: {y_train.shape}")
print(f"x_val_1_spacy shape: {x_val_1_spacy.shape}, x_val_2_spacy shape: {x_val_2_spacy.shape}, y_val shape: {y_val.shape}")
print(f"x_test_1_spacy shape: {x_test_1_spacy.shape}, x_test_2_spacy shape: {x_test_2_spacy.shape}, y_test shape: {y_test.shape}")


x_train_1_spacy shape: (2073, 50, 300), x_train_2_spacy shape: (2073, 50, 300), y_train shape: (2073,)
x_val_1_spacy shape: (500, 50, 300), x_val_2_spacy shape: (500, 50, 300), y_val shape: (500,)
x_test_1_spacy shape: (500, 50, 300), x_test_2_spacy shape: (500, 50, 300), y_test shape: (500,)


In [53]:
# Definir el modelo de regresión de similitud
def build_and_compile_model(input_length, vector_size, hidden_size=64):
    input_1 = tf.keras.Input(shape=(input_length, vector_size))
    input_2 = tf.keras.Input(shape=(input_length, vector_size))
    
    concatenated = tf.keras.layers.Concatenate(axis=1)([input_1, input_2])
    flatten = tf.keras.layers.Flatten()(concatenated)  # Aplanar la entrada concatenada
    hidden = tf.keras.layers.Dense(hidden_size, activation='relu')(flatten)
    output = tf.keras.layers.Dense(1)(hidden)
    
    model = tf.keras.Model(inputs=[input_1, input_2], outputs=output)
    model.compile(loss='mean_absolute_error', optimizer='adam')
    return model

# Construir y compilar el modelo
model_spacy = build_and_compile_model(max_length, vector_size)

# Entrenar el modelo
model_spacy.fit([x_train_1_spacy, x_train_2_spacy], y_train, epochs=10, batch_size=32)


Epoch 1/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 24ms/step - loss: 3.8293
Epoch 2/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 22ms/step - loss: 0.9561
Epoch 3/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 22ms/step - loss: 0.7354
Epoch 4/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 29ms/step - loss: 0.5895
Epoch 5/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 21ms/step - loss: 0.5467
Epoch 6/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 22ms/step - loss: 0.4924
Epoch 7/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 23ms/step - loss: 0.4697
Epoch 8/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 22ms/step - loss: 0.4675
Epoch 9/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 22ms/step - loss: 0.4162
Epoch 10/10
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 27ms/step - loss: 0.3902

<keras.src.callbacks.history.History at 0x2735e5a3210>

In [54]:
# Evaluar el modelo
print(f"Correlación de Pearson (train): {compute_pearson([x_train_1_spacy, x_train_2_spacy], y_train, model_spacy)}")
print(f"Correlación de Pearson (validation): {compute_pearson([x_val_1_spacy, x_val_2_spacy], y_val, model_spacy)}")
print(f"Correlación de Pearson (test): {compute_pearson([x_test_1_spacy, x_test_2_spacy], y_test, model_spacy)}")


[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step
y_pred shape: (2073, 1), y_ shape: (2073,)
Correlación de Pearson (train): 0.7598710620657781
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
y_pred shape: (500, 1), y_ shape: (500,)
Correlación de Pearson (validation): 0.12990470049673508
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step
y_pred shape: (500, 1), y_ shape: (500,)
Correlación de Pearson (test): 0.09486479030228027


## Roberta 

In [56]:
!pip install tf-keras

Collecting tf-keras
  Downloading tf_keras-2.16.0-py3-none-any.whl (1.7 MB)
     ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
      --------------------------------------- 0.0/1.7 MB 1.3 MB/s eta 0:00:02
     -- ------------------------------------- 0.1/1.7 MB 1.3 MB/s eta 0:00:02
     ---- ----------------------------------- 0.2/1.7 MB 1.7 MB/s eta 0:00:01
     ----- ---------------------------------- 0.3/1.7 MB 1.6 MB/s eta 0:00:01
     ------- -------------------------------- 0.3/1.7 MB 1.6 MB/s eta 0:00:01
     ---------- ----------------------------- 0.5/1.7 MB 1.8 MB/s eta 0:00:01
     ----------- ---------------------------- 0.5/1.7 MB 1.9 MB/s eta 0:00:01
     ------------- -------------------------- 0.6/1.7 MB 1.6 MB/s eta 0:00:01
     -------------- ------------------------- 0.6/1.7 MB 1.6 MB/s eta 0:00:01
     --------------- ------------------------ 0.7/1.7 MB 1.5 MB/s eta 0:00:01
     ---------------- ----------------------- 0.7/1.7 MB 1.4 MB/s eta 0:0


[notice] A new release of pip is available: 23.0.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [57]:
import re
from transformers import RobertaTokenizer, TFRobertaModel
import numpy as np
import tensorflow as tf




In [58]:
# Cargar el modelo y tokenizer de RoBERTa
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
roberta_model = TFRobertaModel.from_pretrained('roberta-base')

max_length = 50  # Ajusta según tus necesidades



tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]




Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaModel: ['roberta.embeddings.position_ids', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing TFRobertaModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFRobertaModel were not initialized from the PyTorch model and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and infe

In [68]:
from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('projecte-aina/roberta-base-ca-v2')
model = AutoModelForMaskedLM.from_pretrained('projecte-aina/roberta-base-ca-v2')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"Em dic <mask>."
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])



tokenizer_config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.json:   0%|          | 0.00/848k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/506k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.21M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/957 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

[' Jordi', ' Joan', ' Núria', ' Albert', ' David']


In [61]:
# Función de preprocesamiento para RoBERTa
def roberta_encode(texts, tokenizer, max_length):
    encodings = tokenizer(texts, truncation=True, padding='max_length', max_length=max_length, return_tensors='np')
    return encodings['input_ids'], encodings['attention_mask']

In [62]:
# Convertir el dataset a vectores RoBERTa
def pair_list_to_x_y(data, tokenizer, max_length):
    input_ids_1, attention_masks_1 = roberta_encode([" ".join(s1) for s1, _, _ in data], tokenizer, max_length)
    input_ids_2, attention_masks_2 = roberta_encode([" ".join(s2) for _, s2, _ in data], tokenizer, max_length)
    y = np.array([label for _, _, label in data])
    return (input_ids_1, attention_masks_1), (input_ids_2, attention_masks_2), y

(x_train_1, mask_train_1), (x_train_2, mask_train_2), y_train = pair_list_to_x_y(train_data, tokenizer, max_length)
(x_val_1, mask_val_1), (x_val_2, mask_val_2), y_val = pair_list_to_x_y(val_data, tokenizer, max_length)
(x_test_1, mask_test_1), (x_test_2, mask_test_2), y_test = pair_list_to_x_y(test_data, tokenizer, max_length)

# Verificar las formas de los datos
print(f"x_train_1 shape: {x_train_1.shape}, mask_train_1 shape: {mask_train_1.shape}, x_train_2 shape: {x_train_2.shape}, mask_train_2 shape: {mask_train_2.shape}, y_train shape: {y_train.shape}")
print(f"x_val_1 shape: {x_val_1.shape}, mask_val_1 shape: {mask_val_1.shape}, x_val_2 shape: {x_val_2.shape}, mask_val_2 shape: {mask_val_2.shape}, y_val shape: {y_val.shape}")
print(f"x_test_1 shape: {x_test_1.shape}, mask_test_1 shape: {mask_test_1.shape}, x_test_2 shape: {x_test_2.shape}, mask_test_2 shape: {mask_test_2.shape}, y_test shape: {y_test.shape}")


x_train_1 shape: (2073, 50), mask_train_1 shape: (2073, 50), x_train_2 shape: (2073, 50), mask_train_2 shape: (2073, 50), y_train shape: (2073,)
x_val_1 shape: (500, 50), mask_val_1 shape: (500, 50), x_val_2 shape: (500, 50), mask_val_2 shape: (500, 50), y_val shape: (500,)
x_test_1 shape: (500, 50), mask_test_1 shape: (500, 50), x_test_2 shape: (500, 50), mask_test_2 shape: (500, 50), y_test shape: (500,)


In [67]:
# Definir el modelo de regresión de similitud
def build_and_compile_model(max_length, roberta_model, hidden_size=64):
    input_ids_1 = tf.keras.Input(shape=(max_length,), dtype=tf.int32, name='input_ids_1')
    attention_mask_1 = tf.keras.Input(shape=(max_length,), dtype=tf.int32, name='attention_mask_1')
    input_ids_2 = tf.keras.Input(shape=(max_length,), dtype=tf.int32, name='input_ids_2')
    attention_mask_2 = tf.keras.Input(shape=(max_length,), dtype=tf.int32, name='attention_mask_2')

    roberta_output_1 = roberta_model(input_ids_1, attention_mask=attention_mask_1)
    roberta_output_2 = roberta_model(input_ids_2, attention_mask=attention_mask_2)

    pooled_output_1 = roberta_output_1.last_hidden_state[:, 0, :]  # Obtener la representación del token [CLS]
    pooled_output_2 = roberta_output_2.last_hidden_state[:, 0, :]  # Obtener la representación del token [CLS]

    concatenated = tf.keras.layers.Concatenate(axis=1)([pooled_output_1, pooled_output_2])
    hidden = tf.keras.layers.Dense(hidden_size, activation='relu')(concatenated)
    output = tf.keras.layers.Dense(1)(hidden)
    
    model = tf.keras.Model(inputs=[input_ids_1, attention_mask_1, input_ids_2, attention_mask_2], outputs=output)
    model.compile(loss='mean_absolute_error', optimizer='adam')
    return model

# Construir y compilar el modelo
model_roberta = build_and_compile_model(max_length, roberta_model)

# Entrenar el modelo
model_roberta.fit([x_train_1, mask_train_1, x_train_2, mask_train_2], y_train, epochs=10, batch_size=32)


ValueError: Exception encountered when calling layer 'tf_roberta_model' (type TFRobertaModel).

Data of type <class 'keras.src.backend.common.keras_tensor.KerasTensor'> is not allowed only (<class 'tensorflow.python.framework.tensor.Tensor'>, <class 'bool'>, <class 'int'>, <class 'transformers.utils.generic.ModelOutput'>, <class 'tuple'>, <class 'list'>, <class 'dict'>, <class 'numpy.ndarray'>) is accepted for attention_mask.

Call arguments received by layer 'tf_roberta_model' (type TFRobertaModel):
  • input_ids=<KerasTensor shape=(None, 50), dtype=int32, sparse=None, name=input_ids_1>
  • attention_mask=<KerasTensor shape=(None, 50), dtype=int32, sparse=None, name=attention_mask_1>
  • token_type_ids=None
  • position_ids=None
  • head_mask=None
  • inputs_embeds=None
  • encoder_hidden_states=None
  • encoder_attention_mask=None
  • past_key_values=None
  • use_cache=None
  • output_attentions=None
  • output_hidden_states=None
  • return_dict=None
  • training=False

In [None]:
print(f"Correlación de Pearson (train): {compute_pearson([x_train_1, mask_train_1, x_train_2, mask_train_2], y_train, model)}")
print(f"Correlación de Pearson (validation): {compute_pearson([x_val_1, mask_val_1, x_val_2, mask_val_2], y_val, model)}")
print(f"Correlación de Pearson (test): {compute_pearson([x_test_1, mask_test_1, x_test_2, mask_test_2], y_test, model)}")

## One Hot

In [None]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Crear un vocabulario
vocab = list(set([word for sentence in dataset['train']['sentence1'] + dataset['train']['sentence2'] for word in word_tokenize(sentence)]))
vocab_dict = {word: i for i, word in enumerate(vocab)}

# Función de preprocesamiento para One-Hot
def one_hot_encode(sentence, vocab_dict, max_length):
    tokens = word_tokenize(sentence)
    one_hot_vector = np.zeros((max_length, len(vocab_dict)))
    for i, token in enumerate(tokens):
        if i >= max_length:
            break
        if token in vocab_dict:
            one_hot_vector[i, vocab_dict[token]] = 1
    return one_hot_vector

max_length = 50  # Longitud máxima de las oraciones

# Ejemplo de uso:
sentence1 = dataset['train']['sentence1'][0]
sentence2 = dataset['train']['sentence2'][0]

one_hot_vector1 = one_hot_encode(sentence1, vocab_dict, max_length)
one_hot_vector2 = one_hot_encode(sentence2, vocab_dict, max_length)


## Word2Vec

In [None]:
from gensim.models import Word2Vec

# Cargar el modelo de Word2Vec entrenado
word2vec_model = Word2Vec.load('word2vec_model_part_1.model') ## s'ha de canviar per la part del model que volem comprovar

# Función de preprocesamiento para Word2Vec
def word2vec_encode(sentence, model, max_length):
    tokens = word_tokenize(sentence)
    vector_size = model.vector_size
    word2vec_vector = np.zeros((max_length, vector_size))
    for i, token in enumerate(tokens):
        if i >= max_length:
            break
        if token in model.wv:
            word2vec_vector[i] = model.wv[token]
    return word2vec_vector

# Ejemplo de uso:
word2vec_vector1 = word2vec_encode(sentence1, word2vec_model, max_length)
word2vec_vector2 = word2vec_encode(sentence2, word2vec_model, max_length)


## SpaCy

In [None]:
import spacy

# Cargar el modelo de spaCy
nlp = spacy.load('ca_core_news_md')

# Función de preprocesamiento para spaCy
def spacy_encode(sentence, nlp, max_length):
    doc = nlp(sentence)
    vector_size = len(doc.vector)
    spacy_vector = np.zeros((max_length, vector_size))
    for i, token in enumerate(doc):
        if i >= max_length:
            break
        spacy_vector[i] = token.vector
    return spacy_vector

# Ejemplo de uso:
spacy_vector1 = spacy_encode(sentence1, nlp, max_length)
spacy_vector2 = spacy_encode(sentence2, nlp, max_length)


## Uso de los Embeddings en el Modelo de Similitud:


In [None]:
import tensorflow as tf

# Definir el modelo de regresión de similitud
def build_and_compile_model(input_length, vector_size, hidden_size=64):
    input_1 = tf.keras.Input(shape=(input_length, vector_size))
    input_2 = tf.keras.Input(shape=(input_length, vector_size))
    
    concatenated = tf.keras.layers.Concatenate(axis=1)([input_1, input_2])
    hidden = tf.keras.layers.Dense(hidden_size, activation='relu')(concatenated)
    output = tf.keras.layers.Dense(1)(hidden)
    
    model = tf.keras.Model(inputs=[input_1, input_2], outputs=output)
    model.compile(loss='mean_absolute_error', optimizer='adam')
    return model

# Construir y compilar el modelo
vector_size = word2vec_model.vector_size  # Cambia esto según el modelo de embeddings que estés usando
model = build_and_compile_model(max_length, vector_size)

# Ejemplo de entrenamiento
sentence_pairs = [(s1, s2) for s1, s2 in zip(dataset['train']['sentence1'], dataset['train']['sentence2'])]
labels = dataset['train']['label']

# Convertir las oraciones a vectores Word2Vec (cambia esta función según el método de embeddings)
X1 = np.array([word2vec_encode(s1, word2vec_model, max_length) for s1, s2 in sentence_pairs])
X2 = np.array([word2vec_encode(s2, word2vec_model, max_length) for s1, s2 in sentence_pairs])
y = np.array(labels)

# Entrenar el modelo
model.fit([X1, X2], y, epochs=10, batch_size=32)


per cada frase un unic vector
TF-IDF per descartar paraules uq no aporten info

In [None]:
import tensorflow as tf
def build_and_compile_model(hidden_size: int = 64) -> tf.keras.Model:
  model = tf.keras.Sequential([
      tf.keras.layers.Concatenate(axis=-1, ),
      tf.keras.layers.Dense(hidden_size, activation='relu'),
      tf.keras.layers.Dense(1)
  ])
  model.compile(loss='mean_absolute_error',
                optimizer=tf.keras.optimizers.Adam(0.001))
  return model
m = build_and_compile_model()
# E.g.
import numpy as np
y = m((np.ones((1, 100)), np.ones((1,100)), ), )

el primer 10 s'ha de canviar per la long maxima del vector d'entrada

In [None]:
import tensorflow as tf
def build_and_compile_model(
        input_length: int = 10, hidden_size: int = 64, dictionary_size: int = 1000, embedding_size: int = 16,
) -> tf.keras.Model:
    input_1, input_2 = tf.keras.Input((input_length, ), dtype=tf.int32, ), tf.keras.Input((input_length, ), dtype=tf.int32, )
    # Define Layers
    embedding = tf.keras.layers.Embedding(
        dictionary_size, embedding_size, input_length=input_length, mask_zero=True, )
    pooling = tf.keras.layers.GlobalAveragePooling1D()
    concatenate = tf.keras.layers.Concatenate(axis=-1, )
    hidden = tf.keras.layers.Dense(hidden_size, activation='relu')
    output = tf.keras.layers.Dense(1)
    # Pass through the layers
    _input_mask_1, _input_mask_2 = tf.not_equal(input_1, 0), tf.not_equal(input_2, 0)
    _embedded_1, _embedded_2 = embedding(input_1, ), embedding(input_2, )
    _pooled_1, _pooled_2 = pooling(_embedded_1, mask=_input_mask_1), pooling(_embedded_2, mask=_input_mask_2)
    _concatenated = concatenate((_pooled_1, _pooled_2, ))
    _hidden_output = hidden(_concatenated)
    _output = output(_hidden_output)
    # Define the model
    model = tf.keras.Model(inputs=(input_1, input_2, ), outputs=_output, )
    model.compile(loss='mean_absolute_error',
                optimizer=tf.keras.optimizers.Adam(0.001))
    return model