Predicción de la etiqueta de una pregunta en Stack Overflow--- 0:00 min
===

* Última modificación: Marzo 1, 2022 | YouTube

Importación de librerías
---

In [1]:
import os

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"

import tensorflow as tf

Definición del problema
----

En la base de datos usada, cada pregunta es etiquetada con una de las siguientes etiquetas: Python, CSharp, JavaScript, o Java. El problema consiste en pronosticar la etiqueta dada la pregunta.

Descarga de datos
---

In [2]:
import pathlib

data_url = "https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz"

dataset_dir = tf.keras.utils.get_file(
    origin=data_url,
    untar=True,
    cache_dir="stack_overflow",
    cache_subdir="/tmp/stackoverflow",
)

dataset_dir = pathlib.Path(dataset_dir).parent

list(dataset_dir.iterdir())

[PosixPath('/tmp/stackoverflow/stack_overflow_16k.tar.gz'),
 PosixPath('/tmp/stackoverflow/train'),
 PosixPath('/tmp/stackoverflow/README.md'),
 PosixPath('/tmp/stackoverflow/test')]

In [3]:
train_dir = dataset_dir / "train"
list(train_dir.iterdir())

[PosixPath('/tmp/stackoverflow/train/java'),
 PosixPath('/tmp/stackoverflow/train/javascript'),
 PosixPath('/tmp/stackoverflow/train/csharp'),
 PosixPath('/tmp/stackoverflow/train/python')]

In [4]:
#
# Ejemplo de un mensaje
#
sample_file = train_dir / "python/1755.txt"

with open(sample_file) as f:
    print(f.read())

why does this blank program print true x=true.def stupid():.    x=false.stupid().print x



Estructura del directorio de datos
---

```
train/
...csharp/ 
......1.txt
......2.txt
...java/
......1.txt
......2.txt
...javascript/
......1.txt
......2.txt
...python/
......1.txt
......2.txt
```

Parámetros generales para la carga de datos
---

In [5]:
#
# Parámetros generales para la carga de datos
#
params = {
    "directory": train_dir,
    "batch_size": 32,
    "seed": 12345,
    "validation_split": 0.2,
}

Carga del conjunto de entrenamiento
---

In [6]:
#
# Carga del conjunto de entrenamiento
#
raw_train_ds = tf.keras.utils.text_dataset_from_directory(
    **params,
    subset="training",
)

Found 8000 files belonging to 4 classes.
Using 6400 files for training.


Mensajes de ejemplo
---

In [7]:
for text_batch, label_batch in raw_train_ds.take(1):
    for i in range(5):
        print("Question: ", text_batch.numpy()[i])
        print("Label:", label_batch.numpy()[i])
        print()

Question:  b'"stopping rotating images when popup dialog box is active i am using blank functions to rotate images in an orbital manner. i also have created a popup dialog box as well. what i am trying to do is have it so that when the dialog box is active, the images stop even if i were to mouseout from the image. ..here is my blank functions i have made:..var popupstatus = 0;.var timer = null;.var m = {.z   : 100,.xm  : 0,.xmm : .25,.ymm : 0,.ym  : 0,.mx  : 0,.nx  : 0,.ny  : 0,.nw  : 0,.nh  : 0,.xr  : 0,.ni  : 0,.scr : 0,.img : 0,...run : function () {.    m.xm += (m.xmm - m.xm) * .1;.    if (m.ym &lt; m.nw * .15) m.ym++;.    m.xr += m.xm;.    for (var i = 0; i &lt; m.ni; i++){.        var a = (i * 360 / m.ni) + m.xr;.        var x = math.cos(a * (math.pi / 180));.        var y = math.sin(a * (math.pi / 180));.        var a = m.img[i];.        a.style.width  = \'\'.concat(math.round(math.abs(y * m.ym) + y * m.z), \'px\');.        a.style.left   = \'\'.concat(math.round((m.nw * .5) + 

Etiquetas en el conjunto de entrenamiento
---

In [8]:
for i, label in enumerate(raw_train_ds.class_names):
    print("Label", i, "corresponds to", label)

Label 0 corresponds to csharp
Label 1 corresponds to java
Label 2 corresponds to javascript
Label 3 corresponds to python


Conjunto de validación
---

In [9]:
# Create a validation set.
raw_val_ds = tf.keras.utils.text_dataset_from_directory(
    **params,
    subset="validation",
)

Found 8000 files belonging to 4 classes.
Using 1600 files for validation.


Conjunto de prueba
---

In [10]:
test_dir = dataset_dir / "test"

raw_test_ds = tf.keras.utils.text_dataset_from_directory(
    test_dir, batch_size=32,
)

Found 8000 files belonging to 4 classes.


Preparación del texto
---

In [11]:
#
# Elimina la etiqueta asignada a cada mensaje
#
train_text = raw_train_ds.map(lambda text, labels: text)

Modelo con el texto binarizado (existe o no exite la palabra en el texto)
---

In [12]:
VOCAB_SIZE = 10000

binary_vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode="binary",
)

binary_vectorize_layer.adapt(train_text)

def binary_vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return binary_vectorize_layer(text), label

**Ejemplo de texto binarizado**


In [13]:
#
# Carga el primer batch de 32 elementos
#
text_batch, label_batch = next(iter(raw_train_ds))

#
# Extrae la primera pregunta y su etiqueta
#
first_question, first_label = text_batch[0], label_batch[0]

#
# Ejemplo
#
print("Question", first_question)
print("Label", first_label)
print(
    "'binary' vectorized question:",
    binary_vectorize_text(first_question, first_label)[0],
)

Question tf.Tensor(b"are there constants for the request types in blank i am using the blank.net.httpurlconnection object to set the request method. i am about to construct an enum to handle the possible values but it seems silly this isn't already done. am i missing something? is there an enum somewhere with all of these values?..update..the content type can be handled with a jax-rs class see this.\n", shape=(), dtype=string)
Label tf.Tensor(1, shape=(), dtype=int32)
'binary' vectorized question: tf.Tensor([[1. 1. 1. ... 0. 0. 0.]], shape=(1, 10000), dtype=float32)


**Preparación de los conjuntos de datos**

In [14]:
binary_train_ds = raw_train_ds.map(binary_vectorize_text)
binary_val_ds = raw_val_ds.map(binary_vectorize_text)
binary_test_ds = raw_test_ds.map(binary_vectorize_text)

**Configuración para el desempeño**

In [15]:
AUTOTUNE = tf.data.AUTOTUNE

def configure_dataset(dataset):
    return dataset.cache().prefetch(buffer_size=AUTOTUNE)

In [16]:
binary_train_ds = configure_dataset(binary_train_ds)
binary_val_ds = configure_dataset(binary_val_ds)
binary_test_ds = configure_dataset(binary_test_ds)

**Modelo**

In [17]:
binary_model = tf.keras.Sequential(
    [
        tf.keras.layers.Dense(4),
    ]
)

binary_model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer="adam",
    metrics=["accuracy"],
)

history = binary_model.fit(binary_train_ds, validation_data=binary_val_ds, epochs=10,)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Modelo con texto como secuencia de enteros
---

In [18]:
MAX_SEQUENCE_LENGTH = 250

int_vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens=VOCAB_SIZE, output_mode="int", output_sequence_length=MAX_SEQUENCE_LENGTH,
)

int_vectorize_layer.adapt(train_text)

def int_vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return int_vectorize_layer(text), label

**Ejemplo**

In [19]:
print("'int' vectorized question:", int_vectorize_text(first_question, first_label)[0])

'int' vectorized question: tf.Tensor(
[[  61   68 3725   12    2  555  545    7   16    3   36   47    2    1
    57    4   99    2  555   64    3   36  199    4 2630   31  916    4
   740    2  204  131   26   11  310 2468   13  547  346  402   36    3
   439  147    6   68   31  916 2152   23   73    9  227    1  425  116
    34   33 3438   23    5    1   28  189   13    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0

In [20]:
print("1289 ---> ", int_vectorize_layer.get_vocabulary()[1289])
print("313 ---> ", int_vectorize_layer.get_vocabulary()[313])
print("Vocabulary size: {}".format(len(int_vectorize_layer.get_vocabulary())))

1289 --->  documentation
313 --->  go
Vocabulary size: 10000


**Configuración para el desempeño**

In [21]:
int_train_ds = raw_train_ds.map(int_vectorize_text)
int_val_ds = raw_val_ds.map(int_vectorize_text)
int_test_ds = raw_test_ds.map(int_vectorize_text)

In [22]:
int_train_ds = configure_dataset(int_train_ds)
int_val_ds = configure_dataset(int_val_ds)
int_test_ds = configure_dataset(int_test_ds)

**Modelo**

In [23]:
def create_model(vocab_size, num_labels):
    
    model = tf.keras.Sequential(
        [
            tf.keras.layers.Embedding(vocab_size, 64, mask_zero=True,),
            tf.keras.layers.Conv1D(
                64, 5, padding="valid", activation="relu", strides=2,
            ),
            tf.keras.layers.GlobalMaxPooling1D(),
            tf.keras.layers.Dense(num_labels),
        ]
    )
    
    return model

In [24]:
#
# `vocab_size` es `VOCAB_SIZE + 1` ya que `0` es usado para el padding.
#
int_model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=4,)

int_model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer="adam",
    metrics=["accuracy"],
)

history = int_model.fit(
    int_train_ds,
    validation_data=int_val_ds,
    epochs=5,
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Comparación de los modelos
---

In [25]:
binary_model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 4)                 40004     
                                                                 
Total params: 40,004
Trainable params: 40,004
Non-trainable params: 0
_________________________________________________________________


In [26]:
int_model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 64)          640064    
                                                                 
 conv1d (Conv1D)             (None, None, 64)          20544     
                                                                 
 global_max_pooling1d (Globa  (None, 64)               0         
 lMaxPooling1D)                                                  
                                                                 
 dense_1 (Dense)             (None, 4)                 260       
                                                                 
Total params: 660,868
Trainable params: 660,868
Non-trainable params: 0
_________________________________________________________________


In [27]:
binary_loss, binary_accuracy = binary_model.evaluate(binary_test_ds)
int_loss, int_accuracy = int_model.evaluate(int_test_ds)

print("Binary model accuracy: {:2.2%}".format(binary_accuracy))
print("Int model accuracy: {:2.2%}".format(int_accuracy))

Binary model accuracy: 81.70%
Int model accuracy: 80.54%


Exportación del modelo
----

In [28]:
export_model = tf.keras.Sequential(
    [binary_vectorize_layer, binary_model, tf.keras.layers.Activation("sigmoid"),]
)

export_model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer="adam",
    metrics=["accuracy"],
)

#
# test
#
loss, accuracy = export_model.evaluate(raw_test_ds)
print("Accuracy: {:2.2%}".format(binary_accuracy))

Accuracy: 81.70%


Ejecución sobre nuevos datos
---

In [29]:
def get_string_labels(predicted_scores_batch):
    predicted_int_labels = tf.argmax(predicted_scores_batch, axis=1)
    predicted_labels = tf.gather(raw_train_ds.class_names, predicted_int_labels)
    return predicted_labels

In [30]:
inputs = [
    "how do I extract keys from a dict into a list?",  # 'python'
    "debug public static void main(string[] args) {...}",  # 'java'
]
predicted_scores = export_model.predict(inputs)
predicted_labels = get_string_labels(predicted_scores)
for input, label in zip(inputs, predicted_labels):
    print("Question: ", input)
    print("Predicted label: ", label.numpy())
    print()

Question:  how do I extract keys from a dict into a list?
Predicted label:  b'python'

Question:  debug public static void main(string[] args) {...}
Predicted label:  b'java'

