# HW04 - NLP
## Punto III

You are going to build a classifier to identify the most likely author for a set of input lines of text (I suggest utilizing text segments comprising 150 to 250 words). It is a multinomial classification task (3 classes).
- Describe how you prepare the dataset. Create the training, validation, and testing sets. Make a summary table with the dimensions (number of samples) by class for each one of the previous data sets.
- Define three feed-forward (dense) neural network architectures in Keras that make use of the previously built embeddings.
    - Explain the dimensions of each layer of each architecture (model summary).
- Describe the results of combining the 3 architectures with the 3 types of embeddings in terms of accuracy, precision and recall in tests set

In [14]:
import pandas as pd
import os
import re
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import preprocessing

Reutilización de código del punto 1 para procesar los textos. Se agrega la función create_text_samples para generar muestras de texto de longitud variable.

In [15]:
def serialize_text(f) -> list[str]:
    begun = False
    full_text = []
    paragraph = ""

    for base_line in f:
        line = base_line.strip()

        if len(line) == 0:
            if len(paragraph) > 0:
                full_text.append(paragraph.strip())
                paragraph = ""
            continue

        if line.startswith("*** START OF THE PROJECT GUTENBERG EBOOK"):
            begun = True
            continue

        if line.startswith("*** END OF THE PROJECT GUTENBERG EBOOK"):
            break

        if begun:
            paragraph += line + " "

    return full_text

def create_text_samples(text, min_words=150, max_words=250) -> list[str]:
    """Crea segmentos de texto entre min_words y max_words palabras
    input: text (str): texto completo
           min_words (int): mínimo de palabras por muestra
           max_words (int): máximo de palabras por muestra
    output: list of str: lista de muestras de texto
    """
    samples = []
    words = text.split()

    i = 0
    while i < len(words):
        # Tomar un segmento aleatorio entre min_words y max_words
        sample_size = np.random.randint(min_words, max_words + 1)
        if i + sample_size <= len(words):
            sample = ' '.join(words[i:i+sample_size])
            samples.append(sample)
            i += sample_size
        else:
            # Último segmento si queda texto
            if len(words) - i >= min_words:
                sample = ' '.join(words[i:])
                samples.append(sample)
            break

    return samples

Cargamos los datos y preprocesamos los textos usando las funciones definidas anteriormente. Se asocia cada texto con su autor y libro correspondiente.

In [16]:
base_path = "./books"
books = os.listdir(base_path)

data = []
author_mapping = {
    'arthur': 'Arthur Conan Doyle',
    'lewis': 'Lewis Carroll',
    'shakespear': 'William Shakespeare'
}

for book in books:
    # Identificar autor
    author_key = book.split('-')[0]
    author = author_mapping[author_key]

    path = os.path.join(base_path, book)
    with open(path, encoding="utf-8") as f:
        paragraphs = serialize_text(f)
        full_text = ' '.join(paragraphs)

        # Crear muestras de 150-250 palabras
        samples = create_text_samples(full_text, min_words=150, max_words=250)

        for sample in samples:
            data.append({
                'text': sample,
                'author': author,
                'book': book.replace('.txt', '')
            })

Creamos el DataFrame con las muestras generadas, mapeamos los autores a IDs numéricos, y dividimos el dataset en conjuntos de entrenamiento, validación y prueba. Finalmente, generamos una tabla resumen con la distribución de muestras por autor en cada conjunto.

In [17]:
df = pd.DataFrame(data)

# Mapear autores a números
author_to_id = {author: idx for idx, author in enumerate(df['author'].unique())}
df['author_id'] = df['author'].map(author_to_id)

print(f"Total de muestras: {len(df)}")
print(f"\nDistribución por autor:")
print(df['author'].value_counts())
print(f"\nDistribución por libro:")
print(df['book'].value_counts())

# Dividir en train, validation, test (70%, 15%, 15%)
train_df, temp_df = train_test_split(df, test_size=0.3, stratify=df['author_id'], random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.5, stratify=temp_df['author_id'], random_state=42)

print(f"\n=== DATASET SPLITS ===")
print(f"Train: {len(train_df)} samples")
print(f"Validation: {len(val_df)} samples")
print(f"Test: {len(test_df)} samples")

# Crear tabla resumen
summary_data = []
for dataset_name, dataset in [('Train', train_df), ('Validation', val_df), ('Test', test_df)]:
    for author in df['author'].unique():
        count = len(dataset[dataset['author'] == author])
        summary_data.append({
            'Dataset': dataset_name,
            'Author': author,
            'Samples': count
        })

summary_df = pd.DataFrame(summary_data)
summary_pivot = summary_df.pivot(index='Author', columns='Dataset', values='Samples')
summary_pivot['Total'] = summary_pivot.sum(axis=1)

print("\n=== SUMMARY TABLE ===")
print(summary_pivot)
print(f"\nTotal samples: {summary_pivot['Total'].sum()}")

Total de muestras: 1795

Distribución por autor:
author
Arthur Conan Doyle     1073
William Shakespeare     412
Lewis Carroll           310
Name: count, dtype: int64

Distribución por libro:
book
arthur-return-sherlock      560
arthur-hound-baskerville    298
arthur-the-sign-of-four     215
shakespear-hamlet           158
lewis-glass                 148
shakespear-king-henry       136
lewis-alice-wonderland      134
shakespear-the-temptest     118
lewis-hunting                28
Name: count, dtype: int64

=== DATASET SPLITS ===
Train: 1256 samples
Validation: 269 samples
Test: 270 samples

=== SUMMARY TABLE ===
Dataset              Test  Train  Validation  Total
Author                                             
Arthur Conan Doyle    161    751         161   1073
Lewis Carroll          47    217          46    310
William Shakespeare    62    288          62    412

Total samples: 1795


Reutilizamos el mismo preprocesamiento de textos del punto 1, con los cuales se generaron los embeddings pre-entrenados. Este paso es necesario antes de usar el tokenizador de Keras. Si no se hace este preprocesamiento, los embeddings pre-entrenados no serán efectivos.

In [18]:
def tokenize(text: str):
    processed = text.lower()  # Solo minúsculas
    processed = re.sub(r'[^a-z\s\']', ' ', processed)  # Mantener letras y apóstrofes
    processed = re.sub(r'\s+', ' ', processed).strip()  # Normalizar espacios
    tokens = processed.split()
    tokens = [token for token in tokens if len(token) > 1]  # Eliminar tokens de 1 letra
    return tokens 

train_df['text_processed'] = train_df['text'].apply(tokenize)
val_df['text_processed'] = val_df['text'].apply(tokenize)
test_df['text_processed'] = test_df['text'].apply(tokenize)