# Clasificación de texto usando embeddings pre-entrenados

Este notebook presenta la creación de clasificadores de texto para determinar el autor de un texto entre 3 posibles autores usando embeddings pre-entrenados de `Glove`.

## 0. Importación de Librerias

In [6]:
import pandas as pd
import re
import gensim
from gensim.parsing.preprocessing import STOPWORDS
from sklearn.model_selection import train_test_split

In [3]:
# Cargamos el dataset
df = pd.read_csv('data/classifier/sentences.csv')
df.head()

Unnamed: 0,author,sentence
0,Jane Austen,"Their estate was large, and their residence wa..."
1,Jane Austen,The late owner of this estate was a single man...
2,Jane Austen,"But her death, which happened ten years before..."
3,Jane Austen,"Henry Dashwood to his wishes, which proceeded ..."
4,Jane Austen,"The son, a steady respectable young man, was a..."


## 1. Preprocesamiento

In [5]:
# Conjunto de entrenamiento y prueba
x_train, x_test, y_train, y_test = train_test_split(df['sentence'], df['author'],
                                                    train_size=0.7, random_state=42)

In [7]:
# Tokenizacion y preprocesamiento
def preprocess_text(text: str) -> list[str]:
    """
    Limpia y tokeniza el texto mediante:
    1. Eliminación de puntuación y caracteres especiales.
    2. Convierte el texto a minúsculas.
    3. Tokenización del texto en palabras.
    4. Eliminación de palabras vacías (stopwords).
    
    Args:
    text (str): Texto de entrada a preprocesar.
    
    Returns:
    list: Una lista de tokens (palabras) del texto limpiado.
    """
    # Eliminar cualquier carácter no alfabético, números, etc.
    text = re.sub(r'[^\w\s]', '', text)
    
    # Tokenizar y convertir el texto a minúsculas
    tokens = gensim.utils.simple_preprocess(text, deacc=True)
    
    # Eliminar palabras vacías (stopwords)
    tokens = [word for word in tokens if word not in STOPWORDS]
    
    return tokens

x_train = x_train.apply(preprocess_text)
x_train.head()

2746     [apologies, friend, good, charade, confined, s...
6631     [thirteenth, june, french, russian, emperors, ...
10127    [handsome, young, soldier, brought, wood, sett...
16398    [nonperishable, goods, bought, moses, herzog, ...
11648    [race, began, ring, yards, away, course, obsta...
Name: sentence, dtype: object

In [None]:
# Cargamos los embeddings pre-entrenados de Glove
