# Representación baseline: TF-IDF y N-gramas
En este cuaderno generamos las representaciones del baseline exigido por la rúbrica: un modelo basado en N-gramas/TF-IDF construido sobre las reseñas limpias. También dejamos espacio para comparar contra embeddings (Word2Vec) como alternativa.

## Objetivos
- Cargar los datos procesados (train/dev/test) generados en el notebook anterior.
- Construir una representación TF-IDF con N-gramas como baseline requerido.
- Analizar la cobertura del vocabulario y validar la sparsidad de la matriz.
- Guardar las matrices/del vectorizador para reutilizarlas en el notebook de modelado.
- Dejar notas sobre la comparación futura con embeddings (Word2Vec/Sentence-BERT).

In [1]:
from pathlib import Path
import sys
import pandas as pd
from scipy import sparse
import joblib

PROJECT_ROOT = Path("/home/lctr/SEMESTRES/SEMESTRE_5/MINERIA_DE_TEXTO/PROYECTO")
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

from sistema_recomendacion.src.features.ngram_representation import build_tfidf_matrix

## Carga de datos procesados
Utilizamos los archivos generados en el notebook anterior (`books_ratings_clean.csv`, `ratings_train/dev/test.csv`).

In [2]:
processed_dir = PROJECT_ROOT / "sistema_recomendacion" / "data" / "processed"
ratings_train = pd.read_csv(processed_dir / "ratings_train.csv")
ratings_dev = pd.read_csv(processed_dir / "ratings_dev.csv")
ratings_test = pd.read_csv(processed_dir / "ratings_test.csv")
ratings_clean = pd.read_csv(processed_dir / "books_ratings_clean.csv")

ratings_train.head()

Unnamed: 0,book_id,user_id,rating,review_time,review_summary,review_text,book_title_key,rating_normalized
0,B000G167FA,AZUNT3QP2CWTL,5.0,1969-12-31 23:59:59,Have you ever watched the fairies when the rai...,Although the new cover looks more like a Book ...,silver pennies,1.0
1,B000G167FA,AWVWX5F3YEJKZ,5.0,1969-12-31 23:59:59,Found again...,This book was given to me over 30 years ago by...,silver pennies,1.0
2,0786280670,AE3SEXFJCQLJQ,1.0,1969-12-31 23:59:59,Meanspirited woman,The writing was okayish... But these details f...,"julie and julia: 365 days, 524 recipes, 1 tiny...",0.0
3,B000G167FA,AAFZZHA2I598B,5.0,1969-12-31 23:59:59,An incomparable children's classic,This book of children's poems has been enjoyed...,silver pennies,1.0
4,B00005O4HA,A3RTKL9KB8KLID,5.0,1996-08-17 00:00:00,The best mystery novel I have ever read,"I have been a mystery reader for decades, and ...",playing for the ashes,1.0


## Construcción de corpus textual
Concatenamos el texto de reseñas y resúmenes para alimentar al vectorizador TF-IDF.

In [3]:
def build_corpus(df: pd.DataFrame) -> pd.Series:
    return (df["review_summary"].fillna("") + " " + df["review_text"].fillna(" ")).str.strip()

corpus_train = build_corpus(ratings_train)
corpus_dev = build_corpus(ratings_dev)
corpus_test = build_corpus(ratings_test)

print(f"Corpus train: {len(corpus_train):,} documentos")
corpus_train.head()

Corpus train: 1,036,854 documentos


0    Have you ever watched the fairies when the rai...
1    Found again... This book was given to me over ...
2    Meanspirited woman The writing was okayish... ...
3    An incomparable children's classic This book o...
4    The best mystery novel I have ever read I have...
dtype: object

## Vectorización TF-IDF (baseline)
Entrenamos un vectorizador sobre el corpus de entrenamiento usando unigramas y bigramas.

In [4]:
tfidf_vectorizer, X_train = build_tfidf_matrix(
    corpus_train,
    ngram_range=(1, 2),   # empieza con unigramas
    min_df=10,
    max_df=0.5,
    max_features=100_000
)

In [8]:
# usar tfidf_vectorizer.transform(corpus_dev) y tfidf_vectorizer.transform(corpus_test) para obtener X_dev y X_test
X_dev = tfidf_vectorizer.transform(corpus_dev)
X_test = tfidf_vectorizer.transform(corpus_test)

## Análisis rápido de la representación
Revisamos vocabulario, sparsity y términos más frecuentes.

In [9]:
vocab_size = len(tfidf_vectorizer.vocabulary_)
sparsity = 1 - (X_train.nnz / (X_train.shape[0] * X_train.shape[1]))

print(f"Vocab size: {vocab_size:,}")
print(f"Train matrix density: {(1 - sparsity):.6f}")

feature_names = tfidf_vectorizer.get_feature_names_out()
top_indices = X_train.sum(axis=0).A1.argsort()[::-1][:20]
top_terms = pd.Series(X_train.sum(axis=0).A1[top_indices], index=feature_names[top_indices])
top_terms

Vocab size: 100,000
Train matrix density: 0.000942


quot          19808.796875
story         17801.042969
great         14668.979492
like          14047.493164
good          13960.112305
time          12840.038086
books         12764.720703
novel         12689.014648
just          12523.813477
reading       12229.538086
life          12130.949219
love          11243.894531
characters    10961.800781
people        10034.844727
really        10013.317383
best           9036.454102
way            8956.058594
world          8942.403320
written        8571.416016
think          8388.578125
dtype: float32

## Persistencia para el modelo baseline
Guardamos el vectorizador y las matrices (en formato `.npz`) para el notebook `03_modelo_baseline`.

In [10]:
features_dir = PROJECT_ROOT / "sistema_recomendacion" / "data" / "processed"
vectorizer_path = features_dir / "tfidf_vectorizer.joblib"
train_matrix_path = features_dir / "X_train_tfidf.npz"
dev_matrix_path = features_dir / "X_dev_tfidf.npz"
test_matrix_path = features_dir / "X_test_tfidf.npz"

joblib.dump(tfidf_vectorizer, vectorizer_path)
sparse.save_npz(train_matrix_path, X_train)
sparse.save_npz(dev_matrix_path, X_dev)
sparse.save_npz(test_matrix_path, X_test)

vectorizer_path, train_matrix_path, dev_matrix_path, test_matrix_path

(PosixPath('/home/lctr/SEMESTRES/SEMESTRE_5/MINERIA_DE_TEXTO/PROYECTO/sistema_recomendacion/data/processed/tfidf_vectorizer.joblib'),
 PosixPath('/home/lctr/SEMESTRES/SEMESTRE_5/MINERIA_DE_TEXTO/PROYECTO/sistema_recomendacion/data/processed/X_train_tfidf.npz'),
 PosixPath('/home/lctr/SEMESTRES/SEMESTRE_5/MINERIA_DE_TEXTO/PROYECTO/sistema_recomendacion/data/processed/X_dev_tfidf.npz'),
 PosixPath('/home/lctr/SEMESTRES/SEMESTRE_5/MINERIA_DE_TEXTO/PROYECTO/sistema_recomendacion/data/processed/X_test_tfidf.npz'))

## Próximos pasos / Comparativa con embeddings
- Probar un modelo Word2Vec entrenado sobre el corpus limpio y comparar métricas con el baseline TF-IDF.
- Explorar embeddings preentrenados (FastText, Sentence-BERT) como representación alternativa para cumplir la rúbrica.
- Evaluar si la combinación TF-IDF + embeddings mejora el rendimiento.