# 2. Preprocesamiento de Texto

In [1]:
import pandas as pd
import numpy as np
import re
import string

Cargo el dataset de IMDb que exploré en el notebook anterior - Exploracion.ipynb

In [2]:
df = pd.read_csv('./IMDB Dataset.csv')

In [None]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
print(f"Total de reviews: {len(df)}")
print(f"Columnas: {df.columns.tolist()}")

Total de reviews: 50000
Columnas: ['review', 'sentiment']


Primero verifico si hay valores nulos en las columnas importantes para limpiar el dataset.

In [5]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

Elimino filas con valores nulos en review o sentiment

In [None]:
df = df.dropna(subset=['review', 'sentiment'])

print(len(df))

Reviews después de eliminar nulos: 50000


Creo varias funciones auxiliares para diferentes tareas de preprocesamiento.

Limpiar HTML y caracteres especiales

In [None]:
def clean_html_and_special_chars(text):
    text = re.sub(r'<[^>]+>', '', text)
    text = re.sub(r'http\S+|www\S+', '', text)
    text = re.sub(r'@\w+|#\w+', '', text)
    
    return text

Convertir a minúsculas

In [None]:
def to_lowercase(text):
    return text.lower()

Elimina signos de puntuación del texto

In [None]:
def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

Elimina todos los números del texto.

In [None]:
def remove_numbers(text):
    return re.sub(r'\d+', '', text)

Elimina espacios múltiples y espacios al inicio, final

In [None]:
def remove_extra_spaces(text):
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

Retorna una lista de stopwords en inglés.

In [None]:
def get_stopwords():
    stopwords = {
        'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 
        'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself',
        'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them',
        'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this',
        'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been',
        'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing',
        'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
        'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between',
        'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to',
        'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again',
        'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how',
        'all', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such',
        'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very',
        's', 't', 'can', 'will', 'just', 'don', 'should', 'now'
    }

    return stopwords

Eliminar stopwords del texto.

In [None]:
def remove_stopwords(text, stopwords):
    words = text.split()
    filtered_words = [word for word in words if word not in stopwords]
    
    return ' '.join(filtered_words)

Eliminar palabras muy cortas, ´alabras de 1 o 2 letras generalmente no son útiles

In [None]:
def remove_short_words(text, min_length=3):
    words = text.split()
    filtered_words = [word for word in words if len(word) >= min_length]
    
    return ' '.join(filtered_words)

Función principal de preprocesamiento.
Esta función integra todas las funciones anteriores en un solo pipeline de preprocesamiento.

In [None]:
def preprocess_text(text, stopwords=None, remove_stops=True, min_word_length=3):
    
    if pd.isna(text):
        return ''
    
    text = clean_html_and_special_chars(text)
    text = to_lowercase(text)
    text = remove_punctuation(text)
    text = remove_numbers(text)
    text = remove_extra_spaces(text)
    if remove_stops and stopwords is not None:
        text = remove_stopwords(text, stopwords)
    
    text = remove_short_words(text, min_word_length)
    return text

Pruebo con las primeras 3 reviews para verificar que el preprocesamiento

In [32]:
stopwords = get_stopwords()

for i in range(1, 4):
    original = df['review'].iloc[i]
    processed = preprocess_text(original, stopwords)
    
    print(f"\nEjemplo {i}:")
    print(f"Original ({len(original)} chars): {original[:100]}")
    print(f"Procesado ({len(processed)} chars): {processed[:100]}")


Ejemplo 1:
Original (998 chars): A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-B
Procesado (656 chars): wonderful little production filming technique unassuming oldtimebbc fashion gives comforting sometim

Ejemplo 2:
Original (926 chars): I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air con
Procesado (578 chars): thought wonderful way spend time hot summer weekend sitting air conditioned theater watching lighthe

Ejemplo 3:
Original (748 chars): Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his par
Procesado (456 chars): basically theres family little boy jake thinks theres zombie closet parents fighting timethis movie 


## Aplicar preprocesamiento a todo el dataset

Ahora aplico la función de preprocesamiento a todas las reviews.

In [33]:
df['processed_review'] = df['review'].apply(lambda x: preprocess_text(x, stopwords))

Verifico el resultado de preprocesamiento

In [None]:
df[['review', 'processed_review', 'sentiment']].head()

Unnamed: 0,review,processed_review,sentiment
0,One of the other reviewers has mentioned that ...,one reviewers mentioned watching episode youll...,positive
1,A wonderful little production. <br /><br />The...,wonderful little production filming technique ...,positive
2,I thought this was a wonderful way to spend ti...,thought wonderful way spend time hot summer we...,positive
3,Basically there's a family where a little boy ...,basically theres family little boy jake thinks...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",petter matteis love time money visually stunni...,positive


Verifico si hay reviews vacías después del preprocesamiento

In [None]:
empty_reviews = df[df['processed_review'] == '']
print(f"Reviews vacías: {len(empty_reviews)}")

Reviews vacías después de preprocesar: 0


Elimino las reviews que quedaron vacías después del preprocesamiento.

In [None]:
df = df[df['processed_review'] != '']
print(f"Reviews restantes: {len(df)}")

Reviews restantes: 50000


Estadísticas del preprocesamiento. Calculo longitud promedio antes y después

In [34]:
df['original_length'] = df['review'].apply(len)
df['processed_length'] = df['processed_review'].apply(len)

print("Estadísticas de longitud de texto:")
print(f"\nOriginal:")
print(f"Media: {df['original_length'].mean():.2f} caracteres")
print(f"Mediana: {df['original_length'].median():.2f} caracteres")

print(f"\nPreprocesado:")
print(f"Media: {df['processed_length'].mean():.2f} caracteres")
print(f"Mediana: {df['processed_length'].median():.2f} caracteres")

reduction = (1 - df['processed_length'].mean() / df['original_length'].mean()) * 100
print(f"\nReducción promedio: {reduction:.2f}%")

Estadísticas de longitud de texto:

Original:
Media: 1309.43 caracteres
Mediana: 970.00 caracteres

Preprocesado:
Media: 830.57 caracteres
Mediana: 610.00 caracteres

Reducción promedio: 36.57%


In [None]:
df['original_words'] = df['review'].apply(lambda x: len(x.split()))
df['processed_words'] = df['processed_review'].apply(lambda x: len(x.split()))

print("Estadísticas de número de palabras:")
print(f"\nOriginal:")
print(f"Media: {df['original_words'].mean():.2f} palabras")
print(f"Mediana: {df['original_words'].median():.2f} palabras")

print(f"\nPreprocesado:")
print(f"Media: {df['processed_words'].mean():.2f} palabras")
print(f"Mediana: {df['processed_words'].median():.2f} palabras")

word_reduction = (1 - df['processed_words'].mean() / df['original_words'].mean()) * 100
print(f"\nReducción de palabras: {word_reduction:.2f}%")

Estadísticas de número de palabras:

Original:
  Media: 231.16 palabras
  Mediana: 173.00 palabras

Preprocesado:
  Media: 117.48 palabras
  Mediana: 87.00 palabras

Reducción de palabras: 49.18%


Guardo el dataset con las reviews preprocesadas listo con el preprocesamiento

In [None]:
df_final = df[['review', 'processed_review', 'sentiment']].copy()
df_final.head()

Unnamed: 0,review,processed_review,sentiment
0,One of the other reviewers has mentioned that ...,one reviewers mentioned watching episode youll...,positive
1,A wonderful little production. <br /><br />The...,wonderful little production filming technique ...,positive
2,I thought this was a wonderful way to spend ti...,thought wonderful way spend time hot summer we...,positive
3,Basically there's a family where a little boy ...,basically theres family little boy jake thinks...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",petter matteis love time money visually stunni...,positive


Guardar a CSV

In [None]:
df_final.to_csv('./imdb_preprocessed.csv', index=False)

Dataset preprocesado guardado en: imdb_preprocessed.csv
Total de reviews guardadas: 50000


## Resumen del preprocesamiento

Los pasos de preprocesamiento realizados son los siguientes:
1. Limpieza de HTML y caracteres especiales
2. Normalización: Convierte todo a minúsculas
3. Eliminación de puntuación
4. Eliminación de números
5. Eliminación de stopwords, quite palabras muy comunes sin valor semántico
6. Eliminación de palabras cortas, remueve palabras de menos de 3 caracteres
7. Eliminación espacios múltiples

El resultado es un texto limpio y normalizado.

Ventajas que encontre del preprocesamiento:
- Reduce el vocabulario y la complejidad
- Elimina ruido del texto
- Mejora el rendimiento de los modelos
- Facilita el análisis y la extracción de características