### **Función de Preprocesado de Texto**

clean_text: Limpia el texto eliminando signos de puntuación y convirtiendo todo a minúsculas para uniformidad.
tokenize_text: Convierte el texto limpio en tokens (palabras individuales) utilizando la tokenización de NLTK.
remove_stopwords: Elimina las palabras comunes que suelen ser poco informativas para los modelos de NLP.
lemmatize_words: Reduce las palabras a su forma base o lema, lo que ayuda a consolidar diferentes formas de una palabra para que sean tratadas como una sola entidad.
preprocess_text: Integra todas las funciones anteriores en una secuencia de operaciones que prepara el texto completamente.

In [1]:
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk

# Descargar recursos necesarios de NLTK
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

def clean_text(text):
    """Limpiar el texto removiendo puntuación y convirtiendo el texto a minúsculas."""
    # Convertir a minúsculas
    text = text.lower()
    # Eliminar puntuación
    text = re.sub(r'[^\w\s]', '', text)
    return text

def tokenize_text(text):
    """Tokenizar el texto en palabras individuales."""
    return word_tokenize(text)

def remove_stopwords(tokens):
    """Eliminar stopwords del texto tokenizado."""
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [token for token in tokens if token not in stop_words]
    return filtered_tokens

def lemmatize_words(tokens):
    """Aplicar lematización a los tokens para reducirlos a su forma base o de lema."""
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return lemmatized_tokens

def preprocess_text(text):
    """Función completa de preprocesamiento que integra todos los pasos."""
    text = clean_text(text)
    tokens = tokenize_text(text)
    tokens = remove_stopwords(tokens)
    lemmatized_tokens = lemmatize_words(tokens)
    # Unir los tokens en una cadena para análisis posterior
    preprocessed_text = ' '.join(lemmatized_tokens)
    return preprocessed_text



[nltk_data] Downloading package punkt to
[nltk_data]     /Users/otgerpeidro/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/otgerpeidro/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/otgerpeidro/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/otgerpeidro/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/otgerpeidro/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/otgerpeidro/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

**Descargar los Datasets**

Descargamos de nuevo los datasets ya que estamos en otro notebook

In [8]:
import requests

def download_file(url, filename):
    response = requests.get(url)
    with open(filename, 'wb') as f:
        f.write(response.content)

download_file("https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_2023/raw/review_categories/Software.jsonl.gz", "Software.jsonl.gz")
download_file("https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_2023/raw/review_categories/Digital_Music.jsonl.gz", "Digital_Music.jsonl.gz")

**Funciones para Cargar y Preprocesar Datos**

In [9]:
import gzip
import json
import pandas as pd

def load_data(file_name, n_rows=1000):
    data = []
    with gzip.open(file_name, 'rt', encoding='utf-8') as f:
        for i, line in enumerate(f):
            if i >= n_rows:
                break
            data.append(json.loads(line))
    return pd.DataFrame(data)

software_data = load_data('Software.jsonl.gz', 1000)
digital_music_data = load_data('Digital_Music.jsonl.gz', 1000)

In [11]:
# Aplicar el preprocesamiento
software_data['preprocessed_text'] = software_data['text'].apply(preprocess_text)
digital_music_data['preprocessed_text'] = digital_music_data['text'].apply(preprocess_text)

In [12]:
# Mostrar algunos resultados
print(software_data[['text', 'preprocessed_text']].head())
print(digital_music_data[['text', 'preprocessed_text']].head())

                                                text  \
0                                 mcaffee IS malware   
1  I love playing tapped out because it is fun to...   
2  I love this flashlight app!  It really illumin...   
3                           One of my favorite games   
4  Cute game. I am not that good at it but my kid...   

                                   preprocessed_text  
0                                    mcaffee malware  
1  love playing tapped fun watch town grow earnin...  
2  love flashlight app really illuminates dark co...  
3                                  one favorite game  
4               cute game good kid love nik wallenda  
                                                text  \
0  If i had a dollar for how many times I have pl...   
1  awesome sound - cant wait to see them in perso...   
2  This is a great cd. Good music and plays well....   
3  These are not real German singers, they have a...   
4  I first heard this playing in a Nagoya shop an... 

In [13]:
# Visualizar los cambios
for index, row in digital_music_data.head(5).iterrows():
    print("Original:", row['text'])
    print("Preprocesado:", row['preprocessed_text'])
    print("---")


Original: If i had a dollar for how many times I have played this cd and how many times I have asked Alexa to play it, I would be rich. Love this singer along with the Black Pumas. Finding a lot of new music that I like a lot on amazon. Try new things.
Preprocesado: dollar many time played cd many time asked alexa play would rich love singer along black puma finding lot new music like lot amazon try new thing
---
Original: awesome sound - cant wait to see them in person - always miss them when they are in town !
Preprocesado: awesome sound cant wait see person always miss town
---
Original: This is a great cd. Good music and plays well. Seller responded back very quicky and  received it within 3 days
Preprocesado: great cd good music play well seller responded back quicky received within 3 day
---
Original: These are not real German singers, they have accents. It is nothing what they advertised it. Music stinks.
Preprocesado: real german singer accent nothing advertised music stink
---

In [14]:
# Calcular tamaño del vocabulario
from collections import Counter

def vocab_size(data):
    all_words = ' '.join(data).split()
    vocabulary = Counter(all_words)
    return len(vocabulary)

original_vocab_size = vocab_size(digital_music_data['text'])
processed_vocab_size = vocab_size(digital_music_data['preprocessed_text'])
print("Vocabulario original:", original_vocab_size)
print("Vocabulario después del preprocesamiento:", processed_vocab_size)


Vocabulario original: 13721
Vocabulario después del preprocesamiento: 8240
