# Procesamiento de datasets
En este notebook están agrupados los procedimientos por los cuales se importaron, concatenaron y procesaron los sets de datos originalmente provistos:
- `dataset_es_dev.json`
- `dataset_es_train.json`
- `dataset_es_test.json`

Para generar sets de datos con los atributos relevantes y las reviews lemmatizadas y stemmatizadas, segpun el caso:
- `dataset_amazon_reviews_lemma.json`
- `dataset_amazon_reviews_stem.json`

De esta forma, se ahorra tiempo en la ejecución del notebook que resulve la consigna pedida.

### Importar las librerías necesarias

In [2]:
import pandas as pd
import numpy as np

# Garbage collector para optimizar recursos
import gc

In [3]:
# Previamente
## python -m spacy download es
## python -m spacy download es_core_news_sm

import spacy # https://spacy.io/usage/models
nlp = spacy.load('en_core_web_md')

#Stop Words de en_core_news_md
from spacy.lang.en.stop_words import STOP_WORDS
stopwords_spacy = list(STOP_WORDS)

In [4]:
import nltk

#Stop Words de nltk
from nltk.corpus import stopwords
stopwords_nltk = set(stopwords.words('english'))

import re
from nltk.tokenize import RegexpTokenizer

In [5]:
# Clase para personalizar las impresiones de consola
class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'

### Importar el dataset original

In [6]:
data = pd.read_csv("data/Womens_Clothing_Reviews.csv")
print("- Cantidad de filas del set:" + color.YELLOW, data.shape[0],color.END)
print("- Cantidad de atributos (columnas) del dataset:" + color.CYAN, data.shape[1],color.END)

- Cantidad de filas del set:[93m 23486 [0m
- Cantidad de atributos (columnas) del dataset:[96m 11 [0m


### Filtrado de columnas
Nos quedamos con las columnas del dataset que son relevantes para predecir la valoración en general. Por eso se descartan las columnas de id de usuario y producto. La de categoría nos puede servir para relacionar palabras. La del idioma es redundante.

In [7]:
# Filtramos columnas y las renombramos para mayor facilidad de uso
data = data[['Class Name','Clothing ID','Age','Title','Review Text','Rating','Recommended IND']]
mapper = {'Clothing ID':'clothing_id', 'Age':'usr_age', 'Title':'review_title', 'Review Text':'review_body', 'Rating':'rating',
       'Recommended IND':'recommended', 'Class Name':'class_name'}
data.rename(columns=mapper,inplace=True)
data.head()

Unnamed: 0,class_name,clothing_id,usr_age,review_title,review_body,rating,recommended
0,Intimates,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1
1,Dresses,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1
2,Dresses,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0
3,Pants,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1
4,Blouses,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1


### Limpieza de datos

In [33]:
# Tratamiento de nulos
df  = data.copy()
df.fillna(value={'review_title':"-"}, inplace=True)
df = df.dropna()
df.reset_index(drop=True, inplace=True)

# Porcentaje de nulos por columna
df.isna().sum()*100/df.shape[0]

class_name      0.0
clothing_id     0.0
usr_age         0.0
review_title    0.0
review_body     0.0
rating          0.0
recommended     0.0
dtype: float64

In [21]:
#Constante de signos de puntuación
import string
puntua = string.punctuation + '#...'
excluded_pos = ['SCONJ','CCONJ','NUM','PUNCT','PRON','DET','ADP','AUX','X']

In [22]:
#Función para limpieza de datos con lemmatizer
def text_data_lemma(sentence):
    doc = nlp(sentence)
    
    clean_tokens = []
    for token in doc:
        if (token.pos_ not in excluded_pos and str(token) not in stopwords_spacy and len(token.text)>2): 
            temp = token.lemma_.strip()
            clean_tokens.append(temp.lower())
    
    return clean_tokens

In [23]:
from nltk.stem.porter import *
stemmer = PorterStemmer()

#Función para limpieza de datos con stemmer
def text_data_stem(sentence):
    doc = nlp(sentence)
    
    clean_tokens = []
    for token in doc:
        if (token.pos_ not in excluded_pos and str(token) not in stopwords_spacy and len(token.text)>2): 
            temp = stemmer.stem(token.text).strip()
            clean_tokens.append(temp.lower())
    
    return clean_tokens

#### Lemmatización (15 min aprox)

In [34]:
# Limpiamos todas las reviews con lemmatizer
reviews_lemma = []
for i in df.index:
    rev = text_data_lemma(df.review_title.iloc[i] + ' ' + df.review_body.iloc[i])
    reviews_lemma.append(" ".join(rev))
reviews_lemma[:5]

['absolutely wonderful silky sexy comfortable',
 'love dress sooo pretty happen find store glad order online petite buy petite love length me- hit little knee definitely true midi truly petite',
 'major design flaw high hope dress want work initially order petite small usual size find outrageously small small fact zip reorder petite medium overall half comfortable fit nicely half tight layer somewhat cheap net layer imo major design flaw net layer sew directly zipper',
 'favorite buy love love love jumpsuit fun flirty fabulous time wear great compliment',
 'flattering shirt shirt flattering adjustable tie perfect length wear legging sleeveless pair cardigan love shirt']

#### Stemmización (40 min aprox)

In [None]:
# Limpiamos todas las reviews con stemmizer
reviews_stem = []
for i in df.index:
    rev = text_data_stem(df.review_title.iloc[i] + ' ' + df.review_body.iloc[i])
    reviews_stem.append(" ".join(rev))
reviews_stem[:5]

### Guardar nuevos datos en achivo .csv

In [None]:
# Agregamos columna al dataset
df['revs_lemma'] = reviews_lemma
df['revs_stem'] = reviews_stem
df.head()

In [None]:
# Guardamos dataset lemmatizado
df.to_csv(path_or_buf='data/dataset_clothes_clean.csv')