## Entrega Práctica NLP - Luis Martín Vegas ##

Para esta práctica hemos escogido realizar un análisis del sentimiento de las reviews recogidas en el dataset de Telefonos Móviles y Accesorios. 

En primer lugar, cargamos las librerias necesarias.

In [17]:
# !pip install num2words
import nltk
nltk.download('omw-1.4')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.


True

In [10]:
import pandas as pd
import os
import unicodedata
from num2words import num2words
import nltk
from nltk import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import random

Cargamos el dataset, lo descomprimimos y lo leemos.

In [11]:
! wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz

--2022-07-03 11:08:57--  http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 45409631 (43M) [application/x-gzip]
Saving to: ‘reviews_Cell_Phones_and_Accessories_5.json.gz.1’


2022-07-03 11:09:03 (7.69 MB/s) - ‘reviews_Cell_Phones_and_Accessories_5.json.gz.1’ saved [45409631/45409631]



In [12]:
import pandas as pd
data = pd.read_json("reviews_Cell_Phones_and_Accessories_5.json.gz", lines = True, compression = "gzip")

In [13]:
reviews = data.get("reviewText")
reviews.head()

0    They look good and stick good! I just don't li...
1    These stickers work like the review says they ...
2    These are awesome and make my phone look so st...
3    Item arrived in great time and was in perfect ...
4    awesome! stays on, and looks great. can be use...
Name: reviewText, dtype: object

### Preprocesado del texto ###

Cargamos un diccionario predefinido de contracciones del inglés que nos será util para tratar el texto de las reviews.

In [14]:
contractions = { 
"ain't": "am not", "aren't": "are not", "can't": "cannot", "can't've": "cannot have", "'cause": "because", "could've": "could have",
"couldn't": "could not", "couldn't've": "could not have", "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hadn't've": "had not have", "hasn't": "has not",
"haven't": "have not", "he'd": "he would", "he'd've": "he would have", "he'll": "he will", "he's": "he is", "how'd": "how did", "how'll": "how will", "how's": "how is", "i'd": "i would",
"i'll": "i will", "i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'll": "it will", "it's": "it is", "let's": "let us", "ma'am": "madam",
"mayn't": "may not", "might've": "might have", "mightn't": "might not", "must've": "must have", "mustn't": "must not", "needn't": "need not", "oughtn't": "ought not",
"shan't": "shall not", "sha'n't": "shall not", "she'd": "she would", "she'll": "she will", "she's": "she is", "should've": "should have", "shouldn't": "should not",
"that'd": "that would", "that's": "that is", "there'd": "there had", "there's": "there is", "they'd": "they would", "they'll": "they will", "they're": "they are",
"they've": "they have", "wasn't": "was not", "we'd": "we would", "we'll": "we will", "we're": "we are", "we've": "we have",
"weren't": "were not", "what'll": "what will", "what're": "what are", "what's": "what is", "what've": "what have",
"where'd": "where did", "where's": "where is", "who'll": "who will", "who's": "who is", "won't": "will not",
"wouldn't": "would not", "you'd": "you would", "you'll": "you will", "you're": "you are"
}

Definimos una función para limpiar nuestro texto:

In [15]:
def nltk_cleaner(text):
    clean_text = list()
    tokenizer = RegexpTokenizer(r'\w+')
    sw_list = stopwords.words('english')
    lemmatizer = WordNetLemmatizer()

    #Convertir a minúsculas
    text = text.lower().split()

    #Quitar contracciones
    text_aux = []
    for word in text:
        if word in contractions:
            text_aux.append(contractions[word])
        else:
            text_aux.append(word)
    text = " ".join(text_aux)

    # Eliminar acentos, etc
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    
    # Separar palabras eliminando signos de puntuación
    for word in tokenizer.tokenize(text):
        
        # Eliminar stop words
        if word not in sw_list:
            
        # Eliminar espacios sobrantes y lematizar
            clean_word = lemmatizer.lemmatize(word).strip()
        
        # Convertir dígitos a palabras
            if clean_word.isdigit():
                clean_word = num2words(clean_word, lang='en')

            clean_text.append(clean_word)
            
    return ' '.join(clean_text)

Aplicamos la función de limpieza de texto a nuestras reviews.

In [18]:
nltk.download('stopwords')
nltk.download('wordnet')
clean_texts = []
for review in reviews:
    clean_texts.append(nltk_cleaner(review))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Comprobamos que ha funcionado correctamente con una muestra aleatoria del dataset y guardamos el mismo en drive para trabajar con el más adelante, guardando solo las reviews procesadas, prescindiendo de las originales.

In [19]:
random.sample(clean_texts,5)

['like case much price point one edge pushing screen protector little bit causing air bubble probably installer issue',
 'case fantastic durable feel really sturdy bought case wouldnt bring purse concert perfect hold license credit card two fifty bill ever need show love case phone still fit jean short pocket even fall really well protected totally worth seven paid prettiest case probably best quality definitely recommend',
 'work perfect galaxy note three solid desk much cheaper option oem samsung version job',
 'current favorite case provide protection lay table build quality feel nice two piece sell different colored back plate bottom port fit headphone tried standard apple lightning charger look like could fit brand lightning plug complaint case show scratch case phone',
 'look performs however would rather extended battery understand would longer fit holster occasionally change battery price excellent alternative buying oem battery']

In [20]:
data["cleaned_review"] = clean_texts
data.drop(["reviewText"], axis = "columns", inplace=True)
data.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,overall,summary,unixReviewTime,reviewTime,cleaned_review
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",4,Looks Good,1400630400,"05 21, 2014",look good stick good like rounded shape always...
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",5,Really great product.,1389657600,"01 14, 2014",sticker work like review say stick great stay ...
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",5,LOVE LOVE LOVE,1403740800,"06 26, 2014",awesome make phone look stylish used one far a...
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",4,Cute!,1382313600,"10 21, 2013",item arrived great time perfect condition howe...
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]",5,leopard home button sticker for iphone 4s,1359849600,"02 3, 2013",awesome stay look great used multiple apple pr...


In [22]:
from google.colab import drive
drive.mount('/content/drive')

clean_data = data.dropna(subset=["cleaned_review", "overall"])[["cleaned_review", "overall"]]
clean_data.to_csv('drive/MyDrive/Datasets/cleaned_cellphones.csv', index=False) 

Mounted at /content/drive
