<h1><center>Tarea 3:  Clasificación de texto con RNN

<center><strong><font size="5">CC6204 - Deep Learning</font></strong></center>


---
### Cuerpo Docente:

- Profesor: Iván Sipiran
- Ayudantes:
    - Camila Figueroa Acevedo
    - Gustavo Santelices
    - Sofia Capibara Chávez Bastidas
    - Victor Faraggi V.
### Estudiante:
- Maximiliano Varas González (maximilianovarasg@gmail.com)
- ---

En esta tarea van a crear una red neuronal que clasifique mensajes como spam o no spam. Lo primero es descargar la data:

In [212]:
!mkdir data
!python -m wget https://www.ivan-sipiran.com/downloads/spam.csv
!move *.csv data/

Ya existe el subdirectorio o el archivo data.



Saved under spam.csv
c:\Users\mvarasg\Documents\Universidad\CC6204 - Deep Learning\Tareas\T3\spam.csv
Se han movido         1 archivos.


Los datos vienen en un archivo CSV que contiene dos columnas "text" y "label". La columna "text" contiene el texto del mensaje y la columna "label" contiene las etiquetas "ham" y "spam". Un mensaje "ham" es un mensaje que no se considera spam.

# Tarea
El objetivo de la tarea es crear una red neuronal que clasifique los datos entregados. Para lograr esto debes:



*   Implementar el pre-procesamiento de los datos que creas necesario.
*   Particionar los datos en 70% entrenamiento, 10% validación y 20% test.
*   Usa los datos de entrenamiento y valiadación para tus experimentos y sólo usa el conjunto de test para reportar el resultado final.

Para el diseño de la red neuronal puedes usar una red neuronal recurrente o una red basada en transformers. El objetivo de la tarea no es obtener el performance ultra máximo, sino entender qué decisiones de diseño afectan la solución de un problema como este. Lo que si es necesario (como siempre) es que discutas los resultados y decisiones realizadas.



---
# Pre-procesamiento de data

Lo primero es cargar la data en un DataFrame:

In [213]:
import pandas as pd
import numpy as np
data = pd.read_csv('./data/spam.csv')
print(data.shape)
data.head(5)

(5572, 2)


Unnamed: 0,text,label
0,"Go until jurong point, crazy.. Available only ...",ham
1,Ok lar... Joking wif u oni...,ham
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,U dun say so early hor... U c already then say...,ham
4,"Nah I don't think he goes to usf, he lives aro...",ham


Veamos si hay elementos nulos en ambas columnas

In [214]:
data.isna().sum()

text     1
label    1
dtype: int64

En efecto, hay al menos una entrada en la columna `text` y en `label`. 

In [215]:
data[data.isna().any(axis=1)]

Unnamed: 0,text,label
2115,Well I wasn't available as I washob nobbing wi...,
3035,,-) ok. I feel like john lennon.


Hay elementos NaN en filas diferentes, por lo que se eliminan estas dos.

In [216]:
data = data.dropna()

Se procede a eliminar los signos de puntuación y se pasa todo el texto a minúsculas.

In [217]:
from string import punctuation
print(punctuation)
data['text'] = data['text'].str.lower()
data['text'] = data['text'].str.replace(f'[{punctuation}]', '', regex=True)
data = data[data['text']!=' '] #Se eliminan elementos con strings vacíos.
print(f'Al eliminar elementos NaN y vacíos la nueva dimensión de data es: {data.shape}.')
data.head(5)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
Al eliminar elementos NaN y vacíos la nueva dimensión de data es: (5568, 2).


Unnamed: 0,text,label
0,go until jurong point crazy available only in ...,ham
1,ok lar joking wif u oni,ham
2,free entry in 2 a wkly comp to win fa cup fina...,spam
3,u dun say so early hor u c already then say,ham
4,nah i dont think he goes to usf he lives aroun...,ham


Se pasan todas las palabras de la columna `text` a una lista.

In [218]:
lista_words = data['text'].tolist()
lista_words = [word for instancia in lista_words for word in instancia.split()]
print(f'En total se tienen {len(lista_words)} palabras.')

En total se tienen 83631 palabras.


In [219]:
from collections import Counter

counts = Counter(lista_words) #Diccionario donde la llave es la palabra y el valor es la frecuencia.
vocab = sorted(counts, key=counts.get, reverse=True) #Lista de palabras, ordenadas de mayor a menor.
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)} #Diccionario con palabra como key y la posición de las más frecuentes como value.
print('Palabras únicas:', len(vocab_to_int))

Palabras únicas: 9524


In [220]:
data['text_num'] = data['text'].apply(lambda x: [vocab_to_int.get(word, word) for word in x.split()]) #Nueva columna con la lista con la representación numerica de cada palabras.
data.head(5)

Unnamed: 0,text,label,text_num
0,go until jurong point crazy available only in ...,ham,"[46, 446, 4346, 783, 744, 702, 64, 8, 1224, 88..."
1,ok lar joking wif u oni,ham,"[48, 306, 1381, 429, 6, 1813]"
2,free entry in 2 a wkly comp to win fa cup fina...,spam,"[47, 447, 8, 22, 4, 745, 887, 1, 177, 1814, 11..."
3,u dun say so early hor u c already then say,ham,"[6, 225, 146, 24, 351, 2873, 6, 160, 143, 58, ..."
4,nah i dont think he goes to usf he lives aroun...,ham,"[942, 2, 50, 97, 71, 448, 1, 943, 71, 1817, 20..."


Sigue reemplazar la columna label por valores numéricos. Se considera que spam:1 y ham:0.

In [221]:
data['label_encoded'] = data['label'].map({'spam':1, 'ham':0})
data

Unnamed: 0,text,label,text_num,label_encoded
0,go until jurong point crazy available only in ...,ham,"[46, 446, 4346, 783, 744, 702, 64, 8, 1224, 88...",0.0
1,ok lar joking wif u oni,ham,"[48, 306, 1381, 429, 6, 1813]",0.0
2,free entry in 2 a wkly comp to win fa cup fina...,spam,"[47, 447, 8, 22, 4, 745, 887, 1, 177, 1814, 11...",1.0
3,u dun say so early hor u c already then say,ham,"[6, 225, 146, 24, 351, 2873, 6, 160, 143, 58, ...",0.0
4,nah i dont think he goes to usf he lives aroun...,ham,"[942, 2, 50, 97, 71, 448, 1, 943, 71, 1817, 20...",0.0
...,...,...,...,...
5567,this is the 2nd time we have tried 2 contact u...,spam,"[40, 9, 5, 391, 63, 38, 17, 521, 22, 188, 6, 6...",1.0
5568,will ì b going to esplanade fr home,ham,"[35, 110, 194, 73, 1, 1911, 857, 80]",0.0
5569,pity was in mood for that soany other suggest...,ham,"[9521, 60, 8, 1220, 12, 19, 9522, 231, 9523]",0.0
5570,the guy did some bitching but i acted like id ...,ham,"[5, 517, 108, 109, 9524, 25, 2, 4287, 55, 389,...",0.0


### Padding

Se obtiene información estadística del largo de texto:

In [222]:
from statistics import median, mode
data['largo_text'] = data['text_num'].apply(len)
print("Largo promedio de las listas:", data['largo_text'].mean())
print("Mediana del largo de las listas:", median(data['largo_text']))
print("Moda del largo de las listas:", mode(data['largo_text']))
print('Max:', max(data['largo_text']))
print('Min:', min(data['largo_text']))


import matplotlib.pyplot as plt
import plotly.express as px

fig = px.histogram(data, x='largo_text', nbins=200, title='Distribución largo texto',marginal='box')

fig.show()

Largo promedio de las listas: 15.019935344827585
Mediana del largo de las listas: 12.0
Moda del largo de las listas: 6
Max: 171
Min: 1


Se define la siguiente estrategia:
- Se define un tamaño para las secuencias de texto: `largo_seq`
- Si el largo de una secuencia es mayor a `largo_seq`, se corta y se conservan las primeras `largo_seq` palabras.
- Si el largo de una secuencia es menor a `largo_seq`, se rellena con 0's al principio y se dejan al final las palabras de la secuencia, con el fin de tener una secuencia de largo `largo_seq`.  

In [223]:
data

Unnamed: 0,text,label,text_num,label_encoded,largo_text
0,go until jurong point crazy available only in ...,ham,"[46, 446, 4346, 783, 744, 702, 64, 8, 1224, 88...",0.0,20
1,ok lar joking wif u oni,ham,"[48, 306, 1381, 429, 6, 1813]",0.0,6
2,free entry in 2 a wkly comp to win fa cup fina...,spam,"[47, 447, 8, 22, 4, 745, 887, 1, 177, 1814, 11...",1.0,28
3,u dun say so early hor u c already then say,ham,"[6, 225, 146, 24, 351, 2873, 6, 160, 143, 58, ...",0.0,11
4,nah i dont think he goes to usf he lives aroun...,ham,"[942, 2, 50, 97, 71, 448, 1, 943, 71, 1817, 20...",0.0,13
...,...,...,...,...,...
5567,this is the 2nd time we have tried 2 contact u...,spam,"[40, 9, 5, 391, 63, 38, 17, 521, 22, 188, 6, 6...",1.0,30
5568,will ì b going to esplanade fr home,ham,"[35, 110, 194, 73, 1, 1911, 857, 80]",0.0,8
5569,pity was in mood for that soany other suggest...,ham,"[9521, 60, 8, 1220, 12, 19, 9522, 231, 9523]",0.0,9
5570,the guy did some bitching but i acted like id ...,ham,"[5, 517, 108, 109, 9524, 25, 2, 4287, 55, 389,...",0.0,26


In [229]:
def padding_seq(df, largo_seq):
    df = df.copy()
    def padd_row(row, largo_seq):
        if row['largo_text'] > largo_seq: # Se trunca
            return np.array(row['text_num'][:largo_seq])
        elif row['largo_text'] < largo_seq: # Se rellena con 0's
            return np.array([0] * (largo_seq - row['largo_text']) + row['text_num'])
        else: # Largo igual
            return np.array(row['text_num'])
    df['text_num_padded'] = df.apply(padd_row, args=(largo_seq,), axis=1)
    return df
padding_seq(data, 44)

    

Unnamed: 0,text,label,text_num,label_encoded,largo_text,text_num_padded
0,go until jurong point crazy available only in ...,ham,"[46, 446, 4346, 783, 744, 702, 64, 8, 1224, 88...",0.0,20,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,ok lar joking wif u oni,ham,"[48, 306, 1381, 429, 6, 1813]",0.0,6,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,free entry in 2 a wkly comp to win fa cup fina...,spam,"[47, 447, 8, 22, 4, 745, 887, 1, 177, 1814, 11...",1.0,28,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,u dun say so early hor u c already then say,ham,"[6, 225, 146, 24, 351, 2873, 6, 160, 143, 58, ...",0.0,11,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,nah i dont think he goes to usf he lives aroun...,ham,"[942, 2, 50, 97, 71, 448, 1, 943, 71, 1817, 20...",0.0,13,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
...,...,...,...,...,...,...
5567,this is the 2nd time we have tried 2 contact u...,spam,"[40, 9, 5, 391, 63, 38, 17, 521, 22, 188, 6, 6...",1.0,30,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 40,..."
5568,will ì b going to esplanade fr home,ham,"[35, 110, 194, 73, 1, 1911, 857, 80]",0.0,8,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
5569,pity was in mood for that soany other suggest...,ham,"[9521, 60, 8, 1220, 12, 19, 9522, 231, 9523]",0.0,9,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
5570,the guy did some bitching but i acted like id ...,ham,"[5, 517, 108, 109, 9524, 25, 2, 4287, 55, 389,...",0.0,26,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [226]:
data

Unnamed: 0,text,label,text_num,label_encoded,largo_text
0,go until jurong point crazy available only in ...,ham,"[46, 446, 4346, 783, 744, 702, 64, 8, 1224, 88...",0.0,20
1,ok lar joking wif u oni,ham,"[48, 306, 1381, 429, 6, 1813]",0.0,6
2,free entry in 2 a wkly comp to win fa cup fina...,spam,"[47, 447, 8, 22, 4, 745, 887, 1, 177, 1814, 11...",1.0,28
3,u dun say so early hor u c already then say,ham,"[6, 225, 146, 24, 351, 2873, 6, 160, 143, 58, ...",0.0,11
4,nah i dont think he goes to usf he lives aroun...,ham,"[942, 2, 50, 97, 71, 448, 1, 943, 71, 1817, 20...",0.0,13
...,...,...,...,...,...
5567,this is the 2nd time we have tried 2 contact u...,spam,"[40, 9, 5, 391, 63, 38, 17, 521, 22, 188, 6, 6...",1.0,30
5568,will ì b going to esplanade fr home,ham,"[35, 110, 194, 73, 1, 1911, 857, 80]",0.0,8
5569,pity was in mood for that soany other suggest...,ham,"[9521, 60, 8, 1220, 12, 19, 9522, 231, 9523]",0.0,9
5570,the guy did some bitching but i acted like id ...,ham,"[5, 517, 108, 109, 9524, 25, 2, 4287, 55, 389,...",0.0,26
