# **Caso de uso: Como clasificar mensajes utilizando redes neuronales**


1.   Pre-procesamiento de datos (como puedo representar los datos de entrenamiento)
2.   Evaluar la funcion de activacion para predecir si/no (clasificacion binaria)
3.   Preparar una red neuronal para ello



##**Explorar el DATASET**

Vamos a usar un archivo tsv (tab separated value) para el entrenar al modelo.
Este archivo tendra una serie de mensajes y su clasificacion si es no spam.

In [1]:
import pandas as pd

df = pd.read_csv("/content/drive/My Drive/spam/data/SMSSpamCollection", sep="\t", names=["type", "message"])
df.head()

Unnamed: 0,type,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
df.shape

(5572, 2)

In [None]:
df.iloc[0]["message"]

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [None]:
df["spam"] = df["type"] == "spam"

df.head()

Unnamed: 0,type,message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",False
1,ham,Ok lar... Joking wif u oni...,False
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,True
3,ham,U dun say so early hor... U c already then say...,False
4,ham,"Nah I don't think he goes to usf, he lives aro...",False


In [None]:
df.drop("type", axis=1)

Unnamed: 0,message,spam
0,"Go until jurong point, crazy.. Available only ...",False
1,Ok lar... Joking wif u oni...,False
2,Free entry in 2 a wkly comp to win FA Cup fina...,True
3,U dun say so early hor... U c already then say...,False
4,"Nah I don't think he goes to usf, he lives aro...",False
...,...,...
5567,This is the 2nd time we have tried 2 contact u...,True
5568,Will ü b going to esplanade fr home?,False
5569,"Pity, * was in mood for that. So...any other s...",False
5570,The guy did some bitching but I acted like i'd...,False


In [None]:
print("SPAM message:")
print(len(df[df["spam"] == True]))
print("NORMAL message:")
print(len(df[df["spam"] == False]))

SPAM message:
747
NORMAL message:
4825


##**Pre-procesamiento de datos**
Vamos a utilizar la libreria sklearn para contabilizar la ocurrencia de las palabras.
Primero veamos un ejemplo:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
documents = [
    "Hello, hello, hello.",
    "Hello world. Today is cloudy.",
    "Hello mars. Today is sunny."
]

In [None]:
cv = CountVectorizer(max_features=10)
cv.fit(documents)

In [None]:
print(cv.get_feature_names_out())

['cloudy' 'hello' 'is' 'mars' 'sunny' 'today' 'world']


In [None]:
out = cv.fit_transform(documents)
print(type(out))
print(out)
print(out.todense())

<class 'scipy.sparse._csr.csr_matrix'>
<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 11 stored elements and shape (3, 7)>
  Coords	Values
  (0, 1)	3
  (1, 1)	1
  (1, 6)	1
  (1, 5)	1
  (1, 2)	1
  (1, 0)	1
  (2, 1)	1
  (2, 5)	1
  (2, 2)	1
  (2, 3)	1
  (2, 4)	1
[[0 3 0 0 0 0 0]
 [1 1 1 0 0 1 1]
 [0 1 1 1 1 1 0]]


## Use Case SPAM:
Ahora en vez de usar una lista con oraciones, trabajemos con el dataframe...

In [None]:
vectorizer = CountVectorizer(max_features=5000)
messages = vectorizer.fit_transform(df['message']) #"fit_transform" aprende el vocabulario "transform" no aprende nuevo vocabulario

print(messages[0, :])
#print(vectorizer.get_feature_names_out()[1758])
#print(vectorizer.vocabulary_)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 17 stored elements and shape (1, 5000)>
  Coords	Values
  (0, 1758)	1
  (0, 4661)	1
  (0, 2531)	1
  (0, 3609)	1
  (0, 1140)	1
  (0, 619)	1
  (0, 3437)	1
  (0, 2271)	1
  (0, 843)	1
  (0, 1812)	1
  (0, 4910)	1
  (0, 2658)	1
  (0, 842)	1
  (0, 996)	1
  (0, 4456)	1
  (0, 1782)	1
  (0, 4787)	1


In [None]:
import torch
from torch import nn

In [None]:
df["spam"]

Unnamed: 0,spam
0,False
1,False
2,True
3,False
4,False
...,...
5567,True
5568,False
5569,False
5570,False


In [None]:
X = torch.tensor(messages.todense(), dtype=torch.float32)
y = torch.tensor(df["spam"], dtype=torch.float32)
print(X.shape)
print(y.shape)
y = torch.tensor(df["spam"], dtype=torch.float32).reshape((-1, 1)) # -1 cantidad de filas automaticamente, 1 agrega una dimension "obvia" (una columna)
print(y.shape)

torch.Size([5572, 5000])
torch.Size([5572])
torch.Size([5572, 1])


Definiendo el modelo:

In [None]:
model = nn.Linear(5000, 1)
loss_fn = torch.nn.BCEWithLogitsLoss() #torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.2) #0.02 es el mejor valor


Entrenando a la neurona

In [None]:
for i in range(0,10000): #10000 es el mejor parametro
    optimizer.zero_grad()
    outputs = model(X)
    loss = loss_fn(outputs, y)
    loss.backward()
    optimizer.step()

    if i % 1000 == 0:
        print(loss)
        #print(model.weight)
        #print(model.bias)

tensor(0.2050, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
tensor(0.0744, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
tensor(0.0564, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
tensor(0.0472, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
tensor(0.0411, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
tensor(0.0367, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
tensor(0.0332, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
tensor(0.0304, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
tensor(0.0281, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
tensor(0.0261, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)


Claramente hay un problema en el resultado dado que si quiero probabilidades de si es o no SPAM los valores no pueden ser negativos o estar fuera de rango

In [None]:
model.eval()
with torch.no_grad():
    y_pred = model(X)
    y_pred = nn.functional.sigmoid(model(X))
    print(y_pred)
    print(y_pred.min())
    print(y_pred.max())

tensor([[0.0011],
        [0.0033],
        [0.9991],
        ...,
        [0.0038],
        [0.0178],
        [0.0066]])
tensor(1.0710e-23)
tensor(1.0000)


In [None]:
def evaluate_model(X, y):
    model.eval()
    with torch.no_grad():
        y_pred = nn.functional.sigmoid(model(X)) > 0.25
        print("accuracy:", (y_pred == y)\
            .type(torch.float32).mean())

        print("sensitivity:", (y_pred[y == 1] == y[y == 1])\
            .type(torch.float32).mean())

        print("specificity:", (y_pred[y == 0] == y[y == 0])\
            .type(torch.float32).mean())

        print("precision:", (y_pred[y_pred == 1] == y[y_pred == 1])\
            .type(torch.float32).mean())

In [None]:
print("Evaluating on the training data")
evaluate_model(X, y)

Evaluating on the training data
accuracy: tensor(0.9955)
sensitivity: tensor(0.9813)
specificity: tensor(0.9977)
precision: tensor(0.9852)


In [None]:
new_messages = [
    "We have released a new feature for your product and you have been selected to try it!",
    "We have released a new product to improve your sales, do you want to try it",
    "Winner! Great deal, call us to get this product for free",
    "Tomorrow is my birthday, do you come to the party?"
]

custom_messages = vectorizer.transform(new_messages)

X_custom = torch.tensor(custom_messages.todense(), dtype=torch.float32)

model.eval()
with torch.no_grad():
  pred = nn.functional.sigmoid(model(X_custom))
  print(pred)


tensor([[0.7091],
        [0.1929],
        [0.4226],
        [0.0014]])
