<a href="https://colab.research.google.com/github/jchou03/Natural-Language-Processing/blob/main/Jared_Chou_PA5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Programming Assignment Five: Spam detection with neural network.

In this assignment, you are asked to build a neural network that can detect spam from a given SMS message.

The provided files are:
1. `spam_train.csv`: a csv file containing the training data, where the 'text' column provides the sms messages and the 'label' column indicates whether the sms message is a 'ham' (0) or a 'spam' (1).
2. `spam_test.csv`: a csv file containing the testing data, following the same format as `spam_train.csv`.

**Step 1: Compute the SMS message vector based on the average value of the word vectors that belong to the words in it.** 

Just like the last assignment, we compute the 'representation' of each message, i.e., the vector, by averaging word vectors with Word2Vec. But this time, we are using pre-trained [Glove word embeddings](https://nlp.stanford.edu/projects/glove/) instead. Specifically, we are using word embedding `glove.6B.100d` to obtain word vectors of each message, as long as the word is in the 'glove.6B.100d' embedding space.

In other words, you need to:
1. Have a [basic idea](https://nlp.stanford.edu/pubs/glove.pdf) of how Glove provides pre-trained word embeddings (vectors).
2. Download and extract word vectors from `glove.6B.100d`, contained in `glove.6B.zip`.
3. Compute the message vectors by averaging the vectors of words in the message.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

data_path = "/content/drive/MyDrive/Sophomore(22-23)/CS505/CS505_Data/PA5/"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import torch

torch.cuda.is_available()

False

In [None]:
# convert glove.6B.100d into word vectors 
import numpy as np
def get_line_dic(dir):
  with open(dir) as f: 
    lines = f.readlines()
    word_vecs = {}
    for line in lines:
      line_words = line.split()
      word_vecs[line_words[0]] = np.asarray([float(val) for val in line_words[1:]])
    return word_vecs

word_vecs = get_line_dic(data_path + 'glove.6B.100d.txt')
print("word vectors loaded")
# for vec in word_vecs:
#   print(vec)
# convert 

word vectors loaded


In [None]:
import spacy
import pandas as pd
nlp = spacy.load("en_core_web_sm")
nlp.disable_pipe("parser")
nlp.add_pipe("sentencizer")

<spacy.pipeline.sentencizer.Sentencizer at 0x7f467d3899b0>

In [53]:
# calculate the average for each word 
def avg_word_vec (sent): 
  doc = nlp(sent)
  avg_vec = np.asarray([0.0] * 100)
  count = 0
  for sent in doc.sents:
    for word in sent:
      w = str(word).lower()
      if (w in word_vecs.keys()): 
        avg_vec += word_vecs[w]
        count += 1
  if count != 0:
    for i,element in enumerate(avg_vec):
      avg_vec[i] = avg_vec[i]/count
  return avg_vec

def load_from_csv (filepath):
  df = pd.read_csv(filepath)
  # print(df)
  x = df['text']
  y = df['label']
  return x, y

def avg_vecs (texts):
  vecs = []
  for text in texts:
    vecs.append(avg_word_vec(text))
  return vecs


In [54]:
# load training and testing data
train_x, train_y = load_from_csv(data_path + 'spam_train.csv')
test_x, test_y = load_from_csv(data_path + 'spam_test.csv')

# get average word vectors of all the training data
avg_train_x = avg_vecs(train_x) 
avg_test_x = avg_vecs(test_x)

In [64]:
print((avg_train_x[0]))

[-0.1969375   0.40137155  0.36865167 -0.20718825 -0.02065567  0.07757833
  0.03927933 -0.067465    0.441879    0.15917017  0.139775   -0.00559917
  0.288385    0.03816767 -0.19067117 -0.42057867  0.21766667  0.0576755
  0.00264867  0.21329167  0.143376    0.18265467  0.1512345  -0.14424208
  0.334619   -0.14618367 -0.19254992 -0.42458833 -0.1299887  -0.09602917
  0.01157183  0.34134167 -0.010272    0.023002    0.08147567  0.31745417
  0.07393383  0.373553    0.04445333 -0.22570883 -0.37323983 -0.25216667
  0.16185033 -0.38428667  0.04887733  0.019444    0.48043    -0.15121267
 -0.23660833 -0.343635    0.01208188 -0.16245167 -0.0275325   1.13224833
 -0.06781817 -2.42701667 -0.1387265  -0.27708217  1.45199667  0.60371333
 -0.1404895   0.68789    -0.07098717  0.24411817  0.809665    0.04799167
  0.42990333  0.05737     0.5053695  -0.02417683 -0.01987683 -0.1946005
  0.0938095  -0.18074183  0.27892     0.0567721  -0.10301317 -0.12640867
 -0.77742333  0.18287667  0.62576667 -0.12405117 -0.3

**Step 2: Build 'dataset + data loader' that can feed data to train your model with Pytorch.**

Our goal is to train a spam detection model (classification). Here's an [example](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html) of how a classfier is trained. Although it is for image classification, the idea is very similar:

1. Prepare/build a dataset and load it with data loader;
2. Prepare/build a model that takes the data input and predicts; and 
3. Prepare/build the optimizer and loss functions to train the model with the dataset.

Naturally, the next thing to do is to prepare the data. We do it by building the 'Dataset' and 'Dataloader' with Pytorch.

You may refer to [this page](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) to get an idea of how to make 'Dataset' and 'Dataloader'. 

Hints:
1. Make sure `__init__` , `__len__` and `__getitem__` of your defined dataset is implemented properly. In particular, the `__getitem__` function should return the specified message vector and its label.
2. Don't compute the message vector when calling the `__getitem__` function, otherwise the training process will slow down A LOT.
3. Make sure the shuffle is on for your data loader setup, as the data in the csv file is not. 



In [82]:
# prepare & build dataset
import os
from torch.utils.data import Dataset

class CustomImageDataset(Dataset):
    def __init__(self, data, labels):
      self.data = data
      self.labels = labels

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        input = self.data[idx]
        label = self.labels[idx]
        return input, label

dataset = CustomImageDataset(avg_train_x, train_y)
data_test = CustomImageDataset(avg_test_x, test_y)
# print(dataset[0])

(array([-0.1969375 ,  0.40137155,  0.36865167, -0.20718825, -0.02065567,
        0.07757833,  0.03927933, -0.067465  ,  0.441879  ,  0.15917017,
        0.139775  , -0.00559917,  0.288385  ,  0.03816767, -0.19067117,
       -0.42057867,  0.21766667,  0.0576755 ,  0.00264867,  0.21329167,
        0.143376  ,  0.18265467,  0.1512345 , -0.14424208,  0.334619  ,
       -0.14618367, -0.19254992, -0.42458833, -0.1299887 , -0.09602917,
        0.01157183,  0.34134167, -0.010272  ,  0.023002  ,  0.08147567,
        0.31745417,  0.07393383,  0.373553  ,  0.04445333, -0.22570883,
       -0.37323983, -0.25216667,  0.16185033, -0.38428667,  0.04887733,
        0.019444  ,  0.48043   , -0.15121267, -0.23660833, -0.343635  ,
        0.01208188, -0.16245167, -0.0275325 ,  1.13224833, -0.06781817,
       -2.42701667, -0.1387265 , -0.27708217,  1.45199667,  0.60371333,
       -0.1404895 ,  0.68789   , -0.07098717,  0.24411817,  0.809665  ,
        0.04799167,  0.42990333,  0.05737   ,  0.5053695 , -0.0

In [84]:
#  load dataset into data loader
from torch.utils.data import DataLoader

train_dataloader = DataLoader(dataset, batch_size=64, shuffle=True)
test_dataloader = DataLoader(data_test, batch_size=64, shuffle=True)

# print(train_dataloader)

<torch.utils.data.dataloader.DataLoader object at 0x7f467bf1ba90>


**Step 3: Build the neural net model.** 

Once the data is ready, we need to design and implement our neural network model.

You should look [here](https://pytorch.org/tutorials/beginner/introyt/modelsyt_tutorial.html) to see how a model can be defined.

The model does not need to be complicated. An example structure could be:

1. linear layer 100 x 15
2. ReLU activation layer
3. linear layer 15 x 2 (think about why here is 2 instead of 1?)
4. Softmax activation layer

But feel free to test out other possible combinations of linear layers & activation functions and whether they make significant difference to the model performance later.

In [85]:
import torch

class TinyModel(torch.nn.Module):

    def __init__(self):
        super(TinyModel, self).__init__()

        self.linear1 = torch.nn.Linear(100, 200)
        self.activation = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(200, 2)
        self.softmax = torch.nn.Softmax()

    def forward(self, x):
        x = self.linear1(x)
        x = self.activation(x)
        x = self.linear2(x)
        x = self.softmax(x)
        return x
      
model = TinyModel()

# print('The model:')
# print(model)

# print('\n\nJust one layer:')
# print(model.linear2)

# print('\n\nModel params:')
# for param in model.parameters():
#     print(param)

# print('\n\nLayer params:')
# for param in model.linear2.parameters():
#     print(param)

**Step 4: Train the model with optimizer and loss function.**

Lastly, we need to set up the [optimizer](https://pytorch.org/docs/stable/optim.html) and [loss function](https://pytorch.org/docs/stable/nn.html#loss-functions) to train the model. You may refer to the links for more details. Specifically, we need Stochastic Gradient Descent (SGD) for optimizer and CrossEntropyLoss for loss function.

The last thing to do is to train the model for several epochs and evaluate its performance from time to time. For example,  train the model 5000 epochs, evaluating the model every 100 epochs. If you are not sure how the training works, you may refer to the [classification model tutorial](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html) to see how it is typically done. Don't forget to print the average loss of the epoch to see if the model is being optimized properly.

The evaluation metric should be the [**accuracy**](https://en.wikipedia.org/wiki/Confusion_matrix) of predicting ham/spam on the testing data (TP+TN/(TP+TN+FP+FN)). The highest accuracy should be above at least **90%**. Try different settings of model structure, learning rate, and the number of training epochs  to achieve that level of accuracy.

In [92]:
import torch.optim as optim
import torch.nn as nn

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

In [93]:
# evaluation function

def eval():
  correct = 0
  total = 0
  # since we're not training, we don't need to calculate the gradients for our outputs
  with torch.no_grad():
      for data in train_dataloader:
          inputs, labels = data
          # calculate outputs by running images through the network
          outputs = model(inputs)
          # the class with the highest energy is what we choose as prediction
          _, predicted = torch.max(outputs.data, 1)
          total += labels.size(0)
          correct += (predicted == labels).sum().item()
  print("correct: " + str(correct) + " total: " + str(total))
  print(f'Accuracy this epoch: {100 * correct / total} %')

In [94]:
# training
for epoch in range(5000):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(test_dataloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data
        # print(inputs)
        # print(type(inputs))
        # print(type(inputs[0][0]))
        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        model.double()
        outputs = model(inputs)
        # print("outputs: " + str(outputs))
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
    # if epoch % 2000 == 1999:    # print every 2000 mini-batches
    #     print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}')
    #     running_loss = 0.0
    #     eval()
    if epoch % 100 == 0:
      print("epoch: " + str(epoch))
      print("average loss this epoch: " + str(running_loss / train_dataloader.__len__()))
      eval()

print('Finished Training')



epoch: 0
average loss this epoch: 0.34700796826482055
correct: 487 total: 1000
Accuracy this epoch: 48.7 %
epoch: 100
average loss this epoch: 0.2885680533082038
correct: 872 total: 1000
Accuracy this epoch: 87.2 %
epoch: 200
average loss this epoch: 0.24134272489325467
correct: 890 total: 1000
Accuracy this epoch: 89.0 %
epoch: 300
average loss this epoch: 0.2232059320060829
correct: 898 total: 1000
Accuracy this epoch: 89.8 %
epoch: 400
average loss this epoch: 0.21510880292810247
correct: 903 total: 1000
Accuracy this epoch: 90.3 %
epoch: 500
average loss this epoch: 0.20927601852885472
correct: 907 total: 1000
Accuracy this epoch: 90.7 %
epoch: 600
average loss this epoch: 0.20475424712423695
correct: 910 total: 1000
Accuracy this epoch: 91.0 %
epoch: 700
average loss this epoch: 0.2023045477891965
correct: 911 total: 1000
Accuracy this epoch: 91.1 %
epoch: 800
average loss this epoch: 0.19966670406509462
correct: 914 total: 1000
Accuracy this epoch: 91.4 %
epoch: 900
average loss 

In [None]:
# def train():
#   for data in train_dataloader:
#     inputs, labels = data
#     outputs = print(inputs.shape)
#     print(labels.shape)
#     break;

In [None]:
# Your training output should look similar to this (w/ # of epoch, accuracies, average loss, etc.)

  del sys.path[0]


epoch: 1
average loss this epoch: 0.6933
accuracy this epoch: 0.47
epoch: 101
average loss this epoch: 0.6864
accuracy this epoch: 0.62
epoch: 201
average loss this epoch: 0.6784
accuracy this epoch: 0.75
epoch: 301
average loss this epoch: 0.6655
accuracy this epoch: 0.82
epoch: 401
average loss this epoch: 0.6497
accuracy this epoch: 0.86
epoch: 501
average loss this epoch: 0.6286
accuracy this epoch: 0.86
epoch: 601
average loss this epoch: 0.6030
accuracy this epoch: 0.86
epoch: 701
average loss this epoch: 0.5747
accuracy this epoch: 0.86
epoch: 801
average loss this epoch: 0.5486
accuracy this epoch: 0.86
epoch: 901
average loss this epoch: 0.5273
accuracy this epoch: 0.86
epoch: 1001
average loss this epoch: 0.5078
accuracy this epoch: 0.87
epoch: 1101
average loss this epoch: 0.4917
accuracy this epoch: 0.87
epoch: 1201
average loss this epoch: 0.4804
accuracy this epoch: 0.87
epoch: 1301
average loss this epoch: 0.4702
accuracy this epoch: 0.88
epoch: 1401
average loss this ep