<a href="https://colab.research.google.com/github/jchou03/Natural-Language-Processing/blob/main/Jared_Chou_PA5_Student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Programming Assignment Five: Spam detection with neural network.

In this assignment, you are asked to build a neural network that can detect spam from a given SMS message.

The provided files are:
1. `spam_train.csv`: a csv file containing the training data, where the 'text' column provides the sms messages and the 'label' column indicates whether the sms message is a 'ham' (0) or a 'spam' (1).
2. `spam_test.csv`: a csv file containing the testing data, following the same format as `spam_train.csv`.

**Step 1: Compute the SMS message vector based on the average value of the word vectors that belong to the words in it.** 

Just like the last assignment, we compute the 'representation' of each message, i.e., the vector, by averaging word vectors with Word2Vec. But this time, we are using pre-trained [Glove word embeddings](https://nlp.stanford.edu/projects/glove/) instead. Specifically, we are using word embedding `glove.6B.100d` to obtain word vectors of each message, as long as the word is in the 'glove.6B.100d' embedding space.

In other words, you need to:
1. Have a [basic idea](https://nlp.stanford.edu/pubs/glove.pdf) of how Glove provides pre-trained word embeddings (vectors).
2. Download and extract word vectors from `glove.6B.100d`, contained in `glove.6B.zip`.
3. Compute the message vectors by averaging the vectors of words in the message.

In [None]:
import torch

torch.cuda.is_available()

True

In [8]:
# convert glove.6B.100d into word vectors 
import numpy as np
def get_line_dic():
  with open('glove.6B.100d.txt') as f: 
    lines = f.readlines()
    word_vecs = {}
    for line in lines:
      line_words = line.split()
      word_vecs[line_words[0]] = np.asarray([float(val) for val in line_words[1:]])
    return word_vecs

word_vecs = get_line_dic()
print("word vectors loaded")
# for vec in word_vecs:
#   print(vec)
# convert 

word vectors loaded


In [17]:
import spacy
import pandas as pd
nlp = spacy.load("en_core_web_sm")
nlp.disable_pipe("parser")
nlp.add_pipe("sentencizer")

<spacy.pipeline.sentencizer.Sentencizer at 0x7f987c555190>

In [35]:
# calculate the average for each word 
def avg_word_vec (sent): 
  doc = nlp(sent)
  avg_vec = np.asarray([0.0] * 100)
  count = 0
  for sent in doc.sents:
    for word in sent:
      w = str(word).lower()
      if (w in word_vecs.keys()): 
        avg_vec += word_vecs[w]
        count += 1
  if count != 0:
    for i,element in enumerate(avg_vec):
      avg_vec[i] = avg_vec[i]/count
  return avg_vec

def load_from_csv (filepath):
  df = pd.read_csv(filepath)
  # print(df)
  x = df['text']
  y = df['label']
  return x, y

def avg_vecs (texts):
  vecs = []
  for text in texts:
    vecs.append(avg_word_vec(text))


In [36]:
train_x, train_y = load_from_csv('spam_train.csv')
test_x, test_y = load_from_csv('spam_test.csv')



**Step 2: Build 'dataset + data loader' that can feed data to train your model with Pytorch.**

Our goal is to train a spam detection model (classification). Here's an [example](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html) of how a classfier is trained. Although it is for image classification, the idea is very similar:

1. Prepare/build a dataset and load it with data loader;
2. Prepare/build a model that takes the data input and predicts; and 
3. Prepare/build the optimizer and loss functions to train the model with the dataset.

Naturally, the next thing to do is to prepare the data. We do it by building the 'Dataset' and 'Dataloader' with Pytorch.

You may refer to [this page](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) to get an idea of how to make 'Dataset' and 'Dataloader'. 

Hints:
1. Make sure `__init__` , `__len__` and `__getitem__` of your defined dataset is implemented properly. In particular, the `__getitem__` function should return the specified message vector and its label.
2. Don't compute the message vector when calling the `__getitem__` function, otherwise the training process will slow down A LOT.
3. Make sure the shuffle is on for your data loader setup, as the data in the csv file is not. 



**Step 3: Build the neural net model.** 

Once the data is ready, we need to design and implement our neural network model.

You should look [here](https://pytorch.org/tutorials/beginner/introyt/modelsyt_tutorial.html) to see how a model can be defined.

The model does not need to be complicated. An example structure could be:

1. linear layer 100 x 15
2. ReLU activation layer
3. linear layer 15 x 2 (think about why here is 2 instead of 1?)
4. Softmax activation layer

But feel free to test out other possible combinations of linear layers & activation functions and whether they make significant difference to the model performance later.

**Step 4: Train the model with optimizer and loss function.**

Lastly, we need to set up the [optimizer](https://pytorch.org/docs/stable/optim.html) and [loss function](https://pytorch.org/docs/stable/nn.html#loss-functions) to train the model. You may refer to the links for more details. Specifically, we need Stochastic Gradient Descent (SGD) for optimizer and CrossEntropyLoss for loss function.

The last thing to do is to train the model for several epochs and evaluate its performance from time to time. For example,  train the model 5000 epochs, evaluating the model every 100 epochs. If you are not sure how the training works, you may refer to the [classification model tutorial](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html) to see how it is typically done. Don't forget to print the average loss of the epoch to see if the model is being optimized properly.

The evaluation metric should be the [**accuracy**](https://en.wikipedia.org/wiki/Confusion_matrix) of predicting ham/spam on the testing data (TP+TN/(TP+TN+FP+FN)). The highest accuracy should be above at least **90%**. Try different settings of model structure, learning rate, and the number of training epochs  to achieve that level of accuracy.

In [None]:
# Your training output should look similar to this (w/ # of epoch, accuracies, average loss, etc.)

  del sys.path[0]


epoch: 1
average loss this epoch: 0.6933
accuracy this epoch: 0.47
epoch: 101
average loss this epoch: 0.6864
accuracy this epoch: 0.62
epoch: 201
average loss this epoch: 0.6784
accuracy this epoch: 0.75
epoch: 301
average loss this epoch: 0.6655
accuracy this epoch: 0.82
epoch: 401
average loss this epoch: 0.6497
accuracy this epoch: 0.86
epoch: 501
average loss this epoch: 0.6286
accuracy this epoch: 0.86
epoch: 601
average loss this epoch: 0.6030
accuracy this epoch: 0.86
epoch: 701
average loss this epoch: 0.5747
accuracy this epoch: 0.86
epoch: 801
average loss this epoch: 0.5486
accuracy this epoch: 0.86
epoch: 901
average loss this epoch: 0.5273
accuracy this epoch: 0.86
epoch: 1001
average loss this epoch: 0.5078
accuracy this epoch: 0.87
epoch: 1101
average loss this epoch: 0.4917
accuracy this epoch: 0.87
epoch: 1201
average loss this epoch: 0.4804
accuracy this epoch: 0.87
epoch: 1301
average loss this epoch: 0.4702
accuracy this epoch: 0.88
epoch: 1401
average loss this ep