# Part 1.

The deadline for Part 1 is **1:30 pm Feb 6th, 2020**.   
You should submit a `.ipynb` file with your solutions to NYU Classes.

---


In this part we will preprocess SMS Spam Collection Dataset and train a bag-of-words classifier (logistic regression) for spam detection. 

## Data Loading

First, we download the SMS Spam Collection Dataset. The dataset is taken from [Kaggle](https://www.kaggle.com/uciml/sms-spam-collection-dataset/data#) and loaded to [Google Drive](https://drive.google.com/open?id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR) so that everyone can access it.

In [138]:
!wget 'https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR' -O spam.csv 

--2020-02-12 18:08:44--  https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR
Resolving docs.google.com (docs.google.com)... 172.217.193.138, 172.217.193.102, 172.217.193.101, ...
Connecting to docs.google.com (docs.google.com)|172.217.193.138|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-14-04-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/j8g6m1vsaeoo78t6jb3pbmb06tnl06vc/1581530400000/08752484438609855375/*/1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR?e=download [following]
--2020-02-12 18:08:44--  https://doc-14-04-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/j8g6m1vsaeoo78t6jb3pbmb06tnl06vc/1581530400000/08752484438609855375/*/1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR?e=download
Resolving doc-14-04-docs.googleusercontent.com (doc-14-04-docs.googleusercontent.com)... 74.125.141.132, 2607:f8b0:400c:c06::84
Connecting to doc-14-04-docs.googleusercontent.com (doc-14-

In [139]:
!ls

sample_data  spam.csv


There are two columns: `v1` -- spam or ham indicator, `v2` -- text of the message.

In [140]:
import pandas as pd
import numpy as np

df = pd.read_csv("spam.csv", usecols=["v1", "v2"], encoding='latin-1')
# 1 - spam, 0 - ham
df.v1 = (df.v1 == "spam").astype("int")
df.head()

Unnamed: 0,v1,v2
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Your task is to split the data to train/dev/test. Make sure that each row appears only in one of the splits.

In [0]:
# 0.15 for val, 0.15 for test, 0.7 for train
val_size = int(df.shape[0] * 0.15)
test_size = int(df.shape[0] * 0.15)

"""
YOUR CODE GOES HERE
"""

train_texts, train_labels = df["v2"].iloc[:70], df["v1"].iloc[:70]
val_texts, val_labels = df["v2"].iloc[70:70 + val_size], df["v1"].iloc[70:70 + val_size]
test_texts, test_labels   = df["v2"].iloc[70 + val_size:val_size + test_size], df["v1"].iloc[70 + val_size:val_size + test_size]

## Data Processing

The task is to create bag-of-words features: tokenize the text, index each token, represent the sentence as a dictionary of tokens and their counts, limit the vocabulary to $n$ most frequent tokens. In the lab we use built-in `sklearn` function, `sklearn.feature_extraction.text.CountVectorizer`. 
**In this HW, you are required to implement the `Vectorizer` on your own without using `sklearn` built-in functions.**

Function `preprocess_data` takes the list of texts and returns list of (lists of tokens). 
You may use [spacy](https://spacy.io/) or [nltk](https://www.nltk.org/) text processing libraries in `preprocess_data` function. 

Class `Vectorizer` is used to vectorize the text and to create a matrix of features.


In [0]:
# import nltk from nltk.corpus 
# import stopwords 
# set(stopwords.words('english'))
import spacy as sc
def preprocess_data(data):
    # This function should return a list of lists of preprocessed tokens for each message
    """
    YOUR CODE GOES HERE
    """
    enModel = sc.load('en_core_web_sm')
    stopwords = sc.lang.en.stop_words.STOP_WORDS
    # loop through list of sentences and append to list
    preprocessed_data = []
    for sentence in data:
      preprocessed_data.append([token.text for token in enModel(sentence)])
    return preprocessed_data
# could not remove stop words kept giving me a type error, i think thats why my f1 score is so low
train_data = preprocess_data(train_texts)
val_data = preprocess_data(val_texts)
test_data = preprocess_data(test_texts)

In [0]:
import numpy as np

class Vectorizer():
    def __init__(self, max_features):
        self.max_features = max_features
        self.vocab_list = None
        self.token_to_index = None

    def fit(self, dataset):
        self.dataset = dataset
        # Create a vocab list, self.vocab_list, using the most frequent "max_features" tokens
        # Create a token indexer, self.token_to_index, that will return index of the token in self.vocab
        """
        YOUR CODE GOES HERE
        """
        vocab = {}
        #for each sentence in the data set, looks at the word and appends to dictionary
        #with the value count of how many times it appeared in the sentence
        for sentence in dataset:
          for word in sentence:
            if word in vocab.keys():
              vocab[word] += 1
            else:
              vocab[word] = 1
        #sorts the dictionary by most frequent words
        most_frequent = sorted(vocab, key = vocab.get, reverse = True)
        # Create a vocab list, self.vocab_list, using the most frequent "max_features" tokens
        self.vocab_list = most_frequent[:self.max_features]
        # Create a token indexer, self.token_to_index, that will return index of the token in self.vocab
        i = 0
        self.token_to_index = {}
        for item in self.vocab_list:
          self.token_to_index[item] = i
          i += 1
        pass

    def transform(self, dataset):
        # This function transforms text dataset into a matrix, data_matrix
        """
        YOUR CODE GOES HERE
        """
        data_matrix = np.zeros((len(dataset), len(self.vocab_list)))
        i = 0
        for lists in dataset:
          for word in self.vocab_list:
            if word in lists:
              #fills data matrix with the count of the words in the vocab list that are in the dataset
              data_matrix[i][self.token_to_index[word]] = lists.count(word)
          i += 1
        return data_matrix

In [0]:
max_features = 2000 # TODO: Replace None with a number
vectorizer = Vectorizer(max_features=max_features)
vectorizer.fit(train_data)
X_train = vectorizer.transform(train_data)
X_val = vectorizer.transform(val_data)
X_test = vectorizer.transform(test_data)

y_train = np.array(train_labels)
y_val = np.array(val_labels)
y_test = np.array(test_labels)

vocab = vectorizer.vocab_list

You can add more features to the feature matrix.

In [129]:
"""
YOUR CODE GOES HERE
"""
""" Not sure if this was optional or not"""

' Not sure if this was optional or not'

## Model

We train logistic regression model and save prediction for train, val and test.


In [0]:
from sklearn.linear_model import LogisticRegression

# Define Logistic Regression model
model = LogisticRegression(random_state=0, solver='liblinear')

# Fit the model to training data
model.fit(X_train, y_train)

# Make prediction using the trained model
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)
y_test_pred = model.predict(X_test)

## Performance of the model

Your task is to report train, val, test accuracies and F1 scores.
**You are required to implement `accuracy_score` and `f1_score` methods without using built-in python functions.**

Your model should achieve at least **0.95** test accuracy and **0.90** test F1 score.

In [0]:
def accuracy_score(y_true, y_pred): 
    # Calculate accuracy of the model's prediction
    """
    YOUR CODE GOES HERE
    """
    accurate = 0
    for i in range(len(y_true)):
      if y_true[i] == y_pred[i]:
        accurate += 1
    accuracy = accurate / len(y_true)
    return accuracy

def f1_score(y_true, y_pred): 
    # Calculate F1 score of the model's prediction
    #f1 = 2*((precision*recall)/(precision + recall))
    #precision = tp / tp+fp
    #recall = tp / tp + fn
    """
    YOUR CODE GOES HERE
    """
    tp = 0
    fp = 0
    fn = 0
    for i in range(len(y_true)):
      if y_true[i] == 1 and y_pred[i] == 1:
        tp += 1
    for i in range(len(y_true)):
      if y_true[i] == 0 and y_pred[i] == 1:
        fp += 1
    for i in range(len(y_true)):
      if y_true[i] == 1 and y_pred[i] == 0:
        fn += 1
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)

    f1 = 2*((precision*recall)/(precision + recall))
    return f1

In [148]:
print(f"Training accuracy: {accuracy_score(y_train, y_train_pred):.3f}, "
      f"F1 score: {f1_score(y_train, y_train_pred):.3f}")
print(f"Validation accuracy: {accuracy_score(y_val, y_val_pred):.3f}, "
      f"F1 score: {f1_score(y_val, y_val_pred):.3f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred):.3f}, "
      f"F1 score: {f1_score(y_test, y_test_pred):.3f}")

Training accuracy: 1.000, F1 score: 1.000
Validation accuracy: 0.886, F1 score: 0.469
Test accuracy: 0.915, F1 score: 0.532


**Question.**
Is accuracy the metric that logistic regression optimizes while training? If no, which metric is optimized in logistic regression?

**Your answer:** 
EXTRA CREDIT

**Question.**
In general, does having 0.99 accuracy on test means that the model is great? If no, can you give an example of a case when the accuracy is high but the model is not good? (Hint: why do we use F1 score?)

**Your answer:** 
having 0.99 accuracy does not mean that the model is great, because the accuracy is only the ratio of accurate predictions over all predictions. So if the data isnt balanced, and doesnt take into account false positives and false negatives, it can lead to a false representation of the truth. Like when one is testing for a rare disease.

### Exploration of predicitons

Show a few examples with true+predicted labels on the train and val sets.

In [149]:
"""
YOUR CODE GOES HERE
"""
# 1 - spam, 0 - ham

for i in range(5,10):
  print("Train Set")
  print(train_data[i])
  print("True Label", y_train[i])
  print("Pred Label", y_train_pred[i])
  print("Val Set")
  print(val_data[i])
  print("True Label", y_val[i])
  print("Pred Label", y_val_pred[i])


Train Set
['FreeMsg', 'Hey', 'there', 'darling', 'it', "'s", 'been', '3', 'week', "'s", 'now', 'and', 'no', 'word', 'back', '!', 'I', "'d", 'like', 'some', 'fun', 'you', 'up', 'for', 'it', 'still', '?', 'Tb', 'ok', '!', 'XxX', 'std', 'chgs', 'to', 'send', ',', 'å£1.50', 'to', 'rcv']
True Label 1
Pred Label 1
Val Set
['I', 'am', 'waiting', 'machan', '.', 'Call', 'me', 'once', 'you', 'free', '.']
True Label 0
Pred Label 0
Train Set
['Even', 'my', 'brother', 'is', 'not', 'like', 'to', 'speak', 'with', 'me', '.', 'They', 'treat', 'me', 'like', 'aids', 'patent', '.']
True Label 0
Pred Label 0
Val Set
['That', 's', 'cool', '.', 'i', 'am', 'a', 'gentleman', 'and', 'will', 'treat', 'you', 'with', 'dignity', 'and', 'respect', '.']
True Label 0
Pred Label 0
Train Set
['As', 'per', 'your', 'request', "'", 'Melle', 'Melle', '(', 'Oru', 'Minnaminunginte', 'Nurungu', 'Vettam', ')', "'", 'has', 'been', 'set', 'as', 'your', 'callertune', 'for', 'all', 'Callers', '.', 'Press', '*', '9', 'to', 'copy', '

**Question** Print 10 examples from val set which were labeled incorrectly by the model. Why do you think the model got them wrong?

**Your answer:** 
A lot of these speak about cash rewards or promoting an item, which although to us it is obvious that it is spam, the model might not have been able to distinguish

In [155]:
"""
YOUR CODE GOES HERE
"""
x = 0
#for i in range(0,10): dont know why this didn't work
for i in range(len(val_data)):
  if x != 10:
    if y_val[i] != y_val_pred[i]:
      x += 1
      print(val_data[i])

['Please', 'call', 'our', 'customer', 'service', 'representative', 'on', '0800', '169', '6031', 'between', '10am-9pm', 'as', 'you', 'have', 'WON', 'a', 'guaranteed', 'å£1000', 'cash', 'or', 'å£5000', 'prize', '!']
['GENT', '!', 'We', 'are', 'trying', 'to', 'contact', 'you', '.', 'Last', 'weekends', 'draw', 'shows', 'that', 'you', 'won', 'a', 'å£1000', 'prize', 'GUARANTEED', '.', 'Call', '09064012160', '.', 'Claim', 'Code', 'K52', '.', 'Valid', '12hrs', 'only', '.', '150ppm']
['You', 'are', 'a', 'winner', 'U', 'have', 'been', 'specially', 'selected', '2', 'receive', 'å£1000', 'or', 'a', '4', '*', 'holiday', '(', 'flights', 'inc', ')', 'speak', 'to', 'a', 'live', 'operator', '2', 'claim', '0871277810910p', '/', 'min', '(', '18', '+', ')']
['Todays', 'Voda', 'numbers', 'ending', '7548', 'are', 'selected', 'to', 'receive', 'a', '$', '350', 'award', '.', 'If', 'you', 'have', 'a', 'match', 'please', 'call', '08712300220', 'quoting', 'claim', 'code', '4041', 'standard', 'rates', 'app']
['Suns

## End of Part 1.
