<a href="https://colab.research.google.com/github/jiangzian96/DS1012/blob/master/HW1_pt1_zj444.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 1.

The deadline for Part 1 is **1:30 pm Feb 6th, 2020**.   
You should submit a `.ipynb` file with your solutions to NYU Classes.

---


In this part we will preprocess SMS Spam Collection Dataset and train a bag-of-words classifier (logistic regression) for spam detection. 

## Data Loading

First, we download the SMS Spam Collection Dataset. The dataset is taken from [Kaggle](https://www.kaggle.com/uciml/sms-spam-collection-dataset/data#) and loaded to [Google Drive](https://drive.google.com/open?id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR) so that everyone can access it.

In [89]:
!wget 'https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR' -O spam.csv 

--2020-02-08 01:32:54--  https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR
Resolving docs.google.com (docs.google.com)... 74.125.203.101, 74.125.203.102, 74.125.203.139, ...
Connecting to docs.google.com (docs.google.com)|74.125.203.101|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-14-04-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/vk0s99ate52jrai83kfhgdmqrhprmrii/1581125400000/08752484438609855375/*/1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR?e=download [following]
--2020-02-08 01:32:55--  https://doc-14-04-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/vk0s99ate52jrai83kfhgdmqrhprmrii/1581125400000/08752484438609855375/*/1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR?e=download
Resolving doc-14-04-docs.googleusercontent.com (doc-14-04-docs.googleusercontent.com)... 108.177.125.132, 2404:6800:4008:c01::84
Connecting to doc-14-04-docs.googleusercontent.com (doc-14-04-

In [90]:
!ls

sample_data  spam.csv


There are two columns: `v1` -- spam or ham indicator, `v2` -- text of the message.

In [91]:
import pandas as pd
import numpy as np

df = pd.read_csv("spam.csv", usecols=["v1", "v2"], encoding='latin-1')
# 1 - spam, 0 - ham
df.v1 = (df.v1 == "spam").astype("int")
df.head()

Unnamed: 0,v1,v2
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Your task is to split the data to train/dev/test. Make sure that each row appears only in one of the splits.

In [0]:
# 0.15 for val, 0.15 for test, 0.7 for train
val_size = int(df.shape[0] * 0.15)
test_size = int(df.shape[0] * 0.15)

"""
YOUR CODE GOES HERE
"""
shuffle = np.arange(df.shape[0])
np.random.shuffle(shuffle)
train_size = df.shape[0] - val_size - test_size
labels = df["v1"].to_numpy()

train_texts, train_labels = df["v2"].iloc[:train_size], labels[:train_size]
val_texts, val_labels     = df["v2"].iloc[train_size:train_size+val_size], labels[train_size:train_size+val_size]
test_texts, test_labels   = df["v2"].iloc[-test_size:], labels[-test_size:]

## Data Processing

The task is to create bag-of-words features: tokenize the text, index each token, represent the sentence as a dictionary of tokens and their counts, limit the vocabulary to $n$ most frequent tokens. In the lab we use built-in `sklearn` function, `sklearn.feature_extraction.text.CountVectorizer`. 
**In this HW, you are required to implement the `Vectorizer` on your own without using `sklearn` built-in functions.**

Function `preprocess_data` takes the list of texts and returns list of (lists of tokens). 
You may use [spacy](https://spacy.io/) or [nltk](https://www.nltk.org/) text processing libraries in `preprocess_data` function. 

Class `Vectorizer` is used to vectorize the text and to create a matrix of features.


In [0]:
import spacy
nlp = spacy.load("en_core_web_sm")
from tqdm import tqdm

In [94]:
def preprocess_data(data):
    # This function should return a list of lists of preprocessed tokens for each message
    """
    YOUR CODE GOES HERE
    """
    preprocessed_data = []

    for sent in tqdm(data):
      temp = []
      doc = nlp(sent)
      for token in doc:
        temp.append(token.text.lower())

      preprocessed_data.append(temp)
    return preprocessed_data

train_data = preprocess_data(train_texts)
val_data = preprocess_data(val_texts)
test_data = preprocess_data(test_texts)

100%|██████████| 3902/3902 [00:46<00:00, 84.78it/s]
100%|██████████| 835/835 [00:09<00:00, 85.12it/s]
100%|██████████| 835/835 [00:09<00:00, 85.04it/s]


In [0]:
from collections import Counter
import numpy as np

class Vectorizer():
    def __init__(self, max_features):
        self.max_features = max_features
        self.vocab_list = None
        self.token_to_index = None

    def fit(self, dataset):
        # Create a vocab list, self.vocab_list, using the most frequent "max_features" tokens
        # Create a token indexer, self.token_to_index, that will return index of the token in self.vocab
        """
        YOUR CODE GOES HERE
        """
        all_tokens = []
        UNK_IDX = 0
        for sent in tqdm(dataset):
          for token in sent:
            all_tokens.append(token)
        token_counter = Counter(all_tokens)
        vocab, count = zip(*token_counter.most_common(self.max_features))
        self.token_to_index = dict(zip(vocab, range(1,1+len(vocab)))) 
        self.token_to_index["<unk>"] = UNK_IDX
        self.vocab_list = ["<unk>"]+ list(vocab)

    def transform(self, dataset):
        # This function transforms text dataset into a matrix, data_matrix
        """
        YOUR CODE GOES HERE
        """
        data_matrix = np.zeros((len(dataset), len(self.vocab_list)))
        for i in tqdm(range(len(dataset))):
          sent = dataset[i]
          for token in sent:
            if token in self.token_to_index:
              j = self.token_to_index[token]
            else:
              j = self.token_to_index["<unk>"]
            data_matrix[i,j] += 1
        return data_matrix

In [96]:
max_features = 10000 
vectorizer = Vectorizer(max_features=max_features)
vectorizer.fit(train_data)
X_train = vectorizer.transform(train_data)
X_val = vectorizer.transform(val_data)
X_test = vectorizer.transform(test_data)

y_train = np.array(train_labels)
y_val = np.array(val_labels)
y_test = np.array(test_labels)

vocab = vectorizer.vocab_list


100%|██████████| 3902/3902 [00:00<00:00, 251184.45it/s]
100%|██████████| 3902/3902 [00:00<00:00, 56332.52it/s]
100%|██████████| 835/835 [00:00<00:00, 46820.19it/s]
100%|██████████| 835/835 [00:00<00:00, 44991.70it/s]


You can add more features to the feature matrix.

In [97]:
"""
YOUR CODE GOES HERE
"""

'\nYOUR CODE GOES HERE\n'

## Model

We train logistic regression model and save prediction for train, val and test.


In [0]:
from sklearn.linear_model import LogisticRegression

# Define Logistic Regression model
model = LogisticRegression(random_state=0, solver='liblinear')

# Fit the model to training data
model.fit(X_train, y_train)

# Make prediction using the trained model
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)
y_test_pred = model.predict(X_test)

## Performance of the model

Your task is to report train, val, test accuracies and F1 scores.
**You are required to implement `accuracy_score` and `f1_score` methods without using built-in python functions.**

Your model should achieve at least **0.95** test accuracy and **0.90** test F1 score.

In [0]:
def accuracy_score(y_true, y_pred): 
    # Calculate accuracy of the model's prediction
    """
    YOUR CODE GOES HERE
    """
    accuracy = sum(y_true == y_pred)/y_pred.shape[0]
    return accuracy

def f1_score(y_true, y_pred): 
    # Calculate F1 score of the model's prediction
    """
    YOUR CODE GOES HERE
    """

    TP = 0
    FP = 0
    TN = 0
    FN = 0

    for i in range(len(y_pred)): 
        if y_true[i]==y_pred[i]==1:
           TP += 1
        if y_pred[i]==1 and y_true[i]!=y_pred[i]:
           FP += 1
        if y_true[i]==y_pred[i]==0:
           TN += 1
        if y_pred[i]==0 and y_true[i]!=y_pred[i]:
           FN += 1

    precision = TP/(TP+FP)
    recall = TP/(TP+FN)
    f1 = 2 * (precision * recall) / (precision + recall)
    return f1

In [100]:
print(f"Training accuracy: {accuracy_score(y_train, y_train_pred):.3f}, "
      f"F1 score: {f1_score(y_train, y_train_pred):.3f}")
print(f"Validation accuracy: {accuracy_score(y_val, y_val_pred):.3f}, "
      f"F1 score: {f1_score(y_val, y_val_pred):.3f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred):.3f}, "
      f"F1 score: {f1_score(y_test, y_test_pred):.3f}")

Training accuracy: 0.997, F1 score: 0.990
Validation accuracy: 0.974, F1 score: 0.902
Test accuracy: 0.984, F1 score: 0.938


**Question.**
Is accuracy the metric that logistic regression optimizes while training? If no, which metric is optimized in logistic regression?

**Your answer:** Logistic regression optimizes the cross-entropy loss, thus accuracy is being optimized, as 1-1 and 0-0, predictions that match with ground truth, have 0 loss.

**Question.**
In general, does having 0.99 accuracy on test means that the model is great? If no, can you give an example of a case when the accuracy is high but the model is not good? (Hint: why do we use F1 score?)

**Your answer:** 0.99 accuracy does not mean necessarily a great model. For example, in cancer detection, we care more about recall than accuracy, since we do not want to miss out a cancer on any patient (do not want false negatives). In this case, a high accuracy is not as important as good recall. Missing out diagnosing a cancer is more distractous than wrongly predicting someone to have cancer when he/she actually does not.

### Exploration of predicitons

Show a few examples with true+predicted labels on the train and val sets.

In [101]:
"""
YOUR CODE GOES HERE
"""
# 1 - spam, 0 - ham

print("First 10 training labels:\n","Predictions: ",y_train_pred[:10],"\n Truth: ",y_train[:10])
print("First 10 validation labels:\n","Predictions: ",y_val_pred[:10],"\n Truth: ",y_val[:10])

First 10 training labels:
 Predictions:  [0 0 1 0 0 1 0 0 1 1] 
 Truth:  [0 0 1 0 0 1 0 0 1 1]
First 10 validation labels:
 Predictions:  [0 1 1 0 0 0 0 0 0 1] 
 Truth:  [0 1 1 0 0 0 0 0 0 1]


**Question** Print 10 examples from val set which were labeled incorrectly by the model. Why do you think the model got them wrong?

**Your answer:** Since BoW model only cares about occurence of words instead of the actual ordering of any sentence, it is very likely that it will perform worse on sentences with spam-like words but are not actually spam. For example, "cat loves to eat mouse" and "mouse loves to eat cat" actually are regarded the same by BoW model.

In [118]:
"""
YOUR CODE GOES HERE
"""

indices = (y_val_pred != y_val).nonzero()[0].tolist()[:10]
for index in indices:
  print("Text: ",val_texts.iloc[index])
  print("Predicted: ", "spam" if y_val_pred[index] else "ham")
  print("Truth: ","spam" if y_val[index] else "ham")
  print("\n")

Text:  ringtoneking 84484
Predicted:  ham
Truth:  spam


Text:  You will be receiving this week's Triple Echo ringtone shortly. Enjoy it!
Predicted:  ham
Truth:  spam


Text:  Win a å£1000 cash prize or a prize worth å£5000
Predicted:  ham
Truth:  spam


Text:  TBS/PERSOLVO. been chasing us since Sept forå£38 definitely not paying now thanks to your information. We will ignore them. Kath. Manchester.
Predicted:  ham
Truth:  spam


Text:  Loans for any purpose even if you have Bad Credit! Tenants Welcome. Call NoWorriesLoans.com on 08717111821
Predicted:  ham
Truth:  spam


Text:  In The Simpsons Movie released in July 2007 name the band that died at the start of the film? A-Green Day, B-Blue Day, C-Red Day. (Send A, B or C)
Predicted:  ham
Truth:  spam


Text:  Sorry, it's a lot of friend-of-a-friend stuff, I'm just now about to talk to the actual guy who wants to buy
Predicted:  spam
Truth:  ham


Text:  Missed call alert. These numbers called but left no message. 07008009200
Predicte

## End of Part 1.
