The deadline is 9:30am Feb 9th (Wed).   
You should submit a `.ipynb` file with your solutions to BrightSpace.

--- 

There are 10 extra points for "adding extra features to your model". But the maximum grade you can obtain in this homework is 100%. If you complete the extra-credit task, your score will be min{10+score, 100}.

---


In this homework we will preprocess SMS Spam Collection Dataset and train a bag-of-words classifier (logistic regression) for spam detection. 

## Data Loading (10 points)

First, we download the SMS Spam Collection Dataset. The dataset is taken from [Kaggle](https://www.kaggle.com/uciml/sms-spam-collection-dataset/data#) and loaded to [Google Drive](https://drive.google.com/open?id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR) so that everyone can access it.

In [1]:
!wget 'https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR' -O spam.csv 

zsh:1: command not found: wget


In [2]:
!ls

DSGA1012_HW1.ipynb spam.csv


There are two columns: `v1` -- spam or ham indicator, `v2` -- text of the message.

In [3]:
import pandas as pd
import numpy as np

df = pd.read_csv("spam.csv", usecols=["v1", "v2"], encoding='latin-1')
# 1 - spam, 0 - ham
df.v1 = (df.v1 == "spam").astype("int")
df.head()

Unnamed: 0,v1,v2
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
df.iloc[:,1]

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                Will Ì_ b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: v2, Length: 5572, dtype: object

Your task is to split the data to train/dev/test (don't forget to shuffle the data). Make sure that each row appears only in one of the splits.

In [5]:
# 0.15 for val, 0.15 for test, 0.7 for train
val_size = int(df.shape[0] * 0.15)
test_size = int(df.shape[0] * 0.15)

# My Code Starts Here
df = df.sample(frac=1) #Shuffle The dataframe to achieve randomness
df.reset_index(drop=True, inplace=True) #Reset index

# Split into train/val/test sets, since we shuffled the df
# And we are not sampling with replacement, we use a simple indexing approach
train_texts, train_labels = df.iloc[val_size+test_size:,1], df.iloc[val_size+test_size:,0]
val_texts, val_labels     = df.iloc[:val_size,1], df.iloc[:val_size,0]
test_texts, test_labels   = df.iloc[val_size:val_size + test_size,1], df.iloc[val_size:val_size+test_size,0]

## Data Processing (40 points)

The task is to create bag-of-words features: tokenize the text, index each token, represent the sentence as a dictionary of tokens and their counts, limit the vocabulary to $n$ most frequent tokens. In the lab we use built-in `sklearn` function, `sklearn.feature_extraction.text.CountVectorizer`. 
**In this HW, you are required to implement the `Vectorizer` on your own without using `sklearn` built-in functions.**

Function `preprocess_data` takes the list of texts and returns list of (lists of tokens). 
You may use [spacy](https://spacy.io/) or [nltk](https://www.nltk.org/) text processing libraries in `preprocess_data` function. 

Class `Vectorizer` is used to vectorize the text and to create a matrix of features.


In [6]:
#Import necessary modules for preprocessing
import nltk
from nltk import word_tokenize
from nltk.tokenize import RegexpTokenizer

def preprocess_data(data):
    # This function should return a list of lists of preprocessed tokens for each message
    """
    Input:
    data (string) - A string representing a sentence
    
    Output:
    preprocessed_data (list of lists) - lists of tokens indexed in a list 
    """
    
    #Initialize return list
    preprocessed_data = data
    
    #Make all sentences lowercase
    preprocessed_data = preprocessed_data.apply(lambda X: str.lower(X))
    
    #Strip Commas, Periods, Ect. And tokenize each word
    tokenizer = RegexpTokenizer(r"\w+")
    preprocessed_data = preprocessed_data.apply(tokenizer.tokenize)
    
    #Return our preprocessed data
    return preprocessed_data

train_data = preprocess_data(train_texts)
val_data = preprocess_data(val_texts)
test_data = preprocess_data(test_texts)

In [7]:
train_data

1670    [when, i, was, born, god, said, oh, no, anothe...
1671    [december, only, had, your, mobile, 11mths, yo...
1672    [my, life, means, a, lot, to, me, not, because...
1673                 [what, are, your, new, years, plans]
1674            [hows, the, pain, dear, y, r, u, smiling]
                              ...                        
5567           [in, work, now, going, have, in, few, min]
5568    [claire, here, am, havin, borin, time, am, now...
5569    [i, know, you, are, serving, i, mean, what, ar...
5570    [a, cute, thought, for, friendship, its, not, ...
5571    [oi, ami, parchi, na, re, kicchu, kaaj, korte,...
Name: v2, Length: 3902, dtype: object

In [8]:
import numpy as np

class Vectorizer():
    def __init__(self, max_features):
        self.max_features = max_features
        self.vocab_list = None
        self.token_to_index = None
        
    def fit(self, dataset):
        # Create a vocab list, self.vocab_list, using the most frequent "max_features" tokens
        # Create a token indexer, self.token_to_index, that will map each token in self.vocab 
        # to its corresponding index in self.vocab_list
        """
        input: 
        dataset: a preprocessed dataset of 
        
        modifications:
        self.vocab_list: The vocab list will now contain the first "max_features" of most frequent tokens
        self.token_to_index: Maps each token in self.vocab to its index in self.vocab_list
        
        output:
        True: just to let you know the function ran ok :) 
        """
        #Initialize a list, a dictionary, and a helper variable
        self.vocab_list = []
        token_counter_dict = {}
        self.token_to_index = {}
        
        #Iterate over all tokens in each row
        for row in dataset:
            for token in row:
                #Increment counter in dictionary: current count + 1
                token_counter_dict[token] = token_counter_dict.get(token, 0) + 1
                
        #Sort the tokens by frequency
        token_counter_dict = dict(sorted(token_counter_dict.items(), key=lambda item: item[1], reverse=True))
        self.temp = token_counter_dict
        
        #Since its sorted from highest to lowest, grab the first max_features worth of tokens
        counter = 0 #Helper variable
        
        #Iterate over the sorted tokens
        for token, count in token_counter_dict.items():
        
            #Only append max_features number of tokens
            if counter < max_features:
                #Append token to vocab_list
                self.vocab_list.append(token)
                #Update our token_to_index dictionary to map added token to an index
                self.token_to_index[token] = counter
                counter += 1
            else:
                break
        
        return True

    def transform(self, dataset):
        # This function transforms text dataset into a matrix, data_matrix
        """
        input:
        dataset: preprocessed text represented as a list of lists (tokens)
        
        output:
        data_matrix: a 2D (i,j) binary-array where 1 represents the word is present in the ith row of data
        for the jth item in self.vocab_list
        """
        ### Given Code:
        data_matrix = np.zeros((len(dataset), len(self.vocab_list)))
        
        ### My Code:
        #Iterate over all the tokens in each row
        for i, row in enumerate(dataset):
            for token in row:
                    # If the token is present in self.vocab_list, then include in the data_matrix
                    if token in self.vocab_list:
                        data_matrix[i,self.token_to_index[token]] = 1
                        
        #Return our fitted data matrix
        return data_matrix

In [18]:
max_features = 250 # TODO: Replace None with a number
vectorizer = Vectorizer(max_features=max_features)
vectorizer.fit(train_data)
X_train = vectorizer.transform(train_data)
X_val = vectorizer.transform(val_data)
X_test = vectorizer.transform(test_data)

y_train = np.array(train_labels)
y_val = np.array(val_labels)
y_test = np.array(test_labels)

vocab = vectorizer.vocab_list


(10 extra points) You can add more features to the feature matrix.

In [19]:
"""
YOUR CODE GOES HERE
"""
#Maybe add a 2 features: a counter for positive and a counter for negative words 

'\nYOUR CODE GOES HERE\n'

## Model

We train logistic regression model and save prediction for train, val and test.


In [20]:
from sklearn.linear_model import LogisticRegression

# Define Logistic Regression model
model = LogisticRegression(random_state=0, solver='liblinear')

# Fit the model to training data
model.fit(X_train, y_train)

# Make prediction using the trained model
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)
y_test_pred = model.predict(X_test)

## Performance of the model (30 points)

Your task is to report train, val, test accuracies and F1 scores. **You are required to implement `accuracy_score` and `f1_score` methods without using built-in python functions.** 

Your model should achieve at least **0.95** test accuracy and **0.90** test F1 score.

In [21]:
def accuracy_score(y_true, y_pred): 
    # Calculate accuracy of the model's prediction
    """
    input:
    y_true: (array) of true classification values
    y_pred: (array) of predicted classification values
    
    output:
    accuracy: (float) the accuracy score is defined as the
                number of true predictions divided by the total
                number of predictions
    """
    #Get # of observations
    n = len(y_true)
    #Initialize counter variable
    counter = 0
    #Iterate over y_true and y_pred values
    for i in range(n):
        #If they match, its a correct prediction
        if y_true[i] == y_pred[i]:
            counter += 1 #Increment Counter
    #Accuracy is correct predictions / total predictions
    accuracy = counter / n
    return accuracy

def f1_score(y_true, y_pred): 
    # Calculate F1 score of the model's prediction
    """
    input:
    y_true: (array) of true classification values
    y_pred: (array) of predicted classification values
    
    output:
    f1: (float) Score from 0-1 calculated as 2 * ((precision*recall)/(precision+recall))
    """
    #Initialize helper variables
    false_positive,true_positive,false_negative = 0,0,0
    
    #Iterate over predictions
    for i in range(len(y_true)):
        if y_pred[i] == 1 and y_pred[i] == y_true[i]:
            true_positive += 1
        elif y_pred[i] == 1:
            false_positive += 1
        elif y_pred[i] == 0 and y_pred[i] != y_true[i]:
            false_negative += 1
            
    #True positives / True Positives + False Positives
    precision = true_positive / (true_positive + false_positive)
    #True positives / True Positives + False Negatives
    recall = true_positive / (true_positive + false_negative)
    
    #Calculate f1_score
    f1 = (2 * (precision*recall) / (precision + recall))
    return f1

In [22]:
print(f"Training accuracy: {accuracy_score(y_train, y_train_pred):.3f}, "
      f"F1 score: {f1_score(y_train, y_train_pred):.3f}")
print(f"Validation accuracy: {accuracy_score(y_val, y_val_pred):.3f}, "
      f"F1 score: {f1_score(y_val, y_val_pred):.3f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred):.3f}, "
      f"F1 score: {f1_score(y_test, y_test_pred):.3f}")

Training accuracy: 0.987, F1 score: 0.950
Validation accuracy: 0.989, F1 score: 0.957
Test accuracy: 0.983, F1 score: 0.931


**Question.**
Is accuracy the metric that logistic regression optimizes while training? If no, which metric is optimized in logistic regression?

**Your answer:** 
No this is not the case. Logistic regression minimizes the log loss, which is a form of maximum likelihood estimation, which computes the conditional probabilities of a observation belonging to a certain class given the data.

**Question.**
In general, does having 0.99 accuracy on test means that the model is great? If no, can you give an example of a case when the accuracy is high but the model is not good? (Hint: why do we use F1 score?)

**Your answer:** 
Not in general, it depends on the application (cancer screening for instance), but in either case such a high accuracy score is worthy of suspicion. A very high accuracy score could mean we sampled our data in poor fashion, or that we have a large class imbalance. Upon further invesetigation, it appears that around ~85% of our observations belong to the "Not Spam" or 0 class. When we train our model, it may learn just to predict that an observation is not spam, as it will yield to high accuracy (lowest possible value of 85%). This is exactly why we use F-1 score as its the harmonic mean of precision and recall, and gives us a better understanding on how our model performs for all combinations of true, false, positive, and negative classifications, and helps circumvent class imbalance. 

### Exploration of predicitons (20 points)

Show a few examples with true+predicted labels on the train and val sets.

In [63]:
#Define helper function
def correct(data,y_true,y_pred):
    output = []
    for i in range(len(y_true)):
        if y_true[i] == y_pred[i]:
            output.append([y_true[i], y_pred[i],data.iloc[i]])
    return pd.DataFrame(output, columns=['Y_true','y_pred','text'])

#Get correct predictions for val and train set, print below
correct_val_predictions= correct(val_data, y_val, y_val_pred)
correct_train_predictions= correct(train_data, y_train, y_train_pred)

In [67]:
correct_val_predictions.head(10)

Unnamed: 0,Y_true,y_pred,text
0,1,1,"[urgent, important, information, for, o2, user..."
1,0,0,"[really, good, dhanush, rocks, once, again]"
2,0,0,"[any, pain, on, urination, any, thing, else]"
3,0,0,"[yo, howz, u, girls, never, rang, after, india..."
4,0,0,"[yeah, but, which, is, worse, for, i]"
5,0,0,"[i, know, where, the, lt, gt, is, i, ll, be, t..."
6,0,0,"[thanks, it, was, only, from, tescos, but, qui..."
7,0,0,"[how, izzit, still, raining]"
8,0,0,"[i, ask, if, u, meeting, da, ge, tmr, nite]"
9,0,0,"[get, the, door, i, m, here]"


In [66]:
correct_train_predictions.head(10)

Unnamed: 0,Y_true,y_pred,text
0,0,0,"[when, i, was, born, god, said, oh, no, anothe..."
1,1,1,"[december, only, had, your, mobile, 11mths, yo..."
2,0,0,"[my, life, means, a, lot, to, me, not, because..."
3,0,0,"[what, are, your, new, years, plans]"
4,0,0,"[hows, the, pain, dear, y, r, u, smiling]"
5,0,0,"[and, stop, being, an, old, man, you, get, to,..."
6,0,0,"[babes, i, think, i, got, ur, brolly, i, left,..."
7,0,0,"[what, s, up, my, own, oga, left, my, phone, a..."
8,0,0,"[i, meant, middle, left, or, right]"
9,0,0,"[thank, you, baby, i, cant, wait, to, taste, t..."


**Question** Print 10 examples from val set which were labeled incorrectly by the model. Why do you think the model got them wrong?

**Your answer:** 
It seems like the majority of the false negatives had to do with tokens being present in the text that didn't appear frequently enough to make our max_feature cutoff, but still would be very informative for the task of determining if something was spam or not. For instance, many of the spam messages have some numbers in their text, like a phone number line to call. These numbers are normally rather unique, and don't get incorporated into our vocabulary list that often. Perhaps adding a binary feature that detected the presence of numbers in each row of observation data could enhance performance and reduce the occurance of False Negatives. Whereas for the False Positives, it appears that messages that have text that uses words like "call","text","sms","msg" are being classified incorrectly. This is most likely due to the fact that most spam messages have a similar call to action, to try to bait their victims to engage with them further. True negatives normally were just texts composed of strings, with no numbers in them, while True positives normally had some number in them, a call to action, or something having to do with communication.

In [59]:
#Define helper function
def incorrect(val_data,y_true,y_pred):
    output = []
    for i in range(len(y_true)):
        if y_true[i] != y_pred[i]:
            output.append([y_true[i], y_pred[i],val_data[i]])
    return pd.DataFrame(output, columns=['Y_true','y_pred','text'])

#Get correctly labeled val set predictions
wrong_val_predictions = incorrect(val_data, y_val, y_val_pred)

In [62]:
wrong_val_predictions.head(10)

Unnamed: 0,Y_true,y_pred,text
0,1,0,"[0a, networks, allow, companies, to, bill, for..."
1,0,1,"[urgh, coach, hot, smells, of, chip, fat, than..."
2,0,1,"[waiting, for, your, call]"
3,0,1,"[k, u, also, dont, msg, or, reply, to, his, msg]"
4,1,0,"[mobile, club, choose, any, of, the, top, qual..."
5,1,0,"[2, 2, 146tf150p]"
6,1,0,"[missed, call, alert, these, numbers, called, ..."
7,1,0,"[england, v, macedonia, dont, miss, the, goals..."
8,1,0,"[88066, from, 88066, lost, 3pound, help]"
