The deadline is 9:30am Feb 9th (Wed).   
You should submit a `.ipynb` file with your solutions to BrightSpace.

--- 

There are 10 extra points for "adding extra features to your model". But the maximum grade you can obtain in this homework is 100%. If you complete the extra-credit task, your score will be min{10+score, 100}.

---


In this homework we will preprocess SMS Spam Collection Dataset and train a bag-of-words classifier (logistic regression) for spam detection. 

## Data Loading (10 points)

First, we download the SMS Spam Collection Dataset. The dataset is taken from [Kaggle](https://www.kaggle.com/uciml/sms-spam-collection-dataset/data#) and loaded to [Google Drive](https://drive.google.com/open?id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR) so that everyone can access it.

In [1]:
!wget 'https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR' -O spam.csv 

zsh:1: command not found: wget


In [2]:
!ls

DSGA1012_HW1.ipynb spam.csv


There are two columns: `v1` -- spam or ham indicator, `v2` -- text of the message.

In [3]:
import pandas as pd
import numpy as np

df = pd.read_csv("spam.csv", usecols=["v1", "v2"], encoding='latin-1')
# 1 - spam, 0 - ham
df.v1 = (df.v1 == "spam").astype("int")
df.head()

Unnamed: 0,v1,v2
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
df.iloc[:,1]

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                Will Ì_ b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: v2, Length: 5572, dtype: object

Your task is to split the data to train/dev/test (don't forget to shuffle the data). Make sure that each row appears only in one of the splits.

In [5]:
# 0.15 for val, 0.15 for test, 0.7 for train
val_size = int(df.shape[0] * 0.15)
test_size = int(df.shape[0] * 0.15)

# My Code Starts Here
df = df.sample(frac=1) #Shuffle The dataframe to achieve randomness
df.reset_index(drop=True, inplace=True) #Reset index

# Split into train/val/test sets, since we shuffled the df
# And we are not sampling with replacement, we use a simple indexing approach
train_texts, train_labels = df.iloc[val_size+test_size:,1], df.iloc[val_size+test_size:,0]
val_texts, val_labels     = df.iloc[:val_size,1], df.iloc[:val_size,0]
test_texts, test_labels   = df.iloc[val_size:val_size + test_size,1], df.iloc[val_size:val_size+test_size,0]

## Data Processing (40 points)

The task is to create bag-of-words features: tokenize the text, index each token, represent the sentence as a dictionary of tokens and their counts, limit the vocabulary to $n$ most frequent tokens. In the lab we use built-in `sklearn` function, `sklearn.feature_extraction.text.CountVectorizer`. 
**In this HW, you are required to implement the `Vectorizer` on your own without using `sklearn` built-in functions.**

Function `preprocess_data` takes the list of texts and returns list of (lists of tokens). 
You may use [spacy](https://spacy.io/) or [nltk](https://www.nltk.org/) text processing libraries in `preprocess_data` function. 

Class `Vectorizer` is used to vectorize the text and to create a matrix of features.


In [6]:
#Import necessary modules for preprocessing
import nltk
from nltk import word_tokenize
from nltk.tokenize import RegexpTokenizer

def preprocess_data(data):
    # This function should return a list of lists of preprocessed tokens for each message
    """
    Input:
    data (string) - A string representing a sentence
    
    Output:
    preprocessed_data (list of lists) - lists of tokens indexed in a list 
    """
    #Initialize return list
    preprocessed_data = data
    
    #Make all sentences lowercase
    preprocessed_data = preprocessed_data.apply(lambda X: str.lower(X))
    
    #Strip Commas, Periods, Ect. And tokenize each word
    tokenizer = RegexpTokenizer(r"\w+")
    preprocessed_data = preprocessed_data.apply(tokenizer.tokenize)
    
    #Return our preprocessed data
    return preprocessed_data

train_data = preprocess_data(train_texts)
val_data = preprocess_data(val_texts)
test_data = preprocess_data(test_texts)

In [22]:
import numpy as np

class Vectorizer():
    def __init__(self, max_features):
        self.max_features = max_features
        self.vocab_list = None
        self.token_to_index = None
        
    def fit(self, dataset):
        # Create a vocab list, self.vocab_list, using the most frequent "max_features" tokens
        # Create a token indexer, self.token_to_index, that will map each token in self.vocab 
        # to its corresponding index in self.vocab_list
        """
        input: 
        dataset: a preprocessed dataset of 
        
        modifications:
        self.vocab_list: The vocab list will now contain the first "max_features" of most frequent tokens
        self.token_to_index: Maps each token in self.vocab to its index in self.vocab_list
        
        output:
        True: just to let you know the function ran ok :) 
        """
        #Initialize a list, a dictionary, and a helper variable
        self.vocab_list = []
        token_counter_dict = {}
        self.token_to_index = {}
        
        #Iterate over all tokens in each row
        for row in dataset:
            for token in row:
                #Increment counter in dictionary: current count + 1
                token_counter_dict[token] = token_counter_dict.get(token, 0) + 1
                
        #Sort the tokens by frequency
        token_counter_dict = dict(sorted(token_counter_dict.items(), key=lambda item: item[1], reverse=True))
        self.temp = token_counter_dict
        
        #Since its sorted from highest to lowest, grab the first max_features worth of tokens
        counter = 0 #Helper variable
        
        #Iterate over the sorted tokens
        for token, count in token_counter_dict.items():
        
            #Only append max_features number of tokens
            if counter < max_features:
                #Append token to vocab_list
                self.vocab_list.append(token)
                #Update our token_to_index dictionary to map added token to an index
                self.token_to_index[token] = counter
                counter += 1
            else:
                break
        
        return True

    def transform(self, dataset):
        # This function transforms text dataset into a matrix, data_matrix
        """
        input:
        dataset: preprocessed text represented as a list of lists (tokens)
        
        output:
        data_matrix: a 2D (i,j) binary-array where 1 represents the word is present in the ith row of data
        for the jth item in self.vocab_list
        """
        ### Given Code:
        data_matrix = np.zeros((len(dataset), len(self.vocab_list)))
        
        ### My Code:
        #Iterate over all the tokens in each row
        for i, row in enumerate(dataset):
            for token in row:
                    # If the token is present in self.vocab_list, then include in the data_matrix
                    if token in self.vocab_list:
                        data_matrix[i,self.token_to_index[token]] = 1
                        
        #Return our fitted data matrix
        return data_matrix

In [23]:
max_features = 100 # TODO: Replace None with a number
vectorizer = Vectorizer(max_features=max_features)
vectorizer.fit(train_data)
X_train = vectorizer.transform(train_data)
X_val = vectorizer.transform(val_data)
X_test = vectorizer.transform(test_data)

y_train = np.array(train_labels)
y_val = np.array(val_labels)
y_test = np.array(test_labels)

vocab = vectorizer.vocab_list


(10 extra points) You can add more features to the feature matrix.

In [9]:
"""
YOUR CODE GOES HERE
"""
#Maybe add a 2 features: a counter for positive and a counter for negative words 

'\nYOUR CODE GOES HERE\n'

## Model

We train logistic regression model and save prediction for train, val and test.


In [10]:
from sklearn.linear_model import LogisticRegression

# Define Logistic Regression model
model = LogisticRegression(random_state=0, solver='liblinear')

# Fit the model to training data
model.fit(X_train, y_train)

# Make prediction using the trained model
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)
y_test_pred = model.predict(X_test)

## Performance of the model (30 points)

Your task is to report train, val, test accuracies and F1 scores. **You are required to implement `accuracy_score` and `f1_score` methods without using built-in python functions.** 

Your model should achieve at least **0.95** test accuracy and **0.90** test F1 score.

In [11]:
def accuracy_score(y_true, y_pred): 
    # Calculate accuracy of the model's prediction
    """
    YOUR CODE GOES HERE
    """
    accuracy = None
    return accuracy

def f1_score(y_true, y_pred): 
    # Calculate F1 score of the model's prediction
    """
    YOUR CODE GOES HERE
    """
    f1 = None
    return f1

In [12]:
print(f"Training accuracy: {accuracy_score(y_train, y_train_pred):.3f}, "
      f"F1 score: {f1_score(y_train, y_train_pred):.3f}")
print(f"Validation accuracy: {accuracy_score(y_val, y_val_pred):.3f}, "
      f"F1 score: {f1_score(y_val, y_val_pred):.3f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred):.3f}, "
      f"F1 score: {f1_score(y_test, y_test_pred):.3f}")

TypeError: unsupported format string passed to NoneType.__format__

**Question.**
Is accuracy the metric that logistic regression optimizes while training? If no, which metric is optimized in logistic regression?

**Your answer:** 

**Question.**
In general, does having 0.99 accuracy on test means that the model is great? If no, can you give an example of a case when the accuracy is high but the model is not good? (Hint: why do we use F1 score?)

**Your answer:** 

### Exploration of predicitons (20 points)

Show a few examples with true+predicted labels on the train and val sets.

In [None]:
"""
YOUR CODE GOES HERE
"""
# 1 - spam, 0 - ham


**Question** Print 10 examples from val set which were labeled incorrectly by the model. Why do you think the model got them wrong?

**Your answer:** 

In [None]:
"""
YOUR CODE GOES HERE
"""