# BOW classifier(logistic regression) for spam detection

## Data Loading

First, we download the SMS Spam Collection Dataset. The dataset is taken from [Kaggle](https://www.kaggle.com/uciml/sms-spam-collection-dataset/data#) and loaded to [Google Drive](https://drive.google.com/open?id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR) so that everyone can access it.

In [None]:
!wget 'https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR' -O spam.csv 

In [None]:
!ls

There are two columns: `v1` -- spam or ham indicator, `v2` -- text of the message.

In [90]:
import pandas as pd
import numpy as np
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()

df = pd.read_csv("spam.csv", usecols=["v1", "v2"], encoding='latin-1')
# 1 - spam, 0 - ham
df.v1 = (df.v1 == "spam").astype("int")
df.head()

Unnamed: 0,v1,v2
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [91]:
# 0.15 for val, 0.15 for test, 0.7 for train
val_size = int(df.shape[0] * 0.15)
test_size = int(df.shape[0] * 0.15)

validate, test, train = np.split(df.sample(frac=1), [val_size, test_size + val_size])

train_texts, train_labels = train['v2'].tolist(), train['v1'].tolist()
val_texts, val_labels = validate['v2'].tolist(), validate['v1'].tolist()
test_texts, test_labels = test['v2'].tolist(), test['v1'].tolist()

## Data Processing

Create bag-of-words features: tokenize the text, index each token, represent the sentence as a dictionary of tokens and their counts, limit the vocabulary to $n$ most frequent tokens. 

Function `preprocess_data` takes the list of texts and returns list of (lists of tokens). 
You may use [spacy](https://spacy.io/) or [nltk](https://www.nltk.org/) text processing libraries in `preprocess_data` function. 

Class `Vectorizer` is used to vectorize the text and to create a matrix of features.


In [92]:
def preprocess_data(data):
    """
    Returns a list of lists of preprocessed tokens for each message
    Args:
        data - list of texts
    Returns:
        preprocessed_data - list of lists of tokens
    """
    preprocessed_data = [ ]
    for i in range(len(data)):
        doc = nlp(data[i])
        tokens = [token.text for token in doc]
        preprocessed_data.append(tokens)

    return preprocessed_data

train_data = preprocess_data(train_texts)
val_data = preprocess_data(val_texts)
test_data = preprocess_data(test_texts)

In [100]:
train_data

[['Gudnite', '....', 'tc', '...', 'practice', 'going', 'on'],
 ['My',
  'mobile',
  'number.pls',
  'sms',
  'ur',
  'mail',
  'id.convey',
  'regards',
  'to',
  'achan',
  ',',
  'amma',
  '.',
  'Rakhesh',
  '.',
  'Qatar'],
 ['One',
  'of',
  'best',
  'dialogue',
  'in',
  'cute',
  'reltnship',
  '..',
  '!',
  '!',
  '\\Wen',
  'i',
  'Die'],
 ['Cramps', 'stopped', '.', 'Going', 'back', 'to', 'sleep'],
 ['The',
  'table',
  "'s",
  'occupied',
  ',',
  'I',
  "'m",
  'waiting',
  'by',
  'the',
  'tree'],
 ['Mmmm',
  '...',
  'Fuck',
  '...',
  'Not',
  'fair',
  '!',
  'You',
  'know',
  'my',
  'weaknesses',
  '!',
  '*',
  'grins',
  '*',
  '*',
  'pushes',
  'you',
  'to',
  'your',
  'knee',
  "'s",
  '*',
  '*',
  'exposes',
  'my',
  'belly',
  'and',
  'pulls',
  'your',
  'head',
  'to',
  'it',
  '*',
  'Do',
  "n't",
  'forget',
  '...',
  'I',
  'know',
  'yours',
  'too',
  '*',
  'wicked',
  'smile',
  '*'],
 ['Does',
  'she',
  'usually',
  'take',
  'fifteen',
  

In [94]:
class Vectorizer():
    def __init__(self, max_features):
        self.max_features = max_features
        self.vocab_list = None
        self.token_to_index = None

    def fit(self, dataset):
        """
        Create a vocab list, self.vocab_list, using the most frequent "max_features" tokens
        Create a token indexer, self.token_to_index, that will map each token in self.vocab_list to its corresponding index in self.vocab_list
        Args:
            dataset - preprocessed_data(list of lists of tokens) 
        """
        tem_vocab = dict()
        for i in range(len(dataset)):
            for j in range(len(dataset[i])):
                if dataset[i][j] not in tem_vocab:
                    tem_vocab[dataset[i][j]] = 1
                else:
                    tem_vocab[dataset[i][j]] += 1
        
        self.vocab_list = [k for k,v in sorted(tem_vocab.items(), key=lambda x:x[1], reverse = True)[:max_features]]#the list of most frequent "max_features" tokens
        self.token_to_index = {token: index for index, token in enumerate(self.vocab_list)}
        pass

    def transform(self, dataset):
        """
        Transforms text dataset into a matrix, data_matrix.
        Args:
            dataset - preprocessed_data(list of lists of tokens) 
        Returns:
            data_matrix - (len(dataset) x len(self.vocab_list))
        """
        data_matrix = np.zeros((len(dataset), len(self.vocab_list)))
        for i in range(len(dataset)):
            for j in range(len(self.vocab_list)):
                data_matrix[i][j] = dataset[i].count(self.vocab_list[j])
                
        return data_matrix

In [96]:
max_features = 10000 # TODO: Replace None with a number
vectorizer = Vectorizer(max_features=max_features)
vectorizer.fit(train_data)
X_train = vectorizer.transform(train_data)
X_val = vectorizer.transform(val_data)
X_test = vectorizer.transform(test_data)

y_train = np.array(train_labels)
y_val = np.array(val_labels)
y_test = np.array(test_labels)

vocab = vectorizer.vocab_list

## Model

We train logistic regression model and save prediction for train, val and test.


In [97]:
from sklearn.linear_model import LogisticRegression

# Define Logistic Regression model
model = LogisticRegression(random_state=0, solver='liblinear')

# Fit the model to training data
model.fit(X_train, y_train)

# Make prediction using the trained model
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)
y_test_pred = model.predict(X_test)

## Performance of the model

Report train, val, test accuracies and F1 scores.

In [98]:
def accuracy_score(y_true, y_pred): 
    """
    Return the accuracy score = (tp+tn)/(tp+tn+fp+fn).
    Args:
        y_true - actual labels, 0/1, 1-d array
        y_pred = predicted labels, 0/1, 1-d array
    Returns:
        accuracy - accuracy score
    """
    tn, tp, fn, fp = 0,0,0,0
    for i in range(y_true.shape[0]):
        if y_true[i] == 0 and y_pred[i] == 0:
            tn += 1
        elif y_true[i] == 0 and y_pred[i] == 1:
            fp += 1
        elif y_true[i] == 1 and y_pred[i] == 0:
            fn += 1
        elif y_true[i] == 1 and y_pred[i] == 1:
            tp += 1
            
    accuracy = (tp+tn)/(tp+tn+fp+fn)
    return accuracy

def f1_score(y_true, y_pred): 
    """
    Returns F1 score = 2*P*R/(P+R) where P(precision) = tp/(tp+fp) and R(recall) = tp/(tp+fn)
    Args:
        y_true - actual labels, 0/1, 1-d array
        y_pred = predicted labels, 0/1, 1-d array
    Returns:
        f1 - F1 score
    """
    tp, fn, fp = 0,0,0
    for i in range(y_true.shape[0]):
        if y_true[i] == 0 and y_pred[i] == 1:
            fp += 1
        elif y_true[i] == 1 and y_pred[i] == 0:
            fn += 1
        elif y_true[i] == 1 and y_pred[i] == 1:
            tp += 1
            
    P = tp/(tp+fp)
    R = tp/(tp+fn)
    f1 = 2*P*R/(P+R)
    return f1

In [99]:
print(f"Training accuracy: {accuracy_score(y_train, y_train_pred):.3f}, "
      f"F1 score: {f1_score(y_train, y_train_pred):.3f}")
print(f"Validation accuracy: {accuracy_score(y_val, y_val_pred):.3f}, "
      f"F1 score: {f1_score(y_val, y_val_pred):.3f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred):.3f}, "
      f"F1 score: {f1_score(y_test, y_test_pred):.3f}")

Training accuracy: 0.998, F1 score: 0.994
Validation accuracy: 0.976, F1 score: 0.897
Test accuracy: 0.977, F1 score: 0.914


### Exploration of predicitons

Show a few examples with true+predicted labels on the train and val sets.

In [119]:
print('Examples of train set:')
for i in range(5,10):
    print('Train data: {} with true label {} and predicted label {}'.format(train_texts[i], y_train[i], y_train_pred[i]))

Examples of train set:
Train data: Mmmm ... Fuck ... Not fair ! You know my weaknesses ! *grins* *pushes you to your knee's* *exposes my belly and pulls your head to it* Don't forget ... I know yours too *wicked smile* with true label 0 and predicted label 0
Train data: Does she usually take fifteen fucking minutes to respond to a yes or no question with true label 0 and predicted label 0
Train data: If I die I want u to have all my stuffs. with true label 0 and predicted label 0
Train data: K... Must book a not huh? so going for yoga basic on sunday? with true label 0 and predicted label 0
Train data: For ur chance to win a å£250 wkly shopping spree TXT: SHOP to 80878. T's&C's www.txt-2-shop.com custcare 08715705022, 1x150p/wk with true label 1 and predicted label 1


In [131]:
print('Examples of dev set:')
for i in range(5,10):
    print('Dev data: {} with true label {} and predicted label {}'.format(val_texts[i], y_val[i], y_val_pred[i]))

Examples of dev set:
Dev data: Anything lor but toa payoh got place 2 walk meh... with true label 0 and predicted label 0
Dev data: Lol enjoy role playing much? with true label 0 and predicted label 0
Dev data: Never y lei... I v lazy... Got wat? Dat day Ì_ send me da url cant work one... with true label 0 and predicted label 0
Dev data: Dear 0776xxxxxxx U've been invited to XCHAT. This is our final attempt to contact u! Txt CHAT to 86688 150p/MsgrcvdHG/Suite342/2Lands/Row/W1J6HL LDN 18yrs  with true label 1 and predicted label 1
Dev data: What's your room number again? Wanna make sure I'm knocking on the right door with true label 0 and predicted label 0


According to those correctly labeled spams from dev set, we could see that texts with keywords 'txt', 'call', 'mobile', 'stop', 'reply' have higher probability of being spams, and the model we built captured those frequent keywords in the vocab_list, thus makes it possible to clarify texts with these vocabularies as spams. However, in those incorrectly labeled texts, we could barely see these keywords. Those spams have diversified forms of organization other than including 'txt', 'call', 'mobile', 'stop' in texts directly, thus makes it harder to identify them as spams.

In [133]:
count = 0
for i in range(len(val_texts)):
    if count > 9:
        break
    elif (y_val[i] == 0 and y_val_pred[i] == 1) or (y_val[i] == 1 and y_val_pred[i] == 0) :
        print('Dev: ' + val_texts[i])
        print('Predicted label {} vs true label {}'.format(y_val_pred[i], y_val[i]))
        count += 1

Dev: it to 80488. Your 500 free text messages are valid until 31 December 2005.
Predicted label 0 vs true label 1
Dev: 85233 FREE>Ringtone!Reply REAL
Predicted label 0 vs true label 1
Dev: Misplaced your number and was sending texts to your old number. Wondering why i've not heard from you this year. All the best in your mcat. Got this number from my atlanta friends
Predicted label 1 vs true label 0
Dev: http//tms. widelive.com/index. wml?id=820554ad0a1705572711&first=trueåÁC C RingtoneåÁ
Predicted label 0 vs true label 1
Dev: 500 free text msgs. Just text ok to 80488 and we'll credit your account
Predicted label 0 vs true label 1
Dev: LIFE has never been this much fun and great until you came in. You made it truly special for me. I won't forget you! enjoy @ one gbp/sms
Predicted label 0 vs true label 1
Dev: FreeMsg>FAV XMAS TONES!Reply REAL
Predicted label 0 vs true label 1
Dev: SMS. ac sun0819 posts HELLO:\You seem cool
Predicted label 0 vs true label 1
Dev: Free Msg: Ringtone!From: 

In [134]:
##Correctly labeled spams from dev set
# count = 0
# for i in range(len(val_texts)):
#     if count > 9:
#         break
#     elif y_val[i] == 1 and y_val_pred[i] == 1:
#         print('Dev: ' + val_texts[i])
#         print('Predicted label {} vs true label {}'.format(y_val_pred[i], y_val[i]))
#         count += 1

Dev: Dear 0776xxxxxxx U've been invited to XCHAT. This is our final attempt to contact u! Txt CHAT to 86688 150p/MsgrcvdHG/Suite342/2Lands/Row/W1J6HL LDN 18yrs 
Predicted label 1 vs true label 1
Dev: Do you want a new Video handset? 750 any time any network mins? UNLIMITED TEXT? Camcorder? Reply or Call now 08000930705 for del Sat AM
Predicted label 1 vs true label 1
Dev: Dear Voucher Holder, To claim this weeks offer, at you PC please go to http://www.e-tlp.co.uk/expressoffer Ts&Cs apply. To stop texts, txt STOP to 80062
Predicted label 1 vs true label 1
Dev: Natalja (25/F) is inviting you to be her friend. Reply YES-440 or NO-440 See her: www.SMS.ac/u/nat27081980 STOP? Send STOP FRND to 62468
Predicted label 1 vs true label 1
Dev: 500 New Mobiles from 2004, MUST GO! Txt: NOKIA to No: 89545 & collect yours today!From ONLY å£1 www.4-tc.biz 2optout 087187262701.50gbp/mtmsg18
Predicted label 1 vs true label 1
Dev: Had your mobile 10 mths? Update to the latest Camera/Video phones for FREE