# Text Analysis
## SMS Spam classification

This uses the [SMS Spam dataset](https://www.dropbox.com/s/373c841oqz3usei/sms-spam.csv?dl=1) ([documentation here](https://www.kaggle.com/uciml/sms-spam-collection-dataset)). 

The goal is to classify an unseen SMS message as spam or not spam (also called ham). We'll examine two ways of doing this: 

  1. extracting numerical features and train a binary classification model 
  2. using just the words themselves to create a language model of spam and non-spam, then classify the new instance based on which language model it's closer to

In [1]:
# The following imports a number of libraries that we'll need, as well as
# configures a number of options that will make interacting with the notebook
# a little easier.

import pandas as pd    # For reading an manipulating tabular data.
import seaborn as sns  # For making pretty plots.
import math            # For common math operations, such as checking for NaNs.
import matplotlib.pyplot as plt # For plotting options.
from sklearn.model_selection import train_test_split, KFold # For train/test.
import re # For regular expressions.
from sklearn import metrics # For evaluation
from sklearn import preprocessing


# By default, Pandas will only show the first 20 columns of a dataframe and
# the first 50 characters of a string. These two settings remove those
# restrictions so that all columns and full strings are displayed.
pd.set_option('display.max_columns', None)  
pd.set_option('display.max_colwidth', -1)

## Preliminaries

Here's we'll load in the dataset and create a training, development, and testing set.

In [2]:
sms_data = pd.read_csv('https://www.dropbox.com/s/373c841oqz3usei/sms-spam.csv?dl=1', encoding='latin-1')
sms_data = pd.DataFrame({'spam': sms_data.v1=='spam', 'sms': sms_data.v2})


In [3]:
sms_data.head()

Unnamed: 0,spam,sms
0,False,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
1,False,Ok lar... Joking wif u oni...
2,True,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3,False,U dun say so early hor... U c already then say...
4,False,"Nah I don't think he goes to usf, he lives around here though"


## Binary classifier with numeric features
Here we'll extract a few numeric features per SMS message:

  * `length_chars` (length of message in characters)
  * `length_tokens` (length of message in tokens)
  * `mean_token_length` (mean number of characters per token)
  * `num_distinct_tokens` (number of distinct tokens after case normalization and punctuation removal)
  * `num_capital_chars` (number of capitalized characters in message)
  * `num_capital_tokens` (number of all capitalized tokens after punctuation removal)
  
  
First up, we'll extract these features for each message.

In [4]:
## TODO
## Add a column to sms_data called 'length_chars' and set it to the length of the 
## 'sms' column.
NON_ALPHA = re.compile('\W')
sms_data['length_chars'] = sms_data['sms'].map(len)
sms_data['length_tokens'] = sms_data['sms'].map(lambda s: len(NON_ALPHA.sub(' ', s).split()))
sms_data['num_distinct_tokens'] = sms_data['sms'].map(lambda s: len(list(set(NON_ALPHA.sub(' ', s).split()))))

## Regular expressions
## \w -- alpha-numeric characters ~ [a-zA-Z0-9]
## [A-Z]
## [A-Z]+ -- match 1 or more capital letters
## [A-Z]* -- match 0 or more capital letters
## ([A-Z]+) -- match and capture all sequences of 1 or more capital letters
## (\b[A-Z]+\b) -- match and capture all sequences of 1 or more capital letters 
##                 that stand alone (i.e., tokens)
CAPITAL_LETTER = re.compile('[A-Z]')
sms_data['num_capital_chars'] = sms_data['sms'].map(
    lambda s: len(CAPITAL_LETTER.findall(s)))

## Finds the number of all caps words that are length 1 or greater.
CAPITAL_WORD = re.compile('\\b[A-Z]+\\b')
sms_data['num_capital_tokens'] = sms_data['sms'].map(
    lambda s: len(CAPITAL_WORD.findall(s)))

## Finds the number of all caps words that are length 2 or greater.
CAPITAL_WORD = re.compile('\\b[A-Z]{2,}\\b')
sms_data['num_capital_tokens'] = sms_data['sms'].map(
    lambda s: len(CAPITAL_WORD.findall(s)))

CONTAINS_FREE = re.compile('\\bfree\\b')
sms_data['num_free'] = sms_data['sms'].map(
    lambda s: len(CONTAINS_FREE.findall(s.lower())))

NUMBERS = re.compile('\\d')
sms_data['num_digits'] = sms_data['sms'].map(
    lambda s: len(NUMBERS.findall(s)))

In [5]:
sms_data.head()

Unnamed: 0,spam,sms,length_chars,length_tokens,num_distinct_tokens,num_capital_chars,num_capital_tokens,num_free,num_digits
0,False,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",111,20,20,3,0,0,0
1,False,Ok lar... Joking wif u oni...,29,6,6,2,0,0,0
2,True,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,155,33,28,10,2,1,25
3,False,U dun say so early hor... U c already then say...,49,11,9,2,0,0,0
4,False,"Nah I don't think he goes to usf, he lives around here though",61,14,13,2,0,0,0


In [6]:
sms_data['label'] = sms_data['spam'].map(lambda s: "spam" if s else "ham")

In [7]:
sms_data2 = sms_data[['length_chars','length_tokens', 'num_distinct_tokens', 'num_capital_chars', 'num_capital_tokens', 'num_free', 'num_digits', 'label']]

Next we'll train a logistic regression classifier and evaluate it over the development set.

In [8]:
## Make our splits.
train_dev, test, train_dev_labels, test_labels = \
    train_test_split(sms_data2, sms_data2.label, test_size=0.30, stratify=sms_data2.label)
train, dev, train_labels, dev_labels = \
    train_test_split(train_dev, train_dev_labels, test_size=0.30, stratify=train_dev_labels)


In [9]:
cols = [x for x in train.columns if x != 'label']
min_max_scaler = preprocessing.MinMaxScaler()
min_max_scaler.fit(train[cols])
train_scaled = train.copy()
train_scaled[cols] = min_max_scaler.transform(train_scaled[cols])

dev_scaled = train.copy()
dev_scaled[cols] = min_max_scaler.transform(dev_scaled[cols])

test_scaled = train.copy()
test_scaled[cols] = min_max_scaler.transform(test_scaled[cols])

In [11]:
train.to_csv("train.csv", index=False)
dev.to_csv("dev.csv", index=False)
test.to_csv("test.csv", index=False)

train_scaled.to_csv("train_scaled.csv", index=False)
dev_scaled.to_csv("dev_scaled.csv", index=False)
test_scaled.to_csv("test_scaled.csv", index=False)


Before we evaluate the logistic regression model, here's the baseline:

In [11]:
baseline_predicted_labels = dev_labels * 0 ## All "not-spam".
print(f'Accuracy: {metrics.accuracy_score(dev_labels, baseline_predicted_labels)}')
print(f'AuROC: {metrics.roc_auc_score(dev_labels, baseline_predicted_labels)}')

Accuracy: 0.8658119658119658
AuROC: 0.5


Here's how the logistic regression model does:

In [12]:
## Measure performance:
print(f'Accuracy: {metrics.accuracy_score(dev_labels, predicted_labels)}')
print(f'F1: {metrics.f1_score(dev_labels, predicted_labels)}')
print(f'AuROC: {metrics.roc_auc_score(dev_labels, predicted_labels)}')


## Feature weights.
sorted(list(zip(train_features.columns, logistic.coef_[0])), key=lambda x: abs(x[1]), reverse=True)

Accuracy: 0.8957264957264958
F1: 0.5378787878787878
AuROC: 0.7083456467200283


[('length_tokens', -0.7286595764365558),
 ('num_distinct_tokens', 0.6466058705291646),
 ('num_capital_tokens', -0.11943696628693701),
 ('num_capital_chars', 0.06349254714895461),
 ('length_chars', 0.051319354815039665)]

## Language Modeling

Language models are distributions of tokens. For a collection of documents (think: a collection of SMS messages), we extract all of the tokens and for each distinct token we calculate the following numbers:

  * how often does it occur in the collection
  * in how many different documents does it occur

Language models are useful in classification tasks such as spam filtering because we can model what spammy messages look like and model what non-spammy messages look like. Given a new message, we can calculate which model, spam or non-spam, the message seems more like.


In [52]:
import math
NON_ALPHA = re.compile('\W')


def tokenize(text):
    return NON_ALPHA.sub(' ', text).lower().split()

def buildModel(texts):
    tfdict = {}
    sum = 0
        
    for word in tokenize(' '.join(texts)):
        if word not in tfdict:
            tfdict[word] = 0
            
        tfdict[word] += 1
        sum += 1
            
    return {'tf': tfdict, 'size': sum}

def tf(token, documents, oov=0.05):
    sum = 0
    for document in documents:
        if token in document['tf']:
            sum = sum + document['tf'][token]
            
    return max(sum, oov)
#     return max(oov, sum([document['tf'][token] for document in documents if token in document['tf']]))

def queryLikelihood(query, documents, mu=None, oov=0.05):
    queryTokens = tokenize(query)
    collectionSize = sum([document['size'] for document in documents])*1.0
    scores = []
    if mu == None:
        mu = collectionSize / len(documents)
        
    for i,D in enumerate(documents):
        doc_score = 0
        for q in queryTokens:
            doc_score += math.log(
                (tf(q, [D], oov) + (mu*tf(q, documents, oov)/collectionSize))) - math.log(
                D['size'] + mu)
        scores.append(doc_score)
        
    return scores

def classifySMS(trainSMS, trainLabels, testSMS):
    spamDocument = buildModel(trainSMS[trainLabels].tolist())
    hamDocument = buildModel(trainSMS[trainLabels==False].tolist())
    
    def isSpam(sms):
        scores = queryLikelihood(sms, [spamDocument, hamDocument], mu=20)
        return scores[0] > scores[1]
    
    return testSMS.map(isSpam)


In [53]:
doc1 = buildModel(["Hello there! How are you?"])
doc2 = buildModel(["I'm doing great, how are you?"])
#tf('how', [doc1, doc2])
#queryLikelihood("I'm doing great", [doc1, doc2])

In [54]:
predicted_labels = classifySMS(train_sms['sms'], train_labels, dev_sms['sms'])

In [55]:
print(f'Accuracy: {metrics.accuracy_score(dev_labels, predicted_labels)}')
print(f'F1: {metrics.f1_score(dev_labels, predicted_labels)}')
print(f'AuROC: {metrics.roc_auc_score(dev_labels, predicted_labels)}')


## Feature weights.

Accuracy: 0.9700854700854701
F1: 0.8973607038123168
AuROC: 0.971960060613301
