# Block 2: Lexical Level. SMS Spam Filtering.

## Data
We start by downloading the provided dataset.

Source: [UCI repository](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection)
It consists of 5574 instances of SMS messages (aggregated from different sources) belonging to either the class 'ham' or the class 'spam'.

In [1]:
!wget -nc https://gebakx.github.io/ihlt/b2/resources/smsspamcollection.zip

--2019-11-05 12:49:02--  https://gebakx.github.io/ihlt/b2/resources/smsspamcollection.zip
Resolving gebakx.github.io (gebakx.github.io)... 185.199.110.153, 185.199.109.153, 185.199.108.153, ...
Connecting to gebakx.github.io (gebakx.github.io)|185.199.110.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 203415 (199K) [application/zip]
Saving to: ‘smsspamcollection.zip’


2019-11-05 12:49:03 (2,43 MB/s) - ‘smsspamcollection.zip’ saved [203415/203415]



In [2]:
!unzip smsspamcollection.zip

Archive:  smsspamcollection.zip
  inflating: SMSSpamCollection       
  inflating: readme                  


In [2]:
with open('SMSSpamCollection', 'r') as f:
    raw_data = f.readlines()

In [21]:
print('The dataset effectively has', len(raw_data), 'lines', 'with the following class distributions')
labeled_data = []
freqs = {}
for line in raw_data:
    label, text = line.split('\t')
    labeled_data.append((label, text))
    if label in freqs:
        freqs[label] += 1
    else:
        freqs[label] = 0
for key, value in freqs.items():
    percentage = 100*value/len(raw_data)
    print(key + ':', value, 'of', len(raw_data), '(' + "{0:.2f}".format(percentage) + '%)')

The dataset effectively has 5574 lines with the following class distributions
spam: 746 of 5574 (13.38%)
ham: 4826 of 5574 (86.58%)


The dataset is imbalanced, which poses a challenge when applying machine learning techniques. If we excesively optimize accuracy, perhaps we obtain low precision or recall values for the minority class. We will see this point with more detail later on.

## Preprocessing

In order to ease the task to the machine learning algorithm and decrease noise introduced by some variance that might be irrelevant to this problem, we will:
    - Remove punctuation.
    - Convert characters to lowercase.
Ideally, we would like to take both punctuation and case into account, because perhaps texts with many capital letters and exclamation marks have higher probability of being spam. However, since we do not have a large database, in this case it would make the learning process more difficult.

In addition, we will have a second version of the preprocessed dataset with lemmas instead of the original words. Lemmatization gives the canonical form of words, decreasing the variance introduced by morphology, and since we will be using representations that do not take into account the order of the words, like bag of words, decreasing the vocabulary and considering all the words with the same lemma as equal can ease the process of learning. We suspect that we will obtain better results with lemmas, but we will have to see the results of the experiments.

In [37]:
import nltk
import string
# nltk.download('punkt')
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()

def remove_punctuation(token):
    res = ''
    for c in token:
        if c not in string.punctuation:
            res += c
    return res

def lemmatize(p):
    if p[1][0] in {'N','V'}:
        return wnl.lemmatize(p[0].lower(), pos=p[1][0].lower())
    return p[0]

preprocessed_data = []
for label, text in labeled_data:
    tokenized = nltk.word_tokenize(text.lower())
    tokenized = [remove_punctuation(tok) for tok in tokenized]
    tokenized = list(filter(None, tokenized))
    pos_tags = pos_tag(tokenized)
    lemmas = [lemmatize(pair) for pair in pos_tags]
    preprocessed_data.append((label, tokenized, lemmas))
        

## Experiment design
The assignment suggests a "single validation (50% - 50%)" and a random shuffle. In this context, by validation we understand test, so we will have a train-test split of 50-50. We will provide a random seed in order to make the experiments reproducible. On the other hand, for selecting the best model with regard to generalization, we suggest performing a 5-fold cross-validation with the 50% of the training data. The test set will only be used for the final evaluation of the selected model.

In the cross-validation, we will optimize certain hyperparameters and experiment with different options:
- The 'C' hyperparameter in SVMs, for penalizing more or less the errors (points outside the decision boundaries).
- Whether to use lemmas or words.

In [61]:
import random

shuffled_preprocessed_data = preprocessed_data.copy()
random.seed(1)
random.shuffle(shuffled_preprocessed_data)
n = len(shuffled_preprocessed_data)
train, test = shuffled_preprocessed_data[:n//2], shuffled_preprocessed_data[n//2:]
len(train), len(test)

(2787, 2787)

## Text representation
The way of representing text is crucial for the application of machine learning techniques to natural language. Typically, each machine learning algorithm has an associated input format. In our case, we are keen on defining a custom kernel for sets in SVMs for the reasons we will see later on. Therefore, we will be working with sets, and the required text representation for our algorithm will be a bag of words, that is to say, a boolean vector codifying the presence of the indexed words. Bags of words encode the set of present words as a vector such that the present words will have its corresponding element set to true.

This representation has some caveats, since it does not account for the frequency of the words, for instance, unlike other representations. However, it is the required one for the algorithm of our choice, and perhaps in short texts frequency is not that important. Among the proposed representations, term-frequency times inverse document-frequency seems to be the most robust one, since it avoids the effect of common words.

In order to obtain an honest evaluation of the method, only words in the train set will be indexed. 

In [62]:
from functools import reduce
train_tokens = list(reduce(lambda x, y: x + y, [tokens for label, tokens, lemmas in train]))
fdist_tokens = nltk.FreqDist(train_tokens)
train_lemmas = list(reduce(lambda x, y: x + y, [lemmas for label, tokens, lemmas in train]))
fdist_lemmas = nltk.FreqDist(train_lemmas)
fdist_tokens, fdist_lemmas

(FreqDist({'i': 1431, 'to': 1126, 'you': 1091, 'a': 714, 'the': 676, 'u': 559, 'and': 478, 'is': 470, 'in': 438, 'me': 405, ...}),
 FreqDist({'i': 1431, 'be': 1229, 'to': 1126, 'you': 1091, 'a': 714, 'the': 676, 'u': 559, 'and': 478, 'in': 438, 'do': 437, ...}))

In [63]:
import numpy as np

def get_bow_from_freq(fdist, tokens):
    bow = []
    for key in fdist:
        if key in tokens:
            bow.append(True)
        else:
            bow.append(False)
    return bow

def build_mat(fdist, texts):
    mat = []
    for text in texts:
        bow = get_bow_from_freq(fdist, text)
        mat.append(bow)
    return np.array(mat)

y_train = [label for label, tokens, lemmas in train]
test
#y_train
#X_train_tokens = build_mat(fdist_tokens, train_tokens)
#X_train_lemmas = build_mat(fdist_lemmas, train_lemmas)
#test_tokens = list(reduce(lambda x, y: x + y, [tokens for label, tokens, lemmas in test]))
#test_lemmas = list(reduce(lambda x, y: x + y, [lemmas for label, tokens, lemmas in test]))
#X_test_tokens = build_mat(fdist_tokens, test_tokens)
#X_test_lemmas = build_mat(fdist_lemmas, test_lemmas)
#y_train = [label for label, tokens, lemmas in train]

[('U still painting ur wall?\n',
  ['u', 'still', 'painting', 'ur', 'wall'],
  ['u', 'still', 'paint', 'ur', 'wall']),
 ('We can go 4 e normal pilates after our intro...  \n',
  ['we', 'can', 'go', '4', 'e', 'normal', 'pilates', 'after', 'our', 'intro'],
  ['we', 'can', 'go', '4', 'e', 'normal', 'pilate', 'after', 'our', 'intro']),
 ("This pen thing is beyond a joke. Wont a Biro do? Don't do a masters as can't do this ever again! \n",
  ['this',
   'pen',
   'thing',
   'is',
   'beyond',
   'a',
   'joke',
   'wont',
   'a',
   'biro',
   'do',
   'do',
   'nt',
   'do',
   'a',
   'masters',
   'as',
   'ca',
   'nt',
   'do',
   'this',
   'ever',
   'again'],
  ['this',
   'pen',
   'thing',
   'be',
   'beyond',
   'a',
   'joke',
   'wont',
   'a',
   'biro',
   'do',
   'do',
   'nt',
   'do',
   'a',
   'master',
   'as',
   'ca',
   'nt',
   'do',
   'this',
   'ever',
   'again']),
 ('Yup i thk they r e teacher said that will make my face look longer. Darren ask me not 2 cut 

## Machine learning
As we have anticipated, from the suggested methods in the assignment, our machine learning algorithm of choice will be SVMs. In particular, we will define a custom kernel for determining set similarity between two given sets of tokens (ie. the respective bags of words of two given texts).

In [31]:
# Gram matrix.B = A @ A.T
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets

# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features. We could
                      # avoid this ugly slicing by using a two-dim dataset
Y = iris.target


def my_kernel(X, Y):
    """
    We create a custom kernel:

                 (2  0)
    k(X, Y) = X  (    ) Y.T
                 (0  1)
    """
    print(X)
    input()
    print(Y)
    input()
    M = np.array([[2, 0], [0, 1.0]])
    #return np.dot(np.dot(X, M), Y.T)
    print(X.shape, Y.shape)
    print((np.dot(np.dot(X, M), Y.T)).shape)
    return sum(X == Y)

print(X.shape, Y.shape)
h = .02  # step size in the mesh

# we create an instance of SVM and fit out data.
clf = svm.SVC(kernel=my_kernel)
clf.fit(X, Y)

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired, edgecolors='k')
plt.title('3-Class classification using Support Vector Machine with custom'
          ' kernel')
plt.axis('tight')
plt.show()

(150, 2) (150,)
[[5.1 3.5]
 [4.9 3. ]
 [4.7 3.2]
 [4.6 3.1]
 [5.  3.6]
 [5.4 3.9]
 [4.6 3.4]
 [5.  3.4]
 [4.4 2.9]
 [4.9 3.1]
 [5.4 3.7]
 [4.8 3.4]
 [4.8 3. ]
 [4.3 3. ]
 [5.8 4. ]
 [5.7 4.4]
 [5.4 3.9]
 [5.1 3.5]
 [5.7 3.8]
 [5.1 3.8]
 [5.4 3.4]
 [5.1 3.7]
 [4.6 3.6]
 [5.1 3.3]
 [4.8 3.4]
 [5.  3. ]
 [5.  3.4]
 [5.2 3.5]
 [5.2 3.4]
 [4.7 3.2]
 [4.8 3.1]
 [5.4 3.4]
 [5.2 4.1]
 [5.5 4.2]
 [4.9 3.1]
 [5.  3.2]
 [5.5 3.5]
 [4.9 3.6]
 [4.4 3. ]
 [5.1 3.4]
 [5.  3.5]
 [4.5 2.3]
 [4.4 3.2]
 [5.  3.5]
 [5.1 3.8]
 [4.8 3. ]
 [5.1 3.8]
 [4.6 3.2]
 [5.3 3.7]
 [5.  3.3]
 [7.  3.2]
 [6.4 3.2]
 [6.9 3.1]
 [5.5 2.3]
 [6.5 2.8]
 [5.7 2.8]
 [6.3 3.3]
 [4.9 2.4]
 [6.6 2.9]
 [5.2 2.7]
 [5.  2. ]
 [5.9 3. ]
 [6.  2.2]
 [6.1 2.9]
 [5.6 2.9]
 [6.7 3.1]
 [5.6 3. ]
 [5.8 2.7]
 [6.2 2.2]
 [5.6 2.5]
 [5.9 3.2]
 [6.1 2.8]
 [6.3 2.5]
 [6.1 2.8]
 [6.4 2.9]
 [6.6 3. ]
 [6.8 2.8]
 [6.7 3. ]
 [6.  2.9]
 [5.7 2.6]
 [5.5 2.4]
 [5.5 2.4]
 [5.8 2.7]
 [6.  2.7]
 [5.4 3. ]
 [6.  3.4]
 [6.7 3.1]
 [6.3 2.3]
 [5.6 3. ]
 [5.5

KeyboardInterrupt: 