# Text classification

You assignment is to create a classifier for SMS-texts to detect which messages are spam and which are not. In the dataset, ham is used to identify messages that are not spam.

To classify text, it is common to transform the text into vector representation. In a simple bag-of-word vector representation, every word in the collection is a feature and in each documents we simply count how often each word appears. So if our dictionary consists of [ i, am, hungy, and, thirsty ], then the text "I am hungry" would become (1, 1, 1, 0, 0), the text "I am thirsty" (1, 1, 0, 0, 1) and the text "I am hungry and I am thirsty" (2, 2, 1, 1, 1). Since these now are numbers, we can train a classifier like before.

# Text parsing

Several decisions affect the vectorization of text. Commonly, sentences are split on whitespace and punctuation marks to get words. Words are often lowercased and brought back to their stem (i.e. walk, walked, walking are all converted to their stem 'walk') and a list of relatively meaningless words, the so called 'stopwords', are removed.

In [1]:
import pandas as pd
import numpy as np

In [9]:
df = pd.read_csv("/data/datasets/spam.csv", encoding = 'latin-1')

#### check the Dataframe, for some reason during import 3 empty columns were created, remove them

In [10]:
df = df.drop(columns=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'])

#### convert the Categories in column v1 to numbers. Since we want to detect spam, it makes sense to use 1 for spam

In [11]:
df['v1'] = df.v1.apply(lambda x: x == 'spam') * 1

In [32]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=5, encoding='latin-1', stop_words='english')

In [33]:
features = vectorizer.fit_transform(df.v2).toarray()

In [34]:
features

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

#### Look at the shape, how many texts are there and how many words in the dictionary?

In [35]:
features.shape

(5572, 1602)

#### Use n-fold cross validation to compute the recall. Take the average accuracy over the experiments. Depending on the number of splits you should see an accuracy around 87% for n=10. What does the recall stand for?

In [79]:
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score, precision_score

kf = KFold(n_splits=10, random_state=0)
y = df.v1.to_numpy()
recall = []
for train, valid in kf.split(y):
    train_X = features[train]
    valid_X = features[valid]
    train_y = y[train]
    valid_y = y[valid]
    
    model = LogisticRegression(solver='liblinear', multi_class='auto')
    model.fit(train_X, train_y)
    pred_y = model.predict(valid_X)
    recall.append(recall_score(valid_y, pred_y))
sum(recall)/len(recall)

0.8724424466766699

#### Show the frequency of spam and ham.

In [38]:
df.v1.value_counts()

0    4825
1     747
Name: v1, dtype: int64

#### Since there is a big skew in the dataset, try to balance the training set and repeat the experiment. See what happens to the recall. You should see a big improvement.

In [80]:
pos = np.where(y == 1)[0]

In [81]:
ind = np.hstack([range(len(df)), pos, pos, pos, pos, pos, pos])
X = features[ind]
y = df.v1[ind].to_numpy()

In [82]:
kf = KFold(n_splits=10, random_state=0)
recall = []
precision = []
for train, valid in kf.split(y):
    train_X = X[train]
    valid_X = X[valid]
    train_y = y[train]
    valid_y = y[valid]
    
    model = LogisticRegression(solver='liblinear')
    model.fit(train_X, train_y)
    pred_y = model.predict(valid_X)
    recall.append(recall_score(valid_y, pred_y))
    precision.append(precision_score(valid_y, pred_y))

sum(recall)/len(recall)

0.9725281753273454

#### Also compute the precision.

In [None]:
sum(precision)/len(precision)