## 1.Review the dataset

We review and prepare last phase dataset to processing and learning so we import pandas to convert Dataset.csv file to dataframe. They are in csv format and separated by ',' delimeter. we will use read_csv function to do that

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('Dataset.csv', encoding='utf-8')
df.head()

Unnamed: 0.1,Unnamed: 0,Message,Tag
0,0,سلام\r\n عیدی شما آماده است. عدد 1 را به شماره...,Spam
1,1,رفاه گیلان\r\nچای تشریفات47%\r\nروغن لادن15%\r...,Spam
2,2,درست و اصولی لاغر شوید\r\n*غیرحضوری*\r\n\r\nک...,Spam
3,3,خبرهای هیجان انگیز و جنجالی برای علاقمندان به ...,Spam
4,4,هدیه ویژه نوروزی برای تمام مشترکین سرویس تلگرا...,Spam


It has three columns that we need second and third column

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 416 entries, 0 to 415
Data columns (total 3 columns):
Unnamed: 0    416 non-null int64
Message       416 non-null object
Tag           416 non-null object
dtypes: int64(1), object(2)
memory usage: 9.8+ KB


Here we got 416 samples that they're tagged with Spam and Non Spam

In [4]:
df['Tag'].value_counts()

Spam        236
Non Spam    180
Name: Tag, dtype: int64

Spam messages are more than non spams so it is expected to detect spam messages better than non spams.
We will save messages daa series to use it later.

In [5]:
raw_messages = df['Message']

Because of using scikit learn and ensuring models compatibility we need to encode class labels
we will use sklearn.preprocessing.LabelEncoder to encode all tags.

In [6]:
from sklearn.preprocessing import LabelEncoder

In [7]:
le = LabelEncoder()
labels = le.fit_transform(df['Tag'])

We need to prepare our stop words list for preprcessing step that is next step.
Persian stop words list is sotred in file named stop-words.txt and the words are listed line by line.
Using pandas we will make a dataframe

In [8]:
stop_words = pd.read_csv('stop-words.txt', encoding='utf-8', delimiter='\n', header=None)
stop_words.head()

Unnamed: 0,0
0,اتفاقا
1,احتراما
2,احتمالا
3,اري
4,آري


## 2.Text Preprocessing
We are going to words as features (n-gram language model) and counting their occurance. If we perform this strategy we got lot's of features that many of them is not useful. The classifier would takes to long time to train and likely overfit. so we will do following preprocessing steps:

### Normalization
Lots of spam SMS contains phone numbers, urls or even email addresses so we will use Regex to convert all of them to a key word.<br>
<ul>
    <li>Replace <b>phone numbers</b> with <code>'شماره_تلفن'</code></li>
    <li>Replaec <b>URLs</b> with <code>آدرس _ لینک</code></li>
    <li>Replaec <b>email</b> with <code>آدرس _ایمیل</code></li>
</ul><br>
But to use regex first we should transfer all persian numbers to english that we use a function named <code>numbers_to_english()</code><br>

```python
    text = numbers_to_english(text)
    text = re.sub(email_regex, 'آدرس_ایمیل', text)
    text = re.sub(phone_regex, 'شماره_تلفن', text)
    text = re.sub(url_regex, 'آدرس_لینک', text)
    text = re.sub(number_regex, 'عدد_رقم', text)

We use hazm to normalize text and special characters<br>

```python
    normalizer = Normalizer()
    text = normalizer.normalize(text)

### Stemming
For persian stemming there is Stemmer class in Hazm that can stem all words <br>
Stemmer will find words stem for example it will replace 'کتاب‌ها' with 'کتاب'

```python
stemmer = Stemmer()
for index, term in enumerate(tokens):
    tokens[index] = stemmer.stem(term)

### Stop words
Some words in Persian language while necessary, don't contribute much meaning of phrase. These words, such as 'از', 'احتراما' are called <b>stop words</b>. They can effects on results and should be filtered out.<br>
<div style="border:1px solid #cfcfcf;border-radius: 2px;background: #f7f7f7;line-height: 1.21429em;padding:4px;">
text = ' '.join(term for term in tokens if term not in stop_words.values)
</div>

So the final preprocessing step will have following code:

In [9]:
import re
from hazm import *

In [10]:
email_regex = "^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$"
phone_regex = "((\+98|0)?9\d{9})|(0\d{2}\d{8}|\d{8})"
number_regex = "[1-9]\d+"
url_regex = "(@^(https?|ftp)://[^\s/$.?#].[^\s]*$@iS)|(t.me/[a-z|0-9]{4,})"\
    "|((https?://)?(w{3}.)?[a-zA-Z0-9]+.[a-zA-Z]{2,}(/[a-zA-Z0-9]*)*)"


def numbers_to_english(text):
    text = text.replace('۰', '0')
    text = text.replace('۱', '1')
    text = text.replace('۲', '2')
    text = text.replace('۳', '3')
    text = text.replace('۴', '4')
    text = text.replace('۵', '5')
    text = text.replace('۶', '6')
    text = text.replace('۷', '7')
    text = text.replace('۸', '8')
    text = text.replace('۹', '9')
    return text


def preprocessing_text(text):
    text = numbers_to_english(text)
    text = re.sub(email_regex, 'آدرس_ایمیل', text)
    text = re.sub(phone_regex, 'شماره_تلفن', text)
    text = re.sub(url_regex, 'آدرس_لینک', text)
    text = re.sub(number_regex, 'عدد_رقم', text)
    normalizer = Normalizer()
    text = normalizer.normalize(text)
    tokens = word_tokenize(text)    
    stemmer = Stemmer()
    for index, term in enumerate(tokens):
        tokens[index] = stemmer.stem(term)
    text = ' '.join(term for term in tokens if term not in stop_words.values)
    return text


Now we preprocess all messages and save them inside of an array named documents to start learning

In [11]:
documents = []
for content in raw_messages:
    content = preprocessing_text(content)
    documents.append(content)
documents[0:5]

['سلا عید آماده اس . عدد ۱ شماره عدد_رق پیامک کنید عضو اپلیکیشن بازیانا دسترس نامحدود بازی پرتال عدد_رق شارژ عید بگیرید . "',
 'رفاه گیل چا تشریفاتعدد_رقم٪ روغن لادنعدد_رقم٪ اسپاگت ماناعدد_رقم٪ ۱ /عدد_رق لغوعدد ۱',
 'درس اصول لاغر شوید *غیرحضوری* کلیک کنید : آدرس_لینک',
 'خبر هیج انگیز جنجال برا علاقمند فیل سریال ، هنرمند خبر حاشیه . عضو اپلیکیشن مواستار قرعه کش کمپین بهار عدد_رق جوایز پژو عدد_رق ، آیفون x سامسونگ s ۹ شرک کن جایزه ببر ارسال عدد : عدد_رق دانلود اپلیکیشن : آدرس_لینک',
 'هدیه ویژه نوروز برا تما مشترکین سرویس تلگراف . مناسب نوروز عدد_رق ، کتابچه ارز عدد_رق تمام کاربران که ارسال ۱ عدد_رق عضو سرویس کتابخانه الکترونیک تلگراف تعلق خواهد_گرف .']

## 3.Features
Now we've prepared the dataset for meaningful terms we're ready to construct features. So we will start will with tokenizing terms.

### Tokenization
We will tokenize individual terms and generating a <b>bag of words</b> model. But this model have a weakness that it fails to capture innate structure of human language and only represent occurence of terms.<br>
Alternatively we can use <b>n-gram</b> model to preserve words order and acn capture more information than bag of words model.

### Implementing the tf-idf statistic
The next step is assign each n-gram a feature and then compute the n=gram's frequency using some statistic.<br>
<br>
One good way to do is <b>tf-idf</b>. <b>term frequency (tf)</b> counts each n-gram occurance in a document to weight it's importance. But it won't work much good in some cases because of weighting common words that are in every document much more. Therefore to solve it we'll downweight term frequency with <b>inverse document frequency (idf)</b>, which is calculated by logarithmically scaling the inverse of the fraction of training examples that contain a given term. By combining these two statistic formulas the tf-idf statistics:<br>
$$ tf-idf(t,i) = tf(t,i)\times idf(t) $$ <br>
$$ =tf(t,i) \times \log \left( \frac{M}{m_t} \right) $$

where $tf(t,i)$ is the term frequency for term $t$ in the $i$th training example, $M$ is the total number of training examples, and $m_t$ is the number of training examples that contain the term $t$.<br><br>
Scikit Learn has a class called <code>TfidfVectorizer</code> that perform n-gram tokenization and also computes the tf-idf statistic.<br>
According to it's documentation it will do two things:<br>
<ol>
    <li>Computing tf-idf and avoiding devision by zero using <b>smoothing</b> (laplace) </li>
    <li>L2 normalization using <b>Euclidean</b> norm </li>
</ol><br>
Finally we're ready to transform a corpus of text data into a matrix of numbers with one row per training sample and one column per $n$-gram

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [13]:
vectorizer = TfidfVectorizer(ngram_range=(1, 2))
n_grams = vectorizer.fit_transform(documents)

Let's take a look at the dimensions of the <code>n_grams</code> matrix.

In [14]:
n_grams.shape

(416, 3834)

It's looks like the tokenizer extract 3878 unigrams and bigrams.Since each training set use only a few of these unigram and bigrams this matrix consists of zeros and is called a <b>sparse matrix</b>. But <code>TfidfVectorizer</code> handle it using Scipy.

## 4.Training and evaluating model
These all was preparing work and we haven't done any learning algorithm yet. This step we'll train a model using machine learning algorithms. We'll use a classifier called <b>Support Vector Machine (SVM)</b>. It is a good classifier for binary classification and attemps to find best planes that separates two classes.<br>
I've selected SVM with <b>linear kernel</b> beacause of following reasons:<br>
<ul>
    <li style="margin:6px 0">Text is often lineary separable</li>
    <li style="margin:6px 0">Text has a lot of features</li>
    <i>The linear kernel is good when there is a lot of features. That's because mapping the data to a higher dimensional space does not really improve the performance. In text classification, both the numbers of instances (document) and features (words) are large. The decision boundary produced by a RBF kernel when the data is linearly separable is almost the same as the decision boundary produced by a linear kernel. Mapping data to a higher dimensional space using an RBF kernel is not useful.</i>
    <li style="margin:6px 0">Linear kernel is faster</li>
    <li style="margin:6px 0">Less parameters to optimize</li>

</ul>

### First Analysis
First we have to find out how good is SVM on the dataset, so we start by <b>Hold-Out</b> method: an 80/20 training and test set split. We will measure $F_1$ score to balance precision and recall as metrics. We will use <b>hinge loss</b> function to train classifier.<br>
<br>$$ F_1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} $$<br>


In [15]:
from sklearn.model_selection import train_test_split
from sklearn import svm, metrics

In [16]:
X_train, X_test, y_train, y_test = train_test_split(
        n_grams,
        labels,
        test_size=0.3,
        random_state=42,
        shuffle=True,
        stratify=labels
    )
clf = svm.LinearSVC(loss='hinge', C=1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
metrics.f1_score(y_test, y_pred)

0.9722222222222222

We need to run cross validation to guarantee whether this performance is consistant. Let's take a look at <b>confusion matrix</b>

In [17]:
pd.DataFrame(
    metrics.confusion_matrix(y_test, y_pred),
    index=[['actual', 'actual'], ['spam', 'non spam']],
    columns=[['predicted', 'predicted'], ['spam', 'non spam']]
)

Unnamed: 0_level_0,Unnamed: 1_level_0,predicted,predicted
Unnamed: 0_level_1,Unnamed: 1_level_1,spam,non spam
actual,spam,51,3
actual,non spam,1,70


Classifier make a mistake likely when message is spam, it's typically <b>False Negative</b>

In [18]:
import numpy as np
from sklearn.model_selection import learning_curve
from sklearn.model_selection import StratifiedShuffleSplit

In [19]:
samples_space = np.linspace(100, len(raw_messages) * 0.8, 10, dtype='int')
train_sizes, train_scores, valid_scores = learning_curve(
    estimator=svm.LinearSVC(loss='hinge', C=1e10),
    X=n_grams,
    y=labels,
    train_sizes=samples_space,
    cv=StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=40),
    scoring='f1',
    n_jobs=-1
)

In [20]:
def make_tidy(sample_space, train_scores, valid_scores):
    messy_format = pd.DataFrame(
        np.stack((samples_space, train_scores.mean(axis=1),
                  valid_scores.mean(axis=1)), axis=1),
        columns=['# of training examples', 'Training set', 'Validation set']
    )
    
    return pd.melt(
        messy_format,
        id_vars='# of training examples',
        value_vars=['Training set', 'Validation set'],
        var_name='Scores',
        value_name='F1 score'
    )

In [21]:
import seaborn as sns
import matplotlib.pyplot as plt
g = sns.FacetGrid(
    make_tidy(samples_space, train_scores, valid_scores), hue='Scores', size=5
)

g.map(plt.scatter, '# of training examples', 'F1 score')
g.map(plt.plot, '# of training examples', 'F1 score').add_legend();

### Evaluating model with Cross Validation
Cross-validation is a technique for evaluating machine learning models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data. In k-fold cross-validation, you split the input data into k subsets of data (also known as folds).

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
classifier = svm.LinearSVC(loss='hinge', C=1)
scores = cross_val_score(classifier,
        n_grams,
        labels,
        cv=StratifiedShuffleSplit(n_splits=10, test_size=0.2),
        scoring='f1')

print('Accuracy: %0.2f (+/- %0.2f)' % (scores.mean(), scores.std() * 2))

we got mean score 97% and 93% confidence interval

## 5. What terms are the top predictors of spam?

In [34]:
clf.fit(n_grams, labels)
common_spams = pd.Series(
    clf.coef_.T.ravel(),                 
    index=vectorizer.get_feature_names()
).sort_values(ascending=False)[:20]
print(common_spams)

آدرس_لینک     2.993823
شماره_تلفن    2.633741
عدد_رق        1.879351
ارسال         1.811004
ویژه          1.198835
تخفیف         1.071500
رایگ          1.052526
شارژ          1.035939
روز           0.964954
ایرانسل       0.943415
خرید          0.936982
فر            0.865509
تماس          0.852475
ایر           0.823884
هدیه          0.817125
شماره         0.804614
کش            0.791897
ستاره         0.780390
همراه         0.771419
قرعه          0.753615
dtype: float64
