# 실습1 - Building spam filtering system

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

%matplotlib inline

## Data

* In this lecture, we use the SMS Spam Collection Data Set from UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection). 
    * A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site.
    * A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore.

In [None]:
import io
from google.colab import files
uploaded = files.upload()

df_sms = pd.read_csv(io.StringIO(uploaded['SMS_Spam.tsv'].decode('utf-8')), sep='\t')

## Exploratory Data Analysis
* First, how many messages the data have?

In [None]:
len(df_sms)

* Then, now, how many spams and hams each other?

In [None]:
df_sms['label'].value_counts()

* Now, let's apply lengths of each message and create a new column.

In [None]:
df_sms['length'] = df_sms['message'].apply(len)

In [None]:
df_sms.head()

* How are the lengths of messages distributed?

In [None]:
sns.distplot(df_sms['length'])
plt.show()

* Are there any differences of the distribution of spam and ham messages?

In [None]:
df_spam = df_sms[df_sms['label']=='spam'].reset_index(drop=True)
df_ham = df_sms[df_sms['label']=='ham'].reset_index(drop=True)

In [None]:
plt.figure(figsize=(15,10))

sns.distplot(df_spam['length'], color='red')
sns.distplot(df_ham['length'], color='blue')
plt.legend(labels=['spam','ham'])
plt.show()

* What are the shortest and longest messgaes?

In [None]:
sns_sorted = df_sms.sort_values(by='length').reset_index()

In [None]:
sns_sorted.loc[5571]['message']

## Text preprocessing
* For analyzing texts, we need to split each message into individual words.
* Let's remove punctuations first.
    * Python's built-in library **string** would provide a quick and convenient way of removing them.

In [None]:
import string

string.punctuation

* Check characters whether they are punctuations or not.

In [None]:
sample = "Hello! This is Finance - IT Convergence AI - DX Course."

In [None]:
sample_nopunc = []
for char in sample:
    if char not in string.punctuation:
        sample_nopunc.append(char)

In [None]:
sample_nopunc = "".join(sample_nopunc)

In [None]:
sample_nopunc

* Now, it's a step to remove stopwords. The NLTK library is a kind of stardard library for processing texts in Python (https://www.nltk.org/).
* The NLTK library provide a list of stopwords.

In [None]:
import nltk
from nltk.corpus import stopwords

* We can specify a language for stopwords list.

In [None]:
nltk.download('stopwords')

In [None]:
stopwords.words('english')

* Split the message and remove stopwords according to the list.

In [None]:
sample_nopunc

In [None]:
sample_nopunc.split()

In [None]:
remove_stopwords = []
for word in sample_nopunc.split():
    if word.lower() not in stopwords.words('english'):
        remove_stopwords.append(word)

In [None]:
remove_stopwords

* When you make a function for this, it would be more useful to apply it later.

In [None]:
def preprocessing(text):
    
    # remove punctuation
    nopunc = []
    for char in text:
        if char not in string.punctuation:
            nopunc.append(char)
            
    nopunc = "".join(nopunc)
    
    # remove stopwords
    remove_stop = []
    for word in nopunc.split():
        if word.lower() not in stopwords.words('english'):
            remove_stop.append(word)
            
    # remove words less than three characters
    tokens = []
    for word in remove_stop:
        if len(word) >= 3:
            tokens.append(word)
            
    #tokens = " ".join(tokens)
    
    return tokens

In [None]:
sample

In [None]:
preprocessing(sample)

* You can apply the preprocessing function to whole dataframe.

In [None]:
df_sms.head()

In [None]:
df_sms['message'].apply(preprocessing)

## Frequency Analysis

In [None]:
clean_spam = df_spam['message'].apply(preprocessing)
clean_ham = df_ham['message'].apply(preprocessing)

* First, let's merge whole values of each dataframe into one list.

In [None]:
whole_spam = []
for line in clean_spam.tolist():
    whole_spam += line
    
whole_ham = []
for line in clean_ham.tolist():
    whole_ham += line

* The **Text** class in **NLTK** library provide some useful methods to text analysis.

In [None]:
from nltk import Text

ham_text = Text(whole_ham)
spam_text = Text(whole_spam)

* The **vocab** method in the **Text** class can extract the frequency of usage for each token.

In [None]:
freqDist_ham = ham_text.vocab()

In [None]:
freqDist_ham

In [None]:
freqDist_ham.most_common(10)

* How about spam messages?

* You can plot the distribution of each token by the **plot** method.

In [None]:
plt.figure(figsize=(10,8))

ham_text.plot(30)
plt.show()

* We can also use the **wordcloud** package for visualization. 
* You can download the package by `conda install -c conda-forge wordcloud`

In [None]:
from wordcloud import WordCloud

plt.figure(figsize=(15,10))

wc_ham = WordCloud(width=1000, height=600, background_color="black", random_state=0)
plt.imshow(wc_ham.generate_from_frequencies(freqDist_ham))
plt.axis("off")
plt.show()

In [None]:
plt.figure(figsize=(15,10))

wc_spam = WordCloud(width=1000, height=600, background_color="black", random_state=0)
plt.imshow(wc_spam.generate_from_frequencies(freqDist_spam))
plt.axis("off")
plt.show()

## Building a spam flitering system

* We have a data with classifying each message according to "spam" or "ham". 
    * This means that we can build the spam classifier with machine learning model. 
* First, we have to specify feature and target, then split our data into train-test set.

In [None]:
df_sms.head()

### Preprocessing data

In [None]:
df_sms['message_clean'] = df_sms['message'].apply(preprocessing)

In [None]:
df_sms['message_clean'] = df_sms['message_clean'].apply(', '.join)

In [None]:
df_sms.head()

In [None]:
x = df_sms['message_clean']
y = df_sms['label']

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y)

### Vectorization

* Now, we need to convert each message into a vector for utilizing machine learning models.
* Here, we will use the TF-IDF vectorizer with **SciKit Learn** package.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vectorizer = TfidfVectorizer()
x_train_vec = vectorizer.fit_transform(x_train)
x_test_vec = vectorizer.transform(x_test)

### Model selection and test the classifier

In [None]:
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(x_train_vec, y_train)
pred_rfc = rfc.predict(x_test_vec)

metrics.accuracy_score(y_test, pred_rfc)

## Building a spam filtering system with RNN

In [None]:
df_sms = pd.read_csv('./SMS_Spam.tsv', sep='\t')

* Replace the label "ham" and "spam" as "0" and "1".

In [None]:
df_sms['label'] = df_sms['label'].replace(['ham', 'spam'], [0, 1])

In [None]:
df_sms.head()

* Set a feature and target for classification.

In [None]:
x = df_sms['message']
y = df_sms['label']

* Here, we use the tokenization with Keras.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(x)
sequences = tokenizer.texts_to_sequences(x)

In [None]:
sequences

* Now, we have to assign the number of training and test data for sequences.

In [None]:
n_train = int(len(sequences)*0.8)
n_test = int(len(sequences) - n_train)

In [None]:
x_data = sequences
max_len = max(len(i) for i in x_data)

In [None]:
data = pad_sequences(x_data, maxlen=max_len)

In [None]:
data[0]

* Split train and test sets.

In [None]:
x_train = data[:n_train]
y_train = y[:n_train]
x_test = data[n_train:]
y_test = y[n_train:]

* Now, let's construct simple RNN model to classification.

In [None]:
from tensorflow.keras.layers import SimpleRNN, Embedding, Dense
from tensorflow.keras.models import Sequential

In [None]:
word_size = len(tokenizer.word_index) + 1

model = Sequential()
model.add(Embedding(word_size, 32)) 
model.add(SimpleRNN(32)) 
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(x_train, y_train, epochs=4, batch_size=64, validation_split=0.2)

* How much is the model accurate?

In [None]:
predictions = model.predict_classes(x_test)
print(metrics.accuracy_score(y_test,predictions))