# Spam detection in SMS data

***(Data Processing and Feature Engineering Notebook)***

### Load Source Data from IBM Could Object Store

we need to connect to the object store and read a PARQUET file and create a dataframe out of it. 

In [1]:
# import required packages and libraries
import pyspark.sql.functions as fn
from pyspark.ml.feature import Tokenizer, StopWordsRemover, StringIndexer, HashingTF, IDF, VectorAssembler, Normalizer
from pyspark.ml import Pipeline

import ibmos2spark, os
# @hidden_cell
credentials = {
    'endpoint': 'https://s3.private.us.cloud-object-storage.appdomain.cloud',
    'service_id': 'iam-ServiceId-83c6b79d-12e3-46ca-8755-655edce4d15b',
    'iam_service_endpoint': 'https://iam.cloud.ibm.com/oidc/token',
    'api_key': 'Qs6034fxfk78riRMCpVQXrKDrPzAlo6NUd_mP23rBmv3'
}

configuration_name = 'os_0fdd3f4f5bb549fab8a52672d7c21f58_configs'
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name, 'bluemix_cos')

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sdf = spark.read.parquet(cos.url('SMSSpamData.parquet', 'advanceddatasciencecapstoneibm-donotdelete-pr-prnii9jvlql3tf'))
sdf.createOrReplaceTempView('sdf')
sdf.show(10, truncate=False)

+-----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|Label|Text                                                                                                                                                                        |Body_len|
+-----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|spam |Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's                 |128     |
|spam |FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv                         |116     |
|spam |WINNER!! As a valued network customer you h

## 4. Data Processing 

Below, we perform some data processing steps to prepare data for training. Particularly, we'll use text processing which include:

- Remove punctuations which are !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
- Tokenization: splits sentences into words and convert all words to lower case
- Remove stop words which do not add much meaning to a sentence, such as the, he, she, have, ...
- Convert categorical labels into numeric ones which will be later used for binary classification

In [2]:
# Remove punctuations which are !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
sdf=sdf.withColumn('Text', fn.trim(fn.lower(fn.regexp_replace('Text', '\p{Punct}', ''))))
sdf.show(10, truncate=False)

+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|Label|Text                                                                                                                                                                   |Body_len|
+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|spam |free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry questionstd txt ratetcs apply 08452810075over18s                  |128     |
|spam |freemsg hey there darling its been 3 weeks now and no word back id like some fun you up for it still tb ok xxx std chgs to send £150 to rcv                            |116     |
|spam |winner as a valued network customer you have been selected to receiv

In [3]:
# Tokenization (Extracting words)
tokenizer = Tokenizer().setInputCol('Text').setOutputCol('Tokens')

# Remove the stopwords
stopwords = StopWordsRemover().getStopWords()
stopremover = StopWordsRemover().setStopWords(stopwords).setInputCol('Tokens').setOutputCol('Filtered_tokens')

# Target indexer 
indexer = StringIndexer().setInputCol('Label').setOutputCol('bi_label')

## 5. Vectorization and Feature Engineering

We need to get SMS texts into a form that a machine learning and or a deep learning model can actually use to understand and train a model. The process that converts text to a form that machine can understand is called **vectorizing**. This is defined as the process of encoding text as integers to create feature vectors. There are several vectorization methods. The most popular ones are

- **Count vectorizing**: a document term matrix is generated where each cell is the count corresponding to the news title indicating the number of times a word appears in a document, also known as the term frequency.


- **N-gram vectorizing**: similar to the count vectorization technique, in the N-Gram method, a document term matrix is generated and each cell represents the count. The difference in the N-grams method is that the count represents the combination of adjacent words of length n in the title. 


- **Term Frequency-Inverse Document Frequency (TF-IDF) vectorizing**: Similar to the count vectorization method, in the TF-IDF method, a document term matrix is generated and each column represents a single unique word. The difference in the TF-IDF method is that each cell doesn’t indicate the term frequency, but the cell value represents a weighting that highlights the importance of that particular word to the document.

In our problem, we are going to use **(TF-IDF) vectorizer** from *Mlib* in *Apache Spark*. For further details, visit [Apache Spark Extracting, transforming and selecting features](https://spark.apache.org/docs/latest/ml-features.html)

Furthermore, we are going to combine the features generated by **(TF-IDF) vectorizer** and that we created before, i.e., **Body_len**. To do so, we use **VectorAssembler** transformer from *Apache Spark MLib*. For further details, visit [Apache Spark Feature Transformers](https://spark.apache.org/docs/latest/ml-features.html#feature-transformers)

In [4]:
# Term Frequency-Inverse Document Frequency (TF-IDF) Vectorizer
hashingTF = HashingTF(inputCol="Filtered_tokens", outputCol="rawFeatures", numFeatures=3000)
idf = IDF(inputCol="rawFeatures", outputCol="tf_idf_features")

#Vectore assembler
assembler = VectorAssembler(inputCols=["Body_len", "tf_idf_features"], outputCol="features")

Now, we create a **pipeline** so that we can apply all the above-mentined data pre-processing and feature engineering steps to our dataset, except for the punctuation removal, which has been done before applying the pipeline. 

In [5]:
pipeline = Pipeline(stages = [tokenizer, stopremover, hashingTF, idf, assembler, indexer])
sdf_pro = pipeline.fit(sdf)
sdf_pre=sdf_pro.transform(sdf)
sdf_transformed=sdf_pre.drop("Label","Text","Body_len","Tokens","Filtered_tokens","rawFeatures", "tf_idf_features")
sdf_transformed.show()

+--------------------+--------+
|            features|bi_label|
+--------------------+--------+
|(3001,[0,161,170,...|     1.0|
|(3001,[0,331,564,...|     1.0|
|(3001,[0,44,147,2...|     1.0|
|(3001,[0,147,217,...|     1.0|
|(3001,[0,14,98,10...|     1.0|
|(3001,[0,214,215,...|     1.0|
|(3001,[0,17,452,1...|     1.0|
|(3001,[0,373,469,...|     1.0|
|(3001,[0,100,432,...|     1.0|
|(3001,[0,26,100,1...|     1.0|
|(3001,[0,100,174,...|     1.0|
|(3001,[0,56,147,4...|     1.0|
|(3001,[0,57,147,2...|     1.0|
|(3001,[0,57,87,12...|     1.0|
|(3001,[0,666,1071...|     1.0|
|(3001,[0,129,147,...|     1.0|
|(3001,[0,170,427,...|     1.0|
|(3001,[0,129,147,...|     1.0|
|(3001,[0,44,129,4...|     1.0|
|(3001,[0,147,236,...|     1.0|
+--------------------+--------+
only showing top 20 rows



### Store Feature Engineered Data in IBM Cloud Object Store

Let's store our feature engineered data into the IBM Cloud Object store so that we can use it in the next step of our process i.e. Model Definition and Training.

In [6]:
sdf_transformed = sdf_transformed.repartition(1)
sdf_transformed.write.parquet(cos.url('SMSSpamData_Transformed.parquet', 'advanceddatasciencecapstoneibm-donotdelete-pr-prnii9jvlql3tf'))

Now that the data has been stored in the IBM Cloud Object store, let us check and confirm that the stored data is looking good.

In [7]:
sdf_Transformed_stored = spark.read.parquet(cos.url('SMSSpamData_Transformed.parquet', 'advanceddatasciencecapstoneibm-donotdelete-pr-prnii9jvlql3tf'))
sdf_Transformed_stored.show()

+--------------------+--------+
|            features|bi_label|
+--------------------+--------+
|(3001,[0,161,170,...|     1.0|
|(3001,[0,331,564,...|     1.0|
|(3001,[0,44,147,2...|     1.0|
|(3001,[0,147,217,...|     1.0|
|(3001,[0,14,98,10...|     1.0|
|(3001,[0,214,215,...|     1.0|
|(3001,[0,17,452,1...|     1.0|
|(3001,[0,373,469,...|     1.0|
|(3001,[0,100,432,...|     1.0|
|(3001,[0,26,100,1...|     1.0|
|(3001,[0,100,174,...|     1.0|
|(3001,[0,56,147,4...|     1.0|
|(3001,[0,57,147,2...|     1.0|
|(3001,[0,57,87,12...|     1.0|
|(3001,[0,666,1071...|     1.0|
|(3001,[0,129,147,...|     1.0|
|(3001,[0,170,427,...|     1.0|
|(3001,[0,129,147,...|     1.0|
|(3001,[0,44,129,4...|     1.0|
|(3001,[0,147,236,...|     1.0|
+--------------------+--------+
only showing top 20 rows

