# NLP Using PySpark

## Objective:
- The objective from this project is to create a <b>Spam filter using NaiveBayes classifier</b>.
- We'll use a dataset from UCI Repository. SMS Spam Detection: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

### Create a spark session and import the required libraries

In [1]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql.types import *

spark = SparkSession.builder.getOrCreate()
# from pyspark import  SparkContext
# sc = SparkContext()

### Read the data into a DataFrame

In [2]:
df = spark.read.text('SMSSpamCollection')
df.show()

+--------------------+
|               value|
+--------------------+
|ham	Go until juro...|
|ham	Ok lar... Jok...|
|spam	Free entry i...|
|ham	U dun say so ...|
|ham	Nah I don't t...|
|spam	FreeMsg Hey ...|
|ham	Even my broth...|
|ham	As per your r...|
|spam	WINNER!! As ...|
|spam	Had your mob...|
|ham	I'm gonna be ...|
|spam	SIX chances ...|
|spam	URGENT! You ...|
|ham	I've been sea...|
|ham	I HAVE A DATE...|
|spam	XXXMobileMov...|
|ham	Oh k...i'm wa...|
|ham	Eh u remember...|
|ham	Fine if that...|
|spam	England v Ma...|
+--------------------+
only showing top 20 rows



In [3]:
pd_df = df.toPandas()

data = [ (i[:3], i[3:]) for i in pd_df['value'] ]
df = spark.createDataFrame(data)

### Print the schema

In [4]:
df.printSchema()

root
 |-- _1: string (nullable = true)
 |-- _2: string (nullable = true)



In [5]:
df.show(2)

+---+--------------------+
| _1|                  _2|
+---+--------------------+
|ham|	Go until jurong ...|
|ham|	Ok lar... Joking...|
+---+--------------------+
only showing top 2 rows



### Rename the first column to 'class' and second column to 'text'

In [6]:
df = df.withColumnRenamed('_1', 'class')
df = df.withColumnRenamed('_2', 'text') 

In [7]:
df.show(2)

+-----+--------------------+
|class|                text|
+-----+--------------------+
|  ham|	Go until jurong ...|
|  ham|	Ok lar... Joking...|
+-----+--------------------+
only showing top 2 rows



### Show the first 10 rows from the dataframe
- Show once with truncate=True and once with truncate=False

In [8]:
df.show(10)

+-----+--------------------+
|class|                text|
+-----+--------------------+
|  ham|	Go until jurong ...|
|  ham|	Ok lar... Joking...|
|  spa|m	Free entry in 2...|
|  ham|	U dun say so ear...|
|  ham|	Nah I don't thin...|
|  spa|m	FreeMsg Hey the...|
|  ham|	Even my brother ...|
|  ham|	As per your requ...|
|  spa|m	WINNER!! As a v...|
|  spa|m	Had your mobile...|
+-----+--------------------+
only showing top 10 rows



In [9]:
df.show(10, truncate=False)

+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|class|text                                                                                                                                                             |
+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ham  |	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...                                                 |
|ham  |	Ok lar... Joking wif u oni...                                                                                                                                   |
|spa  |m	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452

## Clean and Prepare the Data

### Create a new feature column contains the length of the text column

In [27]:
from pyspark.sql.functions import length

# df.withColumn('text_len', col('text').count("+"))
# # df.withColumn('text_len', col('text').count())

In [19]:
# from pyspark.sql.functions import count

all_text = df.select('text').collect()
lst = []

for i in range(0, len(all_text)): 
    lst.append(len(all_text[i][0].strip('\n\t')))

In [29]:
df = df.withColumn('length', length('text'))

### Show the new dataframe

In [31]:
df.show()

+-----+--------------------+------+
|class|                text|length|
+-----+--------------------+------+
|  ham|	Go until jurong ...|   112|
|  ham|	Ok lar... Joking...|    30|
|  spa|m	Free entry in 2...|   157|
|  ham|	U dun say so ear...|    50|
|  ham|	Nah I don't thin...|    62|
|  spa|m	FreeMsg Hey the...|   149|
|  ham|	Even my brother ...|    78|
|  ham|	As per your requ...|   161|
|  spa|m	WINNER!! As a v...|   159|
|  spa|m	Had your mobile...|   156|
|  ham|	I'm gonna be hom...|   110|
|  spa|m	SIX chances to ...|   138|
|  spa|m	URGENT! You hav...|   157|
|  ham|	I've been search...|   197|
|  ham|	I HAVE A DATE ON...|    36|
|  spa|m	XXXMobileMovieC...|   151|
|  ham|	Oh k...i'm watch...|    27|
|  ham|	Eh u remember ho...|    82|
|  ham|	Fine if thats t...|    57|
|  spa|m	England v Maced...|   157|
+-----+--------------------+------+
only showing top 20 rows



### Get the average text length for each class (give alias name to the average length column)

In [35]:
from pyspark.sql.functions import *

df.groupBy('class').agg(avg('length').alias('Average')).show()

+-----+-----------------+
|class|          Average|
+-----+-----------------+
|  ham|72.47192873420344|
|  spa|140.6760374832664|
+-----+-----------------+



## Feature Transformations

### In this part you transform your raw text in to tf_idf model :

### Perform the following steps to obtain TF-IDF:
1. Import the required transformers/estimators for the subsequent steps.
2. Create a <b>Tokenizer</b> from the text column.
3. Create a <b>StopWordsRemover</b> to remove the <b>stop words</b> from the column obtained from the <b>Tokenizer</b>.
4. Create a <b>CountVectorizer</b> after removing the <b>stop words</b>.
5. Create the <b>TF-IDF</b> from the <b>CountVectorizer</b>.

In [53]:
from pyspark.ml.feature import CountVectorizer, Tokenizer, StopWordsRemover, IDF

In [54]:
tokenizer = Tokenizer(inputCol='text', outputCol='text_tokenized')
SWRemover = StopWordsRemover(inputCol='text_tokenized', outputCol='text_no_sw')
countvec = CountVectorizer(inputCol='text_no_sw', outputCol='text_cv')
tf_idf = IDF(inputCol='text_cv', outputCol='tf_idf')

- Convert the <b>class column</b> to index using <b>StringIndexer</b>
- Create feature column from the <b>TF-IDF</b> and <b>lenght</b> columns.

In [42]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder,VectorAssembler

In [55]:
strIndexer = StringIndexer(inputCol = 'class', outputCol = 'indexed_class')

all_cols = ['tf_idf'] + ['length']
vecassembler = VectorAssembler(inputCols = all_cols, outputCol='features')

## The Model
- Create a <b>NaiveBayes</b> classifier with the default parameters.

In [59]:
from pyspark.ml.classification import NaiveBayes

NBC = NaiveBayes(featuresCol='features',labelCol='indexed_class')

## Pipeline
### Create a pipeline model contains all the steps starting from the Tokenizer to the NaiveBays classifier.

In [60]:
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[tokenizer,
                            SWRemover,
                            countvec,
                            tf_idf,
                            strIndexer, vecassembler, NBC])

### Split your data to trian and test data with ratios 0.7 and 0.3 respectively.

In [57]:
train, test = df.randomSplit([.7,.3],seed=42)

### Fit your Pipeline model to the training data

In [61]:
model = pipeline.fit(train)

### Perform predictions on tests dataframe

In [62]:
predictions = model.transform(test)

### Print the schema of the prediction dataframe

In [63]:
predictions.printSchema()

root
 |-- class: string (nullable = true)
 |-- text: string (nullable = true)
 |-- length: integer (nullable = true)
 |-- text_tokenized: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- text_no_sw: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- text_cv: vector (nullable = true)
 |-- tf_idf: vector (nullable = true)
 |-- indexed_class: double (nullable = false)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



## Model Evaluation
- Use <b>MulticlassClassificationEvaluator</b> to calculate the <b>f1_score</b>.

In [73]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

f1_score = BinaryClassificationEvaluator(predictionCol='prediction' ,labelCol='indexed_class', metricName='f1')
f1_score.evaluate(predictions)

0.9790754024460486