## Classification - Naive Bayes

<strong> Classification - Decision Tree </strong>
<ul style="list-style-type:square">
  <li>Features :  SMS</li>
  <li>Target : ham/spam </li>
  <li>Model : Naive Bayes</li>
</ul>

<strong> Attribute Information: </strong>
<ul style="list-style-type:square">

<strong>The collection is composed by just one text file, where each line has the correct class followed by the raw message. We offer some examples bellow: </strong>

  <li> ham What you doing?how are you?</li>
  <li>ham Ok lar... Joking wif u oni... </li>
  <li>ham dun say so early hor... U c already then say...</li>
  <li>ham MY NO. IN LUTON 0125698789 RING ME IF UR AROUND! H*   </i>
  <li>ham Siva is in hostel aha:-. </i>
  <li>ham Cos i was out shopping wif darren jus now n i called him 2 ask wat present he wan lor. Then he started guessing who i was wif n he finally guessed darren lor    </i>
  <li>spam FreeMsg: Txt: CALL to No: 86888 & claim your reward of 3 hours talk time to use from your phone now! ubscribe6GBP/ mnth inc 3hrs 16 stop?txtStop</i>
  <li>spam Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of Australia? Text MQUIZ to 82277. B</i>
  <li>spam URGENT! Your Mobile No 07808726822 was awarded a L2,000 Bonus Caller Prize on 02/09/03! This is our 2nd attempt to contact YOU! Call 0871-872-9758 BOX95QU</i> </ul>
 
Note: the messages are not chronologically sorted.

Source: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
  
   
</ul>

## SMS spam message Classification 

In [23]:
# Spark Session
spSession = SparkSession.builder.master("local").appName("NaiveBayes").config("some.config").getOrCreate()

In [13]:
# Libraries 
from pyspark.ml.classification import NaiveBayes, NaiveBayesModel
from pyspark.ml import Pipeline
from pyspark.ml.feature import HashingTF, Tokenizer # natural language processing
# The inverse document frequency (IDF) tells us how important a term is to a collection of documents.
from pyspark.ml.feature import IDF
from pyspark.ml.evaluation import MulticlassClassificationEvaluator # To evaluate the model 

In [16]:
# Import CSV and Store the RDD in cache
spamRDD = sc.textFile("SMSSpamCollection.CSV")
spamRDD.cache()

SMSSpamCollection.CSV MapPartitionsRDD[5] at textFile at <unknown>:0

In [18]:
spamRDD.collect()[:15]

['ham,Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...,,,,,,,,,',
 'ham,Ok lar... Joking wif u oni...,,,,,,,,,,',
 'ham,U dun say so early hor... U c already then say...,,,,,,,,,,',
 "ham,Nah I don't think he goes to usf, he lives around here though,,,,,,,,,",
 'ham,Even my brother is not like to speak with me. They treat me like aids patent.,,,,,,,,,,',
 "ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune,,,,,,,,,,",
 "ham,I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.,,,,,,,,,",
 "ham,I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.,,,,,,,,,,",
 'ham,I HAVE A DATE ON SUNDAY WITH WILL!!,,,,,,,,,,',
 "ham

## Data Pre-Provessing

In [21]:
# 
def TransformVector(inputString):
    # split data
    attList = inputString.split(",")
    # Set numeric label
    smsType = 0.0 if attList[0] == "ham" else 1.0
    return [smsType, attList[1]]

In [25]:
# Apply TranformVector Fuction
# transform rdd into datagrame
spamRDD2 = spamRDD.map(TransformVector)
spamDF = spSession.createDataFrame(spamRDD2,["label", "message"])
spamDF.cache()
spamDF.select("label", "message").show()

+-----+--------------------+
|label|             message|
+-----+--------------------+
|  0.0|Go until jurong p...|
|  0.0|Ok lar... Joking ...|
|  0.0|U dun say so earl...|
|  0.0|Nah I don't think...|
|  0.0|Even my brother i...|
|  0.0|As per your reque...|
|  0.0|I'm gonna be home...|
|  0.0|I've been searchi...|
|  0.0|I HAVE A DATE ON ...|
|  0.0|Oh k...i'm watchi...|
|  0.0|Eh u remember how...|
|  0.0|Fine if thats th...|
|  0.0|Is that seriously...|
|  0.0|I‘m going to try ...|
|  0.0|So ü pay first la...|
|  0.0|Aft i finish my l...|
|  0.0|Ffffffffff. Alrig...|
|  0.0|Just forced mysel...|
|  0.0|Lol your always s...|
|  0.0|Did you catch the...|
+-----+--------------------+
only showing top 20 rows



## Machine Learning 

In [27]:
# Split data
[training_set, test_set] = spamDF.randomSplit([0.7, 0.3])

In [29]:
training_set.count()

709

In [30]:
test_set.count()

291

In [39]:
# Split words and Apply TF-IDF
# Tokanize sms into words
tokenizer = Tokenizer(inputCol="message", outputCol=  "words")
# Generate the term frequency vector
hashingTF = HashingTF( inputCol= tokenizer.getOutputCol(), outputCol="tempfeatures")
# Count word with low frequency
idf = IDF(inputCol= hashingTF.getOutputCol(), outputCol="features")
# Algorithm
nbClassifier = NaiveBayes()

In [40]:
# Create Pipeline
pipeline = Pipeline(stages = [tokenizer,hashingTF,idf,nbClassifier])

In [41]:
#Create the model with pipeline
model = pipeline.fit(training_set)

In [42]:
# Testing the model
predictions = model.transform(test_set)
predictions.select("prediction", "label").show()

+----------+-----+
|prediction|label|
+----------+-----+
|       0.0|  0.0|
|       0.0|  0.0|
|       1.0|  0.0|
|       1.0|  0.0|
|       0.0|  0.0|
|       0.0|  0.0|
|       0.0|  0.0|
|       0.0|  0.0|
|       1.0|  0.0|
|       0.0|  0.0|
|       1.0|  0.0|
|       0.0|  0.0|
|       0.0|  0.0|
|       0.0|  0.0|
|       0.0|  0.0|
|       0.0|  0.0|
|       0.0|  0.0|
|       0.0|  0.0|
|       0.0|  0.0|
|       0.0|  0.0|
+----------+-----+
only showing top 20 rows



In [43]:
# Accruray of the Model
accuracy = MulticlassClassificationEvaluator( predictionCol= "prediction", labelCol= "label", metricName="accuracy")
accuracy.evaluate(predictions)

0.9243986254295533

In [46]:
# Confusion Matrix
predictions.groupBy("prediction","label").count().show()

+----------+-----+-----+
|prediction|label|count|
+----------+-----+-----+
|       1.0|  1.0|  141|
|       0.0|  1.0|    7|
|       1.0|  0.0|   15|
|       0.0|  0.0|  128|
+----------+-----+-----+

