<a href="https://colab.research.google.com/github/m-mehdi/Python101/blob/master/Apache_Spark_04_Classification_PNB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="images/cads-logo.png" style="height: 100px;" align=left> <img src="images/apache_spark.png" style="height: 20%;width:20%" align=right>

# Classification Algorithms

Classification algorithms are useful when we have datasets that we want to be able o split into two categories. So, for example, we might have several pieces of data that fall
into Category A or Category B, and sometimes it's not so obvious where certain things should fall. Classification algorithms
help us identify boundaries between different categories and make it easy for us to decide how to assign a new entity to a particular group.
In this notebook, we'll look at a few different classification algorithms, including Naive Bayes, decision trees, and multilayer perceptrons.

#### Download **iris dataset** from the [link](https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv)

### Import Packages

In [1]:
!pip install pyspark
import pyspark
from pyspark.sql.functions import *
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer
from pyspark.sql import SparkSession

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/f0/26/198fc8c0b98580f617cb03cb298c6056587b8f0447e20fa40c5b634ced77/pyspark-3.0.1.tar.gz (204.2MB)
[K     |████████████████████████████████| 204.2MB 69kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 43.8MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.0.1-py2.py3-none-any.whl size=204612243 sha256=13dd931afa078e7a6683e1a4a829c18423e6b24d677e8c726134caa54986c11b
  Stored in directory: /root/.cache/pip/wheels/5e/bd/07/031766ca628adec8435bb40f0bd83bb676ce65ff4007f8e73f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.1


### Create Spark instance

In [2]:
spark = SparkSession.builder.appName('Class').getOrCreate()

### Load iris dataset into a Spark data frame

In [3]:
iris_df = spark.read.csv('iris.csv',header=True,inferSchema=True)

In [4]:
iris_df.take(4)

[Row(sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species='setosa'),
 Row(sepal_length=4.9, sepal_width=3.0, petal_length=1.4, petal_width=0.2, species='setosa'),
 Row(sepal_length=4.7, sepal_width=3.2, petal_length=1.3, petal_width=0.2, species='setosa'),
 Row(sepal_length=4.6, sepal_width=3.1, petal_length=1.5, petal_width=0.2, species='setosa')]

In [5]:
iris_df.show()

+------------+-----------+------------+-----------+-------+
|sepal_length|sepal_width|petal_length|petal_width|species|
+------------+-----------+------------+-----------+-------+
|         5.1|        3.5|         1.4|        0.2| setosa|
|         4.9|        3.0|         1.4|        0.2| setosa|
|         4.7|        3.2|         1.3|        0.2| setosa|
|         4.6|        3.1|         1.5|        0.2| setosa|
|         5.0|        3.6|         1.4|        0.2| setosa|
|         5.4|        3.9|         1.7|        0.4| setosa|
|         4.6|        3.4|         1.4|        0.3| setosa|
|         5.0|        3.4|         1.5|        0.2| setosa|
|         4.4|        2.9|         1.4|        0.2| setosa|
|         4.9|        3.1|         1.5|        0.1| setosa|
|         5.4|        3.7|         1.5|        0.2| setosa|
|         4.8|        3.4|         1.6|        0.2| setosa|
|         4.8|        3.0|         1.4|        0.1| setosa|
|         4.3|        3.0|         1.1| 

### Create a vector assembler to transform our data

**VectorAssembler** is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models. 

In [8]:
vec_assebler = VectorAssembler(inputCols=['sepal_length','sepal_width','petal_length','petal_width'],\
                               outputCol='features')

In [9]:
vecIris_df = vec_assebler.transform(iris_df)

In [10]:
vecIris_df.printSchema()

root
 |-- sepal_length: double (nullable = true)
 |-- sepal_width: double (nullable = true)
 |-- petal_length: double (nullable = true)
 |-- petal_width: double (nullable = true)
 |-- species: string (nullable = true)
 |-- features: vector (nullable = true)



In [11]:
vecIris_df.show()

+------------+-----------+------------+-----------+-------+-----------------+
|sepal_length|sepal_width|petal_length|petal_width|species|         features|
+------------+-----------+------------+-----------+-------+-----------------+
|         5.1|        3.5|         1.4|        0.2| setosa|[5.1,3.5,1.4,0.2]|
|         4.9|        3.0|         1.4|        0.2| setosa|[4.9,3.0,1.4,0.2]|
|         4.7|        3.2|         1.3|        0.2| setosa|[4.7,3.2,1.3,0.2]|
|         4.6|        3.1|         1.5|        0.2| setosa|[4.6,3.1,1.5,0.2]|
|         5.0|        3.6|         1.4|        0.2| setosa|[5.0,3.6,1.4,0.2]|
|         5.4|        3.9|         1.7|        0.4| setosa|[5.4,3.9,1.7,0.4]|
|         4.6|        3.4|         1.4|        0.3| setosa|[4.6,3.4,1.4,0.3]|
|         5.0|        3.4|         1.5|        0.2| setosa|[5.0,3.4,1.5,0.2]|
|         4.4|        2.9|         1.4|        0.2| setosa|[4.4,2.9,1.4,0.2]|
|         4.9|        3.1|         1.5|        0.1| setosa|[4.9,

### Convert the species labels name into a numeric values

**StringIndexer** encodes a string column of labels to a column of label indices. The indices are in [0, numLabels), ordered by label frequencies, so the most frequent label gets index 0. 

In [12]:
indexer = StringIndexer(inputCol='species',outputCol='label')
indexerModel=indexer.fit(vecIris_df)
indexVecIris_df=indexerModel.transform(vecIris_df)

In [13]:
indexVecIris_df.show()

+------------+-----------+------------+-----------+-------+-----------------+-----+
|sepal_length|sepal_width|petal_length|petal_width|species|         features|label|
+------------+-----------+------------+-----------+-------+-----------------+-----+
|         5.1|        3.5|         1.4|        0.2| setosa|[5.1,3.5,1.4,0.2]|  0.0|
|         4.9|        3.0|         1.4|        0.2| setosa|[4.9,3.0,1.4,0.2]|  0.0|
|         4.7|        3.2|         1.3|        0.2| setosa|[4.7,3.2,1.3,0.2]|  0.0|
|         4.6|        3.1|         1.5|        0.2| setosa|[4.6,3.1,1.5,0.2]|  0.0|
|         5.0|        3.6|         1.4|        0.2| setosa|[5.0,3.6,1.4,0.2]|  0.0|
|         5.4|        3.9|         1.7|        0.4| setosa|[5.4,3.9,1.7,0.4]|  0.0|
|         4.6|        3.4|         1.4|        0.3| setosa|[4.6,3.4,1.4,0.3]|  0.0|
|         5.0|        3.4|         1.5|        0.2| setosa|[5.0,3.4,1.5,0.2]|  0.0|
|         4.4|        2.9|         1.4|        0.2| setosa|[4.4,2.9,1.4,0.2]

## Naive Bayes classification model

Let's import the required packages for this step.

In [14]:
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

### Create training and test datasets

In [15]:
splits = indexVecIris_df.randomSplit([0.65,0.35],1)
train_df = splits[0]
test_df =splits[1]

In [16]:
train_df.count()

104

In [17]:
test_df.count()

46

In [18]:
indexVecIris_df.count()

150

`NavieBayes()` method take an argument called `modelType`, in our case, we will assign `multinomial` to it and `multinomial` just means that there are more than two different classes that we are going to be working with.

In [20]:
nb_classifier = NaiveBayes(modelType='multinomial')
nb_model=nb_classifier.fit(train_df)

We have built and fit the model using the training data set, and in the next step, we are going to make predictions using the model on the test data.

In [21]:
pred_df = nb_model.transform(test_df)

In [22]:
pred_df.printSchema()

root
 |-- sepal_length: double (nullable = true)
 |-- sepal_width: double (nullable = true)
 |-- petal_length: double (nullable = true)
 |-- petal_width: double (nullable = true)
 |-- species: string (nullable = true)
 |-- features: vector (nullable = true)
 |-- label: double (nullable = false)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



### Model Evaluation

Now, we want to see how well the model work overall, so let's to do a model evaluation.

In [23]:
pred_df.select(['species','features','label','prediction']).show(150)

+----------+-----------------+-----+----------+
|   species|         features|label|prediction|
+----------+-----------------+-----+----------+
|    setosa|[4.5,2.3,1.3,0.3]|  0.0|       0.0|
|    setosa|[4.6,3.1,1.5,0.2]|  0.0|       0.0|
|    setosa|[4.8,3.1,1.6,0.2]|  0.0|       0.0|
|    setosa|[4.8,3.4,1.6,0.2]|  0.0|       0.0|
|    setosa|[4.8,3.4,1.9,0.2]|  0.0|       0.0|
|versicolor|[4.9,2.4,3.3,1.0]|  1.0|       1.0|
| virginica|[4.9,2.5,4.5,1.7]|  2.0|       2.0|
|    setosa|[5.0,3.5,1.3,0.3]|  0.0|       0.0|
|versicolor|[5.1,2.5,3.0,1.1]|  1.0|       1.0|
|    setosa|[5.1,3.3,1.7,0.5]|  0.0|       0.0|
|    setosa|[5.1,3.5,1.4,0.2]|  0.0|       0.0|
|    setosa|[5.1,3.8,1.6,0.2]|  0.0|       0.0|
|versicolor|[5.2,2.7,3.9,1.4]|  1.0|       1.0|
|    setosa|[5.2,3.4,1.4,0.2]|  0.0|       0.0|
|    setosa|[5.2,3.5,1.5,0.2]|  0.0|       0.0|
|    setosa|[5.2,4.1,1.5,0.1]|  0.0|       0.0|
|versicolor|[5.4,3.0,4.5,1.5]|  1.0|       1.0|
|    setosa|[5.4,3.4,1.5,0.4]|  0.0|    

In [24]:
model_eval = MulticlassClassificationEvaluator(labelCol='label', predictionCol='prediction',\
                                               metricName='accuracy')

In [25]:
nb_accuracy = model_eval.evaluate(pred_df)

In [26]:
print("Naive Bayes Accuracy is {:.2f}.".format(nb_accuracy))

Naive Bayes Accuracy is 0.98.


## Multi Layer Perceptron (MLP)

In [27]:
from pyspark.ml.classification import MultilayerPerceptronClassifier

A multi-layer perceptron (MLP) classifier has multiple levels of neurons.

In this MLP, first layer has the number of nodes as there are inputs, we have four measures so our first layer will be four.
last element should have the same number of neurons as there are types of outputs. We have three types of iris species. so our last layer will be three. Now we want to have layers in between, and layers in between or hidden layers will help the multi-layer perceptron learn how to classify correctly.
We insert two layers of six neurons each in the middle of the MLP. Now we have a four-level MLP.

In [28]:
layers = [4, 6, 6, 3]

In [29]:
mlp_classifier = MultilayerPerceptronClassifier(layers=layers, seed =1)

In [30]:
mlp_model = mlp_classifier.fit(train_df)

In [31]:
mlp_pred = mlp_model.transform(test_df)

In [32]:
mlp_pred.select(['species','features','label','prediction']).show(150)

+----------+-----------------+-----+----------+
|   species|         features|label|prediction|
+----------+-----------------+-----+----------+
|    setosa|[4.5,2.3,1.3,0.3]|  0.0|       0.0|
|    setosa|[4.6,3.1,1.5,0.2]|  0.0|       0.0|
|    setosa|[4.8,3.1,1.6,0.2]|  0.0|       0.0|
|    setosa|[4.8,3.4,1.6,0.2]|  0.0|       0.0|
|    setosa|[4.8,3.4,1.9,0.2]|  0.0|       0.0|
|versicolor|[4.9,2.4,3.3,1.0]|  1.0|       1.0|
| virginica|[4.9,2.5,4.5,1.7]|  2.0|       2.0|
|    setosa|[5.0,3.5,1.3,0.3]|  0.0|       0.0|
|versicolor|[5.1,2.5,3.0,1.1]|  1.0|       1.0|
|    setosa|[5.1,3.3,1.7,0.5]|  0.0|       0.0|
|    setosa|[5.1,3.5,1.4,0.2]|  0.0|       0.0|
|    setosa|[5.1,3.8,1.6,0.2]|  0.0|       0.0|
|versicolor|[5.2,2.7,3.9,1.4]|  1.0|       1.0|
|    setosa|[5.2,3.4,1.4,0.2]|  0.0|       0.0|
|    setosa|[5.2,3.5,1.5,0.2]|  0.0|       0.0|
|    setosa|[5.2,4.1,1.5,0.1]|  0.0|       0.0|
|versicolor|[5.4,3.0,4.5,1.5]|  1.0|       1.0|
|    setosa|[5.4,3.4,1.5,0.4]|  0.0|    

In [33]:
mlp_eval = MulticlassClassificationEvaluator(metricName='accuracy')
mlp_accuracy = mlp_eval.evaluate(mlp_pred)

In [34]:
print('MLP accuracy is {:.2f}'.format(mlp_accuracy))

MLP accuracy is 1.00


## Decision Tree

In [35]:
from pyspark.ml.classification import DecisionTreeClassifier

In [36]:
decTree_classifier = DecisionTreeClassifier(labelCol='label', featuresCol='features')
decTree_model = decTree_classifier.fit(train_df)
decTree_pred= decTree_model.transform(test_df)

In [37]:
decTree_pred.select(['species','features','label','prediction']).show(150)

+----------+-----------------+-----+----------+
|   species|         features|label|prediction|
+----------+-----------------+-----+----------+
|    setosa|[4.5,2.3,1.3,0.3]|  0.0|       0.0|
|    setosa|[4.6,3.1,1.5,0.2]|  0.0|       0.0|
|    setosa|[4.8,3.1,1.6,0.2]|  0.0|       0.0|
|    setosa|[4.8,3.4,1.6,0.2]|  0.0|       0.0|
|    setosa|[4.8,3.4,1.9,0.2]|  0.0|       0.0|
|versicolor|[4.9,2.4,3.3,1.0]|  1.0|       1.0|
| virginica|[4.9,2.5,4.5,1.7]|  2.0|       1.0|
|    setosa|[5.0,3.5,1.3,0.3]|  0.0|       0.0|
|versicolor|[5.1,2.5,3.0,1.1]|  1.0|       1.0|
|    setosa|[5.1,3.3,1.7,0.5]|  0.0|       0.0|
|    setosa|[5.1,3.5,1.4,0.2]|  0.0|       0.0|
|    setosa|[5.1,3.8,1.6,0.2]|  0.0|       0.0|
|versicolor|[5.2,2.7,3.9,1.4]|  1.0|       1.0|
|    setosa|[5.2,3.4,1.4,0.2]|  0.0|       0.0|
|    setosa|[5.2,3.5,1.5,0.2]|  0.0|       0.0|
|    setosa|[5.2,4.1,1.5,0.1]|  0.0|       0.0|
|versicolor|[5.4,3.0,4.5,1.5]|  1.0|       1.0|
|    setosa|[5.4,3.4,1.5,0.4]|  0.0|    

In [38]:
decTree_evaluator = MulticlassClassificationEvaluator(metricName='accuracy')
decTree_accuracy = decTree_evaluator.evaluate(decTree_pred)

In [39]:
print('Decision Tree accuracy is {:.2f}.'.format(decTree_accuracy))

Decision Tree accuracy is 0.93.


#### Keep it up