# **A study about the use of a SVM Model to predict the iris species of a flower**
**This notebook is a form to practice my knowledge in data science**

The objective of this notebook is, through the use of pySpark, to predict the iris species of a flower by the given characteristics. We will predict the species through the dataset of 2 iris species where 50 samples were collected for each species, considering the species as the output variable and the other variables of petal and sepal sizes as the input.

## Workflow stages
The solution workflow goes through four stages.
1.   Load the Data.
2.   Exploratory some information about dataset.
3.   Data pre-processing.
4.   Defines the SVM Model.

In [0]:
#Import the library that creates the spark section
from pyspark.sql import SparkSession

In [0]:
#Starts the section for using spark
spark = SparkSession.builder.appName("SVM_MLP").getOrCreate()

In [0]:
%fs ls /FileStore/tables

path,name,size,modificationTime
dbfs:/FileStore/tables/iris_bezdekIris.csv,iris_bezdekIris.csv,4551,1663856964000
dbfs:/FileStore/tables/movies-1.csv,movies-1.csv,494431,1662647401000
dbfs:/FileStore/tables/movies-2.csv,movies-2.csv,494431,1662647447000
dbfs:/FileStore/tables/movies.csv,movies.csv,494431,1662647363000
dbfs:/FileStore/tables/ratings.csv,ratings.csv,2483723,1662647649000
dbfs:/FileStore/tables/u.data,u.data,1979173,1662474869000


In [0]:
#Get the directory containing the file to use
dir="/FileStore/tables/iris_bezdekIris.csv"

#1) Load the Data

In [0]:
#Reading stored files through generic function
df_iris = spark.read.format('csv').options(inferSchema=True,header='false',delimiter=',').load(dir)

In [0]:
df_iris.printSchema()

root
 |-- _c0: double (nullable = true)
 |-- _c1: double (nullable = true)
 |-- _c2: double (nullable = true)
 |-- _c3: double (nullable = true)
 |-- _c4: string (nullable = true)



#2) Exploratory some information about dataset

In [0]:
#Formatted column name in header
df_iris = df_iris.selectExpr("_c0 as sep_len", "_c1 as sep_wid", "_c2 as pet_len", "_c3 as pet_wid", "_c4 as label")

In [0]:
df_iris.show(5)

+-------+-------+-------+-------+-----------+
|sep_len|sep_wid|pet_len|pet_wid|      label|
+-------+-------+-------+-------+-----------+
|    5.1|    3.5|    1.4|    0.2|Iris-setosa|
|    4.9|    3.0|    1.4|    0.2|Iris-setosa|
|    4.7|    3.2|    1.3|    0.2|Iris-setosa|
|    4.6|    3.1|    1.5|    0.2|Iris-setosa|
|    5.0|    3.6|    1.4|    0.2|Iris-setosa|
+-------+-------+-------+-------+-----------+
only showing top 5 rows



In [0]:
#Analyze statistics
df_iris.describe(['sep_len','sep_wid','pet_len','pet_wid']).show()

+-------+------------------+-------------------+------------------+------------------+
|summary|           sep_len|            sep_wid|           pet_len|           pet_wid|
+-------+------------------+-------------------+------------------+------------------+
|  count|               150|                150|               150|               150|
|   mean| 5.843333333333335|  3.057333333333334|3.7580000000000027| 1.199333333333334|
| stddev|0.8280661279778637|0.43586628493669793|1.7652982332594662|0.7622376689603467|
|    min|               4.3|                2.0|               1.0|               0.1|
|    max|               7.9|                4.4|               6.9|               2.5|
+-------+------------------+-------------------+------------------+------------------+



In [0]:
#Defined the dataframe view to be used as a table by SQL
df_iris.createOrReplaceTempView("irisTable")

In [0]:
display(spark.sql('select * from irisTable '))

sep_len,sep_wid,pet_len,pet_wid,label
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa


#3) Data pre-processing

In [0]:
#Library that contains the functions for building vectors
from pyspark.ml.linalg import Vectors  
from pyspark.ml.feature import VectorAssembler 

In [0]:
#Created the feature vector
vector_assembler = VectorAssembler(inputCols=["sep_len", "sep_wid", "pet_len", "pet_wid"],outputCol="features")
df_temp = vector_assembler.transform(df_iris)
df_temp.show(5)

+-------+-------+-------+-------+-----------+-----------------+
|sep_len|sep_wid|pet_len|pet_wid|      label|         features|
+-------+-------+-------+-------+-----------+-----------------+
|    5.1|    3.5|    1.4|    0.2|Iris-setosa|[5.1,3.5,1.4,0.2]|
|    4.9|    3.0|    1.4|    0.2|Iris-setosa|[4.9,3.0,1.4,0.2]|
|    4.7|    3.2|    1.3|    0.2|Iris-setosa|[4.7,3.2,1.3,0.2]|
|    4.6|    3.1|    1.5|    0.2|Iris-setosa|[4.6,3.1,1.5,0.2]|
|    5.0|    3.6|    1.4|    0.2|Iris-setosa|[5.0,3.6,1.4,0.2]|
+-------+-------+-------+-------+-----------+-----------------+
only showing top 5 rows



In [0]:
df_temp.printSchema()

root
 |-- sep_len: double (nullable = true)
 |-- sep_wid: double (nullable = true)
 |-- pet_len: double (nullable = true)
 |-- pet_wid: double (nullable = true)
 |-- label: string (nullable = true)
 |-- features: vector (nullable = true)



In [0]:
#Removed unused columns
df_formatted = df_temp.drop('sep_len', 'sep_wid', 'pet_len', 'pet_wid')
df_formatted.show(5)

+-----------+-----------------+
|      label|         features|
+-----------+-----------------+
|Iris-setosa|[5.1,3.5,1.4,0.2]|
|Iris-setosa|[4.9,3.0,1.4,0.2]|
|Iris-setosa|[4.7,3.2,1.3,0.2]|
|Iris-setosa|[4.6,3.1,1.5,0.2]|
|Iris-setosa|[5.0,3.6,1.4,0.2]|
+-----------+-----------------+
only showing top 5 rows



In [0]:
#Apply transformations to the label column
from pyspark.ml.feature import StringIndexer  #Creates the 'vector' for each of the existing classes in the label column

l_indexer = StringIndexer(inputCol="label", outputCol="labelIndex")  #Create object for encoding
df_final = l_indexer.fit(df_formatted).transform(df_formatted)  #Apply the transformation

In [0]:
df_final.show(5)

+-----------+-----------------+----------+
|      label|         features|labelIndex|
+-----------+-----------------+----------+
|Iris-setosa|[5.1,3.5,1.4,0.2]|       0.0|
|Iris-setosa|[4.9,3.0,1.4,0.2]|       0.0|
|Iris-setosa|[4.7,3.2,1.3,0.2]|       0.0|
|Iris-setosa|[4.6,3.1,1.5,0.2]|       0.0|
|Iris-setosa|[5.0,3.6,1.4,0.2]|       0.0|
+-----------+-----------------+----------+
only showing top 5 rows



In [0]:
df_final.printSchema()

root
 |-- label: string (nullable = true)
 |-- features: vector (nullable = true)
 |-- labelIndex: double (nullable = false)



Normalizes the data

In [0]:
#Normalizes the data
from pyspark.ml.feature import MinMaxScaler  #Library to put values between 0 and 1
scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")  #Creating the object to scale
scalerModel = scaler.fit(df_final)

In [0]:
df_final = scalerModel.transform(df_final).drop('features').withColumnRenamed('scaledFeatures', 'features')

In [0]:
df_final.show()

+-----------+----------+--------------------+
|      label|labelIndex|            features|
+-----------+----------+--------------------+
|Iris-setosa|       0.0|[0.22222222222222...|
|Iris-setosa|       0.0|[0.16666666666666...|
|Iris-setosa|       0.0|[0.11111111111111...|
|Iris-setosa|       0.0|[0.08333333333333...|
|Iris-setosa|       0.0|[0.19444444444444...|
|Iris-setosa|       0.0|[0.30555555555555...|
|Iris-setosa|       0.0|[0.08333333333333...|
|Iris-setosa|       0.0|[0.19444444444444...|
|Iris-setosa|       0.0|[0.02777777777777...|
|Iris-setosa|       0.0|[0.16666666666666...|
|Iris-setosa|       0.0|[0.30555555555555...|
|Iris-setosa|       0.0|[0.13888888888888...|
|Iris-setosa|       0.0|[0.13888888888888...|
|Iris-setosa|       0.0|[0.0,0.4166666666...|
|Iris-setosa|       0.0|[0.41666666666666...|
|Iris-setosa|       0.0|[0.38888888888888...|
|Iris-setosa|       0.0|[0.30555555555555...|
|Iris-setosa|       0.0|[0.22222222222222...|
|Iris-setosa|       0.0|[0.3888888

Modify the Dataset for Binary Classification

In [0]:
import pyspark.sql.functions as F
df_SVM=df_final.where((F.col("labelIndex") == 0) | (F.col("labelIndex") == 1))  #Transforms the dataset into a binary sorting problem

In [0]:
df_SVM.show(150)

+---------------+----------+--------------------+
|          label|labelIndex|            features|
+---------------+----------+--------------------+
|    Iris-setosa|       0.0|[0.22222222222222...|
|    Iris-setosa|       0.0|[0.16666666666666...|
|    Iris-setosa|       0.0|[0.11111111111111...|
|    Iris-setosa|       0.0|[0.08333333333333...|
|    Iris-setosa|       0.0|[0.19444444444444...|
|    Iris-setosa|       0.0|[0.30555555555555...|
|    Iris-setosa|       0.0|[0.08333333333333...|
|    Iris-setosa|       0.0|[0.19444444444444...|
|    Iris-setosa|       0.0|[0.02777777777777...|
|    Iris-setosa|       0.0|[0.16666666666666...|
|    Iris-setosa|       0.0|[0.30555555555555...|
|    Iris-setosa|       0.0|[0.13888888888888...|
|    Iris-setosa|       0.0|[0.13888888888888...|
|    Iris-setosa|       0.0|[0.0,0.4166666666...|
|    Iris-setosa|       0.0|[0.41666666666666...|
|    Iris-setosa|       0.0|[0.38888888888888...|
|    Iris-setosa|       0.0|[0.30555555555555...|


In [0]:
#Remove unused columns
df_SVM = df_SVM.drop('label')
df_SVM.show(130)

+----------+--------------------+
|labelIndex|            features|
+----------+--------------------+
|       0.0|[0.22222222222222...|
|       0.0|[0.16666666666666...|
|       0.0|[0.11111111111111...|
|       0.0|[0.08333333333333...|
|       0.0|[0.19444444444444...|
|       0.0|[0.30555555555555...|
|       0.0|[0.08333333333333...|
|       0.0|[0.19444444444444...|
|       0.0|[0.02777777777777...|
|       0.0|[0.16666666666666...|
|       0.0|[0.30555555555555...|
|       0.0|[0.13888888888888...|
|       0.0|[0.13888888888888...|
|       0.0|[0.0,0.4166666666...|
|       0.0|[0.41666666666666...|
|       0.0|[0.38888888888888...|
|       0.0|[0.30555555555555...|
|       0.0|[0.22222222222222...|
|       0.0|[0.38888888888888...|
|       0.0|[0.22222222222222...|
|       0.0|[0.30555555555555...|
|       0.0|[0.22222222222222...|
|       0.0|[0.08333333333333...|
|       0.0|[0.22222222222222...|
|       0.0|[0.13888888888888...|
|       0.0|[0.19444444444444...|
|       0.0|[0

In [0]:
df_SVM=df_SVM.selectExpr('features',"labelIndex as label")

In [0]:
df_SVM.show(5)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[0.22222222222222...|  0.0|
|[0.16666666666666...|  0.0|
|[0.11111111111111...|  0.0|
|[0.08333333333333...|  0.0|
|[0.19444444444444...|  0.0|
+--------------------+-----+
only showing top 5 rows



In [0]:
#Splits between training and testing data
(train, test) = df_SVM.randomSplit([0.7, 0.3])

In [0]:
train.show(100)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[0.0,0.4166666666...|  0.0|
|[0.02777777777777...|  0.0|
|[0.05555555555555...|  0.0|
|[0.08333333333333...|  0.0|
|[0.08333333333333...|  0.0|
|[0.08333333333333...|  0.0|
|[0.11111111111111...|  0.0|
|[0.13888888888888...|  0.0|
|[0.13888888888888...|  0.0|
|[0.13888888888888...|  0.0|
|[0.13888888888888...|  0.0|
|[0.16666666666666...|  0.0|
|[0.16666666666666...|  0.0|
|[0.19444444444444...|  1.0|
|[0.19444444444444...|  0.0|
|[0.19444444444444...|  0.0|
|[0.19444444444444...|  0.0|
|[0.19444444444444...|  0.0|
|[0.19444444444444...|  0.0|
|[0.22222222222222...|  1.0|
|[0.22222222222222...|  0.0|
|[0.22222222222222...|  0.0|
|[0.22222222222222...|  0.0|
|[0.22222222222222...|  0.0|
|[0.22222222222222...|  0.0|
|[0.25000000000000...|  0.0|
|[0.25000000000000...|  0.0|
|[0.27777777777777...|  0.0|
|[0.30555555555555...|  1.0|
|[0.30555555555555...|  0.0|
|[0.30555555555555...|  0.0|
|[0.3055555555

In [0]:
print("Train data: ", train.count())
print("Test data: ", test.count())

Train data:  56
Test data:  44


#4) Defines the SVM Model

In [0]:
from pyspark.mllib.classification import SVMWithSGD, SVMModel  #Library for SVM Model
from pyspark.ml.evaluation import MulticlassClassificationEvaluator  #Used to find performance metrics
from pyspark.mllib.linalg import Vectors  #Dense vectors

In [0]:
from pyspark.mllib.util import MLUtils
df_train = MLUtils.convertVectorColumnsFromML(train, "features")
df_test = MLUtils.convertVectorColumnsFromML(test, "features")



In [0]:
df_train.show(5,False)

+----------------------------------------------------------------------------------+-----+
|features                                                                          |label|
+----------------------------------------------------------------------------------+-----+
|[0.0,0.41666666666666663,0.016949152542372895,0.0]                                |0.0  |
|[0.027777777777777922,0.37499999999999994,0.06779661016949151,0.04166666666666667]|0.0  |
|[0.055555555555555594,0.12499999999999992,0.05084745762711865,0.08333333333333333]|0.0  |
|[0.08333333333333327,0.4583333333333333,0.0847457627118644,0.04166666666666667]   |0.0  |
|[0.08333333333333327,0.5,0.06779661016949151,0.04166666666666667]                 |0.0  |
+----------------------------------------------------------------------------------+-----+
only showing top 5 rows



In [0]:
from pyspark.mllib.regression import LabeledPoint  #Creates the "line" (characteristics and label) to be used

trainingData = df_train.rdd.map(lambda row:LabeledPoint(row.label,row.features))  #Apply the label to the training
testingData = df_test.rdd.map(lambda row:LabeledPoint(row.label,row.features))  #Apply the label to the test

In [0]:
for xs in trainingData.take(10):
        print(xs)

(0.0,[0.0,0.41666666666666663,0.016949152542372895,0.0])
(0.0,[0.027777777777777922,0.37499999999999994,0.06779661016949151,0.04166666666666667])
(0.0,[0.055555555555555594,0.12499999999999992,0.05084745762711865,0.08333333333333333])
(0.0,[0.08333333333333327,0.4583333333333333,0.0847457627118644,0.04166666666666667])
(0.0,[0.08333333333333327,0.5,0.06779661016949151,0.04166666666666667])
(0.0,[0.08333333333333327,0.6666666666666666,0.0,0.04166666666666667])
(0.0,[0.11111111111111119,0.5,0.1016949152542373,0.04166666666666667])
(0.0,[0.13888888888888887,0.41666666666666663,0.06779661016949151,0.0])
(0.0,[0.13888888888888887,0.41666666666666663,0.06779661016949151,0.08333333333333333])
(0.0,[0.13888888888888887,0.5833333333333333,0.1016949152542373,0.04166666666666667])


In [0]:
#Model build
modelSVM = SVMWithSGD.train(trainingData, iterations=100)

In [0]:
#Performing the prediction
labelsAndPreds = testingData.map(lambda p: (p.label, modelSVM.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda lp: lp[0] != lp[1]).count() / float(testingData.count())
print("Error in prediction: ",trainErr)

Error in prediction:  0.0
