# **A study about the use of a Decision Tree to predict the iris species of a flower**
**This notebook is a form to practice my knowledge in data science**

The decision tree is one of the oldest and most widely used methods in machine learning, and it can be used to classify an element by analyzing the relationships between its variables. These algorithms progressively subdivide the data into smaller and more specific sets, in terms of their attributes, until they reach a size simplified enough to be labeled. To do so, it is necessary to train the model with previously labeled data in order to apply it to new data.

The objective of this notebook is, through the use of pySpark, to predict the iris species of a flower by the given characteristics. We will predict the species through the dataset of 3 iris species where 50 samples were collected for each species, considering the species as the output variable and the other variables of petal and sepal sizes as the input.

## Workflow stages
The solution workflow goes through five stages.
1.   Load the Data.
2.   Exploratory some information about dataset.
3.   Data pre-processing.
4.   Defines the Decision Tree Model.
5.   Defines the Random Forest Model.

In [0]:
#Import the library that creates the spark section
from pyspark.sql import SparkSession

In [0]:
#Starts the section for using spark
spark = SparkSession.builder.appName("DecisionTree").getOrCreate()

In [0]:
%fs ls /FileStore/tables

path,name,size,modificationTime
dbfs:/FileStore/tables/iris_bezdekIris.csv,iris_bezdekIris.csv,4551,1663856964000
dbfs:/FileStore/tables/movies-1.csv,movies-1.csv,494431,1662647401000
dbfs:/FileStore/tables/movies-2.csv,movies-2.csv,494431,1662647447000
dbfs:/FileStore/tables/movies.csv,movies.csv,494431,1662647363000
dbfs:/FileStore/tables/ratings.csv,ratings.csv,2483723,1662647649000
dbfs:/FileStore/tables/u.data,u.data,1979173,1662474869000


In [0]:
#Get the directory containing the file to use
dir="/FileStore/tables/iris_bezdekIris.csv" 

#1) Load the Data

In [0]:
#Reading stored files through generic function
df_iris = spark.read.format('csv').options(inferSchema=True,header='false',delimiter=',').load(dir)

In [0]:
df_iris.printSchema()

root
 |-- _c0: double (nullable = true)
 |-- _c1: double (nullable = true)
 |-- _c2: double (nullable = true)
 |-- _c3: double (nullable = true)
 |-- _c4: string (nullable = true)



In [0]:
df_iris.show(5)

+---+---+---+---+-----------+
|_c0|_c1|_c2|_c3|        _c4|
+---+---+---+---+-----------+
|5.1|3.5|1.4|0.2|Iris-setosa|
|4.9|3.0|1.4|0.2|Iris-setosa|
|4.7|3.2|1.3|0.2|Iris-setosa|
|4.6|3.1|1.5|0.2|Iris-setosa|
|5.0|3.6|1.4|0.2|Iris-setosa|
+---+---+---+---+-----------+
only showing top 5 rows



#2) Exploratory some information about dataset

In [0]:
#Formatted column name in header
df_iris = df_iris.selectExpr("_c0 as sep_len", "_c1 as sep_wid", "_c2 as pet_len", "_c3 as pet_wid", "_c4 as label")

In [0]:
df_iris.show(5)

+-------+-------+-------+-------+-----------+
|sep_len|sep_wid|pet_len|pet_wid|      label|
+-------+-------+-------+-------+-----------+
|    5.1|    3.5|    1.4|    0.2|Iris-setosa|
|    4.9|    3.0|    1.4|    0.2|Iris-setosa|
|    4.7|    3.2|    1.3|    0.2|Iris-setosa|
|    4.6|    3.1|    1.5|    0.2|Iris-setosa|
|    5.0|    3.6|    1.4|    0.2|Iris-setosa|
+-------+-------+-------+-------+-----------+
only showing top 5 rows



In [0]:
#Analyzed s statistics
df_iris.describe(['sep_len','sep_wid','pet_len','pet_wid']).show()

+-------+------------------+-------------------+------------------+------------------+
|summary|           sep_len|            sep_wid|           pet_len|           pet_wid|
+-------+------------------+-------------------+------------------+------------------+
|  count|               150|                150|               150|               150|
|   mean| 5.843333333333335|  3.057333333333334|3.7580000000000027| 1.199333333333334|
| stddev|0.8280661279778637|0.43586628493669793|1.7652982332594662|0.7622376689603467|
|    min|               4.3|                2.0|               1.0|               0.1|
|    max|               7.9|                4.4|               6.9|               2.5|
+-------+------------------+-------------------+------------------+------------------+



In [0]:
#Defined the dataframe view to be used as a table by SQL
df_iris.createOrReplaceTempView("irisTable")

In [0]:
display(spark.sql('select * from irisTable '))

sep_len,sep_wid,pet_len,pet_wid,label
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa


#3) Data pre-processing

In [0]:
#Library that contains the functions for building vectors
from pyspark.ml.linalg import Vectors  
from pyspark.ml.feature import VectorAssembler 

In [0]:
#Created the feature vector
vector_assembler = VectorAssembler(inputCols=["sep_len", "sep_wid", "pet_len", "pet_wid"],outputCol="features")
df_temp = vector_assembler.transform(df_iris)
df_temp.show(5)

+-------+-------+-------+-------+-----------+-----------------+
|sep_len|sep_wid|pet_len|pet_wid|      label|         features|
+-------+-------+-------+-------+-----------+-----------------+
|    5.1|    3.5|    1.4|    0.2|Iris-setosa|[5.1,3.5,1.4,0.2]|
|    4.9|    3.0|    1.4|    0.2|Iris-setosa|[4.9,3.0,1.4,0.2]|
|    4.7|    3.2|    1.3|    0.2|Iris-setosa|[4.7,3.2,1.3,0.2]|
|    4.6|    3.1|    1.5|    0.2|Iris-setosa|[4.6,3.1,1.5,0.2]|
|    5.0|    3.6|    1.4|    0.2|Iris-setosa|[5.0,3.6,1.4,0.2]|
+-------+-------+-------+-------+-----------+-----------------+
only showing top 5 rows



In [0]:
#Removed unused columns
df_formatted = df_temp.drop('sep_len', 'sep_wid', 'pet_len', 'pet_wid')
df_formatted.show(5)

+-----------+-----------------+
|      label|         features|
+-----------+-----------------+
|Iris-setosa|[5.1,3.5,1.4,0.2]|
|Iris-setosa|[4.9,3.0,1.4,0.2]|
|Iris-setosa|[4.7,3.2,1.3,0.2]|
|Iris-setosa|[4.6,3.1,1.5,0.2]|
|Iris-setosa|[5.0,3.6,1.4,0.2]|
+-----------+-----------------+
only showing top 5 rows



In [0]:
#Apply transformations to the label column
from pyspark.ml.feature import StringIndexer  #Creates the 'vector' for each of the existing classes in the label column

l_indexer = StringIndexer(inputCol="label", outputCol="labelIndex")  #Create object for encoding
df_final = l_indexer.fit(df_formatted).transform(df_formatted)  #Apply the transformation

In [0]:
df_final.show(5)

+-----------+-----------------+----------+
|      label|         features|labelIndex|
+-----------+-----------------+----------+
|Iris-setosa|[5.1,3.5,1.4,0.2]|       0.0|
|Iris-setosa|[4.9,3.0,1.4,0.2]|       0.0|
|Iris-setosa|[4.7,3.2,1.3,0.2]|       0.0|
|Iris-setosa|[4.6,3.1,1.5,0.2]|       0.0|
|Iris-setosa|[5.0,3.6,1.4,0.2]|       0.0|
+-----------+-----------------+----------+
only showing top 5 rows



In [0]:
#Splits between training and testing data
(train, test) = df_final.randomSplit([0.7, 0.3])

In [0]:
test.show(5)

+-----------+-----------------+----------+
|      label|         features|labelIndex|
+-----------+-----------------+----------+
|Iris-setosa|[4.4,3.2,1.3,0.2]|       0.0|
|Iris-setosa|[4.5,2.3,1.3,0.3]|       0.0|
|Iris-setosa|[4.6,3.6,1.0,0.2]|       0.0|
|Iris-setosa|[4.7,3.2,1.6,0.2]|       0.0|
|Iris-setosa|[4.9,3.1,1.5,0.1]|       0.0|
+-----------+-----------------+----------+
only showing top 5 rows



#4) Defines the Decision Tree Model

In [0]:
from pyspark.ml.classification import DecisionTreeClassifier  #Library for decision tree algorithm
from pyspark.ml.evaluation import MulticlassClassificationEvaluator  #Used to find performance metrics

In [0]:
modelTree = DecisionTreeClassifier(labelCol="labelIndex", featuresCol="features")  #Define model
model = modelTree.fit(train)  #Apply training

In [0]:
#Performs the prediction
predictions = model.transform(test)
predictions.select("prediction", "labelIndex").show(5)

+----------+----------+
|prediction|labelIndex|
+----------+----------+
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
+----------+----------+
only showing top 5 rows



In [0]:
#Finds the evaluation metrics for the model
evaluator = MulticlassClassificationEvaluator(labelCol="labelIndex", predictionCol="prediction",metricName="accuracy")

In [0]:
acc = evaluator.evaluate(predictions)
print("Model Accuracy =  ",(acc))

Model Accuracy =   0.9347826086956522


#5) Defines the Random Forest Model

In [0]:
from pyspark.ml.classification import RandomForestClassifier  #Classifier for Random Forest


In [0]:
modelRF = RandomForestClassifier(labelCol="labelIndex",featuresCol="features", numTrees=10)  #Defines the model
modelRF = modelRF.fit(train)

In [0]:
#Performing the prediction
predictions = modelRF.transform(test)
predictions.select("prediction", "labelIndex").show(5)

+----------+----------+
|prediction|labelIndex|
+----------+----------+
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
+----------+----------+
only showing top 5 rows



In [0]:
#Finds the evaluation metrics for the model
evaluator = MulticlassClassificationEvaluator(labelCol="labelIndex", predictionCol="prediction",metricName="accuracy")

In [0]:
acc = evaluator.evaluate(predictions)
print("Model Accuracy =  ",(acc))

Model Accuracy =   0.9565217391304348


In [0]:
print(modelRF)

RandomForestClassificationModel: uid=RandomForestClassifier_b2ae63031cc2, numTrees=10, numClasses=3, numFeatures=4
