# **LAB 03 - Cài đặt Spark và thực hiện cài đặt các thuật toán máy học**


## **Cài đặt Spark**

Cài đặt môi trường Java-8

In [None]:
# !apt-get install openjdk-8-jdk-headless -qq > /dev/null

Tải gói cài đặt Apache Spark (version 3.2.1)

In [None]:
# !wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz

Giải nén gói cài đặt Apache Spark

In [None]:
# !tar xzf spark-3.2.1-bin-hadoop3.2.tgz
# !pip install findspark

Đặt đường dẫn của môi trường Java và gói đặt Spark

In [None]:
# import os
# os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64/"
# os.environ["SPARK_HOME"] = "/content/spark-3.2.1-bin-hadoop3.2"

Cài đặt gói hỗ trợ Spark Python API - PySpark

In [None]:
# !pip install pyspark

Import và khởi tạo Spark Session

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

Đọc tập dữ liệu Absenteeism at work từ link github hướng dẫn được cung cấp

In [None]:
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/Ruthvicp/CS5590_BigDataProgramming/master/Lab/Lab4/Source/Absenteeism_at_work.csv')

In [None]:
data

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,...,Disciplinary_failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism_time_in_hours
0,11,26,7,3,1,289,36,13,33,239.554,...,0,1,2,1,0,1,90,172,30,4
1,36,0,7,3,1,118,13,18,50,239.554,...,1,1,1,1,0,0,98,178,31,0
2,3,23,7,4,1,179,51,18,38,239.554,...,0,1,0,1,0,0,89,170,31,2
3,7,7,7,5,1,279,5,14,39,239.554,...,0,1,2,1,1,0,68,168,24,4
4,11,23,7,5,1,289,36,13,33,239.554,...,0,1,2,1,0,1,90,172,30,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
735,11,14,7,3,1,289,36,13,33,264.604,...,0,1,2,1,0,1,90,172,30,8
736,1,11,7,3,1,235,11,14,37,264.604,...,0,3,1,0,0,1,88,172,29,4
737,4,0,0,3,1,118,14,13,40,271.219,...,0,1,1,1,0,8,98,170,34,0
738,8,0,0,4,2,231,35,14,39,271.219,...,0,1,2,1,0,2,100,170,35,0


Tạo một Spark dataframe từ pandas dataframe đã tạo

In [None]:
data = spark.createDataFrame(data)

**Lưu ý:** Các bước cài đặt Spark cần được chạy một lần sau khi truy cập Google Colab để cài đặt và khởi tạo Spark trên Google Colab, các lần chạy sau cần #comment các cell cài đặt để tránh tải lại các gói cài đặt Spark.

## **Chạy các thuật toán máy học**
Sử dụng tập dữ liệu sẳn có trong đường dẫn github đã cung cấp

### Decision Tree

In [None]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer, VectorAssembler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
import scipy

from pyspark.python.pyspark.shell import spark

data1 = data.withColumn("MOA", data["Month of absence"] - 0).withColumn("label", data['Height'] - 0). \
    withColumn("ROA", data["Reason for absence"] - 0). \
    withColumn("distance", data["Distance from Residence to Work"] - 0). \
    withColumn("BMI", data["Body mass index"] - 0)

assem = VectorAssembler(inputCols=["label", "distance"], outputCol='features')
data1 = assem.transform(data1)

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data1)
# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data1)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data1.randomSplit([0.7, 0.3])

# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")

# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")

accuracy = evaluator.evaluate(predictions)

y_true = data1.select("BMI").rdd.flatMap(lambda x: x).collect()
y_pred = data1.select("ROA").rdd.flatMap(lambda x: x).collect()

confusionmatrix = confusion_matrix(y_true, y_pred)

precision = precision_score(y_true, y_pred, average='micro')

recall = recall_score(y_true, y_pred, average='micro')

treeModel = model.stages[2]
# summary only
print(treeModel)
print("Decision Tree - Test Accuracy = %g" % (accuracy))
print("Decision Tree - Test Error = %g" % (1.0 - accuracy))

print("The Confusion Matrix for Decision Tree Model is :\n" + str(confusionmatrix))

print("The precision score for Decision Tree Model is: " + str(precision))

print("The recall score for Decision Tree Model is: " + str(recall))

+----------+------------+------------+
|prediction|indexedLabel|    features|
+----------+------------+------------+
|       1.0|         1.0|[172.0,11.0]|
|       1.0|         1.0|[172.0,11.0]|
|       1.0|         1.0|[172.0,11.0]|
|       1.0|         1.0|[172.0,11.0]|
|       1.0|         1.0|[172.0,11.0]|
+----------+------------+------------+
only showing top 5 rows

DecisionTreeClassificationModel: uid=DecisionTreeClassifier_46f71cf64313, depth=5, numNodes=23, numClasses=14, numFeatures=2
Decision Tree - Test Accuracy = 0.966527
Decision Tree - Test Error = 0.0334728
The Confusion Matrix for Decision Tree Model is :
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [2 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [5 0 0 ... 0 0 0]]
The precision score for Decision Tree Model is: 0.02972972972972973
The recall score for Decision Tree Model is: 0.02972972972972973


### Naive Bayesian

In [None]:
from pyspark.ml.feature import VectorAssembler
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
import scipy

from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

from pyspark.ml.linalg import SparseVector
from pyspark.python.pyspark.shell import spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

data2 = data.withColumn("MOA", data["Month of absence"] - 0).withColumn("label", data['Seasons'] - 0). \
    withColumn("ROA", data["Reason for absence"] - 0). \
    withColumn("distance", data["Distance from Residence to Work"] - 0). \
    withColumn("BMI", data["Body mass index"] - 0)

assem = VectorAssembler(inputCols=["label", "MOA"], outputCol='features')

data2 = assem.transform(data2)
# Split the data into train and test
splits = data2.randomSplit([0.7, 0.3], 1000)
train = splits[0]
test = splits[1]

# create the trainer and set its parameters
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

# train the model
model = nb.fit(train)

# select example rows to display.
predictions = model.transform(test)

# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")

y_true = data2.select("BMI").rdd.flatMap(lambda x: x).collect()
y_pred = data2.select("ROA").rdd.flatMap(lambda x: x).collect()


accuracy = evaluator.evaluate(predictions)

confusionmatrix = confusion_matrix(y_true, y_pred)

precision = precision_score(y_true, y_pred, average='micro')

recall = recall_score(y_true, y_pred, average='micro')


print("Naive Bayes - Test set accuracy = " + str(accuracy))

print("The Confusion Matrix for Naive Bayes Model is :\n" + str(confusionmatrix))

print("The precision score for Naive Bayes Model is: " + str(precision))

print("The recall score for Naive Bayes Model is: " + str(recall))

Naive Bayes - Test set accuracy = 0.06756756756756757
The Confusion Matrix for Naive Bayes Model is :
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [2 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [5 0 0 ... 0 0 0]]
The precision score for Naive Bayes Model is: 0.02972972972972973
The recall score for Naive Bayes Model is: 0.02972972972972973


### Random Forest

In [None]:

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer, VectorAssembler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score


from pyspark.python.pyspark.shell import spark

data3 = data.withColumn("MOA", data["Month of absence"] - 0).withColumn("label", data['Height'] - 0). \
    withColumn("ROA", data["Reason for absence"] - 0). \
    withColumn("distance", data["Distance from Residence to Work"] - 0). \
    withColumn("BMI", data["Body mass index"] - 0)

assem = VectorAssembler(inputCols=["label", "distance"], outputCol='features')

data3 = assem.transform(data3)

labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data3)

# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data3)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data3.randomSplit([0.7, 0.3])

# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=10)

# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
                               labels=labelIndexer.labels)

y_true = data3.select("BMI").rdd.flatMap(lambda x: x).collect()
y_pred = data3.select("ROA").rdd.flatMap(lambda x: x).collect()

# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf, labelConverter])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")

accuracy = evaluator.evaluate(predictions)

confusionmatrix = confusion_matrix(y_true, y_pred)

precision = precision_score(y_true, y_pred, average='micro')

recall = recall_score(y_true, y_pred, average='micro')

rfModel = model.stages[2]
print(rfModel)  # summary only
print("Random Forest - Test Accuracy = %g" % (accuracy))
print("Random Forest - Test Error = %g" % (1.0 - accuracy))

print("The Confusion Matrix for Random Forest Model is :\n" + str(confusionmatrix))

print("The precision score for Random Forest Model is: " + str(precision))

print("The recall score for Random Forest Model is: " + str(recall))

+--------------+-----+------------+
|predictedLabel|label|    features|
+--------------+-----+------------+
|           172|  172|[172.0,11.0]|
|           172|  172|[172.0,11.0]|
|           172|  172|[172.0,11.0]|
|           172|  172|[172.0,11.0]|
|           172|  172|[172.0,11.0]|
+--------------+-----+------------+
only showing top 5 rows

RandomForestClassificationModel: uid=RandomForestClassifier_54fba69b98d7, numTrees=10, numClasses=14, numFeatures=2
Random Forest - Test Accuracy = 0.967442
Random Forest - Test Error = 0.0325581
The Confusion Matrix for Random Forest Model is :
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [2 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [5 0 0 ... 0 0 0]]
The precision score for Random Forest Model is: 0.02972972972972973
The recall score for Random Forest Model is: 0.02972972972972973


## **Chạy các thuật toán trên tập dữ liệu khác**
Sử dụng tập dữ liệu **hoa Iris** để chạy lại trên các mô hình đã triển khai

Đọc tập dữ liệu từ trang github website và tạo một spark dataframe

In [None]:
testdata = pd.read_csv('https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv')
testdata = spark.createDataFrame(testdata)

In [None]:
testdata.show()

+------------+-----------+------------+-----------+-------+
|sepal_length|sepal_width|petal_length|petal_width|species|
+------------+-----------+------------+-----------+-------+
|         5.1|        3.5|         1.4|        0.2| setosa|
|         4.9|        3.0|         1.4|        0.2| setosa|
|         4.7|        3.2|         1.3|        0.2| setosa|
|         4.6|        3.1|         1.5|        0.2| setosa|
|         5.0|        3.6|         1.4|        0.2| setosa|
|         5.4|        3.9|         1.7|        0.4| setosa|
|         4.6|        3.4|         1.4|        0.3| setosa|
|         5.0|        3.4|         1.5|        0.2| setosa|
|         4.4|        2.9|         1.4|        0.2| setosa|
|         4.9|        3.1|         1.5|        0.1| setosa|
|         5.4|        3.7|         1.5|        0.2| setosa|
|         4.8|        3.4|         1.6|        0.2| setosa|
|         4.8|        3.0|         1.4|        0.1| setosa|
|         4.3|        3.0|         1.1| 

### Decision Tree

In [None]:
testdata1 = testdata.withColumn("SL", testdata["sepal_length"] - 0). \
    withColumn("SW", testdata["sepal_width"] - 0). \
    withColumn("PL", testdata["petal_length"] - 0). \
    withColumn("PW", testdata["petal_width"] - 0)

assem = VectorAssembler(inputCols=["SL", "SW", "PL", "PW"], outputCol='features')
testdata1 = assem.transform(testdata1)

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in indexL
labelIndexer = StringIndexer(inputCol="species", outputCol="indexedLabel").fit(testdata1)
# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(testdata1)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = testdata1.randomSplit([0.7, 0.3])

# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")

# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testdata1)

# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")

accuracy = evaluator.evaluate(predictions)

y_true = predictions.select("indexedLabel").rdd.flatMap(lambda x: x).collect()
y_pred = predictions.select("prediction").rdd.flatMap(lambda x: x).collect()

confusionmatrix = confusion_matrix(y_true, y_pred)

precision = precision_score(y_true, y_pred, average='micro')

recall = recall_score(y_true, y_pred, average='micro')

treeModel = model.stages[2]
print(treeModel)
print("Decision Tree - Test Accuracy = %g" % (accuracy))
print("Decision Tree - Test Error = %g" % (1.0 - accuracy))

print("The Confusion Matrix for Decision Tree Model is :\n" + str(confusionmatrix))

print("The precision score for Decision Tree Model is: " + str(precision))

print("The recall score for Decision Tree Model is: " + str(recall))

+----------+------------+-----------------+
|prediction|indexedLabel|         features|
+----------+------------+-----------------+
|       0.0|         0.0|[5.1,3.5,1.4,0.2]|
|       0.0|         0.0|[4.9,3.0,1.4,0.2]|
|       0.0|         0.0|[4.7,3.2,1.3,0.2]|
|       0.0|         0.0|[4.6,3.1,1.5,0.2]|
|       0.0|         0.0|[5.0,3.6,1.4,0.2]|
+----------+------------+-----------------+
only showing top 5 rows

DecisionTreeClassificationModel: uid=DecisionTreeClassifier_b2a64f065afe, depth=4, numNodes=11, numClasses=3, numFeatures=4
Decision Tree - Test Accuracy = 0.98
Decision Tree - Test Error = 0.02
The Confusion Matrix for Decision Tree Model is :
[[50  0  0]
 [ 0 48  2]
 [ 0  1 49]]
The precision score for Decision Tree Model is: 0.98
The recall score for Decision Tree Model is: 0.98


### Naive Bayes

In [None]:
testdata2 = testdata.withColumn("SL", testdata["sepal_length"] - 0). \
    withColumn("SW", testdata["sepal_width"] - 0). \
    withColumn("PL", testdata["petal_length"] - 0). \
    withColumn("PW", testdata["petal_width"] - 0)

assem = VectorAssembler(inputCols=["SL", "SW", "PL", "PW"], outputCol='features')
testdata2 = assem.transform(testdata2)

labelIndexer = StringIndexer(inputCol="species", outputCol="label").fit(testdata2)
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=3).fit(testdata2)

# Split the data into train and test
splits = testdata2.randomSplit([0.7, 0.3], 1000)
train = splits[0]
test = splits[1]

# create the trainer and set its parameters
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

pipeline = Pipeline(stages=[labelIndexer, featureIndexer, nb])

model = pipeline.fit(train)

# select example rows to display.
predictions = model.transform(test)

# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")

y_true = predictions.select("label").rdd.flatMap(lambda x: x).collect()
y_pred = predictions.select("prediction").rdd.flatMap(lambda x: x).collect()


accuracy = evaluator.evaluate(predictions)

confusionmatrix = confusion_matrix(y_true, y_pred)

precision = precision_score(y_true, y_pred, average='micro')

recall = recall_score(y_true, y_pred, average='micro')


print("Naive Bayes - Test set accuracy = " + str(accuracy))

print("The Confusion Matrix for Naive Bayes Model is :\n" + str(confusionmatrix))

print("The precision score for Naive Bayes Model is: " + str(precision))

print("The recall score for Naive Bayes Model is: " + str(recall))

Naive Bayes - Test set accuracy = 0.8409090909090909
The Confusion Matrix for Naive Bayes Model is :
[[15  0  0]
 [ 0 13  0]
 [ 0  7  9]]
The precision score for Naive Bayes Model is: 0.8409090909090909
The recall score for Naive Bayes Model is: 0.8409090909090909


### Random Forest

In [None]:
testdata3 = testdata.withColumn("SL", testdata["sepal_length"] - 0). \
    withColumn("SW", testdata["sepal_width"] - 0). \
    withColumn("PL", testdata["petal_length"] - 0). \
    withColumn("PW", testdata["petal_width"] - 0)

assem = VectorAssembler(inputCols=["SL", "SW", "PL", "PW"], outputCol='features')
testdata3 = assem.transform(testdata3)

labelIndexer = StringIndexer(inputCol="species", outputCol="indexedLabel").fit(testdata3)

# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=3).fit(testdata3)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = testdata3.randomSplit([0.7, 0.3])

# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=10)

# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
                               labels=labelIndexer.labels)

# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf, labelConverter])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")

accuracy = evaluator.evaluate(predictions)

y_true = testData.select("species").rdd.flatMap(lambda x: x).collect()
y_pred = predictions.select("predictedLabel").rdd.flatMap(lambda x: x).collect()

confusionmatrix = confusion_matrix(y_true, y_pred)

precision = precision_score(y_true, y_pred, average='micro')

recall = recall_score(y_true, y_pred, average='micro')

rfModel = model.stages[2]
print(rfModel)  # summary only
print("Random Forest - Test Accuracy = %g" % (accuracy))
print("Random Forest - Test Error = %g" % (1.0 - accuracy))

print("The Confusion Matrix for Random Forest Model is :\n" + str(confusionmatrix))

print("The precision score for Random Forest Model is: " + str(precision))

print("The recall score for Random Forest Model is: " + str(recall))

RandomForestClassificationModel: uid=RandomForestClassifier_9bfc3860d025, numTrees=10, numClasses=3, numFeatures=4
Random Forest - Test Accuracy = 0.947368
Random Forest - Test Error = 0.0526316
The Confusion Matrix for Random Forest Model is :
[[12  0  0]
 [ 0 11  1]
 [ 0  1 13]]
The precision score for Random Forest Model is: 0.9473684210526315
The recall score for Random Forest Model is: 0.9473684210526315


In [None]:
predictions.show()

+------------+-----------+------------+-----------+----------+---+---+---+---+-----------------+------------+-----------------+--------------+--------------------+----------+--------------+
|sepal_length|sepal_width|petal_length|petal_width|   species| SL| SW| PL| PW|         features|indexedLabel|  indexedFeatures| rawPrediction|         probability|prediction|predictedLabel|
+------------+-----------+------------+-----------+----------+---+---+---+---+-----------------+------------+-----------------+--------------+--------------------+----------+--------------+
|         4.4|        3.2|         1.3|        0.2|    setosa|4.4|3.2|1.3|0.2|[4.4,3.2,1.3,0.2]|         0.0|[4.4,3.2,1.3,0.2]|[10.0,0.0,0.0]|       [1.0,0.0,0.0]|       0.0|        setosa|
|         4.6|        3.2|         1.4|        0.2|    setosa|4.6|3.2|1.4|0.2|[4.6,3.2,1.4,0.2]|         0.0|[4.6,3.2,1.4,0.2]|[10.0,0.0,0.0]|       [1.0,0.0,0.0]|       0.0|        setosa|
|         4.8|        3.0|         1.4|        0.1