#### Formatting Models According to Your Use Case

* In the case of most classification and regression algorithms, you want to get your data into a column of type Double to represent the label and a column of type Vector (either dense or sparse) to represent the features.
* In the case of recommendation, you want to get your data into a column of users, a column of items (say movies or books), and a column of ratings.
* In the case of unsupervised learning, a column of type Vector (either dense or sparse) is needed to represent the features.
* In the case of graph analytics, you will want a DataFrame of vertices and a DataFrame of edges.

### Preprocessing and Feature Engineering

#### 1. Transformers vs estimator
* Transformers == transform() : 1 seul passage de data
. estimator == fit_transform() : 2 passages

**fit(), transform() and fit_transform()**

In [0]:
import numpy as np
from sklearn.impute import SimpleImputer

In [0]:
X = [[1, 3], 
     [np.nan, 2], 
     [8, 5.5]]

In [0]:
imp = SimpleImputer() # par défaut mean(), on peut aussi définir strategy='most_frequent'
# calculating the means
imp.fit(X)

SimpleImputer()


Maintenant, les imputateurs ont appris à utiliser une moyenne (1 + 8) / 2 = 4,5 pour la première colonne et une moyenne (2 + 3 + 5,5) / 3 = 3,5 pour la deuxième colonne lorsque celle-ci est appliquée à une donnée à deux colonnes:

In [0]:
Y = [[np.nan, 11], 
     [4,      np.nan], 
     [8,      2],
     [np.nan, 1]]
imp.transform(Y) # or imp.fit(X).transform(Y)

Out[156]: array([[ 4.5, 11. ],
       [ 4. ,  3.5],
       [ 8. ,  2. ],
       [ 4.5,  1. ]])

Ainsi, fit l'imputer calcule les moyennes des colonnes à partir de certaines données et les transform applique à certaines données (ce qui remplace simplement les valeurs manquantes par les moyennes). Si ces deux données sont identiques (c'est-à-dire les données permettant de calculer les moyennes et les données auxquelles les moyennes sont appliquées), vous pouvez utiliser fit_transformce qui est fondamentalement un fitsuivi de a transform.

In [0]:
imp.fit_transform(X) # Deux passages : un pour fit, un pour transform


Out[157]: array([[1. , 3. ],
       [4.5, 2. ],
       [8. , 5.5]])

In [0]:
fakeIntDF = spark.read.parquet("/databricks-datasets/definitive-guide/data/simple-ml-integers")
simpleDF = spark.read.json("/databricks-datasets/definitive-guide/data/simple-ml")
scaleDF = spark.read.parquet("/databricks-datasets/definitive-guide/data/simple-ml-scaling")

In [0]:
scaleDF.show(5)
simpleDF.show(5)
fakeIntDF.show(5)

+-----+----+------+------------------+
|color| lab|value1|            value2|
+-----+----+------+------------------+
|green|good|     1|14.386294994851129|
| blue| bad|     8|14.386294994851129|
| blue| bad|    12|14.386294994851129|
|green|good|    15| 38.97187133755819|
|green|good|    12|14.386294994851129|
+-----+----+------+------------------+
only showing top 5 rows



##### 2. Working with Continuous Features
There are two common transformers for continuous features
* bucketing : convert continuous features into categorical features
* scale and normalize

In [0]:
# contDF = spark.range(20).selectExpr("cast(id as double)") # pas besoin de float, en integer marche aussi

contDF = spark.range(20) 
contDF.show()

+----+
|  id|
+----+
| 0.0|
| 1.0|
| 2.0|
| 3.0|
| 4.0|
| 5.0|
| 6.0|
| 7.0|
| 8.0|
| 9.0|
|10.0|
|11.0|
|12.0|
|13.0|
|14.0|
|15.0|
|16.0|
|17.0|
|18.0|
|19.0|
+----+



##### 2.1. Bucketing: is a great example to distinct transformer (Bucketizer) and estimator QuantileDiscretizer

Most straightforward approach to bucketing or binning is using the Bucketizer <br/>
the values passed into splits must satisfy three requirements:
* The minimum value in your splits array must be less than the minimum value in your DataFrame
* The maximum value in your splits array must be greater than the maximum value in your DataFrame
* need to specify at a minimum **three values** in the splits array, which creates two buckets.

In [0]:
from pyspark.ml.feature import Bucketizer

In [0]:
# bucketBorders = [-1.0, 5.0, 10.0, 250.0, 600.0] # pas besoin de float, en integer marche aussi
bucketBorders = [-1, 5, 10, 250, 600]

bucketer = Bucketizer().setSplits(bucketBorders).setInputCol("id").setOutputCol("class")
bucketer.transform(contDF).show()

# poser la question : combien de passage ? estimator ou transfomer?

+----+-------------------------------+
|  id|Bucketizer_bf495df7fd7f__output|
+----+-------------------------------+
| 0.0|                            0.0|
| 1.0|                            0.0|
| 2.0|                            0.0|
| 3.0|                            0.0|
| 4.0|                            0.0|
| 5.0|                            1.0|
| 6.0|                            1.0|
| 7.0|                            1.0|
| 8.0|                            1.0|
| 9.0|                            1.0|
|10.0|                            2.0|
|11.0|                            2.0|
|12.0|                            2.0|
|13.0|                            2.0|
|14.0|                            2.0|
|15.0|                            2.0|
|16.0|                            2.0|
|17.0|                            2.0|
|18.0|                            2.0|
|19.0|                            2.0|
+----+-------------------------------+



In [0]:
from pyspark.ml.feature import QuantileDiscretizer
bucketer = QuantileDiscretizer().setNumBuckets(5).setInputCol("id").setOutputCol("class")

# poser la question : combien de passage ? estimator ou transfomer?
bucketer.fit(contDF).transform(contDF).show()

+----+----------------------------------------+
|  id|QuantileDiscretizer_d65943e4b48f__output|
+----+----------------------------------------+
| 0.0|                                     0.0|
| 1.0|                                     0.0|
| 2.0|                                     0.0|
| 3.0|                                     1.0|
| 4.0|                                     1.0|
| 5.0|                                     1.0|
| 6.0|                                     1.0|
| 7.0|                                     2.0|
| 8.0|                                     2.0|
| 9.0|                                     2.0|
|10.0|                                     2.0|
|11.0|                                     3.0|
|12.0|                                     3.0|
|13.0|                                     3.0|
|14.0|                                     3.0|
|15.0|                                     4.0|
|16.0|                                     4.0|
|17.0|                                  

More advanced techniques such as locality sensitivity hashing (LSH) are also available in MLlib

##### 2.2 StandardScaler

In [0]:
# Using scaleDF
scaleDF.show()

from pyspark.ml.feature import StandardScaler
scale = StandardScaler().setInputCol('features').setOutputCol('features_scaled')
scale.fit(scaleDF).transform(scaleDF).show(5, False)

+---+--------------+------------------------------------------------------------+
|id |features      |StandardScaler_3efd6830aee9__output                         |
+---+--------------+------------------------------------------------------------+
|0  |[1.0,0.1,-1.0]|[1.1952286093343936,0.02337622911060922,-0.5976143046671968]|
|1  |[2.0,1.1,1.0] |[2.390457218668787,0.2571385202167014,0.5976143046671968]   |
|0  |[1.0,0.1,-1.0]|[1.1952286093343936,0.02337622911060922,-0.5976143046671968]|
|1  |[2.0,1.1,1.0] |[2.390457218668787,0.2571385202167014,0.5976143046671968]   |
|1  |[3.0,10.1,3.0]|[3.5856858280031805,2.3609991401715313,1.7928429140015902]  |
+---+--------------+------------------------------------------------------------+



##### MinMaxScaler (Optional)

In [0]:
from pyspark.ml.feature import MinMaxScaler
minMax = MinMaxScaler().setMin(5).setMax(10).setInputCol('features')
fittedminMax = minMax.fit(scaleDF)
fittedminMax.transform(scaleDF).display()



##### MaxAbsScaler (Optional)

In [0]:
from pyspark.ml.feature import MaxAbsScaler



In [0]:
maScaler = MaxAbsScaler().setInputCol('features')
fittedmaScaler= maScaler.fit(scaleDF)
fittedmaScaler.transform(scaleDF).display()



##### ElementwiseProduct (Optional)

In [0]:
from pyspark.ml.feature import ElementwiseProduct
from pyspark.ml.linalg import Vectors
scaleUpVec = Vectors.dense(10.0, 15.0, 20.0)



In [0]:
scalingUp = ElementwiseProduct().setScalingVec(scaleUpVec).setInputCol('features')
scalingUp.transform(scaleDF).show()




##### Normalizer (Optional)
For any 1 <= p < float(‘inf’), normalizes samples using sum(abs(vector) p) (1/p) as norm <br/>
Normalization in L^p^ space, p = 2 by default.

In [0]:
from pyspark.ml.feature import Normalizer



##### 3. Working with Categorical Features

##### 3.1. StringIndexer

In [0]:
simpleDF.show()

+-----+----+------+------------------+
|color| lab|value1|            value2|
+-----+----+------+------------------+
|green|good|     1|14.386294994851129|
| blue| bad|     8|14.386294994851129|
| blue| bad|    12|14.386294994851129|
|green|good|    15| 38.97187133755819|
|green|good|    12|14.386294994851129|
|green| bad|    16|14.386294994851129|
|  red|good|    35|14.386294994851129|
|  red| bad|     1| 38.97187133755819|
|  red| bad|     2|14.386294994851129|
|  red| bad|    16|14.386294994851129|
|  red|good|    45| 38.97187133755819|
|green|good|     1|14.386294994851129|
| blue| bad|     8|14.386294994851129|
| blue| bad|    12|14.386294994851129|
|green|good|    15| 38.97187133755819|
|green|good|    12|14.386294994851129|
|green| bad|    16|14.386294994851129|
|  red|good|    35|14.386294994851129|
|  red| bad|     1| 38.97187133755819|
|  red| bad|     2|14.386294994851129|
+-----+----+------+------------------+
only showing top 20 rows



In [0]:
# Using simpleDF
simpleDF.show(5)

from pyspark.ml.feature import StringIndexer #simplement transformer lab de prédiciton du string en vecteur numérique

strIndexer = StringIndexer().setInputCol('lab').setOutputCol('lab_Ind')

strIndexer.fit(simpleDF).transform(simpleDF).show()

+-----+----+------+------------------+-------+
|color| lab|value1|            value2|lab_Ind|
+-----+----+------+------------------+-------+
|green|good|     1|14.386294994851129|    1.0|
| blue| bad|     8|14.386294994851129|    0.0|
| blue| bad|    12|14.386294994851129|    0.0|
|green|good|    15| 38.97187133755819|    1.0|
|green|good|    12|14.386294994851129|    1.0|
|green| bad|    16|14.386294994851129|    0.0|
|  red|good|    35|14.386294994851129|    1.0|
|  red| bad|     1| 38.97187133755819|    0.0|
|  red| bad|     2|14.386294994851129|    0.0|
|  red| bad|    16|14.386294994851129|    0.0|
|  red|good|    45| 38.97187133755819|    1.0|
|green|good|     1|14.386294994851129|    1.0|
| blue| bad|     8|14.386294994851129|    0.0|
| blue| bad|    12|14.386294994851129|    0.0|
|green|good|    15| 38.97187133755819|    1.0|
|green|good|    12|14.386294994851129|    1.0|
|green| bad|    16|14.386294994851129|    0.0|
|  red|good|    35|14.386294994851129|    1.0|
|  red| bad| 

##### Indexing in Vectors (optionnal)

VectorIndexer helps index categorical features in datasets of Vectors. It can both automatically decide which features are categorical and convert original values to category indices. Specifically, it does the following:

- Take an input column of type Vector and a parameter maxCategories.
- Decide which features should be categorical based on the number of distinct values, where features with at most maxCategories are declared categorical.
- Compute 0-based category indices for each categorical feature.
- Index categorical features and transform original feature values to indices.
Indexing categorical features allows algorithms such as Decision Trees and Tree Ensembles to treat categorical features appropriately, improving performance.

In [0]:
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.linalg import Vectors

idxIn = spark.createDataFrame([
(Vectors.dense(1, 2, 3),1),
(Vectors.dense(2, 5, 6),2),
(Vectors.dense(1, 8, 9),3)
]).toDF("features", "label")


idxIn.show()

indxr = VectorIndexer().setInputCol('features').setOutputCol('idxed').setMaxCategories(3) # if == 3, 3 features of vectors are encoded
indxr.fit(idxIn).transform(idxIn).show()

+-------------+-----+-------------+
|     features|label|        idxed|
+-------------+-----+-------------+
|[1.0,2.0,3.0]|    1|[0.0,0.0,0.0]|
|[2.0,5.0,6.0]|    2|[1.0,1.0,1.0]|
|[1.0,8.0,9.0]|    3|[0.0,2.0,2.0]|
+-------------+-----+-------------+



##### 3.2. One-Hot Encoding
**For string type input data, it is common to encode categorical features using StringIndexer first**.

In [0]:
# important : for string type input data, we have to encode

# Thats why this does not work
# ohe = OneHotEncoder().setInputCol('color').setOutputCol('ohe_color')
# ohe.fit(simpleDF).transform(simpleDF).show()


from pyspark.ml.feature import StringIndexer
str_idx = StringIndexer().setInputCol("color").setOutputCol("color_idx")
simpleDF_idx = str_idx.fit(simpleDF).transform(simpleDF)

from pyspark.ml.feature import OneHotEncoder
ohe = OneHotEncoder().setInputCol('color_idx').setOutputCol('ohe_color')
ohe.fit(simpleDF_idx).transform(simpleDF_idx).show()


+-----+----+------+------------------+--------+-------------+
|color| lab|value1|            value2|colorInd|    color_ohe|
+-----+----+------+------------------+--------+-------------+
|green|good|     1|14.386294994851129|     1.0|(2,[1],[1.0])|
| blue| bad|     8|14.386294994851129|     2.0|    (2,[],[])|
| blue| bad|    12|14.386294994851129|     2.0|    (2,[],[])|
|green|good|    15| 38.97187133755819|     1.0|(2,[1],[1.0])|
|green|good|    12|14.386294994851129|     1.0|(2,[1],[1.0])|
|green| bad|    16|14.386294994851129|     1.0|(2,[1],[1.0])|
|  red|good|    35|14.386294994851129|     0.0|(2,[0],[1.0])|
|  red| bad|     1| 38.97187133755819|     0.0|(2,[0],[1.0])|
|  red| bad|     2|14.386294994851129|     0.0|(2,[0],[1.0])|
|  red| bad|    16|14.386294994851129|     0.0|(2,[0],[1.0])|
|  red|good|    45| 38.97187133755819|     0.0|(2,[0],[1.0])|
|green|good|     1|14.386294994851129|     1.0|(2,[1],[1.0])|
| blue| bad|     8|14.386294994851129|     2.0|    (2,[],[])|
| blue| 

"color_ohe" is represented in sparse format. In this format the zeros of a vector are not printed. 
- The first value (2) shows the length of the vector, 
- the second value is an array that lists zero or more indices where non-zero entries are found. 
- The third value is another array that tells which numbers are found at these indices. 
- So (2,[0],[1.0]) means a vector of length 2 with 1.0 at position 0 and 0 elsewhere (10)
- 2,[1],[1.0] with 1.0 at position 1 (01)
- (2,[],[]) means '00'

- 0  -> 10
- 1  -> 01
- 2  -> 00

#### 4. First Pipeline

In [0]:
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[
    StringIndexer(inputCol='lab', outputCol='lab_ind'),
    StringIndexer(inputCol='color', outputCol='color_idx'),
    OneHotEncoder(inputCol='color_idx', outputCol='ohe_color'),
#     IDF(inputCol='a_tf', outputCol='a_idf'),
])

pipeline.fit(simpleDF).transform(simpleDF).show(5)

If now we'd like to concatenate all numerical features into one big vector to passe to ML algorith, which colums should we choose?

#### 5. VectorAssembler
* **concatenate all your features into one big vector** you can then pass into an estimator
* takes as input a number of columns of Boolean, Double, or Vector
* used in nearly every single pipeline in the last step of a machine learning pipeline

In [0]:
# VectorAssembler

from pyspark.ml.feature import VectorAssembler
va = VectorAssembler().setInputCols(["value1", "value2", "lab_ind", "ohe_color"]).setOutputCol("features")
va.transform(pipeline_df).show(5, False)


In [0]:
# we can also update our pipeline by adding the VectorAssembler()
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[
    StringIndexer(inputCol='lab', outputCol='lab_ind'),
    StringIndexer(inputCol='color', outputCol='color_idx'),
    OneHotEncoder(inputCol='color_idx', outputCol='ohe_color'),
    VectorAssembler().setInputCols(["value1", "value2", "lab_ind", "ohe_color"]).setOutputCol("features")
])

pipeline.fit(simpleDF).transform(simpleDF).show(5, False)

#### 6. Feature Manipulation

##### 6.1. PCA

In [0]:
scaleDF.show()

+---+--------------+
| id|      features|
+---+--------------+
|  0|[1.0,0.1,-1.0]|
|  1| [2.0,1.1,1.0]|
|  0|[1.0,0.1,-1.0]|
|  1| [2.0,1.1,1.0]|
|  1|[3.0,10.1,3.0]|
+---+--------------+



In [0]:
from pyspark.ml.feature import PCA
pca = PCA().setInputCol("features").setK(2)
pca.fit(scaleDF).transform(scaleDF).show(5, False)

+---+--------------+------------------------------------------+
|id |features      |PCA_c1b7509baacd__output                  |
+---+--------------+------------------------------------------+
|0  |[1.0,0.1,-1.0]|[0.0713719499248417,-0.4526654888147805]  |
|1  |[2.0,1.1,1.0] |[-1.6804946984073723,1.2593401322219198]  |
|0  |[1.0,0.1,-1.0]|[0.0713719499248417,-0.4526654888147805]  |
|1  |[2.0,1.1,1.0] |[-1.6804946984073723,1.2593401322219198]  |
|1  |[3.0,10.1,3.0]|[-10.872398139848944,0.030962697060155975]|
+---+--------------+------------------------------------------+



##### 6.2 Interaction & Polynomial Expansion
[Poser la question : transformer or estimator?]

Polynomial(X1, X2, X3) degree 2 = >
X12 + X22 + X32+  X1 X2 + X2 X3 +X1 X3  + X1 + X2 + X3

In [0]:
scaleDF.show()

+---+--------------+
| id|      features|
+---+--------------+
|  0|[1.0,0.1,-1.0]|
|  1| [2.0,1.1,1.0]|
|  0|[1.0,0.1,-1.0]|
|  1| [2.0,1.1,1.0]|
|  1|[3.0,10.1,3.0]|
+---+--------------+



In [0]:
from pyspark.ml.feature import PolynomialExpansion
pe = PolynomialExpansion().setInputCol('features').setOutputCol('poly').setDegree(2)

# Poser la question : transformer or estimator?
pe.transform(scaleDF).show(5, False)

+---+--------------+-----------------------------------------------------------------------------------+
|id |features      |PolynomialExpansion_00e44401fba3__output                                           |
+---+--------------+-----------------------------------------------------------------------------------+
|0  |[1.0,0.1,-1.0]|[1.0,1.0,0.1,0.1,0.010000000000000002,-1.0,-1.0,-0.1,1.0]                          |
|1  |[2.0,1.1,1.0] |[2.0,4.0,1.1,2.2,1.2100000000000002,1.0,2.0,1.1,1.0]                               |
|0  |[1.0,0.1,-1.0]|[1.0,1.0,0.1,0.1,0.010000000000000002,-1.0,-1.0,-0.1,1.0]                          |
|1  |[2.0,1.1,1.0] |[2.0,4.0,1.1,2.2,1.2100000000000002,1.0,2.0,1.1,1.0]                               |
|1  |[3.0,10.1,3.0]|[3.0,9.0,10.1,30.299999999999997,102.00999999999999,3.0,9.0,30.299999999999997,9.0]|
+---+--------------+-----------------------------------------------------------------------------------+



In [0]:
[,0.1,,a,,,1.0]

#### 7. High-Level Transformers (optionnal)

##### 7.1. RFormula
Spark borrows this transformer from the R language to make it simple to declaratively specify a set of transformations for your data

In [0]:
from pyspark.ml.feature import RFormula
r_formula = RFormula(formula = 'lab ~.+value1:value2') # je ne comprend pas le resultat
r_formula.fit(simpleDF).transform(simpleDF).show(2, False)

+-----+----+------+------------------+---------------------------------------------------+-----+
|color|lab |value1|value2            |features                                           |label|
+-----+----+------+------------------+---------------------------------------------------+-----+
|green|good|1     |14.386294994851129|[0.0,1.0,1.0,14.386294994851129,14.386294994851129]|1.0  |
|blue |bad |8     |14.386294994851129|[0.0,0.0,8.0,14.386294994851129,115.09035995880903]|0.0  |
+-----+----+------+------------------+---------------------------------------------------+-----+
only showing top 2 rows



##### 7.2. SQL Transformers
* The only thing you need to change is that instead of using the table name, you should just use the keyword "__THIS__"
* SQLTransformer() is just a transformer whereas that RFormula() is a estimator

In [0]:
# from pyspark.ml.feature import SQLTransformer
# basicTransformation = (SQLTransformer()
#   .setStatement("""
#     select sum(Quantity), count(*), CustomerID
#     from __THIS__
#     group by CustomerID
                       
#   """)
# )
# basicTransformation.transform(sales).show(5)

+-------------+--------+----------+
|sum(Quantity)|count(1)|CustomerID|
+-------------+--------+----------+
|          119|      62|   14452.0|
|          440|     143|   16916.0|
|          630|      72|   17633.0|
|           34|       6|   14768.0|
|         1542|      30|   13094.0|
+-------------+--------+----------+
only showing top 5 rows



#### 8. Persisting Transformers

In [0]:
pipeline_fitted = pipeline.fit(simpleDF)
pipeline_fitted.write().overwrite().save("/tmp/fitted_Pipeline")

In [0]:
from pyspark.ml.pipeline import PipelineModel # attention we must import PipelineModel
pipeline_loaded = PipelineModel.load("/tmp/fitted_Pipeline")

pipeline_loaded.transform(simpleDF).show()

In [0]:
import os
os.chdir('/tmp/')
os.listdir()
# os.getcwd()

Out[61]: ['hsperfdata_root',
 'driver-daemon-params',
 'RtmpemfDFX',
 'chauffeur-env.sh',
 '.XIM-unix',
 '.X11-unix',
 'chauffeur-daemon.pid',
 'systemd-private-a625af562f844c789d618b7595a2bd9f-apache2.service-GdrD3f',
 'systemd-private-a625af562f844c789d618b7595a2bd9f-ntp.service-mafYzi',
 '.font-unix',
 'Rserv',
 'systemd-private-a625af562f844c789d618b7595a2bd9f-systemd-resolved.service-EU5seg',
 'systemd-private-a625af562f844c789d618b7595a2bd9f-systemd-logind.service-v7Wo8g',
 '.Test-unix',
 'custom-spark.conf',
 'driver-daemon.pid',
 'tmp.FabbD9rQNA',
 'driver-env.sh',
 '.ICE-unix',
 'chauffeur-daemon-params']

In [0]:
# from pyspark.ml.feature import PCA
# pca = PCA().setInputCol("features").setK(2)
# fittedPCA = pca.fit(scaleDF)
# fittedPCA.write().overwrite().save("/tmp/fittedPCA")

In [0]:
# from pyspark.ml.feature import PCAModel
# # os.chdir('/tmp/')
# loadedPCA = PCAModel.load('/tmp/fittedPCA')
# loadedPCA.transform(scaleDF).show(5, False)

+---+--------------+------------------------------------------+
|id |features      |PCAModel_8abce6ab4de1__output             |
+---+--------------+------------------------------------------+
|0  |[1.0,0.1,-1.0]|[0.0713719499248417,-0.4526654888147805]  |
|1  |[2.0,1.1,1.0] |[-1.6804946984073723,1.2593401322219198]  |
|0  |[1.0,0.1,-1.0]|[0.0713719499248417,-0.4526654888147805]  |
|1  |[2.0,1.1,1.0] |[-1.6804946984073723,1.2593401322219198]  |
|1  |[3.0,10.1,3.0]|[-10.872398139848944,0.030962697060155975]|
+---+--------------+------------------------------------------+



In [0]:
scaleDF.show()
fittedPCA.transform(scaleDF).show(5, False)

+---+--------------+
| id|      features|
+---+--------------+
|  0|[1.0,0.1,-1.0]|
|  1| [2.0,1.1,1.0]|
|  0|[1.0,0.1,-1.0]|
|  1| [2.0,1.1,1.0]|
|  1|[3.0,10.1,3.0]|
+---+--------------+

+---+--------------+------------------------------------------+
|id |features      |PCA_ced553f8e30d__output                  |
+---+--------------+------------------------------------------+
|0  |[1.0,0.1,-1.0]|[0.0713719499248417,-0.4526654888147805]  |
|1  |[2.0,1.1,1.0] |[-1.6804946984073723,1.2593401322219198]  |
|0  |[1.0,0.1,-1.0]|[0.0713719499248417,-0.4526654888147805]  |
|1  |[2.0,1.1,1.0] |[-1.6804946984073723,1.2593401322219198]  |
|1  |[3.0,10.1,3.0]|[-10.872398139848944,0.030962697060155975]|
+---+--------------+------------------------------------------+

