# ETL and Machine Learning

In this lab I’ll create an Apache Spark Machine learning application as end to end use case from data acquisition, transformation, model training and deployment.

Objectives
After completing reading, you will see and hopefelly be able also to:

- Pull-in data from the HMP dataset <a href="https://github.com/wchill/HMP_Dataset">here</a>
- Create a Spark data frame from the raw data
- Store this to parquet (in Cloud Object Store)
- Read it again (from Cloud Object Store)
- Deploy this model to Train a ML-Model on that data set
- Watson Machine Learning

## 1. Pull-in data from the HMP dataset 

Now it’s time to explore data <a href="https://github.com/wchill/HMP_Dataset">here</a>. You're invited to get familiarize a little bit with it. It's important to understand data so that you can grasp thefollowing step code easily.
Let's pull the data in raw format from the source (github).

In [1]:
# !rm -Rf HMP_Dataset    
# !sudo apt install git  #to uncomment if needed
# !git clone https://github.com/wchill/HMP_Dataset

In [3]:
!pip install pyspark
!pip install findspark

Collecting pyspark
  Downloading pyspark-3.2.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 14 kB/s  eta 0:00:01    |███                             | 26.2 MB 4.6 MB/s eta 0:00:56     |███████▎                        | 64.2 MB 6.0 MB/s eta 0:00:37     |█████████████████▏              | 151.0 MB 7.3 MB/s eta 0:00:18     |█████████████████████▍          | 188.2 MB 5.7 MB/s eta 0:00:17     |██████████████████████████▋     | 233.8 MB 7.1 MB/s eta 0:00:07     |███████████████████████████████▎| 274.5 MB 6.1 MB/s eta 0:00:02
[?25hCollecting py4j==0.10.9.2
  Downloading py4j-0.10.9.2-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 5.5 MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25ldone
[?25h  Created wheel for pyspark: filename=pyspark-3.2.0-py2.py3-none-any.whl size=281805912 sha256=ff6aa6dfb729a8007bfc4d584b0bd125dccee78b79820147a37fdfc3262d0441
  Stored in dir

In [1]:
import findspark
findspark.init()

## 2. Create a Spark data frame from the raw data

Lets create a local spark context (sc) and session (spark)

In [2]:
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SparkSession

conf = SparkConf().setAppName("SparkApp_ETL_ML").setMaster("local[*]")
sc = SparkContext.getOrCreate(conf)

spark = SparkSession.builder.getOrCreate()

In [5]:
spark

let's create a schema first (convert x,y,z columns to integers)

In [3]:
from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import IntegerType
schema = StructType([StructField("x", IntegerType(), True),
                     StructField("y", IntegerType(), True),
                     StructField("z", IntegerType(), True)])

This step takes a while, it parses through all files and folders and creates a temporary dataframe for each file which gets appended to an overall data-frame "df". In addition, a column called "class" is added to allow for straightforward usage in Spark afterwards in a supervised machine learning scenario for example.

In [4]:
import os
import fnmatch
import random
from pyspark.sql.functions import lit

sample = os.environ.get('sample', '1.0')
sample = float(sample)

d = 'HMP_Dataset/'

# filter list all folders containing data (folders that don't start with .)
file_list_filtered = [s for s in os.listdir(d)
                      if os.path.isdir(os.path.join(d, s)) & ~fnmatch.fnmatch(s, '.*')]

# create pandas data frame for all the data
df = None

for category in file_list_filtered:
    data_files = os.listdir('HMP_Dataset/' + category)
    for data_file in data_files:
        print(category, ":", data_file)
        temp_df = spark.read. \
            option("header", "false"). \
            option("delimiter", " "). \
            csv('HMP_Dataset/' + category + '/' + data_file, schema=schema)
        
        # create a column called "source" storing the current CSV file
        temp_df = temp_df.withColumn("source", lit(data_file))

        # create a column called "class" storing the current data folder
        temp_df = temp_df.withColumn("class", lit(category))

        if df is None:
            df = temp_df
        else:
            df = df.union(temp_df)

Liedown_bed : Accelerometer-2011-04-11-11-52-20-liedown_bed-f1.txt
Liedown_bed : Accelerometer-2011-05-30-20-52-31-liedown_bed-f1.txt
Liedown_bed : Accelerometer-2011-06-02-16-58-20-liedown_bed-f4.txt
Liedown_bed : Accelerometer-2011-06-02-17-21-57-liedown_bed-m1.txt
Liedown_bed : Accelerometer-2011-05-30-21-13-15-liedown_bed-f1.txt
Liedown_bed : Accelerometer-2011-06-02-17-00-38-liedown_bed-f4.txt
Liedown_bed : Accelerometer-2011-05-31-14-56-04-liedown_bed-f1.txt
Liedown_bed : Accelerometer-2011-04-05-18-27-12-liedown_bed-f1.txt
Liedown_bed : Accelerometer-2011-05-30-09-32-42-liedown_bed-f1.txt
Liedown_bed : Accelerometer-2011-05-30-09-41-49-liedown_bed-f1.txt
Liedown_bed : Accelerometer-2011-04-11-13-33-26-liedown_bed-f1.txt
Liedown_bed : Accelerometer-2011-05-30-08-37-27-liedown_bed-f1.txt
Liedown_bed : Accelerometer-2011-05-30-08-25-59-liedown_bed-f1.txt
Liedown_bed : Accelerometer-2011-05-30-20-59-04-liedown_bed-f1.txt
Liedown_bed : Accelerometer-2011-03-29-09-23-22-liedown_bed-f1

Drink_glass : Accelerometer-2011-06-01-14-28-50-drink_glass-f1.txt
Drink_glass : Accelerometer-2011-06-02-17-30-51-drink_glass-m1.txt
Drink_glass : Accelerometer-2012-05-29-17-20-28-drink_glass-m3.txt
Drink_glass : Accelerometer-2012-05-30-21-48-13-drink_glass-m2.txt
Drink_glass : Accelerometer-2011-06-02-17-23-53-drink_glass-m1.txt
Drink_glass : Accelerometer-2011-04-11-13-17-55-drink_glass-f1.txt
Drink_glass : Accelerometer-2012-05-30-19-51-49-drink_glass-f2.txt
Drink_glass : Accelerometer-2011-05-30-21-06-15-drink_glass-f1.txt
Drink_glass : Accelerometer-2011-03-24-13-31-22-drink_glass-f1.txt
Drink_glass : Accelerometer-2011-06-01-16-44-44-drink_glass-f1.txt
Drink_glass : Accelerometer-2011-06-01-16-45-39-drink_glass-f1.txt
Drink_glass : Accelerometer-2011-03-24-13-09-29-drink_glass-f1.txt
Drink_glass : Accelerometer-2011-05-31-15-12-36-drink_glass-f1.txt
Drink_glass : Accelerometer-2012-03-23-03-47-02-drink_glass-m9.txt
Drink_glass : Accelerometer-2011-04-12-21-44-55-drink_glass-m8

Standup_chair : Accelerometer-2012-03-23-03-48-20-standup_chair-m9.txt
Standup_chair : Accelerometer-2011-04-08-17-32-45-standup_chair-f3.txt
Standup_chair : Accelerometer-2011-05-31-15-13-49-standup_chair-f1.txt
Standup_chair : Accelerometer-2011-04-08-18-09-06-standup_chair-m4.txt
Standup_chair : Accelerometer-2012-05-25-18-33-08-standup_chair-f4.txt
Standup_chair : Accelerometer-2011-04-08-17-34-10-standup_chair-f3.txt
Standup_chair : Accelerometer-2012-03-26-05-04-33-standup_chair-m3.txt
Standup_chair : Accelerometer-2011-06-01-14-50-10-standup_chair-f1.txt
Standup_chair : Accelerometer-2011-04-11-13-24-10-standup_chair-f1.txt
Standup_chair : Accelerometer-2011-05-30-10-31-41-standup_chair-m1.txt
Standup_chair : Accelerometer-2011-05-30-08-24-19-standup_chair-f1.txt
Standup_chair : Accelerometer-2011-12-11-08-23-15-standup_chair-m2.txt
Standup_chair : Accelerometer-2011-12-05-09-45-47-standup_chair-f1.txt
Standup_chair : Accelerometer-2011-03-24-16-09-19-standup_chair-f2.txt
Standu

Walk : Accelerometer-2012-06-06-14-13-56-walk-m7.txt
Walk : Accelerometer-2011-05-30-10-29-28-walk-m1.txt
Walk : Accelerometer-2012-06-11-11-39-29-walk-m1.txt
Walk : Accelerometer-2012-06-07-10-50-09-walk-f1.txt
Walk : Accelerometer-2012-06-06-09-42-05-walk-m6.txt
Walk : Accelerometer-2012-05-29-16-51-03-walk-f2.txt
Walk : Accelerometer-2011-03-29-16-16-01-walk-f1.txt
Walk : Accelerometer-2012-05-29-17-16-24-walk-m3.txt
Walk : Accelerometer-2012-05-29-16-46-02-walk-f2.txt
Walk : Accelerometer-2012-06-06-14-15-57-walk-m7.txt
Walk : Accelerometer-2012-06-06-09-47-07-walk-m6.txt
Walk : Accelerometer-2012-06-06-09-40-33-walk-m6.txt
Walk : Accelerometer-2012-06-06-14-10-33-walk-m7.txt
Walk : Accelerometer-2012-06-06-09-36-06-walk-m6.txt
Walk : Accelerometer-2012-05-30-22-04-20-walk-m2.txt
Walk : Accelerometer-2012-05-30-18-29-02-walk-f3.txt
Walk : Accelerometer-2012-06-11-11-33-42-walk-m1.txt
Walk : Accelerometer-2011-05-31-15-01-05-walk-f1.txt
Walk : Accelerometer-2012-06-11-11-37-01-walk-

Climb_stairs : Accelerometer-2012-06-06-09-38-17-climb_stairs-m6.txt
Climb_stairs : Accelerometer-2012-06-06-14-09-12-climb_stairs-m7.txt
Climb_stairs : Accelerometer-2012-05-30-19-13-50-climb_stairs-m4.txt
Climb_stairs : Accelerometer-2012-06-06-09-38-43-climb_stairs-m6.txt
Climb_stairs : Accelerometer-2012-06-06-14-05-05-climb_stairs-m7.txt
Climb_stairs : Accelerometer-2012-06-06-08-54-29-climb_stairs-m5.txt
Climb_stairs : Accelerometer-2012-05-30-22-07-44-climb_stairs-m2.txt
Climb_stairs : Accelerometer-2012-06-06-09-33-15-climb_stairs-m6.txt
Climb_stairs : Accelerometer-2012-05-29-17-17-51-climb_stairs-m3.txt
Climb_stairs : Accelerometer-2012-05-29-16-53-12-climb_stairs-f2.txt
Climb_stairs : Accelerometer-2012-06-06-09-42-55-climb_stairs-m6.txt
Climb_stairs : Accelerometer-2012-06-06-08-59-53-climb_stairs-m5.txt
Climb_stairs : Accelerometer-2011-05-30-09-43-37-climb_stairs-f1.txt
Climb_stairs : Accelerometer-2012-05-28-17-56-03-climb_stairs-m1.txt
Climb_stairs : Accelerometer-2011-

Pour_water : Accelerometer-2011-06-02-17-45-45-pour_water-m1.txt
Pour_water : Accelerometer-2012-06-09-22-55-42-pour_water-m11.txt
Pour_water : Accelerometer-2012-05-25-18-31-58-pour_water-f4.txt
Pour_water : Accelerometer-2012-06-09-22-54-40-pour_water-m11.txt
Pour_water : Accelerometer-2011-05-30-21-05-39-pour_water-f1.txt
Pour_water : Accelerometer-2011-06-01-16-46-50-pour_water-f1.txt
Pour_water : Accelerometer-2011-03-24-13-30-01-pour_water-f1.txt
Pour_water : Accelerometer-2011-06-01-14-13-29-pour_water-f1.txt
Pour_water : Accelerometer-2011-06-02-17-28-27-pour_water-m1.txt
Pour_water : Accelerometer-2012-06-07-21-34-47-pour_water-f4.txt
Pour_water : Accelerometer-2012-05-30-19-57-40-pour_water-m3.txt
Pour_water : Accelerometer-2011-06-06-09-45-45-pour_water-f1.txt
Pour_water : Accelerometer-2012-03-26-04-55-40-pour_water-f2.txt
Pour_water : Accelerometer-2012-06-07-21-35-50-pour_water-f4.txt
Pour_water : Accelerometer-2012-06-07-21-31-05-pour_water-f4.txt
Pour_water : Accelerome

Sitdown_chair : Accelerometer-2011-12-11-08-23-42-sitdown_chair-m2.txt
Sitdown_chair : Accelerometer-2011-05-30-09-23-12-sitdown_chair-f1.txt
Sitdown_chair : Accelerometer-2011-04-08-17-34-35-sitdown_chair-f3.txt
Sitdown_chair : Accelerometer-2012-03-26-05-04-48-sitdown_chair-m3.txt
Sitdown_chair : Accelerometer-2011-06-01-14-37-35-sitdown_chair-f1.txt
Sitdown_chair : Accelerometer-2012-03-23-03-45-26-sitdown_chair-m9.txt
Sitdown_chair : Accelerometer-2011-12-05-09-50-54-sitdown_chair-f1.txt
Sitdown_chair : Accelerometer-2011-06-02-16-43-56-sitdown_chair-f4.txt
Sitdown_chair : Accelerometer-2011-06-02-16-46-07-sitdown_chair-f4.txt
Sitdown_chair : Accelerometer-2012-03-26-04-54-57-sitdown_chair-f2.txt
Sitdown_chair : Accelerometer-2011-04-08-18-09-40-sitdown_chair-m4.txt
Sitdown_chair : Accelerometer-2011-05-30-10-21-50-sitdown_chair-m1.txt
Sitdown_chair : Accelerometer-2011-12-05-09-53-32-sitdown_chair-f1.txt
Sitdown_chair : Accelerometer-2011-12-11-08-12-14-sitdown_chair-f4.txt
Sitdow

In [5]:
df.show(5)

+---+---+---+--------------------+-----------+
|  x|  y|  z|              source|      class|
+---+---+---+--------------------+-----------+
| 11| 34| 34|Accelerometer-201...|Liedown_bed|
| 11| 34| 34|Accelerometer-201...|Liedown_bed|
| 12| 35| 34|Accelerometer-201...|Liedown_bed|
| 12| 36| 33|Accelerometer-201...|Liedown_bed|
| 11| 35| 34|Accelerometer-201...|Liedown_bed|
+---+---+---+--------------------+-----------+
only showing top 5 rows



In [6]:
df.printSchema()

root
 |-- x: integer (nullable = true)
 |-- y: integer (nullable = true)
 |-- z: integer (nullable = true)
 |-- source: string (nullable = false)
 |-- class: string (nullable = false)



## 3. Store the dataframe to parquet (in Cloud Object Store)

Let's first be sure to have one parquet file. We'll repartition the data frame to one and extract/rename the file inside the folder.

In [7]:
df = df.repartition(1)

In [8]:
df.write.parquet("data_condensed.parquet")

In [10]:
!mv data_condensed.parquet/`ls data_condensed.parquet |grep .parquet` condensed_data.parquet

mv: cannot overwrite non-directory 'condensed_data.parquet' with directory 'data_condensed.parquet/'


In [11]:
!rm -Rf  data_condensed.parquet

## 4. Upload to IBM Cloud Object Storage COS

Uncomment the following command if S3Fs `'a Pythonic file interface to S3'` is not yet installed

In [None]:
# !pip install s3fs==2021.7.0

Let's import `s3fs` and read my credentials to connect to IBM COS (it's a free account), the same thing applyes for AWS S3

In [12]:
import s3fs
import json
with open('COS_credential.json') as f:
    keys=json.load(f)

In [18]:
access_key_id=keys['access_key_id']
secret_access_key=keys['secret_access_key']
endpoint=keys['endpoint']
bucket_name=keys['bucket_name']
source_file='condensed_data.parquet'
destination_file='HMP_data.parquet'

In [14]:
s3 = s3fs.S3FileSystem(
    anon=False,
    key=access_key_id,
    secret=secret_access_key,
    client_kwargs={'endpoint_url': endpoint}
)

In [19]:
s3.put(source_file, bucket_name + '/' + destination_file)

[None]

The command above has succissfuly putten the file in the cloud as shown from the result of the cell bellow and also the picture.
![](cos-objects.png)

In [20]:
s3.ls(bucket_name)

['cloud-object-storage-yy-cos-standard-js4/HMP_data.parquet',
 'cloud-object-storage-yy-cos-standard-js4/data.parquet',
 'cloud-object-storage-yy-cos-standard-js4/jfk_data.parquet']

## 5. Read the data again (from Cloud Object Store)

In [23]:
import pandas as pd
df=pd.read_parquet(endpoint + '/' + bucket_name + '/HMP_data.parquet')

You can uncomment the following cell to read the same parquet file from my ibm storage acount

In [7]:
!pip3 install pandas

Collecting pandas
  Downloading pandas-1.3.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.5 MB)
[K     |████████████████████████████████| 11.5 MB 8.0 MB/s eta 0:00:01
Collecting numpy>=1.17.3; platform_machine != "aarch64" and platform_machine != "arm64" and python_version < "3.10"
  Downloading numpy-1.21.4-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[K     |████████████████████████████████| 15.7 MB 306 kB/s eta 0:00:01
Installing collected packages: numpy, pandas
Successfully installed numpy-1.21.4 pandas-1.3.4


In [9]:
!pip3 install pyarrow

Collecting pyarrow
  Using cached pyarrow-6.0.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25.6 MB)
Installing collected packages: pyarrow
Successfully installed pyarrow-6.0.1


In [3]:
import pandas as pd
df=pd.read_parquet("https://s3.eu-de.cloud-object-storage.appdomain.cloud/cloud-object-storage-yy-cos-standard-js4/data.parquet")

In [4]:
df.head()

Unnamed: 0,x,y,z,source,class
0,28,45,47,Accelerometer-2011-03-29-09-52-41-use_telephon...,Use_telephone
1,28,45,47,Accelerometer-2011-03-29-09-52-41-use_telephon...,Use_telephone
2,28,46,48,Accelerometer-2011-03-29-09-52-41-use_telephon...,Use_telephone
3,28,46,48,Accelerometer-2011-03-29-09-52-41-use_telephon...,Use_telephone
4,27,45,47,Accelerometer-2011-03-29-09-52-41-use_telephon...,Use_telephone


In [4]:
sdf = spark.createDataFrame(df)

In [5]:
from pyspark.sql.types import DoubleType
sdf = sdf.withColumn("x", sdf.x.cast(DoubleType()))
sdf = sdf.withColumn("y", sdf.y.cast(DoubleType()))
sdf = sdf.withColumn("z", sdf.z.cast(DoubleType()))

In [6]:
sdf.printSchema()

root
 |-- x: double (nullable = true)
 |-- y: double (nullable = true)
 |-- z: double (nullable = true)
 |-- source: string (nullable = true)
 |-- class: string (nullable = true)



## 6. Building and Training a classification Model

In [7]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [8]:
input_columns = ["x", "y", "z"]  # input columns to consider

train, test = sdf.randomSplit([0.8, 0.2], seed=1)

indexer = StringIndexer(inputCol="class", outputCol="label")

vectorAssembler = VectorAssembler(inputCols=input_columns, outputCol="features")

normalizer = MinMaxScaler(inputCol="features", outputCol="features_norm")

pipeline = Pipeline(stages=[indexer, vectorAssembler, normalizer])

binEval = MulticlassClassificationEvaluator().setMetricName("accuracy").setPredictionCol("prediction"). \
    setLabelCol("label")

In [9]:
df_train = pipeline.fit(train).transform(train)
df_test = pipeline.fit(test).transform(test)

                                                                                

In [13]:
df_train.show(5)

+----+----+----+--------------------+-------------+-----+----------------+--------------------+
|   x|   y|   z|              source|        class|label|        features|       features_norm|
+----+----+----+--------------------+-------------+-----+----------------+--------------------+
|20.0|45.0|45.0|Accelerometer-201...|Use_telephone|  0.0|[20.0,45.0,45.0]|[0.31746031746031...|
|21.0|46.0|52.0|Accelerometer-201...|Use_telephone|  0.0|[21.0,46.0,52.0]|[0.33333333333333...|
|21.0|47.0|45.0|Accelerometer-201...|Use_telephone|  0.0|[21.0,47.0,45.0]|[0.33333333333333...|
|22.0|40.0|49.0|Accelerometer-201...|Use_telephone|  0.0|[22.0,40.0,49.0]|[0.34920634920634...|
|23.0|47.0|52.0|Accelerometer-201...|Use_telephone|  0.0|[23.0,47.0,52.0]|[0.36507936507936...|
+----+----+----+--------------------+-------------+-----+----------------+--------------------+
only showing top 5 rows



In [14]:
print("Training Dataset Count: " + str(df_train.count()))
print("Test Dataset Count: " + str(df_test.count()))

Training Dataset Count: 6653
Test Dataset Count: 1718


In [30]:
print("Training Dataset Count: " + str(df_train.count()))
print("Test Dataset Count: " + str(df_test.count()))

Training Dataset Count: 357420
Test Dataset Count: 89109


### Logistic regression

we'll will train a logistic regression model and perform hyperparameter tuning as follows:

*   Train it with different hyperparameters listed below and report the best performing hyperparameter combinations.

    Hyper parameters:
   ``` 
       - max iteration : {10, 100, 1000}
       - regularization : {0.01, 0.5, 2.0}
       - elastic net parameter : {0.0, 0.5, 1.0}
   ```
*   Use the accuracy metric when evaluating the model with different hyperparameters

`elasticNetParam ; allows to have both L1 and L2 regularization as special cases. set to 1, it is equivalent to a Lasso model and if set to 0, the trained model reduces to a ridge regression model.`

In [31]:
from pyspark.ml.classification import LogisticRegression

In [32]:
data = [];
for i in [10, 100, 1000]:
    for j in [0.01, 0.5, 2.0]:
        for k in [0.0, 0.5, 1.]:
            lr = LogisticRegression(featuresCol='features_norm', labelCol='label', maxIter=i,regParam=j,
                        elasticNetParam=k)
            model = lr.fit(df_train)
            data.append([i,j,k,binEval.evaluate(model.transform(df_train))*100,
                            binEval.evaluate(model.transform(df_test))*100])

In [33]:
df = pd.DataFrame(data, columns=['maxIter', 'regParam', 'elasticNetParam', 'accuracy_tr', 'accuracy_test'])
df.sort_values(by=['accuracy_tr'], ascending=False).head(5)

Unnamed: 0,maxIter,regParam,elasticNetParam,accuracy_tr,accuracy_test
20,1000,0.01,1.0,35.405405,35.26692
11,100,0.01,1.0,35.405405,35.26692
2,10,0.01,1.0,35.32231,35.167043
1,10,0.01,0.5,35.113312,35.109809
19,1000,0.01,0.5,34.967825,34.965043


The logistic regression model is not performing but anayway the best parameters are {`max iter: 100`, `regParam:0.01` and `elasticNet:1`}

### RandomForest classification

In this task, you will train a RandomForest classification model and perform hyperparameter tuning as follows:

*   Train a Random Forest model with different hyperparameters listed below and report the best performing hyperparameter combinations.

    Hyper parameters:
   ``` 
       - number of trees : {10, 20}
       - maximum depth : {5, 7} 
   ```
*   Use the accuracy metric when evaluating the model with different hyperparameters

In [10]:
from pyspark.ml.classification import RandomForestClassifier

In [35]:
data = [];
for i in [10, 20]:
    for j in [5, 7]:
        rf = RandomForestClassifier(featuresCol = 'features_norm', labelCol = 'label', maxDepth=i,
                                    numTrees=j, seed=1)
        rfModel = rf.fit(df_train)
        data.append([i,j,binEval.evaluate(rfModel.transform(df_train))*100,
                        binEval.evaluate(rfModel.transform(df_test))*100])

In [36]:
df = pd.DataFrame(data, columns=['nbrTrees', 'maxDepth', 'accuracy_tr', 'accuracy_test'])
df.sort_values(by=['accuracy_tr'], ascending=False)

Unnamed: 0,nbrTrees,maxDepth,accuracy_tr,accuracy_test
3,20,7,52.318281,48.693174
2,20,5,52.179229,48.583196
1,10,7,49.102456,48.392418
0,10,5,48.961166,48.251018


Then the best parameters for the random forest model are {`number of trees`: 20 and `max depth`:7}

## 7. Model building and deployement

Let's first retrain the random forest model with the best parameters and then save as .xml (it's a PMML format, standing for “Predictive Model Markup Language” and is an interchange format for machine learning models). We’ll use this file to deploy to the `IBM Watson Machine Learning Service`

In [11]:
rf = RandomForestClassifier(featuresCol='features_norm', labelCol='label', maxDepth=20, numTrees=7, seed=1)
rfModel = rf.fit(df_train)

21/12/01 17:36:19 WARN DAGScheduler: Broadcasting large task binary with size 1086.9 KiB
21/12/01 17:36:20 WARN DAGScheduler: Broadcasting large task binary with size 1441.7 KiB
21/12/01 17:36:21 WARN DAGScheduler: Broadcasting large task binary with size 1800.6 KiB
21/12/01 17:36:22 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
21/12/01 17:36:22 WARN DAGScheduler: Broadcasting large task binary with size 2.3 MiB
21/12/01 17:36:23 WARN DAGScheduler: Broadcasting large task binary with size 2.5 MiB
21/12/01 17:36:23 WARN DAGScheduler: Broadcasting large task binary with size 2.6 MiB
21/12/01 17:36:24 WARN DAGScheduler: Broadcasting large task binary with size 2.6 MiB
21/12/01 17:36:24 WARN DAGScheduler: Broadcasting large task binary with size 2.3 MiB
21/12/01 17:36:25 WARN DAGScheduler: Broadcasting large task binary with size 2.3 MiB


if you get this error `An error occurred while trying to connect to the Java server (127.0.0.1:35321)` just restart the kernel, it's likely because you're in standalone mode and there is no enough memory. see <a href="https://stackoverflow.com/questions/51359802/pyspark-errorpy4j-java-gatewayan-error-occurred-while-trying-to-connect-to-the">here</a>

In [17]:
binEval.evaluate(rfModel.transform(df_train))*100

21/12/01 17:05:38 WARN DAGScheduler: Broadcasting large task binary with size 1385.3 KiB


89.94438599128213

In [None]:
# !export PYSPARK_DRIVER_PYTHON=jupyter
# !export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
# !pyspark --jars jpmml-sparkml-executable-1.7.2.jar
# !mv jpmml-sparkml-executable-1.7.2.jar /opt/spark/jars/

In [14]:
!pip3 install pyspark2pmml==0.5.1



In [12]:
from pyspark2pmml import PMMLBuilder
model_target = "HMP_frModel.xml"       # model output file name

In [18]:
spark

In [23]:
!pip3 install wget

Processing /home/mbg/.cache/pip/wheels/bd/a8/c3/3cf2c14a1837a4e04bd98631724e81f33f462d86a1d895fae0/wget-3.2-py3-none-any.whl
Installing collected packages: wget
Successfully installed wget-3.2


In [34]:
# import shutil
# import site
# import wget
# url = ('https://github.com/jpmml/jpmml-sparkml/releases/download/1.7.2/'
#            'jpmml-sparkml-executable-1.7.2.jar')
# wget.download(url)
# # shutil.copy('jpmml-sparkml-executable-1.7.2.jar', site.getsitepackages()[0] + '/pyspark/jars/')
# shutil.copy('jpmml-sparkml-executable-1.7.2.jar', '~/.local/lib/python3.8/site-packages/pyspark/jars')


In [13]:
pmmlBuilder = PMMLBuilder(sc, df_train, rfModel)

Exception in thread "Thread-4" java.lang.ExceptionInInitializerError
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Class.java:398)
	at py4j.reflection.CurrentThreadClassLoadingStrategy.classForName(CurrentThreadClassLoadingStrategy.java:40)
	at py4j.reflection.ReflectionUtil.classForName(ReflectionUtil.java:51)
	at py4j.reflection.TypeUtil.forName(TypeUtil.java:243)
	at py4j.commands.ReflectionCommand.getUnknownMember(ReflectionCommand.java:175)
	at py4j.commands.ReflectionCommand.execute(ReflectionCommand.java:87)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.IllegalArgumentException: Expected Apache Spark ML version 3.1, got version 3.2 (3.2.0)
	at org.jpmml.sparkml.ConverterFactory.checkVersion(ConverterFactory.java:114)
	at org.jpmml.sparkml.PMMLBuilder.init(PMML

Py4JError: org.jpmml.sparkml.PMMLBuilder does not exist in the JVM

In [None]:
pmmlBuilder.buildFile(model_target)