This is the second assignment for the Coursera course "Advanced Machine Learning and Signal Processing"


Just execute all cells one after the other and you are done - just note that in the last one you have to update your email address (the one you've used for coursera) and obtain a submission token, you get this from the programming assignment directly on coursera.

Please fill in the sections labelled with "###YOUR_CODE_GOES_HERE###"

This notebook is designed to run in a IBM Watson Studio default runtime (NOT the Watson Studio Apache Spark Runtime as the default runtime with 1 vCPU is free of charge). Therefore, we install Apache Spark in local mode for test purposes only. Please don't use it in production.

In case you are facing issues, please read the following two documents first:

https://github.com/IBM/skillsnetwork/wiki/Environment-Setup

https://github.com/IBM/skillsnetwork/wiki/FAQ

Then, please feel free to ask:

https://coursera.org/learn/machine-learning-big-data-apache-spark/discussions/all

Please make sure to follow the guidelines before asking a question:

https://github.com/IBM/skillsnetwork/wiki/FAQ#im-feeling-lost-and-confused-please-help-me


If running outside Watson Studio, this should work as well. In case you are running in an Apache Spark context outside Watson Studio, please remove the Apache Spark setup in the first notebook cells.

In [None]:
from IPython.display import Markdown, display


def printmd(string):
    display(Markdown('# <span style="color:red">' + string + "</span>"))


if "sc" in locals() or "sc" in globals():
    printmd(
        "<<<<<!!!!! It seems that you are running in a IBM Watson Studio Apache Spark Notebook. Please run it in an IBM Watson Studio Default Runtime (without Apache Spark) !!!!!>>>>>"
    )

In [None]:
!pip install pyspark==2.4.5

In [None]:
try:
    from pyspark import SparkConf, SparkContext
    from pyspark.sql import SparkSession
except ImportError as e:
    printmd(
        "<<<<<!!!!! Please restart your kernel after installing Apache Spark !!!!!>>>>>"
    )

In [None]:
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))

spark = SparkSession.builder.getOrCreate()

In [None]:
!wget https://github.com/IBM/coursera/raw/master/coursera_ml/a2.parquet

Now it’s time to have a look at the recorded sensor data. You should see data similar to the one exemplified below….


In [None]:
df = spark.read.load("a2.parquet")

df.createOrReplaceTempView("df")
spark.sql("SELECT * from df").show()

In [None]:
spark.sql("SELECT count(*) from df").show()

In [None]:
spark.sql("SELECT CLASS, count(*) from df group by CLASS").show()

Please create a VectorAssembler which consumes columns X, Y and Z and produces a column “features”


In [None]:
from pyspark.ml.feature import MinMaxScaler, VectorAssembler


vectorAssembler = VectorAssembler(inputCols=["X", "Y", "Z"], outputCol="features")
normalizer = MinMaxScaler(inputCol="features", outputCol="features_norm")

Please instantiate a classifier from the SparkML package and assign it to the classifier variable. Make sure to either
1.	Rename the “CLASS” column to “label” or
2.	Specify the label-column correctly to be “CLASS”


In [None]:
from pyspark.ml.classification import GBTClassifier


gbt = GBTClassifier(featuresCol="features_norm", maxIter=10, labelCol="CLASS")

Let’s train and evaluate…


In [None]:
from pyspark.ml import Pipeline


pipeline = Pipeline(stages=[vectorAssembler, normalizer, gbt])

In [None]:
model = pipeline.fit(df)

In [None]:
prediction = model.transform(df)

In [None]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder


paramGrid = (
    ParamGridBuilder()
    .addGrid(gbt.maxBins, [2, 4, 8])
    .addGrid(gbt.maxDepth, [2, 4, 8])
    .build()
)

In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator


binEval = (
    MulticlassClassificationEvaluator()
    .setMetricName("accuracy")
    .setPredictionCol("prediction")
    .setLabelCol("CLASS")
)

In [None]:
crossval = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=binEval,
    numFolds=5,
)

In [None]:
cvModel = crossval.fit(df)

In [None]:
prediction = cvModel.transform(df)

In [None]:
binEval.evaluate(prediction)

If you are happy with the result (I’m happy with > 0.55) please submit your solution to the grader by executing the following cells, please don’t forget to obtain an assignment submission token (secret) from the Coursera’s graders web page and paste it to the “secret” variable below, including your email address you’ve used for Coursera. (0.55 means that you are performing better than random guesses)


In [None]:
!rm -Rf a2_m2.json

In [None]:
prediction = prediction.repartition(1)
prediction.write.json("a2_m2.json")

In [None]:
!rm -f rklib.py
!wget https://raw.githubusercontent.com/IBM/coursera/master/rklib.py

In [None]:
import os
import zipfile


def zipdir(path, ziph):
    for root, dirs, files in os.walk(path):
        for file in files:
            ziph.write(os.path.join(root, file))


zipf = zipfile.ZipFile("a2_m2.json.zip", "w", zipfile.ZIP_DEFLATED)
zipdir("a2_m2.json", zipf)
zipf.close()

In [None]:
!base64 a2_m2.json.zip > a2_m2.json.zip.base64

In [None]:
from rklib import submit


key = "J3sDL2J8EeiaXhILFWw2-g"
part = "G4P6f"
email = "kriwohizha@gmail.com"
token = "DswIDQQFELiHxseW"

with open("a2_m2.json.zip.base64", "r") as myfile:
    data = myfile.read()
submit(email, token, key, part, [part], data)