<a href="https://colab.research.google.com/github/j143/notebooks/blob/master/systemds_dev_standalone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Developer notebook for Apache SystemDS

#### Install Java
This installs Java 8 of Open JDK

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

#### Set Environment Variables
Set the locations where Spark and Java are installed.

In [2]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
!update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
!java -version
# os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java to provide /usr/bin/java (java) in manual mode
openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~18.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)


## Apache SystemDS

#### Setup

In [3]:
# Run and print a shell command.
def run(cmd):
  print('>> {}'.format(cmd))
  !{cmd}
  print('')


#### Install Apache Maven

In [4]:
import os

# Download the maven source.
maven_version = 'apache-maven-3.6.3'
maven_path = f"/opt/{maven_version}"
if not os.path.exists(maven_path):
  run(f"wget -q -nc -O apache-maven.zip https://downloads.apache.org/maven/maven-3/3.6.3/binaries/{maven_version}-bin.zip")
  run('unzip -q -d /opt apache-maven.zip')
  run('rm -f apache-maven.zip')

# Let's choose the absolute path instead of $PATH environment variable.
def maven(args):
  run(f"{maven_path}/bin/mvn {args}")

maven('-v')


>> wget -q -nc -O apache-maven.zip https://downloads.apache.org/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.zip

>> unzip -q -d /opt apache-maven.zip

>> rm -f apache-maven.zip

>> /opt/apache-maven-3.6.3/bin/mvn -v
[1mApache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)[m
Maven home: /opt/apache-maven-3.6.3
Java version: 1.8.0_252, vendor: Private Build, runtime: /usr/lib/jvm/java-8-openjdk-amd64/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "4.19.104+", arch: "amd64", family: "unix"



#### Download Apache Spark

NOTE: If spark is not downloaded. Let us make sure the version we are trying to download is officially supported at
https://spark.apache.org/downloads.html

In [13]:
# Spark and Hadoop version
spark_version = 'spark-2.4.6'
hadoop_version = 'hadoop2.7'
spark_path = f"/opt/{spark_version}-bin-{hadoop_version}"
if not os.path.exists(spark_path):
  run(f"wget -q -nc -O apache-spark.tgz https://downloads.apache.org/spark/{spark_version}/{spark_version}-bin-{hadoop_version}.tgz")
  run('tar zxf apache-spark.tgz -C /opt')
  run('rm -f apache-spark.tgz')

os.environ["SPARK_HOME"] = spark_path
os.environ["PATH"] += ":$SPARK_HOME/bin"


>> wget -q -nc -O apache-spark.tgz https://downloads.apache.org/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz

>> tar zxf apache-spark.tgz -C /opt

>> rm -f apache-spark.tgz



#### Get Apache SystemDS

In [6]:
!git clone https://github.com/apache/systemds systemds
%cd systemds

Cloning into 'systemds'...
remote: Enumerating objects: 23, done.[K
remote: Counting objects: 100% (23/23), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 152626 (delta 0), reused 8 (delta 0), pack-reused 152603[K
Receiving objects: 100% (152626/152626), 225.02 MiB | 13.23 MiB/s, done.
Resolving deltas: 100% (97811/97811), done.
/content/systemds


In [7]:
# Logging flags: -q only for ERROR; -X for DEBUG; -e for ERROR
maven('clean package -q')

>> /opt/apache-maven-3.6.3/bin/mvn clean package -q



In [8]:
# Example Classification task
# !$SPARK_HOME/bin/spark-submit ./target/SystemDS.jar -f ./scripts/nn/examples/fm-binclass-dummy-data.dml

#### Playground for DML

The following code cell is for dml code.

In [9]:
%%writefile /content/test.dml

# This code code acts as a playground for dml code
X = rand (rows = 20, cols = 10)
y = X %*% rand(rows = ncol(X), cols = 1)
lm(X = X, y = y)

Writing /content/test.dml


Run `dml` with Spark backend

In [14]:
!$SPARK_HOME/bin/spark-submit \
    ./target/SystemDS.jar -f /content/test.dml 


20/07/18 05:06:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.sysds.api.DMLScript).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
ANTLR Tool version 4.5.3 used for code generation does not match the current runtime version 4.7ANTLR Runtime version 4.5.3 used for parser compilation does not match the current runtime version 4.7ANTLR Tool version 4.5.3 used for code generation does not match the current runtime version 4.7ANTLR Runtime version 4.5.3 used for parser compilation does not match the current runtime version 4.7Calling the Direct Solver...
Computing the statistics...
AVG_TOT_Y, 2.306291350486548
STDEV_TOT_Y, 0.41527871477713496
AVG_RES_Y, 5.730468777276343E-9
STDEV_RES_Y, 5.849014866902826E-8
DISPERSION, 3.144664287007204E-15
R2, 0.999999999999

### Working with SystemDS **Standalone**

(NOTE: Pay attention to *directories* and *relative paths*. :))

##### 1. Set SystemDS environement variables

These are useful for the `./bin/systemds` script.

In [15]:
!export SYSTEMDS_ROOT=$(pwd)
!export PATH=$SYSTEMDS_ROOT/bin:$PATH

In [16]:
!echo 'export SYSTEMDS_ROOT='$(pwd) >> ~/.bashrc
!echo 'export PATH=$SYSTEMDS_ROOT/bin:$PATH' >> ~/.bashrc

##### 2. Download Haberman data

In [23]:
!mkdir ../data

In [28]:
!wget -P ../data/ http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data

--2020-07-18 05:19:39--  http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3103 (3.0K) [application/x-httpd-php]
Saving to: ‘../data/haberman.data’


2020-07-18 05:19:39 (266 MB/s) - ‘../data/haberman.data’ saved [3103/3103]



##### 2.1 Set `metadata` for the data

In [29]:
# generate metadata file for the dataset
!echo '{"rows": 306, "cols": 4, "format": "csv"}' > ../data/haberman.data.mtd

# generate type description for the data
!echo '1,1,1,2' > ../data/types.csv
!echo '{"rows": 1, "cols": 4, "format": "csv"}' > ../data/types.csv.mtd

##### 3. Find the algorithm to run with `systemds`

In [36]:
# Inspect the directory structure
!ls

bin   CONTRIBUTING.md  docker  LICENSE	pom.xml    scratch_space  src
conf  dev	       docs    NOTICE	README.md  scripts	  target


In [34]:
# List all the scripts
!ls scripts/algorithms

ALS-CG.dml		   GLM.dml	       naive-bayes.dml
ALS-DS.dml		   GLM-predict.dml     naive-bayes-predict.dml
ALS_predict.dml		   KM.dml	       obsolete
ALS_topk_predict.dml	   Kmeans.dml	       PCA.dml
apply-transform.dml	   Kmeans-predict.dml  random-forest.dml
bivar-stats.dml		   l2-svm.dml	       random-forest-predict.dml
Cox.dml			   l2-svm-predict.dml  StepGLM.dml
Cox-predict.dml		   LinearRegCG.dml     StepLinearRegDS.dml
CsplineCG.dml		   LinearRegDS.dml     stratstats.dml
CsplineDS.dml		   m-svm.dml	       transform.dml
decision-tree.dml	   m-svm-predict.dml   Univar-Stats.dml
decision-tree-predict.dml  MultiLogReg.dml


In [33]:
!./bin/systemds ./scripts/algorithms/Univar-Stats.dml -nvargs X=../data/haberman.data TYPES=../data/types.csv STATS=../data/univarOut.mtx CONSOLE_OUTPUT=TRUE

###############################################################################
#  SYSTEMDS_ROOT= .
#  SYSTEMDS_JAR_FILE= target/SystemDS.jar
#  CONFIG_FILE= 
#  LOG4JPROP= -Dlog4j.configuration=file:conf/log4j-silent.properties
#  CLASSPATH= target/SystemDS.jar:./lib/*:./target/lib/*
#  HADOOP_HOME= /content/systemds/target/hadoop-test/org/apache/hadoop
#
#  Running script ./scripts/algorithms/Univar-Stats.dml locally with opts: -nvargs X=../data/haberman.data TYPES=../data/types.csv STATS=../data/univarOut.mtx CONSOLE_OUTPUT=TRUE
###############################################################################
Executing command:     java       -Xmx4g      -Xms4g      -Xmn400m   -cp target/SystemDS.jar:./lib/*:./target/lib/*   -Dlog4j.configuration=file:conf/log4j-silent.properties   org.apache.sysds.api.DMLScript   -f ./scripts/algorithms/Univar-Stats.dml   -exec singlenode      -nvargs X=../data/haberman.data TYPES=../data/types.csv STATS=../data/univarOut.mtx CONSOLE_OUTPUT=TRUE

20/

##### 3.1 Let us inspect the output data

In [35]:
!cat ../data/univarOut.mtx

1 1 30.0
1 2 58.0
2 1 83.0
2 2 69.0
2 3 52.0
3 1 53.0
3 2 11.0
3 3 52.0
4 1 52.45751633986928
4 2 62.85294117647059
4 3 4.026143790849673
5 1 116.71458266366658
5 2 10.558630665380907
5 3 51.691117539912135
6 1 10.803452349303281
6 2 3.2494046632238507
6 3 7.189653506248555
7 1 0.6175922641866753
7 2 0.18575610076612029
7 3 0.41100513466216837
8 1 0.20594669940735139
8 2 0.051698529971741194
8 3 1.7857418611299172
9 1 0.1450718616532357
9 2 0.07798443581479181
9 3 2.954633471088322
10 1 -0.6150152487211726
10 2 -1.1324380182967442
10 3 11.425776549251449
11 1 0.13934809593495995
11 2 0.13934809593495995
11 3 0.13934809593495995
12 1 0.277810485320835
12 2 0.277810485320835
12 3 0.277810485320835
13 1 52.0
13 2 63.0
13 3 1.0
14 1 52.16013071895425
14 2 62.80392156862745
14 3 1.2483660130718954
15 4 2.0
16 4 1.0
17 4 1.0
