## Developer notebook for Apache SystemDS

#### Install Java
This installs Java 8 of Open JDK

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

#### Set Environment Variables
Set the locations where Spark and Java are installed.

In [3]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
!update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
!java -version
# os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java to provide /usr/bin/java (java) in manual mode
openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~18.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)


## Apache SystemDS

#### Setup

In [0]:
# Run and print a shell command.
def run(cmd):
  print('>> {}'.format(cmd))
  !{cmd}
  print('')


#### Install Apache Maven

In [5]:
import os

# Download the maven source.
maven_version = 'apache-maven-3.6.3'
maven_path = f"/opt/{maven_version}"
if not os.path.exists(maven_path):
  run(f"wget -q -nc -O apache-maven.zip https://downloads.apache.org/maven/maven-3/3.6.3/binaries/{maven_version}-bin.zip")
  run('unzip -q -d /opt apache-maven.zip')
  run('rm -f apache-maven.zip')

# Let's choose the absolute path instead of $PATH environment variable.
def maven(args):
  run(f"{maven_path}/bin/mvn {args}")

maven('-v')


>> wget -q -nc -O apache-maven.zip https://downloads.apache.org/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.zip

>> unzip -q -d /opt apache-maven.zip

>> rm -f apache-maven.zip

>> /opt/apache-maven-3.6.3/bin/mvn -v
[1mApache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)[m
Maven home: /opt/apache-maven-3.6.3
Java version: 1.8.0_252, vendor: Private Build, runtime: /usr/lib/jvm/java-8-openjdk-amd64/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "4.19.104+", arch: "amd64", family: "unix"



#### Download Apache Spark

In [6]:
# Spark and Hadoop version
spark_version = 'spark-2.4.5'
hadoop_version = 'hadoop2.7'
spark_path = f"/opt/{spark_version}-bin-{hadoop_version}"
if not os.path.exists(spark_path):
  run(f"wget -q -nc -O apache-spark.tgz https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz")
  run('tar zxf apache-spark.tgz -C /opt')
  run('rm -f apache-spark.tgz')

os.environ["SPARK_HOME"] = spark_path
os.environ["PATH"] += ":$SPARK_HOME/bin"


>> wget -q -nc -O apache-spark.tgz https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

>> tar zxf apache-spark.tgz -C /opt

>> rm -f apache-spark.tgz



#### Get Apache SystemDS

In [1]:
!git clone https://github.com/apache/systemml systemds
%cd systemds

Cloning into 'systemds'...
remote: Enumerating objects: 16, done.[K
remote: Counting objects:   6% (1/16)[Kremote: Counting objects:  12% (2/16)[Kremote: Counting objects:  18% (3/16)[Kremote: Counting objects:  25% (4/16)[Kremote: Counting objects:  31% (5/16)[Kremote: Counting objects:  37% (6/16)[Kremote: Counting objects:  43% (7/16)[Kremote: Counting objects:  50% (8/16)[Kremote: Counting objects:  56% (9/16)[Kremote: Counting objects:  62% (10/16)[Kremote: Counting objects:  68% (11/16)[Kremote: Counting objects:  75% (12/16)[Kremote: Counting objects:  81% (13/16)[Kremote: Counting objects:  87% (14/16)[Kremote: Counting objects:  93% (15/16)[Kremote: Counting objects: 100% (16/16)[Kremote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (13/13), done.[K
remote: Total 148415 (delta 1), reused 8 (delta 1), pack-reused 148399[K
Receiving objects: 100% (148415/148415), 214.74 MiB | 25.34 MiB/s, done.
Resolving deltas: 10

In [10]:
# Logging flags: -q only for ERROR; -X for DEBUG; -e for ERROR
maven('clean package -q')

>> /opt/apache-maven-3.6.3/bin/mvn clean package -q



In [0]:
# Example Classification task
# !$SPARK_HOME/bin/spark-submit ./target/SystemDS.jar -f ./scripts/nn/examples/fm-binclass-dummy-data.dml

#### Playground for DML

The following code cell is for dml code.

In [9]:
%%writefile /content/test.dml

# This code code acts as a playground for dml code
X = rand (rows = 20, cols = 10)
y = X %*% rand(rows = ncol(X), cols = 1)
lm(X = X, y = y)

Writing /content/test.dml


Run `dml` with Spark backend

In [15]:
!$SPARK_HOME/bin/spark-submit \
    ./target/SystemDS.jar -f /content/test.dml 


20/06/06 17:15:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.sysds.api.DMLScript).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
ANTLR Tool version 4.5.3 used for code generation does not match the current runtime version 4.7ANTLR Runtime version 4.5.3 used for parser compilation does not match the current runtime version 4.7ANTLR Tool version 4.5.3 used for code generation does not match the current runtime version 4.7ANTLR Runtime version 4.5.3 used for parser compilation does not match the current runtime version 4.7Calling the Direct Solver...
Computing the statistics...
AVG_TOT_Y, 2.2726367629635753
STDEV_TOT_Y, 0.6765237971000172
AVG_RES_Y, 4.539799897118613E-9
STDEV_RES_Y, 2.0354823544354087E-8
DISPERSION, 4.1410652358136863E-16
R2, 0.9999999999