<a href="https://colab.research.google.com/github/jalorenzo/SparkNotebookColab/blob/master/BDF_10_Spark_MLib.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#00 - Configuration of Apache Spark on Collaboratory


###Installing Java, Spark, and Findspark


---


This code installs Apache Spark 2.2.1, Java 8, and [Findspark](https://github.com/minrk/findspark), a library that makes it easy for Python to find Spark.

In [None]:
import os

os.environ["SPARK_VERSION"] = "spark-3.5.0"
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget  http://apache.osuosl.org/spark/$SPARK_VERSION/$SPARK_VERSION-bin-hadoop3.tgz
!tar xf $SPARK_VERSION-bin-hadoop3.tgz
!echo $SPARK_VERSION-bin-hadoop3.tgz
!rm $SPARK_VERSION-bin-hadoop3.tgz
!pip install -q findspark

### Set Environment Variables
Set the locations where Spark and Java are installed.

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark/"
os.environ["DRIVE_DATA"] = "/content/gdrive/My Drive/Enseignement/2023-2024/ING3/HPDA/BigDataFrameworks/data/"

!rm /content/spark
!ln -s /content/$SPARK_VERSION-bin-hadoop3 /content/spark
!export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
!echo $SPARK_HOME
!env |grep  "DRIVE_DATA"

### Start a SparkSession
This will start a local Spark session.

In [None]:
!python -V

import findspark
findspark.init()

from pyspark import SparkContext
sc = SparkContext.getOrCreate()

# Example: shows the PySpark version
print("PySpark version {0}".format(sc.version))

# Example: parallelise an array and show the 2 first elements
sc.parallelize([2, 3, 4, 5, 6]).cache().take(2)

In [None]:
from pyspark.sql import SparkSession
# We create a SparkSession object (or we retrieve it if it is already created)
spark = SparkSession \
.builder \
.appName("My application") \
.config("spark.some.config.option", "some-value") \
.master("local[4]") \
.getOrCreate()
# We get the SparkContext
sc = spark.sparkContext

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')


---


# 10 - Spark MLlib

Library of ML parallel algorithms for massive data

-   Machine learning classic algorithms: classification, regression, clustering, collaborative filtering
-   Other algorithms: feature extraction, transformation, dimensionality reduction, and selection
-   Tools to build, evaluate and adjust ML pipelines
-   Other tools: linear algebra, statistics, data processing, etc.


Two packages:

-   **spark.mllib:** Original RDD-based API
-   **spark.ml:** High-level API, based on DataFrames

Documentation and APIS:

- ML
    - Guia: http://spark.apache.org/docs/latest/ml-guide.html
    - API Python: https://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html
    - API Scala: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.package
- MLlib
    - Guia: http://spark.apache.org/docs/latest/mllib-guide.html
    - API Python: https://spark.apache.org/docs/latest/api/python/reference/pyspark.mllib.html
    - API Scala: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.package



## Example

Use the [KMeans](http://spark.apache.org/docs/latest/mllib-clustering.html#k-means) clustering algorithm to group data from vectors spread over two clusters.


In [None]:
from pyspark.ml.clustering import KMeans, KMeansModel
from pyspark.ml.linalg import Vectors

#  Define an array of 4 sparse vectors, 3 elements each
sparseData = [
     Vectors.sparse(3, {1: 1.2}),
     Vectors.sparse(3, {1: 1.1}),
     Vectors.sparse(3, {0: 0.9, 2: 1.0}),
     Vectors.sparse(3, {0: 1.0, 2: 1.1})
 ]

for i in range(4):
    print(sparseData[i].toArray())

In [None]:
# Turn the array into a DataFrame
dfSD = sc.parallelize([
  (1, sparseData[0]),
  (2, sparseData[1]),
  (3, sparseData[2]),
  (4, sparseData[3])
]).toDF(["row", "features"])

dfSD.show()

In [None]:
# Create a KMeans model without training, with 2 clusters
# For more information, see https://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html#module-pyspark.ml.clustering
kmeans = KMeans()\
    .setInitMode("k-means||")\
    .setFeaturesCol("features")\
    .setPredictionCol("prediction")\
    .setK(2)\
    .setSeed(1)

In [None]:
# Adjust the model to the previous DataFrame and show the cluster centres
kmModel = kmeans.fit(dfSD)
print("Clusters centres: {0}".format(
    kmModel.clusterCenters()))

In [None]:
# Verify that the model clusters the data from the previous array
kmModel.transform(dfSD).show()
# Calculate the cost as the addition of the squared distance between the input points
# and the centres of the corresponding clusters
print("Cost = {0}".format(
    kmModel.summary.trainingCost))

In [None]:
# Test the model with other points
dfTest = sc.parallelize([
  (1, Vectors.sparse(3, {0: 0.9, 1:1.0, 2: 1.0})),
  (2, Vectors.sparse(3, {1: 1.5, 2: 0.3}))
]).toDF(["row", "features"])

kmModel.transform(dfTest).show(truncate=False)

# Calculate the cost as the addition of the squared distance between the input points
# and the centres of the corresponding clusters
print("Cost = {0}".format(
    kmModel.summary.trainingCost))

In [None]:
# Save the model in a directory
kmModel.save("/tmp/kmModel")

In [None]:
# Reload the model
sameModel = KMeansModel.load("/tmp/kmModel")

sameModel.transform(dfTest).show(truncate=False)
# Calculate the cost as the addition of the squared distance between the input points
# and the centres of the corresponding clusters
# print("Cost = {0}".format(sameModel.summary.trainingCost))

In [None]:
!rm -rf /tmp/kmModel