# Spark Preparation
We check if we are in Google Colab.  If this is the case, install all necessary packages.

To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 3.2.1 with hadoop 3.2, Java 8 and Findspark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab.
Learn more from [A Must-Read Guide on How to Work with PySpark on Google Colab for Data Scientists!](https://www.analyticsvidhya.com/blog/2020/11/a-must-read-guide-on-how-to-work-with-pyspark-on-google-colab-for-data-scientists/)

credit: Natawut Nupairoj

In [None]:
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

In [None]:
if IN_COLAB:
    !apt-get install openjdk-8-jdk-headless -qq > /dev/null
    !wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
    !tar xf spark-3.2.1-bin-hadoop3.2.tgz
    !mv spark-3.2.1-bin-hadoop3.2 spark
    !pip install -q findspark

In [None]:
if IN_COLAB:
  import os
  os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
  os.environ["SPARK_HOME"] = "/content/spark"

In [None]:
import findspark
findspark.init()

# Pyspark_Clustering_Pipeline_Cdr

In [None]:
#1 - import module
from pyspark import SparkContext,SparkConf
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler,MaxAbsScaler

In [None]:
#2 - Create SparkContext
from pyspark import SparkContext

sc = SparkContext.getOrCreate()

In [None]:
sc

In [None]:
sc._conf.getAll()

In [None]:
print  (sc.getConf().toDebugString())

In [None]:
#3 - Setup SparkSession(SparkSQL)
spark = (SparkSession
         .builder
         .appName("Pyspark_Clustering_Pipeline_Cdr")
         .getOrCreate())
print (spark)

In [None]:
!wget https://github.com/kaopanboonyuen/GISTDA2022/raw/main/dataset/cdr_extractFeatures.csv

In [None]:
#4 - Read file to spark DataFrame

data = (spark
        .read
        .option("header","true")
        .option("inferSchema", "true")
        .csv("cdr_extractFeatures.csv"))
data.cache()
print ("finish caching data")

In [None]:
data.describe().toPandas()

In [None]:
data.printSchema()

In [None]:
data.toPandas()

In [None]:
#5 - Print sample 5 rows of all variables
column_name = ["no_CallIn_Unique","no_CallOut_Unique","no_CallIn","no_CallOut"
               ,"avg_CallIn_Length","avg_CallOut_Length","avg_Call_Length"]

In [None]:
#6 - Create Vector
assem =  VectorAssembler(inputCols=column_name ,outputCol="temp_features")

print (assem)

In [None]:
#7 - Normalize
scaler = MaxAbsScaler(inputCol="temp_features", outputCol="features")

print (scaler)

In [None]:
#8 - Create model
kmeans = KMeans().setK(3).setSeed(50)

In [None]:
#9 - Set ML pipeline
all_process_list = [assem,scaler,kmeans]
for process in all_process_list: print(process)

pipeline = Pipeline(stages=all_process_list)
print (pipeline)

In [None]:
#10 - Train model
model = pipeline.fit(data)

In [None]:
#11 - Make predictions
predictions = model.transform(data).select("features","prediction")
predictions.cache()

In [None]:
# Print sample result
predictions.sample(False, 0.3, 1234).toPandas()

In [None]:
#12 Shows Cluster's Center
centers = model.stages[2].clusterCenters()

scaler_model = model.stages[-2]
max = scaler_model.maxAbs
print("Cluster Centers: ")
for center in centers:
    print(center*max)