### Preparing HDFS
Using magic

Create input folder on HDFS if not exists

Copy from data from local

In [1]:
import org.apache.spark.sql.functions._

Intitializing Scala interpreter ...

Spark Web UI available at http://4550f27d0b98:4040
SparkContext available as 'sc' (version = 2.4.5, master = local[*], app id = local-1590143462072)
SparkSession available as 'spark'


2020-05-22 10:31:00,289 WARN  [main] util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


import org.apache.spark.sql.functions._


In [2]:
!pwd
! hadoop fs -mkdir -p  /tmp/rs_input
! hadoop fs -put   -p  ./../data-raw/Melbourne_housing_FULL.csv             /tmp/rs_input/raw.csv
! hadoop fs -ls        /tmp/rs_input/

/home/big-data-realestate-master/scripts

put: `/tmp/rs_input/raw.csv': File exists


Found 1 items


-rwxrwxrwx   1 1000 staff    5018236 2020-05-22 06:09 /tmp/rs_input/raw.csv




In [3]:
//load raw into df
val df_raw = spark
    .read
    .format("csv")
    .option("header", "true")
    .load("hdfs://localhost:9000/tmp/rs_input/raw.csv")

df_raw: org.apache.spark.sql.DataFrame = [Suburb: string, Address: string ... 19 more fields]


In [4]:
//only select columns we need now
var df_working= df_raw.select("Price",
                          "Method",
                          "Type",
                          "Distance",
                          "Rooms",
                          "Bathroom",
                          "Car",
                          "Landsize",
                          "Propertycount",
                          "Suburb",
                          "Address",
                          "Date")


//add meaningful to column names
df_working = df_working.withColumnRenamed("Method","MethodOfSale")
    .withColumnRenamed("Distance","DistanceFromCBD")
    .withColumnRenamed("Type","PropertyType")
    .withColumnRenamed("Propertycount","PropertyCountInSuburb")

df_working: org.apache.spark.sql.DataFrame = [Price: string, MethodOfSale: string ... 10 more fields]
df_working: org.apache.spark.sql.DataFrame = [Price: string, MethodOfSale: string ... 10 more fields]


#### Change Remove "#n/a" records

In [5]:
//when profiling there are a number of columns with a "#N/A" which need to be removed
df_working = df_working.filter(($"DistanceFromCBD" =!= "#N/A")||($"PropertyCountInSuburb" =!= "#N/A"))

df_working: org.apache.spark.sql.DataFrame = [Price: string, MethodOfSale: string ... 10 more fields]


In [6]:
//refactored to remove the for column loop
df_working = df_working.withColumn("Price",col("Price").cast("Double"))
    .withColumn("Rooms",col("Rooms").cast("Int"))
    .withColumn("DistanceFromCBD",col("DistanceFromCBD").cast("Double"))
    .withColumn("Bathroom",col("Bathroom").cast("Double"))
    .withColumn("Car",col("Car").cast("Int"))
    .withColumn("Landsize",col("Landsize").cast("Double"))
    .withColumn("PropertyCountInSuburb",col("PropertyCountInSuburb").cast("Int"))
    

df_working: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: string ... 10 more fields]


#### Remove bad landsize data

In [7]:
//remove where landsize is zero 
//enhance to be remove where landsize is less then X df_working = df_working.filter($"Landsize" < x)
df_working = df_working.filter($"Landsize" =!= 0)

df_working: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: string ... 10 more fields]


#### Split Address on Street and Suffix

In [8]:
//split address on Street name and Suffix
df_working = df_working.withColumn("StreetName",split(col("Address")," ").getItem(1)).
    withColumn("StreetSuffix",split(col("Address")," ").getItem(2)).drop("Address")

df_working: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: string ... 11 more fields]


In [9]:
// make first letter of suburb upper case
df_working= df_working.withColumn("Suburb", initcap(col("Suburb")))

//make type code upper
df_working = df_working.withColumn("PropertyCountInSuburb", initcap(col("PropertyCountInSuburb")))

df_working: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: string ... 11 more fields]
df_working: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: string ... 11 more fields]


#### Filtering null values

In [10]:
val df_not_null = df_working.na.drop()
//debug df_not_null.count()
// should = 17701 (with zero landsizes)
// should = 15759 (with no zero landsizes)

//debug df_not_null.show()

df_not_null: org.apache.spark.sql.DataFrame = [Price: double, MethodOfSale: string ... 11 more fields]


#### Write down clean data:

In [11]:
! hadoop fs -mkdir -p /tmp/output

In [12]:
val df_output = df_not_null.coalesce(1)
   .write
   .format("csv")
   .option("header","true")
   .mode("overwrite").option("sep",",")
   .save("hdfs://localhost:9000/tmp/output")

df_output: Unit = ()


Save the clean data to disk

In [13]:
! hadoop fs -mkdir -p /tmp/output
! hadoop fs -copyToLocal /tmp/output/\*.csv ./../data-clean/cleanMelbourneData.csv

copyToLocal: `./../data-clean/cleanMelbourneData.csv': File exists




## References

Apache Spark (n.d.). _Spark Scala API (Scaladoc). Overview._ https://spark.apache.org/docs/latest/api/java/overview-summary.html

Apache Spark (n.d.). _Basic Statistic._ https://spark.apache.org/docs/latest/ml-statistics.html

Bahadoor N. (2020). _Spark Tutorials_ https://allaboutscala.com/big-data/spark/#dataframe-statistics-correlation

Databricks. (2020). _Introduction to DataFrames - Scala._  https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-scala.html 

Grimaldi E. (2018). _Pandas vs. Spark: how to handle dataframes (Part II.)_  https://towardsdatascience.com/python-pandas-vs-scala-how-to-handle-dataframes-part-ii-d3e5efe8287d 

