## Initialise Spark Session:

In [1]:
val spark = org.apache.spark.sql.SparkSession.builder
        .master("local") 
        .appName("Spark CSV Reader")
        .getOrCreate;

Intitializing Scala interpreter ...

Spark Web UI available at http://c8ed6e6c5b9e:4040
SparkContext available as 'sc' (version = 2.4.5, master = local[*], app id = local-1589854407215)
SparkSession available as 'spark'




spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@2d9f5d49


### Preparing HDFS

In [2]:
!pwd


/home/sandpit/big-data-realestate/scripts



In [3]:
!cat ./../data-raw/Melbourne_housing_FULL.csv| wc -l

34858



In [5]:
! hadoop fs -mkdir -p  /tmp/rs_in
! hadoop fs -put   -p  ./../data-raw/Melbourne_housing_FULL.csv             /tmp/rs_in/mh.csv
! hadoop fs -ls        /tmp/rs_in/

put: `/tmp/rs_in/mh.csv': File exists


Found 3 items


-rw-r--r--   1 root root    5018236 2020-05-15 05:20 /tmp/rs_in/mh.csv


-rw-r--r--   1 root root     123637 2020-05-16 02:33 /tmp/rs_in/sales.csv


-rw-r--r--   1 root root       5860 2020-05-19 00:12 /tmp/rs_in/test.csv




In [6]:
!hadoop fs -cat /tmp/rs_in/mh.csv | wc -l

34858



### Get config info about hdfs:

In [7]:
!hdfs getconf -confKey fs.defaultFS

hdfs://localhost:9000



In [8]:
val df = spark.read.format("csv").option("header", "true").load("hdfs://localhost:9000/tmp/rs_in/mh.csv")

df: org.apache.spark.sql.DataFrame = [Suburb: string, Address: string ... 19 more fields]


### Print schema:

In [9]:
df.printSchema()

root
 |-- Suburb: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Rooms: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Price: string (nullable = true)
 |-- Method: string (nullable = true)
 |-- SellerG: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Distance: string (nullable = true)
 |-- Postcode: string (nullable = true)
 |-- Bedroom2: string (nullable = true)
 |-- Bathroom: string (nullable = true)
 |-- Car: string (nullable = true)
 |-- Landsize: string (nullable = true)
 |-- BuildingArea: string (nullable = true)
 |-- YearBuilt: string (nullable = true)
 |-- CouncilArea: string (nullable = true)
 |-- Lattitude: string (nullable = true)
 |-- Longtitude: string (nullable = true)
 |-- Regionname: string (nullable = true)
 |-- Propertycount: string (nullable = true)



In [10]:
df.columns

res1: Array[String] = Array(Suburb, Address, Rooms, Type, Price, Method, SellerG, Date, Distance, Postcode, Bedroom2, Bathroom, Car, Landsize, BuildingArea, YearBuilt, CouncilArea, Lattitude, Longtitude, Regionname, Propertycount)


### Show column types:

In [11]:
df.dtypes

res2: Array[(String, String)] = Array((Suburb,StringType), (Address,StringType), (Rooms,StringType), (Type,StringType), (Price,StringType), (Method,StringType), (SellerG,StringType), (Date,StringType), (Distance,StringType), (Postcode,StringType), (Bedroom2,StringType), (Bathroom,StringType), (Car,StringType), (Landsize,StringType), (BuildingArea,StringType), (YearBuilt,StringType), (CouncilArea,StringType), (Lattitude,StringType), (Longtitude,StringType), (Regionname,StringType), (Propertycount,StringType))


In [12]:
df.dtypes.filter(colTup => colTup._1 == "Suburb")

res3: Array[(String, String)] = Array((Suburb,StringType))


### Display first 12 columns:

In [13]:
df.select("Suburb","Address","Rooms","Type","Price","Method","SellerG","Date","Distance","Postcode","Bathroom","Car").show()

+----------+-------------------+-----+----+-------+------+-------+---------+--------+--------+--------+----+
|    Suburb|            Address|Rooms|Type|  Price|Method|SellerG|     Date|Distance|Postcode|Bathroom| Car|
+----------+-------------------+-----+----+-------+------+-------+---------+--------+--------+--------+----+
|Abbotsford|      68 Studley St|    2|   h|   null|    SS| Jellis|3/09/2016|     2.5|    3067|       1|   1|
|Abbotsford|       85 Turner St|    2|   h|1480000|     S| Biggin|3/12/2016|     2.5|    3067|       1|   1|
|Abbotsford|    25 Bloomburg St|    2|   h|1035000|     S| Biggin|4/02/2016|     2.5|    3067|       1|   0|
|Abbotsford| 18/659 Victoria St|    3|   u|   null|    VB| Rounds|4/02/2016|     2.5|    3067|       2|   1|
|Abbotsford|       5 Charles St|    3|   h|1465000|    SP| Biggin|4/03/2017|     2.5|    3067|       2|   0|
|Abbotsford|   40 Federation La|    3|   h| 850000|    PI| Biggin|4/03/2017|     2.5|    3067|       2|   1|
|Abbotsford|       

### Display last 8 columns:

In [14]:
df.select("Landsize","BuildingArea","YearBuilt","CouncilArea","Lattitude","Longtitude","Regionname","Propertycount").show(10)

+--------+------------+---------+------------------+---------+----------+--------------------+-------------+
|Landsize|BuildingArea|YearBuilt|       CouncilArea|Lattitude|Longtitude|          Regionname|Propertycount|
+--------+------------+---------+------------------+---------+----------+--------------------+-------------+
|     126|        null|     null|Yarra City Council| -37.8014|  144.9958|Northern Metropol...|         4019|
|     202|        null|     null|Yarra City Council| -37.7996|  144.9984|Northern Metropol...|         4019|
|     156|          79|     1900|Yarra City Council| -37.8079|  144.9934|Northern Metropol...|         4019|
|       0|        null|     null|Yarra City Council| -37.8114|  145.0116|Northern Metropol...|         4019|
|     134|         150|     1900|Yarra City Council| -37.8093|  144.9944|Northern Metropol...|         4019|
|      94|        null|     null|Yarra City Council| -37.7969|  144.9969|Northern Metropol...|         4019|
|     120|         

 <span style="font-size:16pt;color: red">TO DO: write down null/not null; range of values; distinct for categorical and range and statistics for  for numerical             </span> .

### Attributes
#### Categorical:

 |-- Suburb:
 
 |-- Address: 
 
 |-- Type: 
 
 |-- Method:
 
 |-- SellerG: 
 
 |-- Date: 
 
 |-- CouncilArea:
 
 |-- Regionname:
 


#### Numerical:

|-- Rooms: 

|-- Price: 

|-- Distance: 

|-- Rooms:

|-- Postcode: 

|-- Bedroom2:  <span style="font-size:16pt;color: red">TO DO Shoud justify why were are dropping it</span> .

|-- Bathroom: 

|-- Car: 

|-- Landsize: 

|-- BuildingArea: 

|-- YearBuilt: 

|-- Lattitude: 

|-- Longtitude: 

|-- Propertycount: <span style="font-size:16pt;color: red">TO DO: what is property count?</span> .

 <span style="font-size:16pt;color: red">TO DO: What are out attributes?</span> .


### Filtering

In [15]:
df.filter($"Suburb"==="Glen Waverley").select("Address","Rooms","Type","Price","Method","SellerG","Date","Distance","Postcode","Bathroom","Car").show()

+-----------------+-----+----+-------+------+---------+----------+--------+--------+--------+----+
|          Address|Rooms|Type|  Price|Method|  SellerG|      Date|Distance|Postcode|Bathroom| Car|
+-----------------+-----+----+-------+------+---------+----------+--------+--------+--------+----+
|     7 Marbray Dr|    4|   h|   null|    SN|Harcourts| 1/07/2017|    16.7|    3150|       1|   2|
|      24 Owens Av|    4|   h|1250000|     S|      Ray| 1/07/2017|    16.7|    3150|    null|null|
|515 Springvale Rd|    3|   h|   null|    PI|      Ray| 1/07/2017|    16.7|    3150|       1|   2|
| 22 Stableford Av|    3|   h|   null|     S|      Ray| 1/07/2017|    16.7|    3150|    null|null|
|  28 Brentwood Dr|    5|   h|   null|    PI|      Ray| 3/06/2017|    16.7|    3150|       5|   2|
|2/70 Leicester Av|    3|   t|   null|    SP|      LLC| 3/06/2017|    16.7|    3150|    null|null|
|    38 Margate Cr|    3|   h|   null|    SN| Woodards| 3/06/2017|    16.7|    3150|    null|null|
|    91 Or

In [16]:
df.where("Suburb = 'Abbotsford'").select("Address","Rooms","Type","Price","Method","SellerG","Date","Distance","Postcode","Bathroom","Car", "Propertycount").show()

+-------------------+-----+----+-------+------+-------+---------+--------+--------+--------+----+-------------+
|            Address|Rooms|Type|  Price|Method|SellerG|     Date|Distance|Postcode|Bathroom| Car|Propertycount|
+-------------------+-----+----+-------+------+-------+---------+--------+--------+--------+----+-------------+
|      68 Studley St|    2|   h|   null|    SS| Jellis|3/09/2016|     2.5|    3067|       1|   1|         4019|
|       85 Turner St|    2|   h|1480000|     S| Biggin|3/12/2016|     2.5|    3067|       1|   1|         4019|
|    25 Bloomburg St|    2|   h|1035000|     S| Biggin|4/02/2016|     2.5|    3067|       1|   0|         4019|
| 18/659 Victoria St|    3|   u|   null|    VB| Rounds|4/02/2016|     2.5|    3067|       2|   1|         4019|
|       5 Charles St|    3|   h|1465000|    SP| Biggin|4/03/2017|     2.5|    3067|       2|   0|         4019|
|   40 Federation La|    3|   h| 850000|    PI| Biggin|4/03/2017|     2.5|    3067|       2|   1|       

In [17]:
df.where("Price >1000000").filter("Suburb = 'Abbotsford'").select("Address","Rooms","Type","Price","Method","SellerG","Date","Distance","Postcode","Bathroom","Car", "Propertycount").count()

res8: Long = 53


In [18]:
df.where("Price >1000000").filter("Suburb = 'Abbotsford'").select("Address","Rooms","Type","Price","Method","SellerG","Date","Distance","Postcode","Bathroom","Car", "Propertycount").show()

+-------------------+-----+----+-------+------+--------+----------+--------+--------+--------+----+-------------+
|            Address|Rooms|Type|  Price|Method| SellerG|      Date|Distance|Postcode|Bathroom| Car|Propertycount|
+-------------------+-----+----+-------+------+--------+----------+--------+--------+--------+----+-------------+
|       85 Turner St|    2|   h|1480000|     S|  Biggin| 3/12/2016|     2.5|    3067|       1|   1|         4019|
|    25 Bloomburg St|    2|   h|1035000|     S|  Biggin| 4/02/2016|     2.5|    3067|       1|   0|         4019|
|       5 Charles St|    3|   h|1465000|    SP|  Biggin| 4/03/2017|     2.5|    3067|       2|   0|         4019|
|        55a Park St|    4|   h|1600000|    VB|  Nelson| 4/06/2016|     2.5|    3067|       1|   2|         4019|
|       124 Yarra St|    3|   h|1876000|     S|  Nelson| 7/05/2016|     2.5|    3067|       2|   0|         4019|
|      98 Charles St|    2|   h|1636000|     S|  Nelson| 8/10/2016|     2.5|    3067|   

In [19]:
df.where("Price >1000000").filter("Suburb = 'Abbotsford'").collect()

res10: Array[org.apache.spark.sql.Row] = Array([Abbotsford,85 Turner St,2,h,1480000,S,Biggin,3/12/2016,2.5,3067,2,1,1,202,null,null,Yarra City Council,-37.7996,144.9984,Northern Metropolitan,4019], [Abbotsford,25 Bloomburg St,2,h,1035000,S,Biggin,4/02/2016,2.5,3067,2,1,0,156,79,1900,Yarra City Council,-37.8079,144.9934,Northern Metropolitan,4019], [Abbotsford,5 Charles St,3,h,1465000,SP,Biggin,4/03/2017,2.5,3067,3,2,0,134,150,1900,Yarra City Council,-37.8093,144.9944,Northern Metropolitan,4019], [Abbotsford,55a Park St,4,h,1600000,VB,Nelson,4/06/2016,2.5,3067,3,1,2,120,142,2014,Yarra City Council,-37.8072,144.9941,Northern Metropolitan,4019], [Abbotsford,124 Yarra St,3,h,1876000,S,Nelson,7/05/2016,2.5,3067,4,2,0,245,210,1910,Yarra City Council,-37.8024,144.9993,Northern Metropolitan,401...

#### Filtering null values

In [23]:
df.filter(df("Price").isNull || df("Price") === "" || df("Price").isNaN).count()

res14: Long = 7610


In [39]:
df.filter("Price IS NULL").count()

res29: Long = 7610


In [33]:
df_filtered.where("Price >1000000").filter("Suburb = 'Abbotsford'").select("Address","Rooms","Type","Price","Method","SellerG","Date","Distance","Postcode","Bathroom","Car", "Propertycount").show()

+----------------+-----+----+-------+------+--------+----------+--------+--------+--------+---+-------------+
|         Address|Rooms|Type|  Price|Method| SellerG|      Date|Distance|Postcode|Bathroom|Car|Propertycount|
+----------------+-----+----+-------+------+--------+----------+--------+--------+--------+---+-------------+
| 25 Bloomburg St|    2|   h|1035000|     S|  Biggin| 4/02/2016|     2.5|    3067|       1|  0|         4019|
|    5 Charles St|    3|   h|1465000|    SP|  Biggin| 4/03/2017|     2.5|    3067|       2|  0|         4019|
|     55a Park St|    4|   h|1600000|    VB|  Nelson| 4/06/2016|     2.5|    3067|       1|  2|         4019|
|    124 Yarra St|    3|   h|1876000|     S|  Nelson| 7/05/2016|     2.5|    3067|       2|  0|         4019|
|   98 Charles St|    2|   h|1636000|     S|  Nelson| 8/10/2016|     2.5|    3067|       1|  2|         4019|
|   10 Valiant St|    2|   h|1097000|     S|  Biggin| 8/10/2016|     2.5|    3067|       1|  2|         4019|
| 40 Nicho

In [31]:
df_filtered.filter(df("Price").isNull || df("Price") === "" || df("Price").isNaN).count()



res21: Long = 0


In [36]:
df.filter("Price IS NULL").select("Address","Rooms","Type","Price","Method","SellerG","Date","Distance","Postcode","Bathroom","Car", "Propertycount").show(10)

+-------------------+-----+----+-----+------+-------+----------+--------+--------+--------+----+-------------+
|            Address|Rooms|Type|Price|Method|SellerG|      Date|Distance|Postcode|Bathroom| Car|Propertycount|
+-------------------+-----+----+-----+------+-------+----------+--------+--------+--------+----+-------------+
|      68 Studley St|    2|   h| null|    SS| Jellis| 3/09/2016|     2.5|    3067|       1|   1|         4019|
| 18/659 Victoria St|    3|   u| null|    VB| Rounds| 4/02/2016|     2.5|    3067|       2|   1|         4019|
|       16 Maugie St|    4|   h| null|    SN| Nelson| 6/08/2016|     2.5|    3067|       2|   2|         4019|
|       53 Turner St|    2|   h| null|     S| Biggin| 6/08/2016|     2.5|    3067|       1|   2|         4019|
|       99 Turner St|    2|   h| null|     S|Collins| 6/08/2016|     2.5|    3067|       2|   1|         4019|
|121/56 Nicholson St|    2|   u| null|    PI| Biggin| 7/11/2016|     2.5|    3067|       2|   1|         4019|
|

In [None]:
df.filter("Distance IS NULL").count()

In [34]:
df.filter(df("Bathroom").isNull || df("Bathroom") === "" || df("Bathroom").isNaN).count()

res24: Long = 8226


In [40]:
df.filter("Bathroom IS NULL").count()

res30: Long = 8226


In [38]:
df.filter("Rooms IS NULL").count()

res28: Long = 0


In [41]:
df.filter("Car IS NULL").count()

res31: Long = 8728


In [None]:
df.filter("Landsize IS NULL").count()

In [None]:
df.filter("BuildingArea IS NULL").count()

In [None]:
df.filter("YearBuilt IS NULL").count()

In [27]:
val df_not_null = df.na.drop
df_not_null.count()

df_not_null: org.apache.spark.sql.DataFrame = [Suburb: string, Address: string ... 19 more fields]
res17: Long = 8887


In [30]:
val df_filtered = df.filter(row => !row.anyNull);
df_filtered.printSchema()


root
 |-- Suburb: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Rooms: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Price: string (nullable = true)
 |-- Method: string (nullable = true)
 |-- SellerG: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Distance: string (nullable = true)
 |-- Postcode: string (nullable = true)
 |-- Bedroom2: string (nullable = true)
 |-- Bathroom: string (nullable = true)
 |-- Car: string (nullable = true)
 |-- Landsize: string (nullable = true)
 |-- BuildingArea: string (nullable = true)
 |-- YearBuilt: string (nullable = true)
 |-- CouncilArea: string (nullable = true)
 |-- Lattitude: string (nullable = true)
 |-- Longtitude: string (nullable = true)
 |-- Regionname: string (nullable = true)
 |-- Propertycount: string (nullable = true)



df_filtered: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Suburb: string, Address: string ... 19 more fields]


In [32]:
df_filtered.count()

res22: Long = 8887


#### Imputing null values

In [None]:
var df_imp = df.na.fill("N/A")

### Renaming Columns

### Group By and  Aggregation

In [None]:
import org.apache.spark.sql.functions.*
df.groupBy("Suburb").agg(max("Price")).show()

In [None]:
df.groupBy("Suburb").agg(min("Price")).show()

In [None]:
df.groupBy("Distance").agg(median("Price")).show()

### Correlation
 <span style="font-size:16pt;color: red">TO DO: FIND OUT CORRELATIONS between theses attributes </span> .

Rooms:
|-- Price:
|-- Distance:
|-- Rooms:
|-- Bathroom:
|-- Car:
|-- Landsize:
|-- BuildingArea:
|-- YearBuilt:

In [None]:
import org.apache.spark.sql.functions.corr
df.select(corr("Distance","Price")).show()

In [None]:
df.select(corr("Rooms","Price")).show()

In [None]:
df.select(corr("Landsize","Price")).show()

In [None]:
df.select(corr("Bathroom","Price")).show()

In [None]:
df.select(corr("Car","Price")).show()

In [None]:
df.select(corr("YearBuilt","Price")).show()

In [None]:
df.select(corr("BuildingArea","Price")).show()

In [None]:
import org.apache.spark.ml.stat.Summarizer
val (meanVal, varianceVal) = df.select(metrics("mean", "variance")
  .summary($"Price").as("summary"))
  .select("summary.mean", "summary.variance")
  .as[(Vector, Vector)].first()

println(s"with weight: mean = ${meanVal}, variance = ${varianceVal}")


## References

Apache Spark (n.d.). _Spark Scala API (Scaladoc). Overview._ https://spark.apache.org/docs/latest/api/java/overview-summary.html

Apache Spark (n.d.). _Basic Statistic._ https://spark.apache.org/docs/latest/ml-statistics.html

Bahadoor N. (2020). _Spark Tutorials_ https://allaboutscala.com/big-data/spark/#dataframe-statistics-correlation

Databricks. (2020). _Introduction to DataFrames - Scala._  https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-scala.html 

Grimaldi E. (2018). _Pandas vs. Spark: how to handle dataframes (Part II.)_  https://towardsdatascience.com/python-pandas-vs-scala-how-to-handle-dataframes-part-ii-d3e5efe8287d 

