## Initialise Spark Session:

In [1]:
val spark = org.apache.spark.sql.SparkSession.builder
        .master("local") 
        .appName("Spark CSV Reader")
        .getOrCreate;

Intitializing Scala interpreter ...

Spark Web UI available at http://c8ed6e6c5b9e:4040
SparkContext available as 'sc' (version = 2.4.5, master = local[*], app id = local-1589860725539)
SparkSession available as 'spark'




spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@e452704


### Preparing HDFS

In [2]:
!pwd


/home/sandpit/big-data-realestate/scripts



In [3]:
!cat ./../data-raw/Melbourne_housing_FULL.csv| wc -l

34858



In [4]:
! hadoop fs -mkdir -p  /tmp/rs_in
! hadoop fs -put   -p  ./../data-raw/Melbourne_housing_FULL.csv             /tmp/rs_in/mh.csv
! hadoop fs -ls        /tmp/rs_in/

put: `/tmp/rs_in/mh.csv': File exists


Found 3 items


-rw-r--r--   1 root root    5018236 2020-05-15 05:20 /tmp/rs_in/mh.csv


-rw-r--r--   1 root root     123637 2020-05-16 02:33 /tmp/rs_in/sales.csv


-rw-r--r--   1 root root       5860 2020-05-19 00:12 /tmp/rs_in/test.csv




In [5]:
!hadoop fs -cat /tmp/rs_in/mh.csv | wc -l

34858



### Get config info about hdfs:

In [6]:
!hdfs getconf -confKey fs.defaultFS

hdfs://localhost:9000



In [7]:
val df = spark.read.format("csv").option("header", "true").load("hdfs://localhost:9000/tmp/rs_in/mh.csv")

df: org.apache.spark.sql.DataFrame = [Suburb: string, Address: string ... 19 more fields]


### Print schema:

In [8]:
df.printSchema()

root
 |-- Suburb: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Rooms: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Price: string (nullable = true)
 |-- Method: string (nullable = true)
 |-- SellerG: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Distance: string (nullable = true)
 |-- Postcode: string (nullable = true)
 |-- Bedroom2: string (nullable = true)
 |-- Bathroom: string (nullable = true)
 |-- Car: string (nullable = true)
 |-- Landsize: string (nullable = true)
 |-- BuildingArea: string (nullable = true)
 |-- YearBuilt: string (nullable = true)
 |-- CouncilArea: string (nullable = true)
 |-- Lattitude: string (nullable = true)
 |-- Longtitude: string (nullable = true)
 |-- Regionname: string (nullable = true)
 |-- Propertycount: string (nullable = true)



In [9]:
df.columns

res1: Array[String] = Array(Suburb, Address, Rooms, Type, Price, Method, SellerG, Date, Distance, Postcode, Bedroom2, Bathroom, Car, Landsize, BuildingArea, YearBuilt, CouncilArea, Lattitude, Longtitude, Regionname, Propertycount)


### Show column types:

In [10]:
df.dtypes

res2: Array[(String, String)] = Array((Suburb,StringType), (Address,StringType), (Rooms,StringType), (Type,StringType), (Price,StringType), (Method,StringType), (SellerG,StringType), (Date,StringType), (Distance,StringType), (Postcode,StringType), (Bedroom2,StringType), (Bathroom,StringType), (Car,StringType), (Landsize,StringType), (BuildingArea,StringType), (YearBuilt,StringType), (CouncilArea,StringType), (Lattitude,StringType), (Longtitude,StringType), (Regionname,StringType), (Propertycount,StringType))


In [11]:
df.dtypes.filter(colTup => colTup._1 == "Suburb")

res3: Array[(String, String)] = Array((Suburb,StringType))


### Display first 12 columns:

In [12]:
df.select("Suburb","Address","Rooms","Type","Price","Method","SellerG","Date","Distance","Postcode","Bathroom","Car").show()

+----------+-------------------+-----+----+-------+------+-------+---------+--------+--------+--------+----+
|    Suburb|            Address|Rooms|Type|  Price|Method|SellerG|     Date|Distance|Postcode|Bathroom| Car|
+----------+-------------------+-----+----+-------+------+-------+---------+--------+--------+--------+----+
|Abbotsford|      68 Studley St|    2|   h|   null|    SS| Jellis|3/09/2016|     2.5|    3067|       1|   1|
|Abbotsford|       85 Turner St|    2|   h|1480000|     S| Biggin|3/12/2016|     2.5|    3067|       1|   1|
|Abbotsford|    25 Bloomburg St|    2|   h|1035000|     S| Biggin|4/02/2016|     2.5|    3067|       1|   0|
|Abbotsford| 18/659 Victoria St|    3|   u|   null|    VB| Rounds|4/02/2016|     2.5|    3067|       2|   1|
|Abbotsford|       5 Charles St|    3|   h|1465000|    SP| Biggin|4/03/2017|     2.5|    3067|       2|   0|
|Abbotsford|   40 Federation La|    3|   h| 850000|    PI| Biggin|4/03/2017|     2.5|    3067|       2|   1|
|Abbotsford|       

### Display last 8 columns:

In [13]:
df.select("Landsize","BuildingArea","YearBuilt","CouncilArea","Lattitude","Longtitude","Regionname","Propertycount").show(10)

+--------+------------+---------+------------------+---------+----------+--------------------+-------------+
|Landsize|BuildingArea|YearBuilt|       CouncilArea|Lattitude|Longtitude|          Regionname|Propertycount|
+--------+------------+---------+------------------+---------+----------+--------------------+-------------+
|     126|        null|     null|Yarra City Council| -37.8014|  144.9958|Northern Metropol...|         4019|
|     202|        null|     null|Yarra City Council| -37.7996|  144.9984|Northern Metropol...|         4019|
|     156|          79|     1900|Yarra City Council| -37.8079|  144.9934|Northern Metropol...|         4019|
|       0|        null|     null|Yarra City Council| -37.8114|  145.0116|Northern Metropol...|         4019|
|     134|         150|     1900|Yarra City Council| -37.8093|  144.9944|Northern Metropol...|         4019|
|      94|        null|     null|Yarra City Council| -37.7969|  144.9969|Northern Metropol...|         4019|
|     120|         

 <span style="font-size:16pt;color: red">TO DO: write down null/not null; range of values; distinct for categorical and range and statistics for  for numerical             </span> .

### Attributes
#### Categorical:

 |-- Suburb:
 
 |-- Address: 
 
 |-- Type: 
 
 |-- Method:
 
 |-- SellerG: 
 
 |-- Date: 
 
 |-- CouncilArea:
 
 |-- Regionname:
 


#### Numerical:

|-- Rooms: 

|-- Price: 

|-- Distance: 

|-- Rooms:

|-- Postcode: 

|-- Bedroom2:  <span style="font-size:16pt;color: red">TO DO Shoud justify why were are dropping it</span> .

|-- Bathroom: 

|-- Car: 

|-- Landsize: 

|-- BuildingArea: 

|-- YearBuilt: 

|-- Lattitude: 

|-- Longtitude: 

|-- Propertycount: <span style="font-size:16pt;color: red">TO DO: what is property count?</span> .

 <span style="font-size:16pt;color: red">TO DO: What are out attributes?</span> .


In [99]:
df.describe().select("summary","price","Suburb").show()

+-------+-----------------+----------+
|summary|            price|    Suburb|
+-------+-----------------+----------+
|  count|            27247|     34857|
|   mean|1050173.344955408|      null|
| stddev|641467.1301045999|      null|
|    min|          1000000|Abbotsford|
|    max|           999999|  viewbank|
+-------+-----------------+----------+



### Filtering

In [14]:
df.filter($"Suburb"==="Glen Waverley").select("Address","Rooms","Type","Price","Method","SellerG","Date","Distance","Postcode","Bathroom","Car").show()

+-----------------+-----+----+-------+------+---------+----------+--------+--------+--------+----+
|          Address|Rooms|Type|  Price|Method|  SellerG|      Date|Distance|Postcode|Bathroom| Car|
+-----------------+-----+----+-------+------+---------+----------+--------+--------+--------+----+
|     7 Marbray Dr|    4|   h|   null|    SN|Harcourts| 1/07/2017|    16.7|    3150|       1|   2|
|      24 Owens Av|    4|   h|1250000|     S|      Ray| 1/07/2017|    16.7|    3150|    null|null|
|515 Springvale Rd|    3|   h|   null|    PI|      Ray| 1/07/2017|    16.7|    3150|       1|   2|
| 22 Stableford Av|    3|   h|   null|     S|      Ray| 1/07/2017|    16.7|    3150|    null|null|
|  28 Brentwood Dr|    5|   h|   null|    PI|      Ray| 3/06/2017|    16.7|    3150|       5|   2|
|2/70 Leicester Av|    3|   t|   null|    SP|      LLC| 3/06/2017|    16.7|    3150|    null|null|
|    38 Margate Cr|    3|   h|   null|    SN| Woodards| 3/06/2017|    16.7|    3150|    null|null|
|    91 Or

In [15]:
df.where("Suburb = 'Abbotsford'").select("Address","Rooms","Type","Price","Method","SellerG","Date","Distance","Postcode","Bathroom","Car", "Propertycount").show()

+-------------------+-----+----+-------+------+-------+---------+--------+--------+--------+----+-------------+
|            Address|Rooms|Type|  Price|Method|SellerG|     Date|Distance|Postcode|Bathroom| Car|Propertycount|
+-------------------+-----+----+-------+------+-------+---------+--------+--------+--------+----+-------------+
|      68 Studley St|    2|   h|   null|    SS| Jellis|3/09/2016|     2.5|    3067|       1|   1|         4019|
|       85 Turner St|    2|   h|1480000|     S| Biggin|3/12/2016|     2.5|    3067|       1|   1|         4019|
|    25 Bloomburg St|    2|   h|1035000|     S| Biggin|4/02/2016|     2.5|    3067|       1|   0|         4019|
| 18/659 Victoria St|    3|   u|   null|    VB| Rounds|4/02/2016|     2.5|    3067|       2|   1|         4019|
|       5 Charles St|    3|   h|1465000|    SP| Biggin|4/03/2017|     2.5|    3067|       2|   0|         4019|
|   40 Federation La|    3|   h| 850000|    PI| Biggin|4/03/2017|     2.5|    3067|       2|   1|       

In [16]:
df.where("Price >1000000").filter("Suburb = 'Abbotsford'").select("Address","Rooms","Type","Price","Method","SellerG","Date","Distance","Postcode","Bathroom","Car", "Propertycount").count()

res8: Long = 53


In [17]:
df.where("Price >1000000").filter("Suburb = 'Abbotsford'").select("Address","Rooms","Type","Price","Method","SellerG","Date","Distance","Postcode","Bathroom","Car", "Propertycount").show()

+-------------------+-----+----+-------+------+--------+----------+--------+--------+--------+----+-------------+
|            Address|Rooms|Type|  Price|Method| SellerG|      Date|Distance|Postcode|Bathroom| Car|Propertycount|
+-------------------+-----+----+-------+------+--------+----------+--------+--------+--------+----+-------------+
|       85 Turner St|    2|   h|1480000|     S|  Biggin| 3/12/2016|     2.5|    3067|       1|   1|         4019|
|    25 Bloomburg St|    2|   h|1035000|     S|  Biggin| 4/02/2016|     2.5|    3067|       1|   0|         4019|
|       5 Charles St|    3|   h|1465000|    SP|  Biggin| 4/03/2017|     2.5|    3067|       2|   0|         4019|
|        55a Park St|    4|   h|1600000|    VB|  Nelson| 4/06/2016|     2.5|    3067|       1|   2|         4019|
|       124 Yarra St|    3|   h|1876000|     S|  Nelson| 7/05/2016|     2.5|    3067|       2|   0|         4019|
|      98 Charles St|    2|   h|1636000|     S|  Nelson| 8/10/2016|     2.5|    3067|   

In [18]:
df.where("Price >1000000").filter("Suburb = 'Abbotsford'").collect()

res10: Array[org.apache.spark.sql.Row] = Array([Abbotsford,85 Turner St,2,h,1480000,S,Biggin,3/12/2016,2.5,3067,2,1,1,202,null,null,Yarra City Council,-37.7996,144.9984,Northern Metropolitan,4019], [Abbotsford,25 Bloomburg St,2,h,1035000,S,Biggin,4/02/2016,2.5,3067,2,1,0,156,79,1900,Yarra City Council,-37.8079,144.9934,Northern Metropolitan,4019], [Abbotsford,5 Charles St,3,h,1465000,SP,Biggin,4/03/2017,2.5,3067,3,2,0,134,150,1900,Yarra City Council,-37.8093,144.9944,Northern Metropolitan,4019], [Abbotsford,55a Park St,4,h,1600000,VB,Nelson,4/06/2016,2.5,3067,3,1,2,120,142,2014,Yarra City Council,-37.8072,144.9941,Northern Metropolitan,4019], [Abbotsford,124 Yarra St,3,h,1876000,S,Nelson,7/05/2016,2.5,3067,4,2,0,245,210,1910,Yarra City Council,-37.8024,144.9993,Northern Metropolitan,401...

In [19]:
df.select("Address","Type","Method","SellerG","Postcode","CouncilArea","Regionname").count()

res11: Long = 34857


### Categorical Attributes

#### 1. Type

#### Distinct values 

In [20]:
import org.apache.spark.sql.functions.countDistinct
df.select("Type").distinct.show()

+----+
|Type|
+----+
|   h|
|   u|
|   t|
+----+



import org.apache.spark.sql.functions.countDistinct


In [104]:
var df_result = df.withColumn("Type", initcap(col("Type")))
df_result.select("Type").distinct.show()

+----+
|Type|
+----+
|   T|
|   U|
|   H|
+----+



df_result: org.apache.spark.sql.DataFrame = [Suburb: string, Address: string ... 19 more fields]


#### Null values  

In [21]:
df.filter("Type IS NULL").count()

res13: Long = 0


#### 2. Method

In [22]:
df.select("Method").distinct.show()

+------+
|Method|
+------+
|    PI|
|    SA|
|    SP|
|    VB|
|    PN|
|     W|
|     S|
|    SN|
|    SS|
+------+



#### Null values  

In [23]:
df.filter("Method IS NULL").count()

res15: Long = 0


<span style="font-size:16pt;color: red">TO DO</span> 
#### 3. Suburb

In [24]:
val suburbs = df.select("Suburb").distinct

suburbs: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Suburb: string]


In [25]:
suburbs.show()

+----------------+
|          Suburb|
+----------------+
|  Brunswick West|
| South Melbourne|
|    Ivanhoe East|
|    Princes Hill|
|      Cranbourne|
|         Ashwood|
|       Brunswick|
|South Kingsville|
|        Brighton|
|        Oak Park|
|         Doveton|
|       Albanvale|
|      Brookfield|
|        Lynbrook|
|     Ferny Creek|
|     Pascoe Vale|
| Blackburn North|
|         croydon|
|     Sandringham|
|   Botanic Ridge|
+----------------+
only showing top 20 rows



In [26]:
suburbs.count()

res17: Long = 351


In [27]:

df.filter("Suburb IS NULL").count()

res18: Long = 0


In [105]:
// make first letter of suburb upper case
import org.apache.spark.sql.functions._
df_result = df_result.withColumn("Suburb", initcap(col("Suburb")))
df_result.select("Suburb").distinct.show()


+----------------+
|          Suburb|
+----------------+
|  Brunswick West|
| South Melbourne|
|    Ivanhoe East|
|    Princes Hill|
|      Cranbourne|
|         Ashwood|
|       Brunswick|
|South Kingsville|
|        Brighton|
|        Oak Park|
|         Doveton|
|       Albanvale|
|      Brookfield|
|        Lynbrook|
|     Ferny Creek|
|     Pascoe Vale|
| Blackburn North|
|     Sandringham|
|   Botanic Ridge|
|          Carrum|
+----------------+
only showing top 20 rows



import org.apache.spark.sql.functions._
df_result: org.apache.spark.sql.DataFrame = [Suburb: string, Address: string ... 19 more fields]


<span style="font-size:16pt;color: red">TO DO</span> 
#### 4. SellerG

In [106]:
val sellers = df_result.select("SellerG").distinct

sellers: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [SellerG: string]


In [110]:
sellers.show(388)

+--------------------+
|             SellerG|
+--------------------+
|              LITTLE|
|                 S&L|
|              Ristic|
|            Langwell|
|             Ruralco|
|             Xynergy|
|               Ryder|
|               iSell|
|               Scott|
|              Wilson|
|          McNaughton|
|           Blackbird|
|hockingstuart/Biggin|
|               Lucas|
|                 One|
|         Buxton/Find|
|                Real|
|            Sterling|
|             Compton|
|           Tiernan's|
|                 MSM|
|            Westside|
|            Keatings|
|               Allan|
|               iTRAK|
|                W.B.|
|           Clairmont|
|                  Le|
|             Meallin|
|               Raine|
|             Bayside|
|                  RE|
|             Sweeney|
|              Hooper|
|                AIME|
|                Home|
|       hockingstuart|
|            Grantham|
|               Jason|
|           Landfield|
|          

In [108]:
sellers.count()

res88: Long = 388


In [30]:

df.filter("SellerG IS NULL").count()

res21: Long = 0


<span style="font-size:16pt;color: red">TO DO</span> 
#### 6. CouncilArea

In [112]:
val sareas = df.select("CouncilArea").distinct()

sareas: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [CouncilArea: string]


In [115]:
sareas.show(40)

+--------------------+
|         CouncilArea|
+--------------------+
|Bayside City Council|
|Greater Dandenong...|
|   Hume City Council|
|Glen Eira City Co...|
|Kingston City Cou...|
|Nillumbik Shire C...|
| Monash City Council|
|Macedon Ranges Sh...|
|   Knox City Council|
|Wyndham City Council|
|Mitchell Shire Co...|
|Maribyrnong City ...|
|Whittlesea City C...|
|Whitehorse City C...|
|Frankston City Co...|
|Manningham City C...|
|Darebin City Council|
|Moreland City Cou...|
|Cardinia Shire Co...|
|Moonee Valley Cit...|
|Boroondara City C...|
|Yarra Ranges Shir...|
|  Casey City Council|
|Port Phillip City...|
|Brimbank City Cou...|
|                #N/A|
|Hobsons Bay City ...|
|Banyule City Council|
|Stonnington City ...|
| Melton City Council|
|  Yarra City Council|
|Melbourne City Co...|
|Moorabool Shire C...|
|Maroondah City Co...|
+--------------------+



In [114]:
sareas.count()

res93: Long = 34


In [32]:

df.filter("CouncilArea IS NULL").count()

res23: Long = 0


<span style="font-size:16pt;color: red">TO DO</span> 
#### 7. Postcode

In [117]:
val postcodes = df.select("Postcode").distinct()

postcodes: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Postcode: string]


In [118]:
postcodes.show()

+--------+
|Postcode|
+--------+
|    3015|
|    3200|
|    3121|
|    3167|
|    3057|
|    3179|
|    3786|
|    3089|
|    3145|
|    3808|
|    3140|
|    3187|
|    3019|
|    3107|
|    3802|
|    3074|
|    3078|
|    3207|
|    3777|
|    3090|
+--------+
only showing top 20 rows



In [119]:
postcodes.count()

res97: Long = 212


In [120]:

df.filter("Postcode IS NULL").count()

res98: Long = 0


<span style="font-size:16pt;color: red">TO DO</span> 
#### 8. Regionname

In [121]:
df.select("Regionname").distinct.show()

+--------------------+
|          Regionname|
+--------------------+
|South-Eastern Met...|
|Western Metropolitan|
|Eastern Metropolitan|
|    Eastern Victoria|
|                #N/A|
|   Northern Victoria|
|Northern Metropol...|
|Southern Metropol...|
|    Western Victoria|
+--------------------+



In [122]:

df.filter("Regionname IS NULL").count()

res100: Long = 0


<span style="font-size:16pt;color: red">TO DO</span> 
#### Numercal attributes 

|-- Price:
 
|-- Rooms:

|-- Distance: 

|-- Bathroom: 

|-- Car: 

|-- Landsize: 

|-- BuildingArea: 

|-- YearBuilt:


<span style="font-size:16pt;color: red">TO DO</span> 
#### 1. Price

In [123]:
df_result.filter(df("Price").isNull || df("Price") === "" || df("Price").isNaN).count()

res101: Long = 7610


In [124]:
df_result.filter("Price IS NULL").count()

res102: Long = 7610


In [125]:
df_result.filter("Price IS NULL").select("Address","Rooms","Type","Price","Method","SellerG","Date","Distance","Postcode","Bathroom","Car", "Propertycount").show(10)

+-------------------+-----+----+-----+------+-------+----------+--------+--------+--------+----+-------------+
|            Address|Rooms|Type|Price|Method|SellerG|      Date|Distance|Postcode|Bathroom| Car|Propertycount|
+-------------------+-----+----+-----+------+-------+----------+--------+--------+--------+----+-------------+
|      68 Studley St|    2|   H| null|    SS| Jellis| 3/09/2016|     2.5|    3067|       1|   1|         4019|
| 18/659 Victoria St|    3|   U| null|    VB| Rounds| 4/02/2016|     2.5|    3067|       2|   1|         4019|
|       16 Maugie St|    4|   H| null|    SN| Nelson| 6/08/2016|     2.5|    3067|       2|   2|         4019|
|       53 Turner St|    2|   H| null|     S| Biggin| 6/08/2016|     2.5|    3067|       1|   2|         4019|
|       99 Turner St|    2|   H| null|     S|Collins| 6/08/2016|     2.5|    3067|       2|   1|         4019|
|121/56 Nicholson St|    2|   U| null|    PI| Biggin| 7/11/2016|     2.5|    3067|       2|   1|         4019|
|

In [126]:
df_result.select(min(df("Price")), max(df("Price")),mean(df("Price"))).show()


+----------+----------+-----------------+
|min(Price)|max(Price)|       avg(Price)|
+----------+----------+-----------------+
|   1000000|    999999|1050173.344955408|
+----------+----------+-----------------+



In [128]:
df_result.describe().select("summary","price","Suburb").show()

+-------+-----------------+----------+
|summary|            price|    Suburb|
+-------+-----------------+----------+
|  count|            27247|     34857|
|   mean|1050173.344955408|      null|
| stddev|641467.1301045999|      null|
|    min|          1000000|Abbotsford|
|    max|           999999|Yarraville|
+-------+-----------------+----------+



In [127]:
df_result.groupBy($"Suburb").agg(sort_array(collect_list($"Price"))).show()

+----------------+-------------------------------------+
|          Suburb|sort_array(collect_list(Price), true)|
+----------------+-------------------------------------+
|  Brunswick West|                 [1000000, 1000000...|
|    Ivanhoe East|                 [1050000, 1260000...|
| South Melbourne|                 [1000000, 1002000...|
|      Cranbourne|                 [490000, 521000, ...|
|    Princes Hill|                 [1100000, 1150000...|
|         Ashwood|                 [1008000, 1020000...|
|       Brunswick|                 [1000000, 1000000...|
|South Kingsville|                 [1015000, 1030000...|
|       Albanvale|                 [415000, 502000, ...|
|        Brighton|                 [1002500, 1005000...|
|      Brookfield|                 [445000, 456000, ...|
|         Doveton|                 [396000, 430000, ...|
|        Oak Park|                 [1000000, 1005000...|
|        Lynbrook|                     [597500, 630000]|
|     Ferny Creek|             

In [70]:
df_result.groupBy($"Suburb").agg(round(mean($"Price"))).show()

+----------------+--------------------+
|          Suburb|round(avg(Price), 0)|
+----------------+--------------------+
|  Brunswick West|            817657.0|
| South Melbourne|           1349624.0|
|    Ivanhoe East|           1769048.0|
|    Princes Hill|           1633265.0|
|      Cranbourne|            639455.0|
|         Ashwood|           1173157.0|
|       Brunswick|            977989.0|
|South Kingsville|            784642.0|
|        Brighton|           1984227.0|
|        Oak Park|            804799.0|
|         Doveton|            522974.0|
|       Albanvale|            536056.0|
|      Brookfield|            498500.0|
|        Lynbrook|            613750.0|
|     Ferny Creek|            690000.0|
|     Pascoe Vale|            760727.0|
| Blackburn North|           1136711.0|
|         croydon|            730000.0|
|     Sandringham|           1559589.0|
|   Botanic Ridge|            750000.0|
+----------------+--------------------+
only showing top 20 rows



<span style="font-size:16pt;color: red">TO DO</span> 
#### 1. Rooms

In [83]:
df_result.filter("Rooms IS NULL").count()

res67: Long = 0


<span style="font-size:16pt;color: red">TO DO</span> 
#### 1. Distance

In [84]:
df_result.filter("Distance IS NULL").count()

res68: Long = 0


<span style="font-size:16pt;color: red">TO DO</span> 
#### 1. Bathroom

In [85]:
df_result.filter(df("Bathroom").isNull || df("Bathroom") === "" || df("Bathroom").isNaN).count()

res69: Long = 8226


In [86]:
df_result.filter("Bathroom IS NULL").count()

res70: Long = 8226


<span style="font-size:16pt;color: red">TO DO</span> 
#### 1. Car

In [87]:
df.filter("Car IS NULL").count()

res71: Long = 8728


<span style="font-size:16pt;color: red">TO DO</span> 
#### 1. Landsize

In [88]:
df_result.filter("Landsize IS NULL").count()

res72: Long = 11810


<span style="font-size:16pt;color: red">TO DO</span> 
#### 1. BuildingArea

In [89]:
df_result.filter("BuildingArea IS NULL").count()

res73: Long = 21115


<span style="font-size:16pt;color: red">TO DO</span> 
#### 1. YearBuilt

In [90]:
df_result.filter("YearBuilt IS NULL").count()

res74: Long = 19306


#### Filtering null values

In [129]:
val df_not_null = df_result.na.drop
df_not_null.count()

df_not_null: org.apache.spark.sql.DataFrame = [Suburb: string, Address: string ... 19 more fields]
res107: Long = 8887


In [130]:
val df_filtered = df_result.filter(row => !row.anyNull);
df_filtered.printSchema()


root
 |-- Suburb: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Rooms: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Price: string (nullable = true)
 |-- Method: string (nullable = true)
 |-- SellerG: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Distance: string (nullable = true)
 |-- Postcode: string (nullable = true)
 |-- Bedroom2: string (nullable = true)
 |-- Bathroom: string (nullable = true)
 |-- Car: string (nullable = true)
 |-- Landsize: string (nullable = true)
 |-- BuildingArea: string (nullable = true)
 |-- YearBuilt: string (nullable = true)
 |-- CouncilArea: string (nullable = true)
 |-- Lattitude: string (nullable = true)
 |-- Longtitude: string (nullable = true)
 |-- Regionname: string (nullable = true)
 |-- Propertycount: string (nullable = true)



df_filtered: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Suburb: string, Address: string ... 19 more fields]


In [131]:
df_filtered.filter(df("Price").isNull || df("Price") === "" || df("Price").isNaN).count()

res109: Long = 0


In [132]:
df_filtered.where("Price >1000000").filter("Suburb = 'Abbotsford'").select(
    "Address","Rooms","Type","Price","Method","SellerG","Date","Distance","Postcode","Bathroom","Car", "Propertycount").show()

+----------------+-----+----+-------+------+--------+----------+--------+--------+--------+---+-------------+
|         Address|Rooms|Type|  Price|Method| SellerG|      Date|Distance|Postcode|Bathroom|Car|Propertycount|
+----------------+-----+----+-------+------+--------+----------+--------+--------+--------+---+-------------+
| 25 Bloomburg St|    2|   H|1035000|     S|  Biggin| 4/02/2016|     2.5|    3067|       1|  0|         4019|
|    5 Charles St|    3|   H|1465000|    SP|  Biggin| 4/03/2017|     2.5|    3067|       2|  0|         4019|
|     55a Park St|    4|   H|1600000|    VB|  Nelson| 4/06/2016|     2.5|    3067|       1|  2|         4019|
|    124 Yarra St|    3|   H|1876000|     S|  Nelson| 7/05/2016|     2.5|    3067|       2|  0|         4019|
|   98 Charles St|    2|   H|1636000|     S|  Nelson| 8/10/2016|     2.5|    3067|       1|  2|         4019|
|   10 Valiant St|    2|   H|1097000|     S|  Biggin| 8/10/2016|     2.5|    3067|       1|  2|         4019|
| 40 Nicho

In [133]:
df_filtered.count()

res111: Long = 8887


#### Imputing null values

In [None]:
var df_imp = df.na.fill("N/A")

### Renaming Columns

### Group By and  Aggregation

In [None]:
import org.apache.spark.sql.functions.*
df.groupBy("Suburb").agg(max("Price")).show()

In [None]:
df.groupBy("Suburb").agg(min("Price")).show()

In [None]:
df.groupBy("Distance").agg(median("Price")).show()

### Correlation
 <span style="font-size:16pt;color: red">TO DO: FIND OUT CORRELATIONS between theses attributes </span> .

Rooms:
|-- Price:
|-- Distance:
|-- Rooms:
|-- Bathroom:
|-- Car:
|-- Landsize:
|-- BuildingArea:
|-- YearBuilt:

In [134]:
import org.apache.spark.sql.functions.corr
df_result.select(corr("Distance","Price")).show()

+---------------------+
|corr(Distance, Price)|
+---------------------+
| -0.21138434279157942|
+---------------------+



import org.apache.spark.sql.functions.corr


In [135]:
df_result.select(corr("Rooms","Price")).show()

+-------------------+
| corr(Rooms, Price)|
+-------------------+
|0.46523834510759615|
+-------------------+



In [136]:
df_result.select(corr("Landsize","Price")).show()

+---------------------+
|corr(Landsize, Price)|
+---------------------+
| 0.032748365249470925|
+---------------------+



In [137]:
df_result.select(corr("Bathroom","Price")).show()

+---------------------+
|corr(Bathroom, Price)|
+---------------------+
|   0.4298780777015672|
+---------------------+



In [138]:
df_result.select(corr("Car","Price")).show()

+-------------------+
|   corr(Car, Price)|
+-------------------+
|0.20180256061576263|
+-------------------+



In [139]:
df_result.select(corr("YearBuilt","Price")).show()

+----------------------+
|corr(YearBuilt, Price)|
+----------------------+
|   -0.3333055641267079|
+----------------------+



In [147]:
df_result.select(corr("BuildingArea","Price")).show()

+-------------------------+
|corr(BuildingArea, Price)|
+-------------------------+
|       0.1007536394731018|
+-------------------------+



#### Write down clean data:

In [148]:
df_result.write.format("csv").option("header","true").mode("overwrite").option("sep",",").save("hdfs://localhost:9000/tmp/rs_out/mh.csv") 

Save the clean data to disk

In [None]:
! hadoop fs -mkdir -p  /tmp/rs_out
! hadoop fs –get /tmp/rs_out/mh.csv      ./../data-clean

## References

Apache Spark (n.d.). _Spark Scala API (Scaladoc). Overview._ https://spark.apache.org/docs/latest/api/java/overview-summary.html

Apache Spark (n.d.). _Basic Statistic._ https://spark.apache.org/docs/latest/ml-statistics.html

Bahadoor N. (2020). _Spark Tutorials_ https://allaboutscala.com/big-data/spark/#dataframe-statistics-correlation

Databricks. (2020). _Introduction to DataFrames - Scala._  https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-scala.html 

Grimaldi E. (2018). _Pandas vs. Spark: how to handle dataframes (Part II.)_  https://towardsdatascience.com/python-pandas-vs-scala-how-to-handle-dataframes-part-ii-d3e5efe8287d 

