## Initialise Spark Session:

In [1]:
val spark = org.apache.spark.sql.SparkSession.builder
        .master("local") 
        .appName("Spark CSV Reader")
        .getOrCreate;

Intitializing Scala interpreter ...

Spark Web UI available at http://356e2a7fd1b6:4040
SparkContext available as 'sc' (version = 2.4.5, master = local[*], app id = local-1590037845952)
SparkSession available as 'spark'




spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@274abeea


### Preparing HDFS

In [2]:
!pwd


/home/sandpit/big-data-realestate/scripts



In [3]:
!cat ./../data-raw/Melbourne_housing_FULL.csv| wc -l

34858



In [4]:
! hadoop fs -mkdir -p  /tmp/rs_in
! hadoop fs -put   -p  ./../data-raw/Melbourne_housing_FULL.csv             /tmp/rs_in/mh.csv
! hadoop fs -ls        /tmp/rs_in/

put: `/tmp/rs_in/mh.csv': File exists


Found 1 items


-rw-r--r--   1 root root    5018236 2020-05-15 05:20 /tmp/rs_in/mh.csv




In [5]:
!hadoop fs -cat /tmp/rs_in/mh.csv | wc -l

34858



### Get config info about hdfs:

In [6]:
!hdfs getconf -confKey fs.defaultFS

hdfs://localhost:9000



In [7]:
val df = spark.read.format("csv").option("header", "true").load("hdfs://localhost:9000/tmp/rs_in/mh.csv")

df: org.apache.spark.sql.DataFrame = [Suburb: string, Address: string ... 19 more fields]


### Print schema:

In [8]:
df.printSchema()

root
 |-- Suburb: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Rooms: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Price: string (nullable = true)
 |-- Method: string (nullable = true)
 |-- SellerG: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Distance: string (nullable = true)
 |-- Postcode: string (nullable = true)
 |-- Bedroom2: string (nullable = true)
 |-- Bathroom: string (nullable = true)
 |-- Car: string (nullable = true)
 |-- Landsize: string (nullable = true)
 |-- BuildingArea: string (nullable = true)
 |-- YearBuilt: string (nullable = true)
 |-- CouncilArea: string (nullable = true)
 |-- Lattitude: string (nullable = true)
 |-- Longtitude: string (nullable = true)
 |-- Regionname: string (nullable = true)
 |-- Propertycount: string (nullable = true)



In [9]:
df.columns

res1: Array[String] = Array(Suburb, Address, Rooms, Type, Price, Method, SellerG, Date, Distance, Postcode, Bedroom2, Bathroom, Car, Landsize, BuildingArea, YearBuilt, CouncilArea, Lattitude, Longtitude, Regionname, Propertycount)


### Show column types:

In [10]:
df.dtypes

res2: Array[(String, String)] = Array((Suburb,StringType), (Address,StringType), (Rooms,StringType), (Type,StringType), (Price,StringType), (Method,StringType), (SellerG,StringType), (Date,StringType), (Distance,StringType), (Postcode,StringType), (Bedroom2,StringType), (Bathroom,StringType), (Car,StringType), (Landsize,StringType), (BuildingArea,StringType), (YearBuilt,StringType), (CouncilArea,StringType), (Lattitude,StringType), (Longtitude,StringType), (Regionname,StringType), (Propertycount,StringType))


In [11]:
df.dtypes.filter(colTup => colTup._1 == "Suburb")

res3: Array[(String, String)] = Array((Suburb,StringType))


### Display first 12 columns:

In [12]:
df.select("Suburb","Address","Rooms","Type","Price","Method","SellerG","Date","Distance","Postcode","Bathroom","Car").show()

+----------+-------------------+-----+----+-------+------+-------+---------+--------+--------+--------+----+
|    Suburb|            Address|Rooms|Type|  Price|Method|SellerG|     Date|Distance|Postcode|Bathroom| Car|
+----------+-------------------+-----+----+-------+------+-------+---------+--------+--------+--------+----+
|Abbotsford|      68 Studley St|    2|   h|   null|    SS| Jellis|3/09/2016|     2.5|    3067|       1|   1|
|Abbotsford|       85 Turner St|    2|   h|1480000|     S| Biggin|3/12/2016|     2.5|    3067|       1|   1|
|Abbotsford|    25 Bloomburg St|    2|   h|1035000|     S| Biggin|4/02/2016|     2.5|    3067|       1|   0|
|Abbotsford| 18/659 Victoria St|    3|   u|   null|    VB| Rounds|4/02/2016|     2.5|    3067|       2|   1|
|Abbotsford|       5 Charles St|    3|   h|1465000|    SP| Biggin|4/03/2017|     2.5|    3067|       2|   0|
|Abbotsford|   40 Federation La|    3|   h| 850000|    PI| Biggin|4/03/2017|     2.5|    3067|       2|   1|
|Abbotsford|       

### Display last 8 columns:

In [13]:
df.select("Landsize","BuildingArea","YearBuilt","CouncilArea","Lattitude","Longtitude","Regionname","Propertycount").show(10)

+--------+------------+---------+------------------+---------+----------+--------------------+-------------+
|Landsize|BuildingArea|YearBuilt|       CouncilArea|Lattitude|Longtitude|          Regionname|Propertycount|
+--------+------------+---------+------------------+---------+----------+--------------------+-------------+
|     126|        null|     null|Yarra City Council| -37.8014|  144.9958|Northern Metropol...|         4019|
|     202|        null|     null|Yarra City Council| -37.7996|  144.9984|Northern Metropol...|         4019|
|     156|          79|     1900|Yarra City Council| -37.8079|  144.9934|Northern Metropol...|         4019|
|       0|        null|     null|Yarra City Council| -37.8114|  145.0116|Northern Metropol...|         4019|
|     134|         150|     1900|Yarra City Council| -37.8093|  144.9944|Northern Metropol...|         4019|
|      94|        null|     null|Yarra City Council| -37.7969|  144.9969|Northern Metropol...|         4019|
|     120|         

In [14]:
df.describe().select("summary","Price", "Rooms","Distance","Bathroom","Car").show()

+-------+-----------------+------------------+------------------+------------------+------------------+
|summary|            Price|             Rooms|          Distance|          Bathroom|               Car|
+-------+-----------------+------------------+------------------+------------------+------------------+
|  count|            27247|             34857|             34857|             26631|             26129|
|   mean|1050173.344955408|3.0310124221820582|11.184929423916007| 1.624798167549097|1.7288453442535114|
| stddev|641467.1301045999|0.9699329348975204| 6.788892455935938|0.7242120114699068|1.0107707853554244|
|    min|          1000000|                 1|              #N/A|                 0|                 0|
|    max|           999999|                 9|               9.9|                 9|                 9|
+-------+-----------------+------------------+------------------+------------------+------------------+



In [15]:
df.describe().select("summary","Price","Landsize", "BuildingArea").show()

+-------+-----------------+------------------+------------------+
|summary|            Price|          Landsize|      BuildingArea|
+-------+-----------------+------------------+------------------+
|  count|            27247|             23047|             13742|
|   mean|1050173.344955408|  593.598993361392| 160.2564003565711|
| stddev|641467.1301045999|3398.8419464599056|401.26706008485496|
|    min|          1000000|                 0|                 0|
|    max|           999999|               999|               999|
+-------+-----------------+------------------+------------------+



#### Change "#N/A" to null

In [16]:
var df_result =df
for (colName<-df.columns){ 
  df_result = df.withColumn(colName, when(trim(df(colName))==="#N/A",null).otherwise(df(colName)))
  }

df_result: org.apache.spark.sql.DataFrame = [Suburb: string, Address: string ... 19 more fields]


In [17]:
df_result.describe().select("summary","Price", "Rooms","Distance","Bathroom","Car").show()

+-------+-----------------+------------------+------------------+------------------+------------------+
|summary|            Price|             Rooms|          Distance|          Bathroom|               Car|
+-------+-----------------+------------------+------------------+------------------+------------------+
|  count|            27247|             34857|             34857|             26631|             26129|
|   mean|1050173.344955408|3.0310124221820582|11.184929423916007| 1.624798167549097|1.7288453442535114|
| stddev|641467.1301045999|0.9699329348975204| 6.788892455935938|0.7242120114699068|1.0107707853554244|
|    min|          1000000|                 1|              #N/A|                 0|                 0|
|    max|           999999|                 9|               9.9|                 9|                 9|
+-------+-----------------+------------------+------------------+------------------+------------------+



#### Convert numeric data represented as string into double 

In [18]:
val doubleColNames = df_result.select("Price", "Rooms","Bedroom2","Distance","Bathroom","Car", "Landsize", "BuildingArea","Propertycount",
               "YearBuilt","Lattitude", "Longtitude").columns
//val colNames =df.columns
for (colName<-doubleColNames){ 
    df_result=df_result.withColumn(colName,col(colName).cast("Double"))
}

doubleColNames: Array[String] = Array(Price, Rooms, Bedroom2, Distance, Bathroom, Car, Landsize, BuildingArea, Propertycount, YearBuilt, Lattitude, Longtitude)


In [19]:
df_result.printSchema()

root
 |-- Suburb: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Rooms: double (nullable = true)
 |-- Type: string (nullable = true)
 |-- Price: double (nullable = true)
 |-- Method: string (nullable = true)
 |-- SellerG: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Distance: double (nullable = true)
 |-- Postcode: string (nullable = true)
 |-- Bedroom2: double (nullable = true)
 |-- Bathroom: double (nullable = true)
 |-- Car: double (nullable = true)
 |-- Landsize: double (nullable = true)
 |-- BuildingArea: double (nullable = true)
 |-- YearBuilt: double (nullable = true)
 |-- CouncilArea: string (nullable = true)
 |-- Lattitude: double (nullable = true)
 |-- Longtitude: double (nullable = true)
 |-- Regionname: string (nullable = true)
 |-- Propertycount: double (nullable = true)



### Filtering

In [20]:
df_result.filter($"Suburb"==="Glen Waverley").select("Address","Rooms","Type","Price","Method","SellerG","Date","Distance","Postcode","Bathroom","Car").show()

+-----------------+-----+----+---------+------+---------+----------+--------+--------+--------+----+
|          Address|Rooms|Type|    Price|Method|  SellerG|      Date|Distance|Postcode|Bathroom| Car|
+-----------------+-----+----+---------+------+---------+----------+--------+--------+--------+----+
|     7 Marbray Dr|  4.0|   h|     null|    SN|Harcourts| 1/07/2017|    16.7|    3150|     1.0| 2.0|
|      24 Owens Av|  4.0|   h|1250000.0|     S|      Ray| 1/07/2017|    16.7|    3150|    null|null|
|515 Springvale Rd|  3.0|   h|     null|    PI|      Ray| 1/07/2017|    16.7|    3150|     1.0| 2.0|
| 22 Stableford Av|  3.0|   h|     null|     S|      Ray| 1/07/2017|    16.7|    3150|    null|null|
|  28 Brentwood Dr|  5.0|   h|     null|    PI|      Ray| 3/06/2017|    16.7|    3150|     5.0| 2.0|
|2/70 Leicester Av|  3.0|   t|     null|    SP|      LLC| 3/06/2017|    16.7|    3150|    null|null|
|    38 Margate Cr|  3.0|   h|     null|    SN| Woodards| 3/06/2017|    16.7|    3150|    n

In [21]:
df_result.where("Suburb = 'Abbotsford'").select("Address","Rooms","Type","Price","Method","SellerG","Date","Distance","Postcode","Bathroom","Car", "Propertycount").show()

+-------------------+-----+----+---------+------+-------+---------+--------+--------+--------+----+-------------+
|            Address|Rooms|Type|    Price|Method|SellerG|     Date|Distance|Postcode|Bathroom| Car|Propertycount|
+-------------------+-----+----+---------+------+-------+---------+--------+--------+--------+----+-------------+
|      68 Studley St|  2.0|   h|     null|    SS| Jellis|3/09/2016|     2.5|    3067|     1.0| 1.0|       4019.0|
|       85 Turner St|  2.0|   h|1480000.0|     S| Biggin|3/12/2016|     2.5|    3067|     1.0| 1.0|       4019.0|
|    25 Bloomburg St|  2.0|   h|1035000.0|     S| Biggin|4/02/2016|     2.5|    3067|     1.0| 0.0|       4019.0|
| 18/659 Victoria St|  3.0|   u|     null|    VB| Rounds|4/02/2016|     2.5|    3067|     2.0| 1.0|       4019.0|
|       5 Charles St|  3.0|   h|1465000.0|    SP| Biggin|4/03/2017|     2.5|    3067|     2.0| 0.0|       4019.0|
|   40 Federation La|  3.0|   h| 850000.0|    PI| Biggin|4/03/2017|     2.5|    3067|   

In [22]:
df_result.where("Price >1000000").filter("Suburb = 'Abbotsford'").select("Address","Rooms","Type","Price","Method","SellerG","Date","Distance","Postcode","Bathroom","Car", "Propertycount").count()

res14: Long = 53


In [23]:
df_result.where("Price >1000000").filter("Suburb = 'Abbotsford'").select("Address","Rooms","Type","Price","Method","SellerG","Date","Distance","Postcode","Bathroom","Car", "Propertycount").show()

+-------------------+-----+----+---------+------+--------+----------+--------+--------+--------+----+-------------+
|            Address|Rooms|Type|    Price|Method| SellerG|      Date|Distance|Postcode|Bathroom| Car|Propertycount|
+-------------------+-----+----+---------+------+--------+----------+--------+--------+--------+----+-------------+
|       85 Turner St|  2.0|   h|1480000.0|     S|  Biggin| 3/12/2016|     2.5|    3067|     1.0| 1.0|       4019.0|
|    25 Bloomburg St|  2.0|   h|1035000.0|     S|  Biggin| 4/02/2016|     2.5|    3067|     1.0| 0.0|       4019.0|
|       5 Charles St|  3.0|   h|1465000.0|    SP|  Biggin| 4/03/2017|     2.5|    3067|     2.0| 0.0|       4019.0|
|        55a Park St|  4.0|   h|1600000.0|    VB|  Nelson| 4/06/2016|     2.5|    3067|     1.0| 2.0|       4019.0|
|       124 Yarra St|  3.0|   h|1876000.0|     S|  Nelson| 7/05/2016|     2.5|    3067|     2.0| 0.0|       4019.0|
|      98 Charles St|  2.0|   h|1636000.0|     S|  Nelson| 8/10/2016|   

In [24]:
df_result.where("Price >1000000").filter("Suburb = 'Abbotsford'").collect()

res16: Array[org.apache.spark.sql.Row] = Array([Abbotsford,85 Turner St,2.0,h,1480000.0,S,Biggin,3/12/2016,2.5,3067,2.0,1.0,1.0,202.0,null,null,Yarra City Council,-37.7996,144.9984,Northern Metropolitan,4019.0], [Abbotsford,25 Bloomburg St,2.0,h,1035000.0,S,Biggin,4/02/2016,2.5,3067,2.0,1.0,0.0,156.0,79.0,1900.0,Yarra City Council,-37.8079,144.9934,Northern Metropolitan,4019.0], [Abbotsford,5 Charles St,3.0,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067,3.0,2.0,0.0,134.0,150.0,1900.0,Yarra City Council,-37.8093,144.9944,Northern Metropolitan,4019.0], [Abbotsford,55a Park St,4.0,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067,3.0,1.0,2.0,120.0,142.0,2014.0,Yarra City Council,-37.8072,144.9941,Northern Metropolitan,4019.0], [Abbotsford,124 Yarra St,3.0,h,1876000.0,S,Nelson,7/05/2016,2.5,3067,4.0,2.0,0...

In [25]:
df_result.select("Address","Type","Method","SellerG","Postcode","CouncilArea","Regionname").count()

res17: Long = 34857


### Categorical Attributes

#### 1. Address

In [26]:
// rename into street keep only street name
import org.apache.spark.sql.functions.countDistinct
df_result.select("Address").distinct.show()

+-------------------+
|            Address|
+-------------------+
|      557 Orrong Rd|
|      19 Poulter St|
|    43 Riverside Av|
|       11 South Tce|
|  41 Marlborough St|
|          4 Park Cr|
|        3/3 Dega Av|
|        93 Tudor St|
|         10 Kent Rd|
|       18 Thomas St|
|   1/1 Glen Iris Rd|
|      7 Allambee Av|
|    83 Truganini Rd|
|       130 Keele St|
|       8 Winters Wy|
|     36a Mitford St|
|   7/223 Station St|
|1/146 Ascot Vale Rd|
|    5/60 Farnham St|
|      22 Renwick St|
+-------------------+
only showing top 20 rows



import org.apache.spark.sql.functions.countDistinct


In [27]:
df_result.filter("Address IS NULL").count()

res19: Long = 0


In [28]:
df_result.select("Address").distinct.count()

res20: Long = 34009


In [29]:
df_result.printSchema()

root
 |-- Suburb: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Rooms: double (nullable = true)
 |-- Type: string (nullable = true)
 |-- Price: double (nullable = true)
 |-- Method: string (nullable = true)
 |-- SellerG: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Distance: double (nullable = true)
 |-- Postcode: string (nullable = true)
 |-- Bedroom2: double (nullable = true)
 |-- Bathroom: double (nullable = true)
 |-- Car: double (nullable = true)
 |-- Landsize: double (nullable = true)
 |-- BuildingArea: double (nullable = true)
 |-- YearBuilt: double (nullable = true)
 |-- CouncilArea: string (nullable = true)
 |-- Lattitude: double (nullable = true)
 |-- Longtitude: double (nullable = true)
 |-- Regionname: string (nullable = true)
 |-- Propertycount: double (nullable = true)



#### Split Address on Street and Suffix

In [30]:
//split address on Street and Suffix
df_result = df_result.withColumn("Street",split(col("Address")," ").getItem(1)).
                   withColumn("Suffix",split(col("Address")," ").getItem(2)).drop("Address")
df_result.show()

+----------+-----+----+---------+------+-------+---------+--------+--------+--------+--------+----+--------+------------+---------+------------------+---------+----------+--------------------+-------------+----------+------+
|    Suburb|Rooms|Type|    Price|Method|SellerG|     Date|Distance|Postcode|Bedroom2|Bathroom| Car|Landsize|BuildingArea|YearBuilt|       CouncilArea|Lattitude|Longtitude|          Regionname|Propertycount|    Street|Suffix|
+----------+-----+----+---------+------+-------+---------+--------+--------+--------+--------+----+--------+------------+---------+------------------+---------+----------+--------------------+-------------+----------+------+
|Abbotsford|  2.0|   h|     null|    SS| Jellis|3/09/2016|     2.5|    3067|     2.0|     1.0| 1.0|   126.0|        null|     null|Yarra City Council| -37.8014|  144.9958|Northern Metropol...|       4019.0|   Studley|    St|
|Abbotsford|  2.0|   h|1480000.0|     S| Biggin|3/12/2016|     2.5|    3067|     2.0|     1.0| 1.0| 

df_result: org.apache.spark.sql.DataFrame = [Suburb: string, Rooms: double ... 20 more fields]


In [31]:
df_result.printSchema()

root
 |-- Suburb: string (nullable = true)
 |-- Rooms: double (nullable = true)
 |-- Type: string (nullable = true)
 |-- Price: double (nullable = true)
 |-- Method: string (nullable = true)
 |-- SellerG: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Distance: double (nullable = true)
 |-- Postcode: string (nullable = true)
 |-- Bedroom2: double (nullable = true)
 |-- Bathroom: double (nullable = true)
 |-- Car: double (nullable = true)
 |-- Landsize: double (nullable = true)
 |-- BuildingArea: double (nullable = true)
 |-- YearBuilt: double (nullable = true)
 |-- CouncilArea: string (nullable = true)
 |-- Lattitude: double (nullable = true)
 |-- Longtitude: double (nullable = true)
 |-- Regionname: string (nullable = true)
 |-- Propertycount: double (nullable = true)
 |-- Street: string (nullable = true)
 |-- Suffix: string (nullable = true)



In [32]:
df_result.filter("Street IS NULL").count()

res24: Long = 0


#### 2. Postcode

In [33]:
var postcodes = df_result.select("Postcode").distinct()

postcodes: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Postcode: string]


In [34]:
df_result.filter("Postcode IS NULL").count()

res25: Long = 0


In [35]:
postcodes.count()

res26: Long = 212


#### 3. Suburb

In [36]:
val suburbs = df_result.select("Suburb").distinct

suburbs: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Suburb: string]


In [37]:
suburbs.count()

res27: Long = 351


In [38]:

df.filter("Suburb IS NULL").count()

res28: Long = 0


In [39]:
// make first letter of suburb upper case
import org.apache.spark.sql.functions._
df_result = df_result.withColumn("Suburb", initcap(col("Suburb")))
df_result.select("Suburb").distinct.show()


+----------------+
|          Suburb|
+----------------+
|  Brunswick West|
| South Melbourne|
|    Ivanhoe East|
|    Princes Hill|
|      Cranbourne|
|         Ashwood|
|       Brunswick|
|South Kingsville|
|        Brighton|
|        Oak Park|
|         Doveton|
|       Albanvale|
|      Brookfield|
|        Lynbrook|
|     Ferny Creek|
|     Pascoe Vale|
| Blackburn North|
|     Sandringham|
|   Botanic Ridge|
|          Carrum|
+----------------+
only showing top 20 rows



import org.apache.spark.sql.functions._
df_result: org.apache.spark.sql.DataFrame = [Suburb: string, Rooms: double ... 20 more fields]


#### 4. Type 
#### Distinct values 

In [40]:
import org.apache.spark.sql.functions.countDistinct
df_result.select("Type").distinct.show()

+----+
|Type|
+----+
|   h|
|   u|
|   t|
+----+



import org.apache.spark.sql.functions.countDistinct


In [41]:
df_result = df_result.withColumn("Type", initcap(col("Type")))
df_result.select("Type").distinct.show()

+----+
|Type|
+----+
|   T|
|   U|
|   H|
+----+



df_result: org.apache.spark.sql.DataFrame = [Suburb: string, Rooms: double ... 20 more fields]


#### Null values  

In [42]:
df_result.filter("Type IS NULL").count()

res32: Long = 0


#### 5. Method

In [43]:
df_result.select("Method").distinct.show()

+------+
|Method|
+------+
|    PI|
|    SA|
|    SP|
|    VB|
|    PN|
|     W|
|     S|
|    SN|
|    SS|
+------+



#### Null values  

In [44]:
df_result.filter("Method IS NULL").count()

res34: Long = 0


#### 6. SellerG

In [45]:
val sellers = df_result.select("SellerG").distinct

sellers: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [SellerG: string]


In [46]:
sellers.show()

+--------------------+
|             SellerG|
+--------------------+
|              LITTLE|
|                 S&L|
|              Ristic|
|            Langwell|
|             Ruralco|
|             Xynergy|
|               Ryder|
|               iSell|
|               Scott|
|              Wilson|
|          McNaughton|
|           Blackbird|
|hockingstuart/Biggin|
|               Lucas|
|                 One|
|         Buxton/Find|
|                Real|
|            Sterling|
|             Compton|
|           Tiernan's|
+--------------------+
only showing top 20 rows



In [47]:
sellers.count()

res36: Long = 388


In [48]:

df_result.filter("SellerG IS NULL").count()

res37: Long = 0


#### 7. Date

In [49]:
val dates = df_result.select("Date").distinct()

dates: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Date: string]


In [50]:
dates.count()

res38: Long = 78


In [51]:
df_result.filter("Date IS NULL").count()

res39: Long = 0


In [52]:
dates.show()

+----------+
|      Date|
+----------+
|16/04/2016|
|29/04/2017|
|10/12/2016|
|19/08/2017|
| 7/05/2016|
| 8/07/2017|
| 4/03/2017|
|29/07/2017|
|27/05/2017|
|28/10/2017|
| 9/09/2017|
|26/07/2016|
|12/11/2016|
|25/02/2017|
| 6/05/2017|
|18/11/2017|
| 3/09/2016|
| 3/12/2016|
|25/11/2017|
| 3/06/2017|
+----------+
only showing top 20 rows



#### 8. CouncilArea

In [53]:
val sareas = df_result.select("CouncilArea").distinct()

sareas: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [CouncilArea: string]


In [54]:
sareas.count()

res41: Long = 34


In [55]:
sareas.show(34)

+--------------------+
|         CouncilArea|
+--------------------+
|Bayside City Council|
|Greater Dandenong...|
|   Hume City Council|
|Glen Eira City Co...|
|Kingston City Cou...|
|Nillumbik Shire C...|
| Monash City Council|
|Macedon Ranges Sh...|
|   Knox City Council|
|Wyndham City Council|
|Mitchell Shire Co...|
|Maribyrnong City ...|
|Whittlesea City C...|
|Whitehorse City C...|
|Frankston City Co...|
|Manningham City C...|
|Darebin City Council|
|Moreland City Cou...|
|Cardinia Shire Co...|
|Moonee Valley Cit...|
|Boroondara City C...|
|Yarra Ranges Shir...|
|  Casey City Council|
|Port Phillip City...|
|Brimbank City Cou...|
|                #N/A|
|Hobsons Bay City ...|
|Banyule City Council|
|Stonnington City ...|
| Melton City Council|
|  Yarra City Council|
|Melbourne City Co...|
|Moorabool Shire C...|
|Maroondah City Co...|
+--------------------+



In [56]:

df_result.filter("CouncilArea IS NULL").count()

res43: Long = 0


#### 9. Regionname

In [57]:
df_result.select("Regionname").distinct.show()

+--------------------+
|          Regionname|
+--------------------+
|South-Eastern Met...|
|Western Metropolitan|
|Eastern Metropolitan|
|    Eastern Victoria|
|                #N/A|
|   Northern Victoria|
|Northern Metropol...|
|Southern Metropol...|
|    Western Victoria|
+--------------------+



In [58]:

df_result.filter("Regionname IS NULL").count()

res45: Long = 0


#### 10. YearBuilt

In [59]:
df_result.filter("YearBuilt IS NULL").count()

res46: Long = 19306


In [60]:
df_result.select("YearBuilt").distinct.show()

+---------+
|YearBuilt|
+---------+
|   1988.0|
|   1976.0|
|   1951.0|
|   1940.0|
|   1928.0|
|   1900.0|
|   1979.0|
|   1856.0|
|   1953.0|
|   1830.0|
|   1913.0|
|   1987.0|
|   1909.0|
|   1959.0|
|   1934.0|
|   1904.0|
|   1978.0|
|   1968.0|
|   2010.0|
|   1881.0|
+---------+
only showing top 20 rows



#### 1. Price

In [61]:
df_result.filter("Price IS NULL").count()

res48: Long = 7610


In [62]:
//get rid of null in price
df_result=df_result.filter(!df_result("Price").isNull)

df_result: org.apache.spark.sql.DataFrame = [Suburb: string, Rooms: double ... 20 more fields]


In [63]:
df_result.select("Price").distinct().show()

+---------+
|    Price|
+---------+
| 300000.0|
| 495000.0|
|1185000.0|
| 330000.0|
| 532000.0|
| 940500.0|
| 452500.0|
| 640500.0|
| 431500.0|
|1155500.0|
| 474000.0|
| 546000.0|
|1483000.0|
| 671000.0|
|1055000.0|
| 677776.0|
|2053000.0|
|1259000.0|
| 975500.0|
|4690000.0|
+---------+
only showing top 20 rows



In [None]:
// histogram
val (startValues,counts) = df_result.select($"Price")
    .rdd.map(r => r.getDouble(0))
    .histogram()

In [None]:
val zippedValues = startValues.zip(counts)
case class HistRow(startPoint:Double,count:Long)
val rowRDD = zippedValues.map( value => HistRow(value._1,value._2))
val histDf = org.apache.spark.sql.SparkSession.createDataFrame(rowRDD)
histDf.createOrReplaceTempView("histogramTable")

In [None]:
val _tmpHist = df_result
    .select($"Price" cast "double")
    .rdd.map(r => r.getDouble(0))
    .histogram(thresholds)

// Result DataFrame contains `from`, `to` range and the `value`.
val histogram = sc.parallelize((thresholds, thresholds.tail, _tmpHist).zipped.toList).toDF("from", "to", "value")

In [64]:
df_result.describe().select("summary","Price","Suburb").show()

+-------+-----------------+----------+
|summary|            Price|    Suburb|
+-------+-----------------+----------+
|  count|            27247|     27247|
|   mean|1050173.344955408|      null|
| stddev|641467.1301045999|      null|
|    min|          85000.0|Abbotsford|
|    max|           1.12E7|Yarraville|
+-------+-----------------+----------+



In [65]:
df.where("Price >10000000").select("Address","Rooms","Type","Price","Method","SellerG","Date","Distance","Postcode","Bathroom","Car", "Propertycount").count()

res51: Long = 1


#### 1. Rooms

In [66]:
df_result.filter("Rooms IS NULL").count()

res52: Long = 0


In [67]:
df_result.select("Rooms").distinct().show()

+-----+
|Rooms|
+-----+
|  8.0|
|  7.0|
|  1.0|
|  4.0|
|  3.0|
|  2.0|
| 10.0|
|  6.0|
|  5.0|
|  9.0|
| 16.0|
| 12.0|
+-----+



<span style="font-size:16pt;color: red">TO DO: HISTOGRAMS IS IT POSSIBLE? </span> .



In [None]:
df_result.groupBy("Rooms").count().rdd.histogram()

In [None]:
histogram(df_result, df_result("Rooms"), nbins = 10)

In [None]:
df_result.select("Rooms").createOrReplaceTempView("histogramTable").show()

#### 2. Distance

In [68]:
df_result.filter("Distance IS NULL").count()

res54: Long = 1


In [69]:
df_result.select("Distance").distinct.count()

res55: Long = 214


In [70]:
df_result.select("Distance")

res56: org.apache.spark.sql.DataFrame = [Distance: double]


In [71]:
df_result.select("Distance").distinct.show(213)

+--------+
|Distance|
+--------+
|    14.9|
|    13.4|
|    15.5|
|    15.4|
|     2.4|
|     8.0|
|    10.2|
|    24.7|
|    32.3|
|     0.0|
|    17.9|
|    43.3|
|    11.4|
|     5.4|
|    23.8|
|    16.6|
|     7.0|
|    11.5|
|     3.5|
|    31.7|
|     6.1|
|     9.5|
|     7.7|
|    17.3|
|    25.2|
|    31.6|
|     6.6|
|    34.7|
|    20.5|
|     8.7|
|    13.3|
|    28.8|
|    12.5|
|     3.7|
|    10.3|
|     4.5|
|    19.9|
|     5.7|
|     1.4|
|    26.1|
|     6.7|
|     0.7|
|     7.4|
|     2.3|
|    13.8|
|     6.5|
|    45.2|
|    null|
|    19.6|
|     3.4|
|    21.1|
|    16.1|
|    18.0|
|    35.4|
|    20.8|
|    16.7|
|    16.5|
|    23.6|
|    12.8|
|     8.4|
|     2.5|
|    23.2|
|    39.0|
|     9.8|
|    17.5|
|    18.4|
|    12.3|
|    34.6|
|    44.2|
|    22.9|
|    12.1|
|     3.1|
|    21.3|
|    20.1|
|    13.9|
|    12.9|
|     2.7|
|    25.0|
|    14.2|
|    22.7|
|     4.1|
|    39.8|
|    12.4|
|     2.8|
|    34.9|
|    41.0|
|     9.3|
|     8.8|

#### 3. Bathroom

In [72]:
df_result.filter("Bathroom IS NULL").count()

res58: Long = 6447


In [73]:
df_result= df_result.filter(!df_result("Bathroom").isNull)
df_result.select("Bathroom").distinct.show()

+--------+
|Bathroom|
+--------+
|     8.0|
|     0.0|
|     7.0|
|     1.0|
|     4.0|
|     3.0|
|     2.0|
|     6.0|
|     5.0|
|     9.0|
+--------+



df_result: org.apache.spark.sql.DataFrame = [Suburb: string, Rooms: double ... 20 more fields]


#### 4. Car

In [74]:
df.filter("Car IS NULL").count()

res60: Long = 8728


In [75]:
df_result.select("Car").distinct.show()

+----+
| Car|
+----+
| 8.0|
| 0.0|
| 7.0|
|null|
|18.0|
| 1.0|
| 4.0|
|11.0|
| 3.0|
| 2.0|
|10.0|
| 6.0|
| 5.0|
| 9.0|
+----+



In [76]:
df_result= df_result.filter(!df_result("Car").isNull)
df_result.select("Car").distinct.show()

+----+
| Car|
+----+
| 8.0|
| 0.0|
| 7.0|
|18.0|
| 1.0|
| 4.0|
|11.0|
| 3.0|
| 2.0|
|10.0|
| 6.0|
| 5.0|
| 9.0|
+----+



df_result: org.apache.spark.sql.DataFrame = [Suburb: string, Rooms: double ... 20 more fields]


#### 5. Landsize

In [77]:
df_result.filter("Landsize IS NULL").count()

res63: Long = 2722


In [78]:
df_result.select("Landsize").distinct.count()

res64: Long = 1546


In [79]:
df_result= df_result.filter(!df_result("Landsize").isNull)
df_result.select("Landsize").distinct.count()

df_result: org.apache.spark.sql.DataFrame = [Suburb: string, Rooms: double ... 20 more fields]
res65: Long = 1545


#### 6. BuildingArea

In [80]:
df_result.filter("BuildingArea IS NULL").count()

res66: Long = 8457


In [81]:
df_result.select("BuildingArea").distinct.count()

res67: Long = 640


In [82]:
df_result= df_result.filter(!df_result("BuildingArea").isNull)
df_result.select("BuildingArea").distinct.show(640)

+------------+
|BuildingArea|
+------------+
|       305.0|
|       558.0|
|       934.0|
|       496.0|
|       147.0|
|       170.0|
|       720.0|
|       184.0|
|    140.7481|
|       169.0|
|       160.0|
|        72.3|
|        53.3|
|        70.0|
|       311.0|
|        67.0|
|       168.0|
|        69.0|
|       206.0|
|         0.0|
|       650.0|
|      298.21|
|       389.0|
|       390.0|
|       594.0|
|      129.92|
|       365.0|
|       249.0|
|         7.0|
|       112.9|
|      115.96|
|       677.0|
|       401.0|
|       142.0|
|       191.0|
|       329.0|
|       112.0|
|       154.0|
|       35.64|
|      200.71|
|       113.6|
|       232.0|
|       521.0|
|       124.0|
|       438.0|
|       348.0|
|       303.0|
|       410.0|
|       625.0|
|       317.0|
|       253.0|
|       331.0|
|       153.1|
|       128.0|
|       201.0|
|       353.0|
|       235.0|
|       93.84|
|      409.54|
|      118.54|
|      121.84|
|       180.0|
|       108.0|
|       27

df_result: org.apache.spark.sql.DataFrame = [Suburb: string, Rooms: double ... 20 more fields]


In [83]:
df_result.count

res69: Long = 9244


#### 7. Propertycount

In [84]:
df_result.filter("Propertycount IS NULL").count()

res70: Long = 0


In [85]:
df_result.select("BuildingArea").distinct.count()

res71: Long = 639


In [86]:
df_result.select("Propertycount").distinct.show(639)

+-------------+
|Propertycount|
+-------------+
|       1369.0|
|       4385.0|
|        608.0|
|       3640.0|
|      11925.0|
|       4718.0|
|        973.0|
|       1554.0|
|        389.0|
|       2954.0|
|        249.0|
|        984.0|
|        902.0|
|       1130.0|
|       3755.0|
|        802.0|
|       2417.0|
|       5051.0|
|        438.0|
|       6065.0|
|      11308.0|
|       3284.0|
|        588.0|
|       8524.0|
|       4497.0|
|      10160.0|
|       3600.0|
|       4654.0|
|       6938.0|
|       4973.0|
|       2949.0|
|       1123.0|
|       2291.0|
|       1345.0|
|       1475.0|
|       6380.0|
|       3224.0|
|       8743.0|
|       6464.0|
|       1607.0|
|       6795.0|
|       9028.0|
|      11364.0|
|      17384.0|
|       7082.0|
|       4707.0|
|       6388.0|
|       7822.0|
|       6990.0|
|      10788.0|
|       1624.0|
|       5556.0|
|       4704.0|
|       2555.0|
|       8989.0|
|       5837.0|
|       5811.0|
|        642.0|
|       5262.0|
|       

In [87]:
df_result.describe().select("summary","Price", "Rooms","Distance","Bathroom","Car").show()

+-------+------------------+------------------+-----------------+------------------+------------------+
|summary|             Price|             Rooms|         Distance|          Bathroom|               Car|
+-------+------------------+------------------+-----------------+------------------+------------------+
|  count|              9244|              9244|             9244|              9244|              9244|
|   mean|1092328.5888143661|3.0981176979662486|11.24115101687565|1.6524231934227607|1.6953699697100821|
| stddev| 679621.2070858623| 0.964029421835851|6.882570344620598|0.7249913332194025| 0.975528875592598|
|    min|          131000.0|               1.0|              0.0|               1.0|               0.0|
|    max|         9000000.0|              12.0|             48.1|               9.0|              10.0|
+-------+------------------+------------------+-----------------+------------------+------------------+



In [88]:
df_result.describe().select("summary","Landsize","Propertycount").show()

+-------+------------------+-----------------+
|summary|          Landsize|    Propertycount|
+-------+------------------+-----------------+
|  count|              9244|             9244|
|   mean| 528.8338381652964|7463.866724361748|
| stddev|1212.9650900679794|4369.422309704771|
|    min|               0.0|            249.0|
|    max|           44500.0|          21650.0|
+-------+------------------+-----------------+



In [89]:
df_result.count()

res75: Long = 9244


#### Filtering null values

In [None]:
val df_not_null = df_result.na.drop
df_not_null.count()

In [None]:
df_not_null.printSchema()


#### Imputing null values

### Group By and  Aggregation

In [91]:
df_result.groupBy("Suburb").agg(max("Price")).show()

+----------------+----------+
|          Suburb|max(Price)|
+----------------+----------+
|  Brunswick West| 2350000.0|
| South Melbourne| 3050000.0|
|    Ivanhoe East| 3850000.0|
|    Princes Hill| 1908000.0|
|      Cranbourne|  910500.0|
|         Ashwood| 1850000.0|
|       Brunswick| 2545000.0|
|South Kingsville| 1030000.0|
|        Brighton| 5150000.0|
|        Oak Park| 1390000.0|
|         Doveton|  707000.0|
|       Albanvale|  560000.0|
|      Brookfield|  538000.0|
|     Pascoe Vale| 1570000.0|
| Blackburn North| 1570000.0|
|     Sandringham| 2700000.0|
|   Botanic Ridge|  750000.0|
|          Carrum|  980000.0|
|      Ascot Vale| 2425000.0|
|  Sunshine North| 1190000.0|
+----------------+----------+
only showing top 20 rows



In [92]:
df_result.groupBy("Suburb").agg(min("Price")).show()

+----------------+----------+
|          Suburb|min(Price)|
+----------------+----------+
|  Brunswick West|  252000.0|
| South Melbourne|  407500.0|
|    Ivanhoe East|  470000.0|
|    Princes Hill| 1600000.0|
|      Cranbourne|  720008.0|
|         Ashwood|  600000.0|
|       Brunswick|  272500.0|
|South Kingsville|  380000.0|
|        Brighton|  290000.0|
|        Oak Park|  470000.0|
|         Doveton|  450000.0|
|       Albanvale|  506000.0|
|      Brookfield|  456000.0|
|     Pascoe Vale|  248500.0|
| Blackburn North|  706000.0|
|     Sandringham|  643500.0|
|   Botanic Ridge|  750000.0|
|          Carrum|  495000.0|
|      Ascot Vale|  390000.0|
|  Sunshine North|  435000.0|
+----------------+----------+
only showing top 20 rows



In [93]:
df_result.groupBy("Distance").agg(round(mean("Price"),0)).show()

+--------+--------------------+
|Distance|round(avg(Price), 0)|
+--------+--------------------+
|    14.9|            739543.0|
|    15.5|           1038966.0|
|    13.4|           1244368.0|
|    15.4|           1018147.0|
|     2.4|           1192184.0|
|     8.0|           1159445.0|
|    10.2|           1958140.0|
|    24.7|            698971.0|
|    32.3|            623000.0|
|     0.0|            603000.0|
|    17.9|            841709.0|
|    43.3|            563000.0|
|    11.4|           1040141.0|
|     5.4|           2048658.0|
|    16.6|            658000.0|
|    23.8|            625500.0|
|     7.0|           1062146.0|
|    11.5|            746754.0|
|     3.5|           1397590.0|
|    31.7|            494725.0|
+--------+--------------------+
only showing top 20 rows



### Correlation
 <span style="font-size:16pt;color: red">TO DO: FIND OUT CORRELATIONS between theses attributes </span> .

Rooms:
|-- Price:
|-- Distance:
|-- Rooms:
|-- Bathroom:
|-- Car:
|-- Landsize:
|-- BuildingArea:
|-- YearBuilt:

In [94]:
import org.apache.spark.sql.functions.corr
df_result.select(corr("Distance","Price")).show()

+---------------------+
|corr(Distance, Price)|
+---------------------+
| -0.23109535854491728|
+---------------------+



import org.apache.spark.sql.functions.corr


In [95]:
df_result.select(corr("Rooms","Price")).show()

+------------------+
|corr(Rooms, Price)|
+------------------+
|0.4725354566255438|
+------------------+



In [96]:
df_result.select(corr("Bathroom","Price")).show()

+---------------------+
|corr(Bathroom, Price)|
+---------------------+
|  0.46131865546350354|
+---------------------+



In [97]:
df_result.select(corr("Car","Price")).show()

+------------------+
|  corr(Car, Price)|
+------------------+
|0.2084128379555887|
+------------------+



In [98]:
df_result.select(corr("BuildingArea","Price")).show()

+-------------------------+
|corr(BuildingArea, Price)|
+-------------------------+
|      0.09644609706275929|
+-------------------------+



In [99]:
df_result.select(corr("Landsize","Price")).show()

+---------------------+
|corr(Landsize, Price)|
+---------------------+
|  0.05230373296484904|
+---------------------+



In [100]:
df_result.select(corr("Date","Price")).show()

+-----------------+
|corr(Date, Price)|
+-----------------+
|             null|
+-----------------+



In [101]:
df_result.select(corr("YearBuilt","Price")).show()

+----------------------+
|corr(YearBuilt, Price)|
+----------------------+
|  -0.31382021680954214|
+----------------------+



In [102]:
df_result.select(corr("Propertycount","Price")).show()

+--------------------------+
|corr(Propertycount, Price)|
+--------------------------+
|      -0.05930593500382...|
+--------------------------+



### Select relevant features:

In [103]:

df_result = df_result.select("Price","Method","Type","Distance","Rooms","Bathroom","Car","Landsize","Propertycount", "Suburb","Street","Date")

df_result: org.apache.spark.sql.DataFrame = [Price: double, Method: string ... 10 more fields]


#### Write down clean data:

In [104]:
! hadoop fs -mkdir -p /tmp/output

In [109]:
! hadoop fs -ls -R /tmp

drwxrwx---   - root supergroup          0 2020-05-20 13:17 /tmp/hadoop-yarn


drwxrwx---   - root supergroup          0 2020-05-20 13:17 /tmp/hadoop-yarn/staging


drwxrwx---   - root supergroup          0 2020-05-20 13:17 /tmp/hadoop-yarn/staging/history


drwxrwx---   - root supergroup          0 2020-05-20 13:17 /tmp/hadoop-yarn/staging/history/done


drwxrwxrwt   - root supergroup          0 2020-05-20 13:17 /tmp/hadoop-yarn/staging/history/done_intermediate


drwxr-xr-x   - root supergroup          0 2020-05-21 04:16 /tmp/output


drwxr-xr-x   - root supergroup          0 2020-05-20 13:21 /tmp/rs_in


-rw-r--r--   1 root root          5018236 2020-05-15 05:20 /tmp/rs_in/mh.csv




In [None]:
df_result.coalesce(1).write.format("csv").option("header","true").mode("overwrite").option("sep",",").save("hdfs://localhost:9000/tmp/output")


In [None]:
df_result.coalesce(1).write.format("csv").option("header","true").mode("overwrite").csv("hdfs://localhost:9000/tmp/output") 

In [108]:
! hadoop fs -ls /tmp/output

In [None]:
!rm ./output.csv 

Save the clean data to disk

In [None]:
! hadoop fs -copyToLocal /tmp/output/\*.csv./../data-clean/mh.csv

## References

Apache Spark (n.d.). _Spark Scala API (Scaladoc). Overview._ https://spark.apache.org/docs/latest/api/java/overview-summary.html

Apache Spark (n.d.). _Basic Statistic._ https://spark.apache.org/docs/latest/ml-statistics.html

Bahadoor N. (2020). _Spark Tutorials_ https://allaboutscala.com/big-data/spark/#dataframe-statistics-correlation

Databricks. (2020). _Introduction to DataFrames - Scala._  https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-scala.html 

Grimaldi E. (2018). _Pandas vs. Spark: how to handle dataframes (Part II.)_  https://towardsdatascience.com/python-pandas-vs-scala-how-to-handle-dataframes-part-ii-d3e5efe8287d 

