## Preparing HDFS

Using magic

Create input folder on HDFS if not exists

Copy from data from local

In [6]:
val spark = org.apache.spark.sql.SparkSession.builder
        .master("local") 
        .appName("Spark CSV Reader")
        .getOrCreate;



spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@6e3170a4


In [7]:
! pwd
! hadoop fs -mkdir -p  /tmp/rs_input
! hadoop fs -put   -p  ./../data-raw/Melbourne_housing_FULL.csv             /tmp/rs_input/raw.csv
! hadoop fs -ls        /tmp/rs_input/
!hadoop fs -cat /tmp/rs_input/raw.csv | wc -l

/home/sandpit/big-data-realestate/big-data-realestate/scripts

put: `/tmp/rs_input/raw.csv': File exists


Found 1 items


-rw-r--r--   1 root root    5018236 2020-05-15 05:20 /tmp/rs_input/raw.csv


34858



In [8]:
//load raw into df
val df_raw = spark
    .read
    .format("csv")
    .option("header", "true")
    .load("hdfs://localhost:9000/tmp/rs_input/raw.csv")

df_raw: org.apache.spark.sql.DataFrame = [Suburb: string, Address: string ... 19 more fields]


### Data Exploration

In [9]:
val df_raw = spark.read.format("csv").option("header", "true").load("hdfs://localhost:9000/tmp/rs_input/raw.csv")

df_raw: org.apache.spark.sql.DataFrame = [Suburb: string, Address: string ... 19 more fields]


## Analysis

### Print schema:

In [10]:
df_raw.printSchema()

root
 |-- Suburb: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Rooms: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Price: string (nullable = true)
 |-- Method: string (nullable = true)
 |-- SellerG: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Distance: string (nullable = true)
 |-- Postcode: string (nullable = true)
 |-- Bedroom2: string (nullable = true)
 |-- Bathroom: string (nullable = true)
 |-- Car: string (nullable = true)
 |-- Landsize: string (nullable = true)
 |-- BuildingArea: string (nullable = true)
 |-- YearBuilt: string (nullable = true)
 |-- CouncilArea: string (nullable = true)
 |-- Lattitude: string (nullable = true)
 |-- Longtitude: string (nullable = true)
 |-- Regionname: string (nullable = true)
 |-- Propertycount: string (nullable = true)



In [11]:
df_raw.columns

res2: Array[String] = Array(Suburb, Address, Rooms, Type, Price, Method, SellerG, Date, Distance, Postcode, Bedroom2, Bathroom, Car, Landsize, BuildingArea, YearBuilt, CouncilArea, Lattitude, Longtitude, Regionname, Propertycount)


### Show column types:

In [12]:
df_raw.dtypes

res3: Array[(String, String)] = Array((Suburb,StringType), (Address,StringType), (Rooms,StringType), (Type,StringType), (Price,StringType), (Method,StringType), (SellerG,StringType), (Date,StringType), (Distance,StringType), (Postcode,StringType), (Bedroom2,StringType), (Bathroom,StringType), (Car,StringType), (Landsize,StringType), (BuildingArea,StringType), (YearBuilt,StringType), (CouncilArea,StringType), (Lattitude,StringType), (Longtitude,StringType), (Regionname,StringType), (Propertycount,StringType))


All features seem to be stored as text therefore numerical values will need to be converted to Integer and Float accordingly.
In addition, the categorical variables of Type and Method will need to be converted to integers as factors.

## Display the rows:

In [13]:
df_raw.select("Suburb","Address","POSTCODE","SUBURB","TYPE","METHOD","SELLERG","DATE","COUNCILAREA","REGIONNAME","YEARBUILT").show()

+----------+-------------------+--------+----------+----+------+-------+---------+------------------+--------------------+---------+
|    Suburb|            Address|POSTCODE|    SUBURB|TYPE|METHOD|SELLERG|     DATE|       COUNCILAREA|          REGIONNAME|YEARBUILT|
+----------+-------------------+--------+----------+----+------+-------+---------+------------------+--------------------+---------+
|Abbotsford|      68 Studley St|    3067|Abbotsford|   h|    SS| Jellis|3/09/2016|Yarra City Council|Northern Metropol...|     null|
|Abbotsford|       85 Turner St|    3067|Abbotsford|   h|     S| Biggin|3/12/2016|Yarra City Council|Northern Metropol...|     null|
|Abbotsford|    25 Bloomburg St|    3067|Abbotsford|   h|     S| Biggin|4/02/2016|Yarra City Council|Northern Metropol...|     1900|
|Abbotsford| 18/659 Victoria St|    3067|Abbotsford|   u|    VB| Rounds|4/02/2016|Yarra City Council|Northern Metropol...|     null|
|Abbotsford|       5 Charles St|    3067|Abbotsford|   h|    SP| Bigg

In [14]:
df_raw.select("Landsize","BuildingArea","YearBuilt","CouncilArea","Lattitude","Longtitude","Regionname","Propertycount").show(10)

+--------+------------+---------+------------------+---------+----------+--------------------+-------------+
|Landsize|BuildingArea|YearBuilt|       CouncilArea|Lattitude|Longtitude|          Regionname|Propertycount|
+--------+------------+---------+------------------+---------+----------+--------------------+-------------+
|     126|        null|     null|Yarra City Council| -37.8014|  144.9958|Northern Metropol...|         4019|
|     202|        null|     null|Yarra City Council| -37.7996|  144.9984|Northern Metropol...|         4019|
|     156|          79|     1900|Yarra City Council| -37.8079|  144.9934|Northern Metropol...|         4019|
|       0|        null|     null|Yarra City Council| -37.8114|  145.0116|Northern Metropol...|         4019|
|     134|         150|     1900|Yarra City Council| -37.8093|  144.9944|Northern Metropol...|         4019|
|      94|        null|     null|Yarra City Council| -37.7969|  144.9969|Northern Metropol...|         4019|
|     120|         

The data seems to be relatively clean however further exploration is required.

## Descriptive Statistics:

### Summary

In [67]:
df_raw.describe().select("summary",
                        "Suburb",
                        "Address",
                        "Rooms",
                        "Type",
                        "Price",
                        "Method",
                        "SellerG",
                        "Date").show()

+-------+----------+----------------+------------------+-----+-----------------+------+-----------+---------+
|summary|    Suburb|         Address|             Rooms| Type|            Price|Method|    SellerG|     Date|
+-------+----------+----------------+------------------+-----+-----------------+------+-----------+---------+
|  count|     34857|           34857|             34857|34857|            27247| 34857|      34857|    34857|
|   mean|      null|            null|3.0310124221820582| null|1050173.344955408|  null|       null|     null|
| stddev|      null|            null|0.9699329348975204| null|641467.1301045999|  null|       null|     null|
|    min|Abbotsford|1 Abercrombie St|                 1|    h|          1000000|    PI|    @Realty|1/07/2017|
|    max|  viewbank|   9b Stewart St|                 9|    u|           999999|     W|voglwalpole|9/12/2017|
+-------+----------+----------------+------------------+-----+-----------------+------+-----------+---------+



In [68]:
df_raw.describe().select("summary",
                        "Distance",
                        "Postcode",
                        "Bedroom2",
                        "Bathroom",
                        "Car",
                        "Landsize").show()

+-------+------------------+------------------+------------------+------------------+------------------+------------------+
|summary|          Distance|          Postcode|          Bedroom2|          Bathroom|               Car|          Landsize|
+-------+------------------+------------------+------------------+------------------+------------------+------------------+
|  count|             34857|             34857|             26640|             26631|             26129|             23047|
|   mean|11.184929423916007| 3116.062858618315|3.0846471471471473| 1.624798167549097|1.7288453442535114|  593.598993361392|
| stddev| 6.788892455935938|109.02390274290613|0.9806897285461588|0.7242120114699068|1.0107707853554244|3398.8419464599056|
|    min|              #N/A|              #N/A|                 0|                 0|                 0|                 0|
|    max|               9.9|              3978|                 9|                 9|                 9|               999|
+-------

In [69]:
df_raw.describe().select("summary",
                        "BuildingArea",
                        "YearBuilt",
                        "CouncilArea",
                        "Lattitude",
                        "Longtitude",
                        "Regionname").show()

+-------+------------------+------------------+--------------------+-------------------+-------------------+----------------+
|summary|      BuildingArea|         YearBuilt|         CouncilArea|          Lattitude|         Longtitude|      Regionname|
+-------+------------------+------------------+--------------------+-------------------+-------------------+----------------+
|  count|             13742|             15551|               34857|              26881|              26881|           34857|
|   mean| 160.2564003565711| 1965.289884894862|                null|-37.810634295599094| 145.00185113165438|            null|
| stddev|401.26706008485496|37.328178023136616|                null| 0.0902789045092229|0.12016876915353476|            null|
|    min|                 0|              1196|                #N/A|           -37.3902|          144.42379|            #N/A|
|    max|               999|              2106|Yarra Ranges Shir...|          -38.19043|          145.52635|Western Vi

In [70]:
df_raw.describe().select("summary",
                         "Propertycount").show()

+-------+------------------+
|summary|     Propertycount|
+-------+------------------+
|  count|             34857|
|   mean|7572.8883055029555|
| stddev|4428.0903132746425|
|    min|              #N/A|
|    max|               984|
+-------+------------------+



Based on the above summary statistics we can see the following:
* The count of a column less the rows in the dataset, display the null values.
* The Address will need to be stripped down to its Street Name and Street Type. This is to  

### Correlation:

Assess the correlation between the Price and the other features to better understand their relationship and importance.

In [71]:
// Import Correlation Library
import org.apache.spark.sql.functions.corr

import org.apache.spark.sql.functions.corr


In [20]:
df_raw.select(corr("Rooms","Price")).show()
df_raw.select(corr("Distance","Price")).show()
df_raw.select(corr("Postcode","Price")).show()
df_raw.select(corr("Bedroom2","Price")).show()
df_raw.select(corr("Bathroom","Price")).show()
df_raw.select(corr("Car","Price")).show()
df_raw.select(corr("Landsize","Price")).show()
df_raw.select(corr("BuildingArea","Price")).show()
df_raw.select(corr("YearBuilt","Price")).show()
df_raw.select(corr("Lattitude","Price")).show()
df_raw.select(corr("Longtitude","Price")).show()
df_raw.select(corr("Propertycount","Price")).show()

+-------------------+
| corr(Rooms, Price)|
+-------------------+
|0.46523834510759615|
+-------------------+

+---------------------+
|corr(Distance, Price)|
+---------------------+
| -0.21138434279157942|
+---------------------+

+---------------------+
|corr(Postcode, Price)|
+---------------------+
|  0.04494983007693704|
+---------------------+

+---------------------+
|corr(Bedroom2, Price)|
+---------------------+
|   0.4302753383233543|
+---------------------+

+---------------------+
|corr(Bathroom, Price)|
+---------------------+
|   0.4298780777015672|
+---------------------+

+-------------------+
|   corr(Car, Price)|
+-------------------+
|0.20180256061576263|
+-------------------+

+---------------------+
|corr(Landsize, Price)|
+---------------------+
| 0.032748365249470925|
+---------------------+

+-------------------------+
|corr(BuildingArea, Price)|
+-------------------------+
|       0.1007536394731018|
+-------------------------+

+----------------------+
|corr(Y

Landsize having very minimal correlation with price does not align with expectations therefore further analysis is required.

We will compare the corration of Price with Landsize based on the Property type.

- h  - house,cottage,villa, semi,terrace;
- u  - unit, duplex;
- t  - townhouse;


In [25]:
// Corralation of Landsize and Price for Houses
df_raw.where($"Type" === "h").select(corr("Landsize","Price")).show()

+---------------------+
|corr(Landsize, Price)|
+---------------------+
| 0.025980927743436796|
+---------------------+



In [26]:
// Corralation of Landsize and Price for Units
df_raw.where($"Type" === "u").select(corr("Landsize","Price")).show()

+---------------------+
|corr(Landsize, Price)|
+---------------------+
|  0.05064203615229057|
+---------------------+



In [27]:
// Corralation of Landsize and Price for Townhouses
df_raw.where($"Type" === "t").select(corr("Landsize","Price")).show()

+---------------------+
|corr(Landsize, Price)|
+---------------------+
|  0.09629920710291465|
+---------------------+



In [28]:
// Create a new DataFrame with Price per SQM
val df_landprice = df_raw.withColumn("PriceperSQM", col("Price") / col("Landsize"))

// Assess Correlation of Price with Price per SQM
df_landprice.select(corr("Price","PriceperSQM")).show()

+------------------------+
|corr(Price, PriceperSQM)|
+------------------------+
|     0.11049815760669206|
+------------------------+



df_landprice: org.apache.spark.sql.DataFrame = [Suburb: string, Address: string ... 20 more fields]


As expected, the Price vs PriceperSQM correlation is higher at ~11%, however this is due to price being included and compared against itself.

Therefore, it seems that Landsize on its own doesn't seem to have a significant correlation. However, correlation changes depending on the type of property this is. With a townhouse having the highest positive correlation of 0.096 or 9.6%.

As a result we conclude on keeping the LandSize.

In [29]:
df_raw.select("Lattitude").distinct.count()

res19: Long = 13403


In [30]:
df_raw.select("Longtitude").distinct.count()

res20: Long = 14525


#### Based on the preliminary analysis above, the features we have identified as important for the future model are:

* Address
* Suburb
* Date
* Price
* Method
* Type
* Distance
* Rooms
* Bathroom
* Car
* Landsize
* Lattitude
* Longtitude

The excluded features are:

* SellerG
* Postcode
* Bedroom2
* BuildingArea
* YearBuilt
* CouncilArea
* Regionname
* Propertycount


Therefore we will continue our analysis on the following.

### Categorical Attributes

#### Address

In [31]:
df_raw.filter("Address IS NULL").count()

res21: Long = 0


In [32]:
df_raw.select("Address").distinct.show()

+-------------------+
|            Address|
+-------------------+
|      557 Orrong Rd|
|      19 Poulter St|
|    43 Riverside Av|
|       11 South Tce|
|  41 Marlborough St|
|          4 Park Cr|
|        3/3 Dega Av|
|        93 Tudor St|
|         10 Kent Rd|
|       18 Thomas St|
|   1/1 Glen Iris Rd|
|      7 Allambee Av|
|    83 Truganini Rd|
|       130 Keele St|
|       8 Winters Wy|
|     36a Mitford St|
|   7/223 Station St|
|1/146 Ascot Vale Rd|
|    5/60 Farnham St|
|      22 Renwick St|
+-------------------+
only showing top 20 rows



As seen above based on the complexity of Addresses, it would be useful to reduce them to a Street Name and Type in order to create categories within suburbs based on a street.

#### Suburb

In [33]:
df_raw.select("Suburb").distinct.show()

+----------------+
|          Suburb|
+----------------+
|  Brunswick West|
| South Melbourne|
|    Ivanhoe East|
|    Princes Hill|
|      Cranbourne|
|         Ashwood|
|       Brunswick|
|South Kingsville|
|        Brighton|
|        Oak Park|
|         Doveton|
|       Albanvale|
|      Brookfield|
|        Lynbrook|
|     Ferny Creek|
|     Pascoe Vale|
| Blackburn North|
|         croydon|
|     Sandringham|
|   Botanic Ridge|
+----------------+
only showing top 20 rows



In [34]:
df_raw.select("Suburb").distinct.count()

res24: Long = 351


The Suburbs generally seem to correct what will need to done is:
* Capitalise the first letter of the suburb names
* Also the North/West/South/East suffixes to suburbs will be left, as they provide a more accurate location within a suburb.

#### Date

In [35]:
val dates = df_raw.select("Date").distinct()

dates: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Date: string]


In [36]:
dates.count()

res25: Long = 78


In [37]:
df_raw.filter("Date IS NULL").count()

res26: Long = 0


In [38]:
dates.show(80)

+----------+
|      Date|
+----------+
|16/04/2016|
|29/04/2017|
|10/12/2016|
|19/08/2017|
| 7/05/2016|
| 8/07/2017|
| 4/03/2017|
|29/07/2017|
|27/05/2017|
|28/10/2017|
| 9/09/2017|
|26/07/2016|
|12/11/2016|
|25/02/2017|
| 6/05/2017|
|18/11/2017|
| 3/09/2016|
| 3/12/2016|
|25/11/2017|
| 3/06/2017|
|23/04/2016|
|30/09/2017|
|21/10/2017|
| 7/11/2016|
|17/03/2018|
|18/03/2017|
| 4/06/2016|
|28/08/2016|
|24/06/2017|
|13/08/2016|
| 6/01/2018|
|12/08/2017|
| 3/02/2018|
| 8/04/2017|
|22/04/2017|
|20/05/2017|
|17/09/2016|
|12/06/2016|
|14/05/2016|
| 4/11/2017|
|24/02/2018|
|14/10/2017|
| 8/10/2016|
|10/09/2016|
|20/01/2018|
|16/07/2016|
|11/03/2017|
| 9/12/2017|
| 7/10/2017|
|13/05/2017|
|23/09/2017|
|17/06/2017|
|15/10/2016|
|10/02/2018|
|27/06/2016|
|27/11/2016|
|30/07/2016|
|28/01/2016|
| 3/03/2018|
|16/09/2017|
|26/08/2017|
|22/05/2016|
|28/05/2016|
|22/07/2017|
| 3/09/2017|
|15/07/2017|
|24/09/2016|
|17/02/2018|
| 6/08/2016|
|22/08/2016|
| 1/07/2017|
|18/06/2016|
|19/11/2016|
|11/02/2017|

#### Method

In [39]:
df_raw.select("Method").distinct.show()

+------+
|Method|
+------+
|    PI|
|    SA|
|    SP|
|    VB|
|    PN|
|     W|
|     S|
|    SN|
|    SS|
+------+



#### Null values  

In [40]:
df_raw.filter("Method IS NULL").count()

res29: Long = 0


#### Type 
#### Distinct values 

In [41]:
df_raw.select("Type").distinct.show()

+----+
|Type|
+----+
|   h|
|   u|
|   t|
+----+



#### 9. Regionname

In [46]:
df_raw.select("Regionname").distinct.show()

+--------------------+
|          Regionname|
+--------------------+
|South-Eastern Met...|
|Western Metropolitan|
|Eastern Metropolitan|
|    Eastern Victoria|
|                #N/A|
|   Northern Victoria|
|Northern Metropol...|
|Southern Metropol...|
|    Western Victoria|
+--------------------+



In [47]:
df_raw.filter("Regionname IS NULL").count()

res36: Long = 0


# Wrangling

### The cleansing process based on the above findings has been completed in a separate notebook

# Secondary Analysis

Perform a secondary analysis on the clean dataset to compare it with the original.

In [52]:
! hadoop fs -mkdir -p  /tmp/output
! hadoop fs -put   -p  ./../data-clean/*.csv             /tmp/output

put: `/tmp/output/cleanMelbourneData.csv': File exists




In [53]:
// Load Clean Dataset into a DataFrame from HDFS after wrangling is completed
val df_clean = spark
    .read
    .format("csv")
    .option("header", "true")
    .load("hdfs://localhost:9000/tmp/output/*.csv")

df_clean: org.apache.spark.sql.DataFrame = [Price: string, MethodOfSale: string ... 11 more fields]


In [54]:
// Count the rows within the imported file
df_clean.count()

res40: Long = 15728


In [55]:
df_clean.printSchema()

root
 |-- Price: string (nullable = true)
 |-- MethodOfSale: string (nullable = true)
 |-- PropertyType: string (nullable = true)
 |-- DistanceFromCBD: string (nullable = true)
 |-- Rooms: string (nullable = true)
 |-- Bathroom: string (nullable = true)
 |-- Car: string (nullable = true)
 |-- Landsize: string (nullable = true)
 |-- Latitude: string (nullable = true)
 |-- Longtitude: string (nullable = true)
 |-- Suburb: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- StreetName: string (nullable = true)



In [56]:
df_clean.select("Price", 
                "MethodOfSale", 
                "PropertyType", 
                "DistanceFromCBD", 
                "Rooms", 
                "Bathroom", 
                "Car", 
                "Landsize", 
                "Latitude", 
                "Longtitude", 
                "Suburb", 
                "Date", 
                "StreetName").show()

+---------+------------+------------+---------------+-----+--------+---+--------+--------+----------+----------+----------+-------------+
|    Price|MethodOfSale|PropertyType|DistanceFromCBD|Rooms|Bathroom|Car|Landsize|Latitude|Longtitude|    Suburb|      Date|   StreetName|
+---------+------------+------------+---------------+-----+--------+---+--------+--------+----------+----------+----------+-------------+
|1480000.0|           1|           1|            2.5|    2|       1|  1|   202.0|-37.7996|  144.9984|Abbotsford| 3/12/2016|    Turner St|
|1035000.0|           1|           1|            2.5|    2|       1|  0|   156.0|-37.8079|  144.9934|Abbotsford| 4/02/2016| Bloomburg St|
|1465000.0|           2|           1|            2.5|    3|       2|  0|   134.0|-37.8093|  144.9944|Abbotsford| 4/03/2017|   Charles St|
| 850000.0|           3|           1|            2.5|    3|       2|  1|    94.0|-37.7969|  144.9969|Abbotsford| 4/03/2017|Federation La|
|1600000.0|           6|          

### Descriptive Statistics:

In [57]:
df_clean.describe().select("Summary", 
                           "Price", 
                           "MethodOfSale", 
                           "PropertyType", 
                           "DistancefromCBD", 
                           "Rooms", 
                           "Bathroom").show()

+-------+-----------------+------------------+------------------+------------------+------------------+------------------+
|Summary|            Price|      MethodOfSale|      PropertyType|   DistancefromCBD|             Rooms|          Bathroom|
+-------+-----------------+------------------+------------------+------------------+------------------+------------------+
|  count|            15728|             15728|             15728|             15728|             15728|             15728|
|   mean|1150668.961851475|1.8519201424211598|1.2367115971515767|11.743203204475906| 3.187817904374364|1.6276068158697863|
| stddev|663018.9561677927|1.5641294862942188|0.5723389425747235| 6.700598921434572|0.8816674199865697|0.7148157821249159|
|    min|           1.12E7|                 1|                 1|               0.0|                 1|                 0|
|    max|         999999.0|                 8|                 3|               9.9|                 8|                 9|
+-------+-------

In [58]:
df_clean.describe().select("summary", 
                           "Car",
                           "Landsize", 
                           "Latitude", 
                           "Longtitude", 
                           "Suburb", 
                           "Date", 
                           "StreetName").show()

+-------+------------------+------------------+-------------------+-------------------+----------+---------+-----------+
|summary|               Car|          Landsize|           Latitude|         Longtitude|    Suburb|     Date| StreetName|
+-------+------------------+------------------+-------------------+-------------------+----------+---------+-----------+
|  count|             15728|             15728|              15728|              15728|     15728|    15728|      15728|
|   mean|1.7699008138351984| 668.6406408952187| -37.80422331192775|  144.9974195784588|      null|     null|       null|
| stddev|1.0185620996735723|4008.2209550592024|0.09275044706405701|0.12243663700518545|      null|     null|       null|
|    min|                 0|             100.0|          -37.39946|          144.42379|Abbotsford|1/07/2017|Aanensen Ct|
|    max|                 9|             999.0|          -38.19043|          145.52635|Yarraville|9/12/2017|Zurzolo Tce|
+-------+------------------+----

### Crosstabulation of categorical attributes

**PropertyType:**

house => "H" => 1, 
unit => "U" => 2, 
townhouse  => "T" => 3

**MethodOfSale:**

property sold => "S" => 1

property sold prior  => "SP" => 2

property passed in  => "PI" => 3

sold prior not disclosed  => "PN" => 4

sold not disclosed  => "SN" => 5

vendor bid  => "VB" => 6

withdrawn prior to auction  => "W" => 7

sold after auction  => "SA" => 8

sold after auction price not disclosed  => "SS" => 9  

In [62]:
df_clean.stat.crosstab("MethodOfSale","PropertyType").show()

+-------------------------+----+---+---+
|MethodOfSale_PropertyType|   1|  2|  3|
+-------------------------+----+---+---+
|                        8|  97| 10|  9|
|                        6|1142|130|126|
|                        1|8804|916|727|
|                        2|1589|208|140|
|                        3|1528|149|153|
+-------------------------+----+---+---+



In [81]:
// show suburbs with most sold houses
df_clean.stat.crosstab("Suburb","PropertyType").orderBy(desc("1")).show()

+-------------------+---+---+---+
|Suburb_PropertyType|  1|  2|  3|
+-------------------+---+---+---+
|          Reservoir|317| 68| 24|
|            Preston|247| 21| 10|
|     Bentleigh East|203| 29| 56|
|       Balwyn North|192| 10| 14|
|             Coburg|190|  7| 22|
|          Brunswick|189| 15| 16|
|          Northcote|186|  9|  7|
|           Richmond|167| 35| 16|
|            Glenroy|164| 15| 26|
|         Yarraville|163|  3|  7|
|           Essendon|158| 33| 16|
|          Glen Iris|157| 17| 16|
|            Newport|143|  3| 10|
|                Kew|142| 14| 12|
|        Pascoe Vale|142| 23| 30|
|      Brighton East|140| 15| 21|
|           Brighton|136| 27| 25|
|       Moonee Ponds|136|  7| 11|
|        Keilor East|134|  5| 18|
|         Camberwell|126| 23| 13|
+-------------------+---+---+---+
only showing top 20 rows



In [82]:
// show suburbs with most sold units
df_clean.stat.crosstab("Suburb","PropertyType").orderBy(desc("2")).show()

+-------------------+---+---+---+
|Suburb_PropertyType|  1|  2|  3|
+-------------------+---+---+---+
|          Reservoir|317| 68| 24|
|           St Kilda| 44| 42|  2|
|           Hawthorn| 95| 38|  5|
|           Richmond|167| 35| 16|
|           Essendon|158| 33| 16|
|        South Yarra| 58| 31|  7|
|             Elwood| 36| 31|  5|
|     Bentleigh East|203| 29| 56|
|           Carnegie| 63| 29| 15|
|           Brighton|136| 27| 25|
|       Surrey Hills|109| 25|  7|
|        Pascoe Vale|142| 23| 30|
|         Camberwell|126| 23| 13|
|             Altona| 51| 21| 10|
|            Preston|247| 21| 10|
|      Hawthorn East| 79| 20|  7|
|        Murrumbeena| 44| 19| 12|
|     Port Melbourne| 93| 17|  3|
|            Hampton|108| 17| 15|
|          Glen Iris|157| 17| 16|
+-------------------+---+---+---+
only showing top 20 rows



In [83]:
// show suburbs with most sold townhouses
df_clean.stat.crosstab("Suburb","PropertyType").orderBy(desc("2")).show()

+-------------------+---+---+---+
|Suburb_PropertyType|  1|  2|  3|
+-------------------+---+---+---+
|          Reservoir|317| 68| 24|
|           St Kilda| 44| 42|  2|
|           Hawthorn| 95| 38|  5|
|           Richmond|167| 35| 16|
|           Essendon|158| 33| 16|
|             Elwood| 36| 31|  5|
|        South Yarra| 58| 31|  7|
|     Bentleigh East|203| 29| 56|
|           Carnegie| 63| 29| 15|
|           Brighton|136| 27| 25|
|       Surrey Hills|109| 25|  7|
|         Camberwell|126| 23| 13|
|        Pascoe Vale|142| 23| 30|
|            Preston|247| 21| 10|
|             Altona| 51| 21| 10|
|      Hawthorn East| 79| 20|  7|
|        Murrumbeena| 44| 19| 12|
|          Glen Iris|157| 17| 16|
|     Port Melbourne| 93| 17|  3|
|            Hampton|108| 17| 15|
+-------------------+---+---+---+
only showing top 20 rows



In [64]:
df_clean.stat.crosstab("Rooms","Bathroom").show()

+--------------+---+----+----+---+---+---+---+---+---+---+
|Rooms_Bathroom|  0|   1|   2|  3|  4|  5|  6|  7|  8|  9|
+--------------+---+----+----+---+---+---+---+---+---+---+
|            12|  0|   0|   0|  0|  0|  1|  0|  0|  0|  0|
|             8|  0|   0|   2|  2|  3|  0|  1|  1|  1|  0|
|             4|  3| 606|2734|658| 55|  5|  1|  0|  1|  0|
|             5|  0|  21| 380|362| 71| 36|  2|  0|  0|  0|
|            10|  0|   0|   0|  1|  0|  0|  0|  0|  0|  1|
|             6|  0|   3|  21| 51| 16|  0|  2|  0|  0|  0|
|             1|  1| 232|   6|  0|  0|  0|  0|  0|  0|  0|
|             2|  4|2439| 333|  7|  0|  0|  0|  0|  0|  0|
|             7|  0|   0|   4|  7|  3|  0|  0|  0|  0|  0|
|             3|  8|4291|3159|184|  5|  1|  3|  0|  0|  0|
+--------------+---+----+----+---+---+---+---+---+---+---+



In [65]:
df_clean.stat.crosstab("Rooms","Car").show()

+---------+---+----+---+---+---+----+---+---+---+---+---+---+---+
|Rooms_Car|  0|   1| 10| 11| 18|   2|  3|  4|  5|  6|  7|  8|  9|
+---------+---+----+---+---+---+----+---+---+---+---+---+---+---+
|       12|  0|   0|  0|  0|  0|   0|  1|  0|  0|  0|  0|  0|  0|
|        8|  0|   1|  0|  0|  0|   0|  2|  7|  0|  0|  0|  0|  0|
|        4|112| 625|  2|  1|  0|2599|358|286| 43| 22|  7|  7|  1|
|        5| 14|  62|  1|  0|  0| 568|105| 92| 15| 14|  1|  0|  0|
|       10|  0|   0|  0|  0|  0|   2|  0|  0|  0|  0|  0|  0|  0|
|        6|  1|   5|  0|  0|  0|  55| 12| 13|  3|  3|  1|  0|  0|
|        1| 48| 183|  0|  0|  0|   5|  0|  3|  0|  0|  0|  0|  0|
|        2|403|1661|  0|  0|  1| 600| 75| 35|  3|  4|  1|  0|  0|
|        7|  1|   1|  0|  0|  0|   5|  2|  5|  0|  0|  0|  0|  0|
|        3|464|2462|  1|  0|  0|3810|489|331| 35| 48|  5|  5|  1|
+---------+---+----+---+---+---+----+---+---+---+---+---+---+---+



In [66]:
df_clean.stat.crosstab("PropertyType","Car").show()

+----------------+---+----+---+---+---+----+----+---+---+---+---+---+---+
|PropertyType_Car|  0|   1| 10| 11| 18|   2|   3|  4|  5|  6|  7|  8|  9|
+----------------+---+----+---+---+---+----+----+---+---+---+---+---+---+
|               2| 49|1043|  0|  0|  0| 299|  16|  4|  0|  1|  1|  0|  0|
|               1|980|3488|  4|  1|  1|6702|1004|763| 99| 90| 14| 12|  2|
|               3| 14| 469|  0|  0|  0| 643|  24|  5|  0|  0|  0|  0|  0|
+----------------+---+----+---+---+---+----+----+---+---+---+---+---+---+



### Correlation with Price

In [72]:
// Correlation now will include the categorical variables which where converted to factor.
df_clean.select(corr("MethodOfSale","Price")).show()
df_clean.select(corr("PropertyType","Price")).show()
df_clean.select(corr("DistancefromCBD","Price")).show()
df_clean.select(corr("Rooms","Price")).show()
df_clean.select(corr("Bathroom","Price")).show()
df_clean.select(corr("Car","Price")).show()
df_clean.select(corr("Landsize","Price")).show()
df_clean.select(corr("Latitude","Price")).show()
df_clean.select(corr("Longtitude","Price")).show()

+-------------------------+
|corr(MethodOfSale, Price)|
+-------------------------+
|      0.09999735389602706|
+-------------------------+

+-------------------------+
|corr(PropertyType, Price)|
+-------------------------+
|     -0.19834317580327496|
+-------------------------+

+----------------------------+
|corr(DistancefromCBD, Price)|
+----------------------------+
|         -0.3067923420146305|
+----------------------------+

+-------------------+
|corr(Suburb, Price)|
+-------------------+
|               null|
+-------------------+

+------------------+
|corr(Rooms, Price)|
+------------------+
|0.3891959982715667|
+------------------+

+---------------------+
|corr(Bathroom, Price)|
+---------------------+
|   0.4070422568142487|
+---------------------+

+------------------+
|  corr(Car, Price)|
+------------------+
|0.1508691880664582|
+------------------+

+---------------------+
|corr(Landsize, Price)|
+---------------------+
| 0.020547866303955897|
+---------------------

## References

Apache Spark (n.d.). _Spark Scala API (Scaladoc). Overview._ https://spark.apache.org/docs/latest/api/java/overview-summary.html

Apache Spark (n.d.). _Basic Statistic._ https://spark.apache.org/docs/latest/ml-statistics.html

Bahadoor N. (2020). _Spark Tutorials_ https://allaboutscala.com/big-data/spark/#dataframe-statistics-correlation

Databricks. (2020). _Introduction to DataFrames - Scala._  https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-scala.html 

Grimaldi E. (2018). _Pandas vs. Spark: how to handle dataframes (Part II.)_  https://towardsdatascience.com/python-pandas-vs-scala-how-to-handle-dataframes-part-ii-d3e5efe8287d 

