# Load Dataset from Registry

In [2]:
from azureml.core import Workspace
ws = Workspace.from_config()

We will start by loading version 2 of our dataset, where we did not convert everything into numerical values yet.

In [3]:
from azureml.core import Dataset
dataset_name = 'Melbourne Housing Dataset'

# Get a dataset by name
melb_ds = Dataset.get_by_name(workspace=ws, name=dataset_name, version=2)

# Load a TabularDataset into pandas DataFrame
df = melb_ds.to_pandas_dataframe()
df

Unnamed: 0,Suburb,Rooms,Type,Price,Method,Date,Distance,Bedrooms,Bathrooms,Parking,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Region,SuburbPropCount,Price_log
0,Abbotsford,2,house,,SS,2016-09-03,2.5,2.0,1.0,1.0,126.0,160.357259,1970.0,Yarra City Council,-37.80140,144.99580,Northern Metropolitan,4019.0,
1,Abbotsford,2,house,1480000.0,Property Sold,2016-12-03,2.5,2.0,1.0,1.0,202.0,160.357259,1970.0,Yarra City Council,-37.79960,144.99840,Northern Metropolitan,4019.0,14.207553
2,Abbotsford,2,house,1035000.0,Property Sold,2016-02-04,2.5,2.0,1.0,0.0,156.0,79.000000,1900.0,Yarra City Council,-37.80790,144.99340,Northern Metropolitan,4019.0,13.849912
3,Abbotsford,3,unit,,Vendor Bid,2016-02-04,2.5,3.0,2.0,1.0,0.0,160.357259,1970.0,Yarra City Council,-37.81140,145.01160,Northern Metropolitan,4019.0,
4,Abbotsford,3,house,1465000.0,Property Sold Prior,2017-03-04,2.5,3.0,2.0,0.0,134.0,150.000000,1900.0,Yarra City Council,-37.80930,144.99440,Northern Metropolitan,4019.0,14.197366
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34850,Yarraville,4,house,1480000.0,Property Passed In,2018-02-24,6.3,4.0,1.0,3.0,593.0,160.357259,1970.0,Maribyrnong City Council,-37.81053,144.88467,Western Metropolitan,6543.0,14.207553
34851,Yarraville,2,house,888000.0,Property Sold Prior,2018-02-24,6.3,2.0,2.0,1.0,98.0,104.000000,2018.0,Maribyrnong City Council,-37.81551,144.88826,Western Metropolitan,6543.0,13.696727
34852,Yarraville,2,townhouse,705000.0,Property Sold,2018-02-24,6.3,2.0,1.0,2.0,220.0,120.000000,2000.0,Maribyrnong City Council,-37.82286,144.87856,Western Metropolitan,6543.0,13.465953
34853,Yarraville,3,house,1140000.0,Property Sold Prior,2018-02-24,6.3,,,2.0,,160.357259,1970.0,Maribyrnong City Council,,,Western Metropolitan,6543.0,13.946539


# Results so far

First, let us remember our preliminary results from Chapter 5 (Feature Importance for target Price and Price_log):

![alt text](feature_chap5.png "Feature Importance for Price and log(Price)")



Let us talk about these results a bit further than we did in Chapter 5. For that, let us group these feature into what they actually convey.

**Housing Properties**
- Type: seems to have a very high indication for price. It is also helpful, that there are only 3 possible "settings" for this feature.
- Parking: Having 5 or 7 parking spaces is probably not a big difference anymore. We could have a look at making this more discrete by dividing it into maybe three groups 0, 1 and 2+ parking spaces for example.
- YearBuilt: From our understanding, the age of a house should have an impact on the price, yet it seems very small. We could transform the data into a discrete age 0-10, 10-20, 20-30 etc.
- BuildingArea: One would think, this should have a much higher influence. Therefore, let us divide this as well into groups.
- Landsize: Same argument as BuildingArea.
- Bathrooms, Bedrooms, Rooms: We would have to drill down deeper into these. As seen before, there seem to be some questionable combinations. Still this has some impact on the price (as we would expect).

**Housing Location**
- Suburb: As predicted, suburb has too many possible values (around 500) to be of much use. Therefore, we will ignore this as well for now.
- SuburbPropCount: Once again, this value is too detailed and therefore does not seems to have nearly no predictive value. Once again, we could think of building a discrete feature breaking it in 3-5 groups.
- CouncilArea: has some impact, but we see it dwindleds when looking at the logarithmic price.
- Distance (from city center): seems to have a high impact and could be improved by discretization.
- Region: draws a clearer picture for the price compared to CouncilArea or suburb.
- Longitude&Lattitude: Seems to have a high impact. They seem to convey something better than the CouncilArea.

**Others**
- Method: The method how the house or apartment is bought seems to have not much importance, as we would presume in most cases. Therefore, we can remove this certainly.
- Date: (not included in the Feature Importance graph) The sell date over a couple of years. We might find a small increase due to constantly increasing housing prices, if we only look at the year. 



# Visualisations

Let us look at histograms of all the features we might want to transform.

## Parking

In [4]:
import plotly.express as px
fig = px.histogram(df, x="Parking")
fig.show()


Remember, that we replaced the missing values with the median = 2, therefore be advised, that the most properties probably have 1 parking spot. As discussed, we could bin these into 0, 1, 2++ groups.

## YearBuilt

In [5]:
import plotly.express as px
fig = px.histogram(df, x="YearBuilt")
fig.show()

We can see our missing values create a spike in the middle of the dataset. Equal binning of 10 year spans might be a good starting point. Let us have a look at that:

In [6]:
import plotly.express as px
import numpy as np

counts, bins = np.histogram(df["YearBuilt"], bins=range(1850, 2030, 10))
bins = 0.5 * (bins[:-1] + bins[1:])

fig = px.bar(x=bins, y=counts, labels={'x':'Distance', 'y':'count'})
fig.show()

## BuildingArea

In [7]:
import plotly.express as px
fig = px.histogram(df, x="BuildingArea")
fig.show()

Also here, we can see our missing value replacement for the building area. Using equidistant bins might be of interest here as well. Let us look at that:

In [8]:
import plotly.express as px
import numpy as np

counts, bins = np.histogram(df["BuildingArea"], bins=range(0, 350, 25))
bins = 0.5 * (bins[:-1] + bins[1:])

fig = px.bar(x=bins, y=counts, labels={'x':'Distance', 'y':'count'})
fig.show()

## Distance

In [9]:
import plotly.express as px
fig = px.histogram(df, x="Distance")
fig.show()

An equal distance binning with steps of 5 miles could be an option here. Let's have a look at that.

In [10]:
import plotly.express as px
import numpy as np

counts, bins = np.histogram(df["Distance"], bins=range(0, 60, 5))
bins = 0.5 * (bins[:-1] + bins[1:])

fig = px.bar(x=bins, y=counts, labels={'x':'Distance', 'y':'count'})
fig.show()

## CouncilArea

In [11]:
import plotly.express as px
fig = px.histogram(df, x="CouncilArea")
fig.show()

An option here could be to fill out the missing CouncilAreas. We have street addresses for all of them, so either pulling in external data or checking the suburbs to CouncilArea matching should give us the missing data.

## Region

In [12]:
import plotly.express as px
fig = px.histogram(df, x="Region",)
fig.show()

Looking at this, we could combine the houses outside of the metropolitan area into one group (Victoria) or we could even create only two groups (Metropolitan, Victoria).

## Date

In [13]:
import plotly.express as px
df["Year Offered/Sold"] = df['Date'].dt.year.astype(int)

fig = px.histogram(df, x="Year Offered/Sold")
fig.show()


As we presumed, the data is taken during two years (2016 and 2017).

## SuburbPropCount

In [14]:
import plotly.express as px
fig = px.histogram(df, x="SuburbPropCount")
fig.show()

We can see a lot of different amounts here. Once again, binning might be helpful. For example doing bins with a size of 5000:

In [15]:
import plotly.express as px
import numpy as np

counts, bins = np.histogram(df["SuburbPropCount"], bins=range(0, 25000, 5000))
bins = 0.5 * (bins[:-1] + bins[1:])

fig = px.bar(x=bins, y=counts, labels={'x':'SuburbPropCount', 'y':'count'})
fig.show()

This is a much cleaner result and with 4 bins might actually have some predictive property.

# What to do next?

This first view on the data should give you some ideas to think about. What to do next is create new transformed features from the original ones and run the feature importance again. As this process is done through a random forest, you are using already a useful baseline model. You could also just start testing your dataset on a random forest directly, by creating at least a test and training split of the data here and use a cost function to measure your success.

We leave you with our suggestions, what you might want to transform and test again. 

*For this, it is probably wise to load version 1 of the dataset, as we will have a look at the missing values again in step 3*

**1. Discretization**

Create new transformed features for the following:
- SuburbPropCount
- Region
- Distance
- Landsize

**2. Rooms, Bathrooms, Bedrooms**

You might have seen, that there are some discrepencies between Rooms vs. Bathrooms/Bedrooms. This data was extracted from an Austrialian appartment/house selling platform. It might be the case, which means the seller provides this information. Therefore, the two obvious options might be, that:
- Rooms = Bathrooms + Bedrooms
- Rooms = Bedrooms

Looking at the head of the dataset above, one of those rules might be true. Write a function, that groups the dataset into the entries that follow rule 1, the entries that follow rule 2 and anything that does not follow these rules. Then make the decision to change this into either direction. Probably rule 1 is the most useful.

**3. Missing Values**

Having done the above, we now have a better chance to group our samples, which in turn can help us to replace our missing values not with the mean or median of the entire dataset, but with the one defining a group of samples. As an example, we could group by (Type, Distance, Region, Rooms) to calculate for each of the group the mean for BuildingArea. This requires a bunch of code, but gives a more realistic statistical property for the samples with missing values.
In addition, we can the missing CouncilAreas, by checking which CouncilArea is written for the suburb in other samples.

**4. Discretization Part 2**

After that, we can start binning the leftover values:
- BuldingArea
- YearBuilt
- Parking

**5. Lattitude/Longitude**

It is interesting to see, that Lat/Long already has a reasonable predictive value, even though it is a list of a lot of different numerical values. There are a lot of things that can be done with geospatial coordinates. Just to give you an idea: Maybe you have a property in an expensive suburb, but your small area is next to something that influences the price (industrial plant, loud school, church, ...). Therefore, bringing in more information about the location around the property might be of interest. You can find some external datasets for Melbourne here: https://data.melbourne.vic.gov.au/


*Enjoy the data crunching*