# __Melbourne Housing Prediction - Machine Learning Project__
---


## About Our Group 

- BS20DSY031 - Tuan Nguyen
- BS20DSY029 - Thanh Nguyen
- BS20DSY033 - Minh Le
- BS20DSY041 - Hai Nguyen

# 1. Introduction¶

*   The aim of this notebook is to build useful models that can predict Melbourne (located in Victoria) housing prices based on a set of scrapped features made available in the Melbourne Housing Dataset. Also, it provides practice insights for house buyers into the Melbourne Housing Market.
 

*   The dataset is very interesting to explore since Melbourne is one of the world's most liveable cities. It's also interesting to explore the Plotly library capability & create interactive choropleth maps, similar to the notebook I wrote about Australian Geographic Data Plots.



In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly
from plotly.subplots import make_subplots
from sklearn import preprocessing
import seaborn as sns
from scipy import stats

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
from sklearn.base import BaseEstimator,TransformerMixin

%matplotlib inline
cmap = sns.diverging_palette(220, 10, as_cmap=True)

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', 50)

line_colors = ["#7CEA9C", '#50B2C0', "rgb(114, 78, 145)", "hsv(348, 66%, 90%)", "hsl(45, 93%, 58%)"]

# 2. Melbourne Housing Dataset

## 2.1. Data Preparation

Let's review some of the features that are available in the **Melbourne Housing Dataset**.

- Suburb: Suburb

- Address: Address

- Rooms: Number of rooms

- Price: Price in Australian dollars

- Method:  
S - property sold;  
SP - property sold prior;  
PI - property passed in;  
PN - sold prior not disclosed;  
SN - sold not disclosed;  
NB - no bid;  
VB - vendor bid;  
W - withdrawn prior to auction;  
SA - sold after auction;  
SS - sold after auction price not disclosed.  
N/A - price or highest bid not available.  

- Type:  
    br - bedroom(s);  
    h - house,cottage,villa, semi,terrace;  
    u - unit, duplex;  
    t - townhouse;  
    dev site - development site;  
    o res - other residential.  

- SellerG: Real Estate Agent

- Date: Date sold

- Distance: Distance from CBD in Kilometres

- Regionname: General Region (West, North West, North, North east …etc)

- Propertycount: Number of properties that exist in the suburb.

- Bedroom2 : Scraped # of Bedrooms (from different source)

- Bathroom: Number of Bathrooms

- Car: Number of carspots

- Landsize: Land Size in Metres

- BuildingArea: Building Size in Metres

- YearBuilt: Year the house was built

- CouncilArea: Governing council for the area

- Lattitude: Self explanitory

- Longtitude: Self explanitory



In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/monilshah98/Melbourne-House-Price-Prediction/master/DataSet/Melbourne_Housing_Dataset.csv')

In [None]:
df.head()

In [None]:
df.info()

> We have total 34854 housing transactional records. However, there are significant numbers of null values, we will do some data cleaning first before conducting Exploratory Data Analysis.

There are many rows with missing values of the
target variable (**Price**).  

Since the imputation of these values could increase bias in input data, we should remove all the records whose NaN values in **Price**. 

In [None]:
df.dropna(subset=['Price'], how='all', inplace=True)

In [None]:
# Some Data Cleaning 
df.drop_duplicates(subset=['Address'],inplace=True) # Some addresses actually have multiple entries
df.rename ({'Bedroom2': 'Bedrooms'}, axis = 1, inplace = True)
# df.index = df['Address'] # set dataframe index, since it's not really a useful feature 
# del df['Address'] # let's also delete the column

> Columns with more than 55% values missing should be removed
from the original dataset since it is difficult to impute these
missing values with an acceptable level of accuracy. 

In [None]:
missing_percentages = df.isnull().sum(axis = 0) / len(df) 
print(missing_percentages)
df = df.loc[:, (missing_percentages < 0.55)]

Next, we will create a helper class which will be used to visualize missing data and feature importances.

In [None]:
# !pip install shap
# !pip install catboost
from sklearn.base import BaseEstimator, TransformerMixin
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import shap
from catboost import CatBoostClassifier,CatBoostRegressor
from sklearn.feature_selection import SelectKBest,f_regression
from xgboost import plot_importance,XGBClassifier,XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn import preprocessing


#Notebook Helper Class
class transformer(BaseEstimator, TransformerMixin):
  def __init__(self, drop_nan=False, show_nan=False, select_dtype=False, title='Title', figsize=(None,None), feature_importance = False, target = 'Price'):
    self.drop_nan = drop_nan
    self.show_nan = show_nan
    self.select_dtype = select_dtype
    self.title = title
    self.feature_importance = feature_importance
    self.figsize = figsize
    self.target = target


  # Apply Some Transformation to the Feature Matrix
  def transform(self, X):

    '''show NaN % in DataFrame'''
    if(self.show_nan):
        
        fig, ax = plt.subplots(figsize = self.figsize)
        nan_val = (X.isnull().sum()/len(X)*100).sort_values(ascending = False)
        cmap = sns.color_palette("plasma")
        for i in ['top', 'right', 'bottom', 'left']:
            ax.spines[i].set_color('black')
        ax.spines['top'].set_visible(True);ax.spines['right'].set_visible(False)
        ax.spines['bottom'].set_visible(False);ax.spines['left'].set_visible(False)
        sns.barplot(x=nan_val,y=nan_val.index, edgecolor='k',palette = 'rainbow')
        plt.title(self.title);ax.grid(ls='--',alpha = 0.9);plt.show()
        return


    ''' Drop All NaN values in DataFrame'''
    if(self.drop_nan):
        X = X.dropna();
        return X

    if self.select_dtype:
      num = X.loc[:,X.dtypes != 'object']
      cat = X.loc[:,X.dtypes == 'object']
      return num, cat

    if self.feature_importance:

        # Plot Correlation to Target Variable only
      def corrMat2(df,target=self.target,figsize=(9,0.5),ret_id=False):

          corr_mat = df.corr().round(2);shape = corr_mat.shape[0]
          corr_mat = corr_mat.transpose()
          corr = corr_mat.loc[:, df.columns == self.target].transpose().copy()

          if(ret_id is False):
              f, ax = plt.subplots(figsize=figsize)
              sns.heatmap(corr,vmin=-0.3,vmax=0.3,center=0, 
                          cmap=cmap,square=False,lw=2,annot=True,cbar=False)
              plt.title(f'Feature Correlation to {self.target}')

          if(ret_id):
              return corr

      def feature_importance(df, feature=self.target, n_est=500):
        num_df0,_ = transformer(select_dtype=True).transform(X=df)
        num_df = transformer(drop_nan=True).transform(X=num_df0)
      
        #  Input dataframe contains numeric features and target feature
        X = num_df.copy()
        y = num_df[feature].copy()
        del X[feature]

        #  CORRELATION
        imp = corrMat2(num_df,feature,figsize=(15,0.5),ret_id=True)
        del imp[feature]
        s1 = imp.squeeze(axis=0);s1 = abs(s1)
        s1.name = 'Correlation'

        #   SHAP
        model = CatBoostRegressor(silent=True,n_estimators=n_est).fit(X,y)
        explainer = shap.TreeExplainer(model)
        shap_values = explainer.shap_values(X)
        shap_sum = np.abs(shap_values).mean(axis=0)
        s2 = pd.Series(shap_sum,index=X.columns,name='Cat_SHAP').T

        #   RANDOMFOREST
        model = RandomForestRegressor(n_est,random_state=0, n_jobs=-1)
        fit = model.fit(X,y)
        rf_fi = pd.DataFrame(model.feature_importances_,index=X.columns,
                            columns=['RandForest']).sort_values('RandForest',ascending=False)
        s3 = rf_fi.T.squeeze(axis=0)

        #   XGB 
        model=XGBRegressor(n_estimators=n_est,learning_rate=0.5,verbosity = 0)
        model.fit(X,y)
        data = model.feature_importances_
        s4 = pd.Series(data,index=X.columns,name='XGB').T

        #   KBEST
        model = SelectKBest(k=X.shape[1], score_func=f_regression)
        fit = model.fit(X,y)
        data = fit.scores_
        s5 = pd.Series(data,index=X.columns,name='K_best')

        # Combine Scores
        df0 = pd.concat([s1,s2,s3,s4,s5],axis=1)
        df0.rename(columns={'target':'lin corr'})

        x = df0.values 
        min_max_scaler = preprocessing.MinMaxScaler()
        x_scaled = min_max_scaler.fit_transform(x)
        df = pd.DataFrame(x_scaled,index=df0.index,columns=df0.columns)
        df = df.rename_axis('Feature Importance via', axis=1)
        df = df.rename_axis('Feature', axis=0)
        df['total'] = df.sum(axis=1)
        df = df.sort_values(by='total',ascending=True)
        del df['total']
        fig = px.bar(df,orientation='h',barmode='stack',color_discrete_sequence=line_colors)
        fig.update_layout(template='plotly_white',height=self.figsize[1],width=self.figsize[0],margin={"r":0,"t":60,"l":0,"b":0});
        for data in fig.data:
            data["width"] = 0.6 #Change this value for bar widths
        fig.show(renderer="colab")
        
    feature_importance(X)

# 2.2. Categrical & Ordinal Features
- We have a few categorical features which can be handy for EDA, as well as for model features, such as One-Hot Encoding/ GetDummies.
Sold_Month & Sold_Year can be extracted from Date.

- SUBURB probably doesn't tell us any more than the POSTCODE does, but useful for EDA.


In [None]:
#Divide features into categorical and ordinal
df.Postcode = df.Postcode.astype('object') #Convert Postcode in to object data type
df_num, df_cat = transformer(select_dtype=True).transform(X=df.copy())


#Create Sold_Month & Sold_Year
df_num[['Sold_Month', 'Sold_Year']] = df_cat['Date'].str.split('/', 2, expand=True).loc[:,1:2].astype('float64')
df_cat.drop(['Date'], inplace=True, axis=1)

df_EDA = pd.concat([df_cat, df_num], axis=1)

# 3. UNIVARIATE DATA ANALYSIS

## 3.1. Data Distributions Histograms


In [None]:
import plotly.offline as pyo
pyo.init_notebook_mode(connected = True)

In [None]:
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go

# Plot Histogram, Boxplot using Plotly
def px_stats(df, n_cols=4, to_plot=None, height=800, w=None):
  df_num, df_cat = transformer(select_dtype=True).transform(df)
  numeric_cols = df_num.columns
  n_rows = -(-len(numeric_cols) // n_cols)
  row_pos, col_pos = 1, 0
  fig = make_subplots(rows=n_rows, cols=n_cols,subplot_titles=numeric_cols.to_list())

  for col in numeric_cols:
      if (to_plot == 'histogram'):
          trace = go.Histogram(x=df_num[col],showlegend=False,autobinx=True,
                                marker = dict(color = 'rgb(27, 79, 114)',
                                line=dict(color='white',width=0)))
      else:
          trace = getattr(px, to_plot)(df_num[col],x=df_num[col])["data"][0]
          
      if col_pos == n_cols: 
          row_pos += 1
      col_pos = col_pos + 1 if (col_pos < n_cols) else 1
      fig.add_trace(trace, row=row_pos, col=col_pos)

  fig.update_traces(marker = dict(color = 'rgb(27, 79, 114)',
                    line=dict(color='white',width=0)))
  fig.update_layout(template='plotly_white');fig.update_layout(margin={"r":0,"t":60,"l":0,"b":0})
  fig.update_layout(height=height,width=w)
  fig.show()

- The most common price range of a property; 500k-1000k AUD, which makes up about 50% of all properties.
- We can note 1-bedroom properties are quite uncommon in Melbourne, the most common being a 3-bedroom property, typically having 1 or 2 bathrooms & garage with 1 or 2 car slots.
- The last months of a year like September, October, November are the months with the highest sales activity and then it decreases until hitting the lowest point in next January.
- Some of the properties have a very large number of garages slots so it would make sense to just remove them but lets just keep them anyway. Quite a large number of features have *skewed distributions*. Let's note to do some cleaning.

In [None]:
px_stats(df_EDA, to_plot='histogram') # interactive

## 3.2. Data Distributions Boxplots
- Complementary to histograms, boxplots, indicate outliers a little more clearly, as well as useful statistics about min, max, median & q1/q3 values.

- We will lean towards using tree based methods; It is often stated that, ensemble approaches such as *Random Forest* are not sensitive to outliers, however there are counter arguments that state the complete opposite as shown on stackexchange.

- That said our data contains quite a lot of outliers, which is to be expected from a non consistent selling standard/rules for properties, allowing certain properties to be prices above/below values of similar properties depending on specific circumstaces.

- It is interesting to look into creating models for a specific subset of our data ( eg. similar suburbs, low cost suburbs, presold properties and so on ), in an attempt to get around these outliers. One model for the entire dataset seems like a huge stretch, and most definitely will have accuracy limits.

In [None]:
px_stats(df_EDA, to_plot='box',height=550)


## 3.3. Target Model Feature Importance Evaluation

- We can use multiple approaches, even early on, to quickly evaluate which features have most weight in a model evaluation to get a better understanding of not only their imporance but also how different models use these features in their evaluation. 

- It is easy to understand that **Distance** (to CBD), **Number of rooms**, **Number of Bathrooms**, **Number of Bedrooms** are the most important features.

- Although there are a number of features that have little to no impact, the only irrelevant feature seems to be **Sold_Month**, **Sold_Year**, **Property_count** which we ought to drop, a little later.




In [None]:
transformer(feature_importance=True,figsize=(800,400),target='Price').transform(X=df_EDA)

In [None]:
df_EDA.drop(columns = ['Propertycount', 'Sold_Month', 'Sold_Year'], inplace = True)

> Besides, we will try to create a new feature "Distance_to_station" to see if it has any impact on the house prices or not. The code will be implemented in a separate .ipynb file.

# 4. Missing Data & Cleaning
We have 6 features with missing data: Landsize, Car, Bathroom, Bedroom, Longtitude, Lattitude.

In [None]:
transformer(show_nan=True,figsize=(9,5),title='Feature (NaN) %').transform(X=df_EDA)

### **Landsize** Imputation

Firstly, we will remove records whose land sizes of less than 10 square meters are removed. They are outliers which may stem from factors such as human errors, relationship with probability models, and even structured situations.

In [None]:
df_EDA.shape

In [None]:
df_EDA['Landsize'].sort_values()

In [None]:
df_EDA = df_EDA[df_EDA['Landsize'] >= 10]

In [None]:
df_EDA.Landsize

As we have some huge Land size value then it will affect the mean land size, we will impute the Land size values by using its median values group by house types and suburbs.

In [None]:
df_EDA['Landsize'] = df_EDA['Landsize'].fillna(df_EDA.groupby(['Type','Suburb'])['Landsize'].transform('median'))

In [None]:
df_EDA['Landsize'].isna().sum()

In [None]:
df_EDA.isna().sum()

### **Bedroom**, **Bathroom**, **Car** Imputation

We will impute the **Bedroom**, **Bathroom**, **Car** values by using its median values group by house types.

In [None]:
df_EDA['Bedrooms'] = df_EDA['Bedrooms'].fillna(df_EDA.groupby('Type')['Bedrooms'].transform('median'))
df_EDA['Bathroom'] = df_EDA['Bathroom'].fillna(df_EDA.groupby('Type')['Bathroom'].transform('median'))
df_EDA['Car'] = df_EDA['Car'].fillna(df_EDA.groupby('Type')['Car'].transform('median'))

In [None]:
df_EDA.isna().sum()

### **Lattitude**, **Longtitude** Imputation
We will try to fill null values in these features from house addresses using GeoPy. 


In [None]:
!pip install geopy

In [None]:
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
 
#Creating a dataframe with address of locations we want to reterive
condition = df_EDA['Longtitude'].isna()
df_address = pd.DataFrame((df_EDA[condition]['Address']))
df_address['Full_address'] = df_EDA[condition]['Address'] + ' ' + df_EDA[condition].Suburb
# df_address["Address"] = df_address["Address"] + df_EDA[condition].Suburb
# df_address.columns = 

# #Creating an instance of Nominatim Class
geolocator = Nominatim(user_agent="my_request")
 
# #applying the rate limiter wrapper
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
 
# #Applying the method to pandas DataFrame
df_address['location'] = df_address['Full_address'].apply(geocode)
df_address['Lat'] = df_address['location'].apply(lambda x: x.latitude if x else None)
df_address['Lon'] = df_address['location'].apply(lambda x: x.longitude if x else None)

In [None]:
df_EDA['Lattitude'] = df_EDA['Lattitude'].combine_first(df_EDA['Address'].map(df_address.set_index('Address')['Lat']))
df_EDA['Longtitude'] = df_EDA['Longtitude'].combine_first(df_EDA['Address'].map(df_address.set_index('Address')['Lon']))

# 5. EXPLORATORY DATA ANALYSIS

In [None]:
main_df = pd.read_csv ('../input/ditmetuanminh/ditconmeTuanMinh.csv')

main_df = main_df.drop (columns = ['Unnamed: 0'])

In [None]:
main_df

### **5.1. Categorical EDA**

In [None]:
#Type, Method, Region name, Historic

In [None]:
main_df

In [None]:
import plotly.express as px
import plotly.offline as pyo
pyo.init_notebook_mode()

*Among 3 type of accomodation (t, u, h), type_h appears to be more slightly expensive than other*

In [None]:
fig = px.box(main_df, y="Price", x = 'Type')
fig.show()

In [None]:
fig = px.box(main_df, y = "Price", x = "Method")
fig.show()

In [None]:
fig = px.box(main_df, y = "Price", x = "Regionname")
fig.show()

### *Top 10 Suburb that have the largest median of price of house*

In [None]:
top_price = main_df.groupby ('Suburb').median(['Price']).reset_index().sort_values (by = ['Price'], ascending = False).reset_index().drop(columns = ['index']).head(10)
box_top_price_data = main_df.loc[main_df.Suburb.isin(top_price.Suburb.to_list())]

In [None]:
box_top_price_data

In [None]:
fig = px.box(box_top_price_data, y = "Price", x = "Suburb")
fig.show()

In [None]:
fig = px.box(main_df, y = "Price", x = "Suburb")

fig.show()

### **5.2. Aggregation**

In [None]:
count_type_suburb = main_df.groupby(['Suburb', 'Type'])['Address'].count().reset_index()
count_type_suburb.rename ({'Address':'Sold_Count'}, axis = 1, inplace = True)
count_type_suburb_sum = count_type_suburb.groupby (['Suburb'])['Sold_Count'].sum().reset_index()
count_type_suburb_sum = count_type_suburb_sum.sort_values (['Sold_Count'], ascending = False).reset_index()
count_type_suburb_sum.drop(columns = ['index', 'Sold_Count'], inplace = True)
count_type_suburb_sum['Order'] = [i for i in range (len(count_type_suburb_sum))]
count_type_suburb_sum

In [None]:
sorted_count_type_suburb = count_type_suburb_sum.merge(count_type_suburb, how = 'right', on = ['Suburb'])
sorted_count_type_suburb = sorted_count_type_suburb.sort_values (['Order'])
sorted_count_type_suburb = sorted_count_type_suburb.reset_index().drop(columns = ['index'])
sorted_count_type_suburb


In [None]:
#stacked bar?? should be in a specific period of time.

fig = px.bar(sorted_count_type_suburb, x="Suburb", y="Sold_Count", color="Type", title="Stacked-Histogram: Count of sold by Suburb & Type")
fig.show()

### **5.3. Numberical EDA**

#### *Scatter plots*

In [None]:
main_df_numer = main_df.loc[:, ['Address','Price','Bathroom', 'Distance', 'Bedrooms', 'Car', 'Rooms', 'Distance_to_Station', 'Landsize', 'main_lon', 'main_lat']]

In [None]:
main_df_numer.describe().round (2)

In [None]:
import plotly.graph_objs as go

from plotly.subplots import make_subplots
titles = ['Bathroom', 'Distance', 'Bedrooms', 'Car space', 'Rooms', 'Distance to Station']
fig = make_subplots(rows=3, cols=2,shared_yaxes=True,subplot_titles=titles,horizontal_spacing = 0.01, vertical_spacing = 0.06)

fig.add_trace(go.Scattergl(y=main_df_numer['Price'].values,x=main_df_numer['Bathroom'].values,mode='markers',name='Bathroom',text=main_df.index,opacity=0.5),row=1, col=1)
fig.add_trace(go.Scattergl(y=main_df_numer['Price'].values,x=main_df_numer['Distance'].values,mode='markers',name='Distance',text=main_df.index,opacity=0.1),row=1, col=2)
fig.add_trace(go.Scattergl(y=main_df_numer['Price'].values,x=main_df_numer['Bedrooms'].values,mode='markers',name='Bedrooms',text=main_df.index,opacity=0.1),row=2, col=1)
fig.add_trace(go.Scattergl(y=main_df_numer['Price'].values,x=main_df_numer['Car'].values,mode='markers',name='Car',text=main_df.index,opacity=0.1),row=2, col=2)
fig.add_trace(go.Scattergl(y=main_df_numer['Price'].values,x=main_df_numer['Rooms'].values,mode='markers',name='Rooms',text=main_df.index,opacity=0.1),row=3, col=1)
fig.add_trace(go.Scattergl(y=main_df_numer['Price'].values,x=main_df_numer['Distance_to_Station'].values,mode='markers',name='Distance to station',text=main_df.index,opacity=0.1),row=3, col=2)


fig.update_traces(marker=dict(size=4,line=dict(width=1.2,color='black')))
fig.update_layout(template='plotly_white',title={'text':'Important features relative to Price', 'xanchor': 'center', 'x':0.5},height=1000,showlegend= False)
fig.update_layout(margin={"r":0,"t":60,"l":0,"b":0})
fig.show()

#### Comment:
- Most houses have less than 5 bedrooms, 5 bathrooms and 5 rooms in total

- Most houses houses are under 30km to the CBD and 10km to the station in the suburbs

- Among 6 features above, 4 of them have positive correlation with Price, while Distance to CBD and Ditance to Station have negative one.

# *Geospartial plot*

In [None]:
#sort data by name of suburb
geo_df = main_df.sort_values (['Suburb']).sample(2000).reset_index().drop(columns = ['index'])
geo_df

In [None]:
import folium

m = folium.Map(location=[-37.840935, 144.946457], zoom_start=15)
tooltip = "Click Here For More Info"

for i in range (len (main_df[:2000])):
    marker = folium.Marker (
    location = [main_df['main_lat'][i], main_df['main_lon'][i]],
                popup = '<stong>Nothing to display</stong>',
                tooltip = [main_df['Address'][i], main_df['Price'][i]])
    
    marker.add_to (m)

m

*Selling houses are located in clusters and mostly close the railway and stations*

# 6. Data Preparation For Modeling

### *6.1. Removing outliers*

Look at the scatter plots above, we can easily see that there are so many outliers. They need to be removed before we apply the data into training set!

In [None]:
#using Z_score
from scipy import stats
z_value = np.abs (stats.zscore (main_df_numer.iloc[:, 1:]))
z_value

In [None]:
#Clear out data that lay out of the range [-3 < z < 3]
main_df_numer_after_z = main_df_numer[(z_value < 3).all(axis = 1)]
main_df_numer_after_z

In [None]:
from plotly.subplots import make_subplots
titles = ['Bathroom', 'Distance', 'Bedrooms', 'Car space', 'Rooms', 'Distance to Station']
fig = make_subplots(rows=3, cols=2,shared_yaxes=True,subplot_titles=titles,horizontal_spacing = 0.01, vertical_spacing = 0.06)

fig.add_trace(go.Scattergl(y=main_df_numer_after_z['Price'].values,x=main_df_numer_after_z['Bathroom'].values,mode='markers',name='Bathroom',text=main_df.index,opacity=0.5),row=1, col=1)
fig.add_trace(go.Scattergl(y=main_df_numer_after_z['Price'].values,x=main_df_numer_after_z['Distance'].values,mode='markers',name='Distance',text=main_df.index,opacity=0.1),row=1, col=2)
fig.add_trace(go.Scattergl(y=main_df_numer_after_z['Price'].values,x=main_df_numer_after_z['Bedrooms'].values,mode='markers',name='Bedrooms',text=main_df.index,opacity=0.1),row=2, col=1)
fig.add_trace(go.Scattergl(y=main_df_numer_after_z['Price'].values,x=main_df_numer_after_z['Car'].values,mode='markers',name='Car',text=main_df.index,opacity=0.1),row=2, col=2)
fig.add_trace(go.Scattergl(y=main_df_numer_after_z['Price'].values,x=main_df_numer_after_z['Rooms'].values,mode='markers',name='Rooms',text=main_df.index,opacity=0.1),row=3, col=1)
fig.add_trace(go.Scattergl(y=main_df_numer_after_z['Price'].values,x=main_df_numer_after_z['Distance_to_Station'].values,mode='markers',name='Distance to station',text=main_df.index,opacity=0.1),row=3, col=2)


fig.update_traces(marker=dict(size=4,line=dict(width=1.2,color='black')))
fig.update_layout(template='plotly_white',title={'text':'Important features relative to Price', 'xanchor': 'center', 'x':0.5},height=1000,showlegend= False)
fig.update_layout(margin={"r":0,"t":60,"l":0,"b":0})
fig.show()

*Data appears to have less outliers on both sides*

### *6.2. Clean unrelated data*

- *Postcode*, *Suburb*, *Council Area*, *Regionname* are 4 features that equally and significantly affect Price. However, these are categorical data and instead, we can use Longtitude and Lattitude (numerical data) as alternative features, then we could drop those 4 features.
- *Method*, *Seller* should also be dropped because they seems to have no effect on Price

In [None]:
output_df = main_df.merge (main_df_numer_after_z, on = ['Address', 'Price', 'Bathroom', 'Distance', 'Bedrooms', 'Car', 'Rooms', 'Distance_to_Station', 'Landsize'], how = 'right')

output_main = main_df.merge (main_df_numer_after_z, on = ['Address', 'Price', 'Bathroom', 'Distance', 'Bedrooms', 'Car', 'Rooms', 'Distance_to_Station', 'Landsize', 'main_lon', 'main_lat'], how = 'right')

use_data = output_main.drop(columns = ['Suburb', 'Address', 'Method', 'SellerG', 'CouncilArea', 'Regionname', 'station', 'Postcode'])
use_data.head()

### *6.3. Feature Importance*

In [None]:
transformer(feature_importance=True,figsize=(800,400),target='Price').transform(X=use_data)

### *6.4. Normalization & Categorical Encoder*
*Using Standard Normalization*

In [None]:
from sklearn.preprocessing import StandardScaler

#create dummies from "type"
dummy_type = pd.get_dummies(use_data[['Type']])
use_data = dummy_type.merge (use_data, left_index = True, right_index = True).drop(columns = ['Type'])
use_data.head(5)

features = use_data.drop(columns = ['Price'])
price_ = use_data['Price']

features_ = StandardScaler().fit_transform (features)
features_

# 7. BUILDING MODELS

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features_, price_, test_size=0.2, random_state= 42)

In [None]:
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
from sklearn.gaussian_process import GaussianProcessRegressor as GPR
from sklearn.gaussian_process.kernels import ConstantKernel, RBF
from sklearn.ensemble import (RandomForestRegressor,GradientBoostingRegressor,
                              ExtraTreesRegressor,AdaBoostRegressor)
from sklearn.linear_model import LinearRegression,Lasso,ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split # Data Split
from sklearn.model_selection import cross_val_score # Cross Validation
from sklearn.model_selection import StratifiedKFold # K-Fold Cross Validation

''' Evaluate some promising models '''
# let's look at how models perform overall, using default settings

models = [] 
models.append(('LR',  LinearRegression()))
models.append(('LASSO',Lasso()))
models.append(('EN',ElasticNet()))  
models.append(('KNN',KNeighborsRegressor()))        
models.append(('CART',DecisionTreeRegressor()))     
models.append(('ABR', AdaBoostRegressor()))
models.append(('GBR', GradientBoostingRegressor()))
models.append(('RFR', RandomForestRegressor()))
models.append(('XGB', XGBRegressor(verbose_eval=False)))

# Evaluate each model in turn using cross-validation
results, names = [], []
for name, model in models:
	kfold = StratifiedKFold(n_splits=10, random_state=3, shuffle=True)
	model_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring='neg_mean_squared_error')
	results.append(model_results)
	names.append(name)
	print(name + ": ", model_results.mean(), model_results.std())

#Compare algorithms using boxplots
plt.figure(figsize=(15,8))
plt.boxplot(results, labels=names)
plt.title('Algorithm Comparison')
plt.show()

In [None]:
from sklearn import metrics
for name, model in models[-5:]:
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print(name + 'MSE:' + str(metrics.mean_squared_error(y_test, predictions)))