# Train house price predictor

This notebook uses the MLpipeline class created in 'MLpipeline.py' to train a model that predicts houseprices in Dublin. It uses scraped data from the Irish property ads website Daft.ie enriched with information from open street maps. 

The scraping and enriching has already been done and the data is stored in "data/df_ads_mapdata.csv"

In [9]:
import pandas as pd
import plotly.express as px

from MLpipeline import MLpipeline

Loading the data and making a list of variables to be considered as features for the model.

In [4]:
df_ads = pd.read_csv('data/df_ads_mapdata.csv')

xlist = ['surface','area','property_type','ber_classification','seller_type',
         'selling_type','price_type','month','bathrooms','beds',
         'dist_to_centre','caferestaurants', 'churches', 'health', 
         'parks', 'platforms', 'pubs','schools', 'shops', 'sports', 
         'stations', 'latitude', 'longitude','parking']


Instantiating the machine learning pipeline class. The y variable is 'price' and all other variables are considered features as per the list above. The init prints out the info of the dataframe.

In [5]:
mlp = MLpipeline(df_ads, xlist)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 15741 entries, 0 to 15937
Data columns (total 25 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   price               15741 non-null  float64
 1   surface             13180 non-null  float64
 2   area                15741 non-null  object 
 3   property_type       15741 non-null  object 
 4   ber_classification  12091 non-null  object 
 5   seller_type         15741 non-null  object 
 6   selling_type        15679 non-null  object 
 7   price_type          15741 non-null  object 
 8   month               15741 non-null  object 
 9   bathrooms           15741 non-null  float64
 10  beds                15741 non-null  float64
 11  dist_to_centre      15741 non-null  float64
 12  caferestaurants     15741 non-null  int64  
 13  churches            15741 non-null  int64  
 14  health              15741 non-null  int64  
 15  parks               15741 non-null  int64  
 16  plat

The plot below shows that the log price is quite normally distributed but also that the tails are quite long, especially on the right. This makes sense as the most expensive house in the database is actually priced at 10 million euro.

In [34]:
fig = px.histogram(x=mlp.y, nbins=200, marginal="box", labels={"x": "log price"})
fig.show()

Of the numerical features shown below, some of the values are not realistic, e.g. the max surface area. Some houses are also very abnormal with almost 30 bedrooms or bathrooms. The distance to the centre is also in some cases way too high, which would suggest that the coordinates in the ads are wrong. 

Because all these extreme cases can happen as well in any new data that may come in this is dealt with in the preprocessing pipeline.

In [35]:
mlp.X.describe()

Unnamed: 0,surface,bathrooms,beds,dist_to_centre,caferestaurants,churches,health,parks,platforms,pubs,schools,shops,sports,stations,latitude,longitude
count,13180.0,15741.0,15741.0,15741.0,15741.0,15741.0,15741.0,15741.0,15741.0,15741.0,15741.0,15741.0,15741.0,15741.0,15741.0,15741.0
mean,305.501698,1.886729,2.843148,11.653049,6.713741,1.186265,0.22578,2.744298,11.757639,3.274887,1.385808,18.61235,5.221968,0.761578,53.33685,-6.255491
std,11710.127293,1.112966,1.235995,201.764393,21.820179,1.882202,0.61821,3.341547,10.602349,9.160366,1.416099,38.084286,5.25419,1.757796,1.044752,1.916434
min,0.5,0.0,0.0,0.351459,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-43.533968,-77.94056
25%,72.0,1.0,2.0,3.959,0.0,0.0,0.0,1.0,6.0,0.0,0.0,1.0,1.0,0.0,53.297899,-6.323787
50%,93.0,2.0,3.0,7.23719,1.0,1.0,0.0,2.0,10.0,1.0,1.0,6.0,4.0,0.0,53.341907,-6.26455
75%,125.0,2.0,3.0,11.180612,4.0,1.0,0.0,4.0,15.0,3.0,2.0,18.0,7.0,0.0,53.384001,-6.207958
max,937498.80567,29.0,29.0,18910.994078,319.0,16.0,5.0,38.0,144.0,124.0,7.0,464.0,38.0,15.0,54.609707,172.679026


Splitting the data into a testing and training dataset.

In [36]:
mlp.split_data()

Rows in training data: 12592
Rows in testing data: 3149
