# BSc Thesis First Draft

## Forecasting Real Estate Prices using Big Data: Methods and Alternative Data Sources

*Marcell Nemeth*


*13025651*,
marcell.nemeth@student.uva.nl

### TO-DO

- [ ] add statistical tests for sample
- [x] Create flowchart for pipeline
- [ ] MORE ARTICLES
- [ ] Simplify 

### Table of Contents

- [1. Introduction](#1-introduction)
- [2. Objectives](#2-objectives)
- [3. Data](#3-data)
  - [3.1 Geospatial Data](#31-geo-spatial-alternative-data-sources)
  - [3.2 Green Space Estimation](#32-green-space-estimation)
  - [3.3 Noise Pollution](#33-noise-pollution-score)
  - [3.4 Distance to Centre Score ](#34-distance-to-centre-score)
  - [3.5 Google Trends](#35-google-trends-score)
  - [3.6 Neighborhood Safety Score](#36-neighborhood-safety-score)
- [4. Methods](#4-methods)
- [5. Evaluation Metrics](#5-evaluation-metrics)
- [6. Issues](#6-issues)
- [7. Pilot Study](#7-pilot)
  - [7.1 Data](#71-data)
  - [7.2 Distribution](#72-distribution-of-target)
  - [7.3 Results](#73-results-of-the-pilot)
  - [7.4 Correlation of Features](#74-correlation-of-features-and-target)
- [8. Appendix](#8-appendix)
  - [8.1 Parameter Grid](#81-parameter-grid-for-standard-models)
- [9. References](#9-references)

### 1. Introduction

Most traditional of traditional real-estate valuation happens based on simple regression models and easily quantifiable data sources [Source]. With the advent on Big Data and more complex Machine Learning estimators, it is easier to address over/under-valuation of real estate properties [Source]. We propose an estimator framework that leverages alternative data sources and ensemble learning to provide more accurate valuation in the real estate market. 

### Literature Review (Preliminary)



|Article   |Sample (size,loc)  |Input|Output|Models/Methods   |Results   |Limitations   |
|---|---|---|---|---|---|---|
|[Valier, A. (2020)](https://www.tandfonline.com/doi/full/10.1080/09599916.2019.1587489)  | 165 articles  |Property features   |Final prices   |   |||
| [Perez-Rave, I (2019)](https://www.emerald.com/insight/content/doi/10.1108/JPIF-12-2019-0157/full/html)  | Colombia (61,826 observations), Am Housing Survey (58,888 observations)  |   |   | MINREM  |ML > Hedonic| |
| [Winky K.O., H (2020)](https://www.tandfonline.com/doi/full/10.1080/09599916.2020.1832558)   |  Hong-Kong (40,000) |  Orientation(ESWN), distrance from centre, other property char. | Transaction price   |  SVM, RF, GBM |$R^2$ ~0.9||
|[Lorenz (2020)](https://onlinelibrary.wiley.com/doi/full/10.1111/1540-6229.12397) |  Frankfurt(52,966 observations) |Socioecon. + property data+ spatial data (CBD dist., anities) | Rent| XGB | $R^2$ ~0.92| |

#### Hypothesis

The hypotheses of this study are the following:

$H_0$: There is no significant difference in accuracy between traditional real-estate valuation based on simple hedonic regression models and easily quantifiable data sources, and the proposed estimator framework that leverages alternative data sources and ensemble learning.

$H_1$: The proposed estimator framework that leverages alternative data sources and ensemble learning provides more accurate valuation in the real estate market compared to traditional real-estate valuation based on hedonic regression models.

### 2. Objectives

The objectives of this thesis are the following:

1. Establish a baseline model for real estate valuation based on available property data and OLS regression.
2. Improve predictions by utilizing ensemble learning
3. Identify alternative data sources that can contribute to prediction accuracy
4. Compare model performance with and without the identified data sources

### 4. Methods:

#### Flowchart of project's description

![Data Flow](imgs/figures/data_flow.png)

### 3. Data

As many alternative data sources deal with 2D mapping of features, some spatial data can be broken down either to fewer components (PCA?) or a singular metric.

### Data Availability

| Data type | Data source | Transformation methods | Condensed Data | Level of data |
|:---|:---|:---|:---|:---|
|Property Sales Data|Funda.nl (via funda scraper)| Label encoding, standardizing| 25 unique data-points| Individual points | 
| Green Space Nearby | Google Earth Engine (Sentinel-2 RGB) | Masking, green-pixel/total pixels ratio| Green Space Prevalence Score | ~Individual points |
| Noise Pollution Prevalence Score | Amsterdam Gementee Noise Pollution Map | Masking, pixel prevalence ratio | Noise Pollution Prevalence  | ~Block level  | 
| Neighborhood Safety Score | Amsterdam Gementee Neighborhood Safety Score | Standardizing | Range (0-1) 1: Safest 0: Least Safe|
| RE related search terms  | Google Trends API | Avg. scoring of prevalence during period | Range (0-1) <br> 1: Highest interest<br> 0: Lowest interest| City level |  
| Listing description | Funda.nl (via funda scraper) | Sentiment Analysis with continuos output| Score (0-1)<br> 1: Highest Price Listing<br> 0: Lowest Price Listing| Individual points | 


##### 3.1 Geo Spatial Alternative Data Sources:

![Image](imgs/figures/feature_pipeline.png)


Landsat imaging and Traffic maps for the sampling are collected by the Google Earth Python API.

Noise pollution maps are published by local authorities.

Crime statistics largely district level data.

Green space can be calculated by applying a green filter to landsat imaging:

- Geo location from original data X,Y coordinates 
- Cutting a circular mask with radius r
- Scoring prevalence of green in the circle (0-1)

#### 3.2 Green Space Estimation

Green space estimation happens by obtaining landsat images from Google Earth's API for each geolocation. Then, a circular clipping mask of  radius $r$ is applied. Within the circle, each pixel is counted, and we obtain the ratio by dividing pixels falling in the green range with overall pixels within the area of the circle.

![Green Mask Sample](green_mask_sample.png)

#### 3.3  Noise Pollution Score

There are readily available noise pollution maps both on a city and regional level for the Netherlands. A similar scoring system as the Green Space estimation can be utilized to see the exposure to noise of an apartment.

##### TODO: Transform noise map csv file to Folium map object, use the same scoring mechanism (or PCA?)

#### 3.4 Distance to Centre Score

Distance to centre score is calculated by taking a cluster (city) of data points, distributed by x,y coordinates, and locating the centroid of the cluster, calculating Eucledian distance from point to centroid. This method of calculation relies on an important assumption:
The sample is representative of the population and functional city centres can be located by apartment density.

Each datapoint's coordinates were estimated by Geolocation, provided by Google Maps API. Below you can see a distribution of the sample with the Centre of Gravity marked:

![CBD](imgs/figures/map_with_gc.png)

#### 3.5 Google Trends Score

Google Trends gives a very high level summary of search term usage. In the Netherlands, regional levels of search analytics can be accessed.
**This score can be only implemented if the final sample will be on a regional level, not on a city level.**

##### 3.6 Neighborhood Safety Score

Amsterdam provides a district level breakdown of safety scores:

![Crime Map](imgs/figures/crime_map.png)

### 4. Model Selection

The following standard models will be considered for evaluation:

1. Lasso/Ridge Regressors
2. Random Forest Regressor
3. XGBoost Regressor
4. SVM Regressor
   
"Black Box" models considered for evaluation:

1. Convolutional Neural Networks
2. BERT for NLP analysis



### 5. Evaluation Metrics

##### Metrics for model evaluation:
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- Mean Absolute Error (AME)
- $R^2$ 

##### Metrics for feature evaluation: 
- Feature Importance
- Feature Effects
- Feature Interactions

### 6. Issues

#### 6. 1 Scraping Data
I am unsure if permission is needed from agencies such as Funda or Pararius to use their data for model testing. 

### 7. Pilot 

A small pilot on the sample data showed promising results.

- Outliers were removed
- Categorical variables were encoded
- The target variable needed to be log transformed to fit a normal distribution
- A Randomized Search Cross Validation was conducted on all regressors
- Multiple strong correlations between the target and features were found (see Correlation matrix)


#### 7.1 Data

Fortunately there is a preexisting Python package to scrape real estate listings on Funda under a GNU License. (see References)

For the pilot study, a dataset of 605 properties was collected from Funda.nl.

Addresses from the dataset were transformed to coordinates via geolocation (Google Earth Engine)

#### RE listing features

All properties were located in Amsterdam and had the following features:

|    Feature           | Type      | Description|
|:--------------|:--------|:----|
| house_id      | int64   | ID of property |
| city          | object  | City |
| house_type    | object  | Apartment/House |
| building_type | object  | New Property/Resale Property| 
| price         | int64   | Price (EUR) |
| price_m2      | float64 | Price/$m^2$ |
| room          | int64   | Number of rooms |
| bedroom       | int64   | Number of bedrooms |
| bathroom      | int64   | Number of bathrooms |
| living_area   | int64   | Size of living area |
| energy_label  | object  | >A+, A, B, C, D, E, F, G, Nan|
| has_balcony   | int64   | 1,0 |
| has_garden    | int64   |1,0 |
| zip           | int64   | Zip-code|
| address       | object  | Address -> **used for geomapping** |
| year_built    | int64   | Year Built |
| house_age     | int64   | Current year - Year Built |
| date_list     | object  | Date Listed |
| ym_list       | object  | Date Listed? |
| year_list     | int64   | Year Listed? |
| descrip       | object  | Description -> **maybe sentiment analysis with BERT?**
| ym_sold       | object  | Date Sold? |
| year_sold     | int64   |Year Sold? |
| term_days     | int64   |Term Days? |
| date_sold     | object  | Date  Sold |

##### 7.2 Distribution of target

<img src="imgs/figures/target_distrib.png" />

With outliers removed, the target distribution was closer to normally distributed:
<img src="imgs/figures/target_distrib_outliers.png" />

The training data was scaled with a StandardScaler, and fitted with each regressor through a randomized search Cross Validation Process (see Param Grid for Estimators)

#### 7.4 Correlation of features and target

![Correlation Matrix](imgs/figures/feature_corr.png)

#### 7.3 Results of the pilot

![Results](imgs\\figures\\cv_improv.png)

### 8. Appendix

##### 8.1 Parameter Grid for Standard Models:

param_grid_rf = {
    'n_estimators': np.arange(50, 200, 10),
    'max_depth': np.arange(5, 15, 1),
    'min_samples_split': np.arange(2, 11, 1),
    'min_samples_leaf': np.arange(1, 6, 1)
}

param_grid_svr = {
    'C': np.logspace(-3, 3, 7),
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'degree': np.arange(1, 6, 1)
}

param_grid_ridge = {
    'alpha': np.logspace(-3, 3, 7),
    'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']
}

param_grid_lasso = {
    'alpha': np.logspace(-3, 3, 7),
    'max_iter': np.arange(1000, 10000, 1000)
}

param_grid_xgb = {
    'learning_rate': np.logspace(-3, 0, 4),
    'n_estimators': np.arange(50, 200, 10),
    'max_depth': np.arange(3, 10, 1),
    'min_child_weight': np.arange(1, 6, 1),
    'gamma': np.arange(0, 1, 0.1),
    'subsample': np.arange(0.1, 1.1, 0.1),
    'colsample_bytree': np.arange(0.1, 1.1, 0.1),
    'reg_alpha': np.logspace(-3, 3, 7),
    'reg_lambda': np.logspace(-3, 3, 7)
}

### 9. References

Ho, W. K., Tang, B., & Wong, S. K. (2021). Predicting property prices with machine learning algorithms. Journal of Property Research, 38(1), 48–70. https://doi.org/10.1080/09599916.2020.1832558

Lorenz, F., Willwersch, J., Cajias, M., & Fuerst, F. (2022). Interpretable machine learning for real estate market analysis. Real Estate Economics. https://doi.org/10.1111/1540-6229.12397

Pérez-Rave, J., Correa, J. C., & Echavarría, F. G. (2019). A machine learning approach to big data regression analysis of real estate prices for inferential and predictive purposes. Journal of Property Research, 36(1), 59–96. https://doi.org/10.1080/09599916.2019.1587489

Valier, A. (2020). Who performs better? AVMs vs hedonic models. Journal of Property Investment & Finance, 38(3), 213–225. https://doi.org/10.1108/jpif-12-2019-0157


W. (n.d.). GitHub - whchien/funda-scraper: FundaScaper scrapes data from Funda, the Dutch housing website. You can find listings from house-buyer or rental market, and historical data. GitHub. https://github.com/whchien/funda-scraper

In [4]:
import os 
print(os.getcwd())

C:\Users\nemet\OneDrive\Desktop\Git-Thesis
