# Regression: House Price Prediction
* Author: Johannes Maucher
* Paper: [E.H. Ahmed, M.N. Moustafa: House Price Estimation from Visual and Textual Features](https://arxiv.org/pdf/1609.08399.pdf)
* Data: [https://github.com/emanhamed/Houses-dataset](https://github.com/emanhamed/Houses-dataset)

## Goal

House prices can be predicted based on different types of information. In this notebook we apply

* number of bedrooms
* number of bathrooms
* entire living area in square feets
* location of the house in terms of zipcode

as input features. A simple feedforward neural network shall be trained to estimate the house price from these inputs. In a follow-up notebook house prices are estimated on the basis of house-images. Finally, both types of inputs, images and the 4 parameters listed above, will be applied for house price prediction. 

# Task 1: Data Access and Understanding
1. Download data from [https://github.com/emanhamed/Houses-dataset](https://github.com/emanhamed/Houses-dataset). 
2. The downloaded directory contains images (which will be applied in the next exercise) and a csv-file `HousesInfo.txt`. In this csv-file columns are separated by an empty-space (not by a comma)! Read this file into a pandas-dataframe. The file contains for all houses the following features:
    * Number of bedrooms
    * Number of bathrooms
    * Area (i.e., square footage)
    * Zip code

3. Note that the file `HousesInfo.txt` does not contain column names. Assign the column-names `bedrooms`, `bathrooms`, `area`, `zipcode` and `price` to the pandas dataframe.
4. Calculate descriptive statistics on the dataframe by applying the `descripe()`-method.

# Task 2: Data Cleaning
1. In order to create well generalising (robust) models, training data shall reflect the true statistics of the data as good as possible. In the case of categorical attributes this means that each value shall occur sufficiently often in the training dataset. Therefore, all houses in areas, whose zip-code appears only rarely (less than 20 times) in the given dataset, shall be dropped.
2. Calculate the descriptive statistics on this cleaned dataframe
3. How many different zip-codes remain in the cleaned dataframe?

# Task 3: Preprocessing
1. Split the cleaned dataset into a training- and testpartition by applying the [train_test_split()-method from scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). The test-partition shall contain 25% and the training-partition 75% of the data.
2. Gradient Descent training of neural networks requires that all variables have a similar value-range. Transform all numerical variables into a value range between 0 and 1, by applying scikit-learn's *MinMax-Scaling*. Take care that the *MinMax*-model is fitted only on training-data and applied for transforming training- and test-data! 
3. Categorical data must be One-Hot encoded before passing it to the input of a neural network. Implement one-hot encoding for the categorical feature `zipcode`.
4. Training of regression models converges better, if the output variable is normalized to values between 0 and 1. For this all house-prices shall be normalized. Normalize prices by dividing each house-price by the maximum house-price of the training data.

# Task 4: Define MLP Architecture, Train and Evaluate the MLP

1. A simple Multilayer-Perceptron with 2 hidden layers shall be configured with `tensorflow.keras`. The number of neurons in the hidden layers is 8 and 4, respectively. Both hidden layers apply a relu-activation function. As usual for regression neural networks the output-layer consists of only a singel neuron with a linear-activation (identity-function).
2. *Compile* this keras model, by configuring an [Adam optimizer](https://keras.io/api/optimizers/) for training. Suitable values for learnrate and learnrate-decay are *0.001* and *0.001/NumberOfEpochs*, respectively. The loss-function shall be `mean_absolute_percentage_error`. 
3. Train the network for 200 epochs and a batch-size of 8.
4. Visualize the loss-value degradation over the training-epochs for both, training and test-data.
5. Calculate the models's prediction on the test-data and rescale predicted prices and true-prices.
6. In a scatter-plot visualize predicted prices versus true prizes (in dollars).