# DS2 Assignment 1 - Nicolas Fernandez
## Predicting Property Prices in New Taipei City, Taiwan
The task is to predict property prices in New Taipei City, Taiwan using data taken from [UC Irvine](https://archive.ics.uci.edu/dataset/477/real+estate+valuation+data+set).

The precise dataset being used is a cleaned version uploaded within Janos Divenyi's [github repository](https://github.com/divenyijanos/ceu-ml/tree/2023/data/real_estate).

The task asks to create a 20% subsample from the data and then create a 70/30% split from that subsample for a training and test sets.

## Reading Data and Creating Training and Test Splits

In [19]:
# Importing required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Setting the pseudo random seed for the due date of the assignment
prng = np.random.RandomState(20240322)

# Reading the data
#real_estate_data = pd.read_csv("https://raw.githubusercontent.com/divenyijanos/ceu-ml/2023/data/real_estate/real_estate.csv")
real_estate_data = pd.read_csv('real_estate.csv')
display(real_estate_data)

# Creating a randomly selected 20% sample of the data, creating target variable 'outcome'
real_estate_sample = real_estate_data.sample(frac=0.2)
outcome = real_estate_sample["house_price_of_unit_area"]

# Selecting numeric features as X variables, excluding target variable
features = real_estate_sample.drop(columns=["house_price_of_unit_area"]).select_dtypes(include=np.number)

# Creating training and testing splits from features and target, using a 30% split per specifications
X_train, X_test, y_train, y_test = train_test_split(features, outcome, test_size=0.3, random_state=prng)

print(f"Size of the training set: {X_train.shape}, size of the test set: {X_test.shape}")

Unnamed: 0,id,transaction_date,house_age,distance_to_the_nearest_MRT_station,number_of_convenience_stores,latitude,longitude,house_price_of_unit_area
0,1,2012.917,32.0,84.87882,10,24.98298,121.54024,37.9
1,2,2012.917,19.5,306.59470,9,24.98034,121.53951,42.2
2,3,2013.583,13.3,561.98450,5,24.98746,121.54391,47.3
3,4,2013.500,13.3,561.98450,5,24.98746,121.54391,54.8
4,5,2012.833,5.0,390.56840,5,24.97937,121.54245,43.1
...,...,...,...,...,...,...,...,...
409,410,2013.000,13.7,4082.01500,0,24.94155,121.50381,15.4
410,411,2012.667,5.6,90.45606,9,24.97433,121.54310,50.0
411,412,2013.250,18.8,390.96960,7,24.97923,121.53986,40.6
412,413,2013.000,8.1,104.81010,5,24.96674,121.54067,52.5


Size of the training set: (58, 7), size of the test set: (25, 7)


## Evalaution Function

In [20]:
# Defining the loss function, using Root Mean Squared Log Error
def calculateRMSLE(prediction, y_obs):
    return round(np.sqrt(np.mean((np.log(np.where(prediction < 0, 0, prediction) + 1) - np.log(y_obs + 1))**2)), 4)

RMSLE is an appropriate loss function since its a calculation that is less sensitive to outliers by design, appropriate for property price prediction. The risk (from a business perspective) from making a wrong prediction could be either under or over pricing homes in certain areas because values that the loss function may have treated as outliers may have been more important than they seemed.?

## Models
Below all models will be created, with increasing complexity and/or varying methods of regression (OLS vs Machine Learning)

### Benchmark Model

In [24]:
# Creating a simple benchmark model for comparison, mean value of target variable for training set
benchmark = np.mean(y_train)

# Calculating performance of benchmark on test set using loss function
benchmark_pred = ['Benchmark', calculateRMSLE(benchmark, y_train), calculateRMSLE(benchmark, y_test)]

# Storing and displaying results in a dataframe
results = pd.DataFrame([benchmark_pred], columns=['Model', 'Train', 'Test'])
results

Unnamed: 0,Model,Train,Test
0,Benchmark,0.3836,0.4544


### OLS Models

#### Single Feature - `distance_to_the_nearest_MRT_station`

In [41]:
# Importing required library
from sklearn.linear_model import LinearRegression

# Creating OLS model trained on only `distance_to_the_nearest_MRT_station`
ols_single = LinearRegression().fit(X_train[['distance_to_the_nearest_MRT_station']], y_train)

# Creating predictions for model and calculating RMSLE
train_error = calculateRMSLE(ols_single.predict(X_train[['distance_to_the_nearest_MRT_station']]), y_train)
test_error = calculateRMSLE(ols_single.predict(X_test[['distance_to_the_nearest_MRT_station']]), y_test)
ols_single_pred = ['OLS Single Feature', train_error, test_error]

# Adding to results
results.loc[len(results)] = ols_single_pred
results

Unnamed: 0,Model,Train,Test
0,Benchmark,0.3836,0.4544
1,OLS Single Feature,0.346,0.3168


This model improves upon simple using the average of the target variable as a predictive model for our target as the RMSLE's improve however they are not vastly improved and this model only uses a single feature from the dataset. The model can be significantly improved as it likely does not catch much of the complexity of the data.

#### Multivariate OLS
Building an OLS model that uses all the available features instead of a single variable