### Dataset Visualizations

Dataset used: [Real Estate dataset](http://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set)

##### Features:
* transcation date: the transaction date (for example, 2013.250=2013 March, 2013.500=2013 June, etc.)
* house age: the house age (unit: year)
* MRT distance: the distance to the nearest MRT station (unit: meter)
* number of stores: the number of convenience stores in the living circle on foot (integer)
* latitude: the geographic coordinate, latitude. (unit: degree)
* longitude: the geographic coordinate, longitude. (unit: degree)

The output is as follow
* price: house price of unit area (10000 New Taiwan Dollar/Ping, where Ping is a local unit, 1 Ping = 3.3 meter squared)

In [19]:
# All imports needed
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures 
from sklearn import neighbors
from sklearn.tree import DecisionTreeRegressor
import sklearn.metrics as metrics

In [20]:
# Read data from file
file_names = ['transaction date', 'house age', 'MRT distance', 'number of stores', 'latitude', 'longitude', 'price']
real_estate = pd.read_csv("../data/estate/Real estate valuation data set.csv", header = 0, names = file_names, usecols = range(1,8)) 

real_estate.head()

Unnamed: 0,transaction date,house age,MRT distance,number of stores,latitude,longitude,price
0,2012.917,32.0,84.87882,10,24.98298,121.54024,37.9
1,2012.917,19.5,306.5947,9,24.98034,121.53951,42.2
2,2013.583,13.3,561.9845,5,24.98746,121.54391,47.3
3,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8
4,2012.833,5.0,390.5684,5,24.97937,121.54245,43.1


#### Modeling:

In [26]:
def linear_regression(X_train, X_test, y_train):
    
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Testing
    y_pred = model.predict(X_test)
    
    # predictions on testing set
    return y_pred

def polynomial_regression(X_train, X_test, y_train, degree = 2):
    poly = PolynomialFeatures(degree)
    x_poly = poly.fit_transform(X_train)
    
    model = LinearRegression()
    model.fit(x_poly, y_train)
    
    y_pred = model.predict(poly.fit_transform(X_test))
    return y_pred
    
def knn_regression(X_train, X_test, y_train, n = 5):
    model = neighbors.KNeighborsRegressor(n)
    model.fit(X_train, y_train)

    # Testing
    y_pred = model.predict(X_test)
    
    # predictions on testing set
    return y_pred
    
def regression_tree(X_train, X_test, y_train):
    model = DecisionTreeRegressor(random_state=0)
    model.fit(X_train, y_train)

    # Testing
    y_pred = model.predict(X_test)
    
    # predictions on testing set
    return y_pred

def plot_scatter(y_test, y_pred):
    # Plot outputs
    plt.scatter(y_test, y_pred)

    plt.xticks(())
    plt.yticks(())

    plt.show()
    
def print_performance(y_test, y_pred):
    # Mean Squared Error
    print("MSE: ", metrics.mean_squared_error(y_test,y_pred))
    
    print("RMSE: ", metrics.mean_squared_error(y_test,y_pred, squared=False))
    
    # R2 is between 0 and 100 percent
    # 0 indicates that the model explains none of the variability of the response data around its mean.
    # 100 indicates that the model explains all the variability of the response data around its mean.
    print("R2: ", metrics.r2_score(y_test,y_pred) * 100)

In [28]:
X_train, X_test, y_train, y_test = train_test_split(real_estate.drop('price', axis=1),real_estate['price'], test_size=0.2)

# Linear regression
print('Linear Regression')
y_pred = linear_regression(X_train, X_test, y_train)
print_performance(y_test,y_pred)

# Polynomial regression
print('\nPolynomial Regression')
y_pred = polynomial_regression(X_train, X_test, y_train)
print_performance(y_test,y_pred)

# knn regression
print('\nKNN Regression')
y_pred = knn_regression(X_train, X_test, y_train)
print_performance(y_test,y_pred)

# regression tree
print('\nRegression Tree')
y_pred = regression_tree(X_train, X_test, y_train)
print_performance(y_test,y_pred)

Linear Regression
MSE:  44.123122373753944
RMSE:  6.642523795497758
R2:  67.96485519960132

Polynomial Regression
MSE:  34.21241608856267
RMSE:  5.849138063728935
R2:  75.16042282582124

KNN Regression
MSE:  44.434544578313265
RMSE:  6.665924135355372
R2:  67.73875027138227

Regression Tree
MSE:  54.71713855421687
RMSE:  7.397103389450283
R2:  60.27317736492539
