In [6]:
import sqlite3
from sqlite3 import Error

import pandas as pd
import numpy as np

In [4]:
def read_data_from_DB():
    try:
        path_to_DB = r"../Data/house.db"
        conn = sqlite3.connect(path_to_DB)
        df = pd.read_sql_query("SELECT * from house_features", conn)
    except Error as e:
        print(e)
        df = 0
    return df
df = read_data_from_DB()

In [5]:
df.head(5)


Unnamed: 0,ID,HOUSE_TYPE,HOUSE_ROOMS,ADDRESS,AREA,CITY,PRICE,SIZE,YEAR,LATITUDE,LONGITUDE
0,1,Kerrostalo,2 h + k,Mechelininkatu 17 A,Töölö,Helsinki,575000,81,1928,60.172709,24.920025
1,2,Kerrostalo,1h + kk + kph + lasitettu parveke,Porvoonkatu 5-7 A,Alppiharju,Helsinki,209000,28,1963,60.189479,24.950852
2,3,Omakotitalo,"3-4h, k, rh, 2wc, kuisti n. 115 m2 + saunaos....",Lainlukijantie 42,Torpparinmäki,Helsinki,398000,115,1954,60.263987,24.954653
3,4,Kerrostalo,1 h + kk + kph,Kauppalantie 13,Etelä-Haaga,Helsinki,199000,26,1963,60.21179,24.898284
4,5,Kerrostalo,2H + KK + S,Leikosaarentie 13,Vuosaari,Helsinki,16446,46,1996,60.202981,25.142185


Helsinki city center is quite large, but the if someone would have to point out *the* central point, they would either say "Narinkkatori" or "Kolmen sepän patsas".
I want to measure the distance between Narinkkatori and the listed house/apartment. My hypothesis is that the further away the location is from this central point, the less it will be valued.

The distance could be computed in various ways, but as we are possibly dealing with small distances ( $d<1$ km), we need to be careful when selecting the correct distance computation method.

Let $\lambda_1$,$\phi_1$ and $\lambda_2$,$\phi_2$ be the geographical longitude and latitude of two points 1 and 2, and $\Delta\lambda$,$\Delta\phi$ be their absolute differences. R is the radius of the earth (6371km)

We could use the spherical law of cosines:
- $d = R \cdot arccos(sin\phi_1sin\phi_2+cos\phi_1 cos\phi_2cos(\Delta\lambda)))$, but this is susceptible to rounding point errors.

A better alternative would be the Haversine formula:
- $d = archav(hav(\Delta\phi)+(1-hav(\Delta\phi)-hav(\phi_1+\phi_2))\cdot hav(\Delta\lambda))$,
- $hav(x) = sin^2(x/2)$

The simplest method would be a spherical projection onto a plane:
- $d = R\sqrt{(\Delta\phi)^2+(cos(\frac{\phi_1+\phi_2}{2})\Delta\lambda)^2}$



In [20]:
def compute_dist(lat1,lon1,lat2,lon2,dist="Spherical_to_plane"): #phi,lambda
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    if dist == "Spherical_to_plane":
        dist = 6371*np.sqrt((lat1-lat2)**2+(np.cos((lat1+lat2)/2)*(lon1-lon2)**2))
    elif dist == "Haversine":
        dlon = lon2 - lon1 
        dlat = lat2 - lat1 
        a = (np.sin(dlat/2)**2 
         + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2)
        c = 2 * np.arcsin(np.sqrt(a)) 
        dist = 6371 * c
    return dist

In [21]:
print("With a projection of spherical coordinates to a plane: ",compute_dist(60.172709,24.920025,60.263987,24.954653))
print("With a the Haversine formula: ",compute_dist(60.172709,24.920025,60.263987,24.954653,dist="Haversine"))

With a projection of spherical coordinates to a plane:  10.506161724677145
With a the Haversine formula:  10.328266143722727


The Haversine formula probably gives better approximations in our case, as it does not overshoot the distance due to a projection. In the future we might want to add listings from cities far away from the capital area, and thus Haversine still would be able to give decent approximations.

In [30]:
coords_narinkkatori = 60.169673, 24.934854
# compute_dist(62.617156,29.716200,*coords_narinkkatori,dist="Haversine")
df["DIST_to_center"] = compute_dist(df["LATITUDE"],df["LONGITUDE"],*coords_narinkkatori,dist="Haversine")
df.head()

Unnamed: 0,ID,HOUSE_TYPE,HOUSE_ROOMS,ADDRESS,AREA,CITY,PRICE,SIZE,YEAR,LATITUDE,LONGITUDE,DIST_to_center
0,1,Kerrostalo,2 h + k,Mechelininkatu 17 A,Töölö,Helsinki,575000,81,1928,60.172709,24.920025,0.886916
1,2,Kerrostalo,1h + kk + kph + lasitettu parveke,Porvoonkatu 5-7 A,Alppiharju,Helsinki,209000,28,1963,60.189479,24.950852,2.373312
2,3,Omakotitalo,"3-4h, k, rh, 2wc, kuisti n. 115 m2 + saunaos....",Lainlukijantie 42,Torpparinmäki,Helsinki,398000,115,1954,60.263987,24.954653,10.544124
3,4,Kerrostalo,1 h + kk + kph,Kauppalantie 13,Etelä-Haaga,Helsinki,199000,26,1963,60.21179,24.898284,5.100788
4,5,Kerrostalo,2H + KK + S,Leikosaarentie 13,Vuosaari,Helsinki,16446,46,1996,60.202981,25.142185,12.045611


## Exploratory Data Analysis

### Categorical features

### Numerical features

## TODO:
- EDA on features
- Cleaning data + Regression on prices
    Naive + XGboost + ...