# Final Model Evaluation and Validation

In [1]:
# General imports
import joblib
import numpy as np
import pandas as pd
import requests

from geopy.distance import geodesic
from shapely.geometry import Point, Polygon
from sklearn.metrics import r2_score

## 1. Introduction

We are looking to try the model on unseen data ranging from 2024.09.12 to 2024.09.30.

### 1.1 Loading Dataset (unseen data)

In [2]:
# Load unseen data (replace this with your actual data loading method)
df = pd.read_csv(r"C:\Users\gustm\Desktop\Unseen_Data_240930.csv", sep=';', encoding='ISO-8859-1')

In [3]:
df.head()

Unnamed: 0,Address,Price,Living Area,Side Area,Year of Building,Monthly Fee,Floor,Elevator,Balcony,Patio,Fireplace
0,Adventsvägen 1,7800000,85,,2017,7129,5,1,1,0,0
1,Bäckvägen 42,3200000,59,,1942,5043,1,1,0,0,0
2,Bäckvägen 50,3675000,55,,1943,3957,3,1,1,0,0
3,Cedergrensvägen 18,2700000,35,,1940,3206,1,0,1,0,0
4,Cedergrensvägen 36,2400000,345,,1939,3186,1,0,0,0,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Address           47 non-null     object 
 1   Price             47 non-null     int64  
 2   Living Area       47 non-null     object 
 3   Side Area         1 non-null      float64
 4   Year of Building  47 non-null     int64  
 5   Monthly Fee       47 non-null     int64  
 6   Floor             47 non-null     int64  
 7   Elevator          47 non-null     int64  
 8   Balcony           47 non-null     int64  
 9   Patio             47 non-null     int64  
 10  Fireplace         47 non-null     int64  
dtypes: float64(1), int64(8), object(2)
memory usage: 4.2+ KB


### 1.2 Highlights of Tasks in this Notebook

- Data preprocessing and scaling of the dataset (unseen data)
- Evaluating the prediction of unseen data
- Future research and improvements

## 2. Data Preprocessing (Unseen Data)

### 2.1 Coordinates

In [5]:
# Define your API key
API_KEY = '***********************' # Enter your API_KEY here

# Prepare columns for latitude and longitude
df['Latitude'] = None
df['Longitude'] = None

# Geocoding function
def get_lat_lon(address):
    # Append ", Stockholm" to the partial address
    full_address = f"{address}, Stockholm"
    
    base_url = 'https://maps.googleapis.com/maps/api/geocode/json'
    params = {'address': full_address, 'key': API_KEY}
    
    response = requests.get(base_url, params=params)
    
    if response.status_code == 200:
        data = response.json()
        
        if data['status'] == 'OK':
            results = data['results']
            if results:
                # Choose the first result as the best match
                location = results[0]['geometry']['location']
                return location['lat'], location['lng']
            else:
                print("No results found.")
        else:
            print(f"Geocoding API Error: {data['status']}")
    else:
        print(f"HTTP Request Error: {response.status_code}")
    
    return None, None


# Process each address, skipping rows where 'Address' is NaN
for i, row in df.iterrows():
    address = row['Address']
    
    if pd.notna(address):  # Only proceed if the address is not NaN
        lat, lon = get_lat_lon(address)
        df.at[i, 'Latitude'] = lat
        df.at[i, 'Longitude'] = lon
    else:
        print(f"Skipping row {i} due to missing address.")

In [6]:
df.head()

Unnamed: 0,Address,Price,Living Area,Side Area,Year of Building,Monthly Fee,Floor,Elevator,Balcony,Patio,Fireplace,Latitude,Longitude
0,Adventsvägen 1,7800000,85,,2017,7129,5,1,1,0,0,59.301007,18.007276
1,Bäckvägen 42,3200000,59,,1942,5043,1,1,0,0,0,59.303482,18.000106
2,Bäckvägen 50,3675000,55,,1943,3957,3,1,1,0,0,59.303267,17.998685
3,Cedergrensvägen 18,2700000,35,,1940,3206,1,0,1,0,0,59.302473,18.001264
4,Cedergrensvägen 36,2400000,345,,1939,3186,1,0,0,0,0,59.301316,17.998889


In [7]:
# Check if any values in the column are not floats
non_float_values = df['Latitude'].apply(lambda x: not isinstance(x, float))

# Display the rows with non-float values
print(df[non_float_values])

Empty DataFrame
Columns: [Address, Price, Living Area, Side Area, Year of Building, Monthly Fee, Floor, Elevator, Balcony, Patio, Fireplace, Latitude, Longitude]
Index: []


**Comment**
- It went through.
- Let's make sure the `Latitude` and `Longitude` are numeric and in datatype `float`.

In [8]:
# Convert 'Latitude' and 'Longitude' to numeric (float), coercing invalid values to NaN
df['Latitude'] = pd.to_numeric(df['Latitude'], errors='coerce')
df['Longitude'] = pd.to_numeric(df['Longitude'], errors='coerce')

# Verify the changes
print(df[['Latitude', 'Longitude']].dtypes)

Latitude     float64
Longitude    float64
dtype: object


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Address           47 non-null     object 
 1   Price             47 non-null     int64  
 2   Living Area       47 non-null     object 
 3   Side Area         1 non-null      float64
 4   Year of Building  47 non-null     int64  
 5   Monthly Fee       47 non-null     int64  
 6   Floor             47 non-null     int64  
 7   Elevator          47 non-null     int64  
 8   Balcony           47 non-null     int64  
 9   Patio             47 non-null     int64  
 10  Fireplace         47 non-null     int64  
 11  Latitude          47 non-null     float64
 12  Longitude         47 non-null     float64
dtypes: float64(3), int64(8), object(2)
memory usage: 4.9+ KB


### 2.2 Districts

Assigning apartments to set `Districts`.

In [10]:
# Function to determine which district a coordinate falls into
district_polygons = {
    'LM-Staden': Polygon([(59.30090784434475, 18.004156569121477), (59.303536039023335, 18.006542466249208), (59.30354906550476, 18.003308108384427), (59.302513444665585, 18.001572910476696), (59.30275444100907, 18.000360823702913), (59.30229524379241, 17.997987685382537), (59.30163790886046, 17.995702455812182), (59.30129540998905, 17.995002124276585), (59.30020109563427, 17.995270059244486), (59.29830876068701, 17.996424731383566), (59.29864424254005, 17.998115273462787), (59.30011641507348, 18.00113911099409), (59.30060495382102, 18.00307844983214), (59.30097949547381, 18.00324431434554)]),
    'Hökmossen': Polygon([(59.29416724083719, 17.99957193725194), (59.291986739767346, 17.996728795778786), (59.29028826274625, 17.9917828024887), (59.28985541149427, 17.986107248298183), (59.29033757457149, 17.985324043074623), (59.291937430321624, 17.987426894710957), (59.29708600252817, 17.98890202548726), (59.29726677605906, 17.99180953973953), (59.295941081186214, 17.99581139575144), (59.29549186707748, 17.99611180312806)]),
    'Telefonplan': Polygon([(59.29708600252817, 17.98890202548726), (59.29839519151933, 17.98749651998394), (59.2997646113716, 17.987754012149693), (59.30061906464913, 17.99156273435273), (59.30129540998905, 17.995002124276585), (59.30020109563427, 17.995270059244486), (59.29830876068701, 17.996424731383566), (59.29864424254005, 17.998115273462787), (59.29869523342455, 18.002811115370527), (59.29851783894495, 18.002667485926423), (59.29770552181718, 18.005206301469677), (59.2941365592012, 17.999947494068802), (59.29416724083719, 17.99957193725194), (59.29549186707748, 17.99611180312806), (59.295941081186214, 17.99581139575144), (59.29726677605906, 17.99180953973953)]),
    'Midsommarkransen': Polygon([(59.30279356694878, 17.988837635873434), (59.304113525613566, 17.993429539256592), (59.304700577610674, 18.001714335838713), (59.305335877510586, 18.009213791922114), (59.305116809915, 18.016402111974603), (59.30358329727191, 18.019341813172957), (59.30037365065614, 18.014481650757833), (59.299551390719465, 18.01130624585736), (59.29770552181718, 18.005206301469677), (59.29851783894495, 18.002667485926423), (59.29869523342455, 18.002811115370527), (59.29864424254005, 17.998115273462787), (59.30011641507348, 18.00113911099409), (59.30060495382102, 18.00307844983214), (59.30097949547381, 18.00324431434554), (59.30090784434475, 18.004156569121477), (59.303536039023335, 18.006542466249208), (59.30354906550476, 18.003308108384427), (59.302513444665585, 18.001572910476696), (59.30275444100907, 18.000360823702913), (59.30229524379241, 17.997987685382537), (59.30163790886046, 17.995702455812182), (59.30129540998905, 17.995002124276585), (59.30061906464913, 17.99156273435273)])
}

def assign_district(row):
    point = Point(row['Latitude'], row['Longitude'])
    for district_id, polygon in district_polygons.items():
        if polygon.contains(point):
            return district_id
    return None  # If not in any district

# Apply function to each row in the DataFrame
df['District'] = df.apply(assign_district, axis=1)

In [11]:
df.head()

Unnamed: 0,Address,Price,Living Area,Side Area,Year of Building,Monthly Fee,Floor,Elevator,Balcony,Patio,Fireplace,Latitude,Longitude,District
0,Adventsvägen 1,7800000,85,,2017,7129,5,1,1,0,0,59.301007,18.007276,Midsommarkransen
1,Bäckvägen 42,3200000,59,,1942,5043,1,1,0,0,0,59.303482,18.000106,Midsommarkransen
2,Bäckvägen 50,3675000,55,,1943,3957,3,1,1,0,0,59.303267,17.998685,Midsommarkransen
3,Cedergrensvägen 18,2700000,35,,1940,3206,1,0,1,0,0,59.302473,18.001264,LM-Staden
4,Cedergrensvägen 36,2400000,345,,1939,3186,1,0,0,0,0,59.301316,17.998889,LM-Staden


Looks good.
<br> Let's create the final `District`, 'Gamla Midsommarkransen' for all apartments build before 1920.

In [12]:
df.loc[(df['District'] == 'Midsommarkransen') & (df['Year of Building'] <= 1920), 'District'] = 'Gamla Midsommarkransen'

In [13]:
df['District'].value_counts().sort_index(ascending=True)

District
Gamla Midsommarkransen     1
Hökmossen                  1
LM-Staden                  8
Midsommarkransen          19
Telefonplan               18
Name: count, dtype: int64

### 2.3 Living and Side Area

In [14]:
# Fill NaN values with 0 in the 'Side Area' column
df['Side Area'] = df['Side Area'].fillna(0)

In [15]:
columns = ['Living Area', 'Side Area']

for col in columns:
    # Step 1: Convert 'Rooms' column to string (if it's not already)
    df[col] = df[col].astype(str)

    # Step 2: Replace commas with decimal points
    df[col] = df[col].str.replace(',', '.', regex=False)

    # Convert non-numeric strings to NaN
    df[col] = df[col].replace('nan', np.nan)

    # Step 3: Convert columns to numeric (float), coercing errors
    df[col] = pd.to_numeric(df[col], errors='coerce')

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Address           47 non-null     object 
 1   Price             47 non-null     int64  
 2   Living Area       47 non-null     float64
 3   Side Area         47 non-null     float64
 4   Year of Building  47 non-null     int64  
 5   Monthly Fee       47 non-null     int64  
 6   Floor             47 non-null     int64  
 7   Elevator          47 non-null     int64  
 8   Balcony           47 non-null     int64  
 9   Patio             47 non-null     int64  
 10  Fireplace         47 non-null     int64  
 11  Latitude          47 non-null     float64
 12  Longitude         47 non-null     float64
 13  District          47 non-null     object 
dtypes: float64(4), int64(8), object(2)
memory usage: 5.3+ KB


Looks good.

We now create a new feature, `Area`, from `living Area and `Side Area`.

In [17]:
# Create the new 'Area' feature
df['Area'] = df['Living Area'] + 0.2 * df['Side Area']
df['Area'] = df['Living Area'] + 0.2 * df['Side Area']

# Drop the original 'Living Area' and 'Side Area' columns
df = df.drop(['Living Area', 'Side Area'], axis=1)

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Address           47 non-null     object 
 1   Price             47 non-null     int64  
 2   Year of Building  47 non-null     int64  
 3   Monthly Fee       47 non-null     int64  
 4   Floor             47 non-null     int64  
 5   Elevator          47 non-null     int64  
 6   Balcony           47 non-null     int64  
 7   Patio             47 non-null     int64  
 8   Fireplace         47 non-null     int64  
 9   Latitude          47 non-null     float64
 10  Longitude         47 non-null     float64
 11  District          47 non-null     object 
 12  Area              47 non-null     float64
dtypes: float64(3), int64(8), object(2)
memory usage: 4.9+ KB


### 2.4 Expenses per sqm

In [19]:
# Step 1: Calculate Expenses_per_sqm
df['Expenses_per_sqm'] = df['Monthly Fee'] / df['Area']

# Step 2: Drop the original Monthly Fee column
df.drop(columns=['Monthly Fee'], inplace=True)

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Address           47 non-null     object 
 1   Price             47 non-null     int64  
 2   Year of Building  47 non-null     int64  
 3   Floor             47 non-null     int64  
 4   Elevator          47 non-null     int64  
 5   Balcony           47 non-null     int64  
 6   Patio             47 non-null     int64  
 7   Fireplace         47 non-null     int64  
 8   Latitude          47 non-null     float64
 9   Longitude         47 non-null     float64
 10  District          47 non-null     object 
 11  Area              47 non-null     float64
 12  Expenses_per_sqm  47 non-null     float64
dtypes: float64(4), int64(7), object(2)
memory usage: 4.9+ KB


### 2.5 Distance to Amenities

We create the shortest distance, in meters, to several ameneties - `Subway Station`, `Grocery Store`, `Gym` and `Bakery`.

In [21]:
### In the calculations below, I have listed the coordinates of a number of amenities. I then compare each apartment's coordinates to the list of amenities to determine the shortest distance to a ceratin amenitiy.


# Coordinates for amenities
subway_stations = [
    (59.30185067542301, 18.01202406292642), 
    (59.29821711774902, 17.996994798039932), 
    (59.29506118543444, 17.9780168890666), 
    (59.30643713088927, 18.001160022538965), 
    (59.305375257742895, 17.9879852282915)
]

grocery_stores = [
    (59.3016985001167, 18.012611952492634), 
    (59.30113086529778, 18.0031789931638), 
    (59.29659675577328, 18.001528242193288), 
    (59.29361809315374, 18.00445489637013), 
    (59.29699319325665, 17.982844393866806), 
    (59.29886742804572, 17.996848114544783), 
    (59.306187496037204, 18.001237911819175), 
    (59.30535716975179, 17.99327541408561)
]

gyms = [
    (59.30511769646003, 17.989173373641005), 
    (59.29891680440922, 18.003360598064162), 
    (59.297087201555804, 18.003875582138363), 
    (59.29384891582548, 18.00404487764149), 
    (59.294572073002776, 17.986256468635066), 
    (59.29975424959265, 17.992071497321394)
]

bakeries = [
    (59.30135286511991, 18.013197054392563), 
    (59.2986250537181, 18.013282885071593), 
    (59.29858123166955, 18.002875915238732), 
    (59.30027086815059, 17.996533457748022), 
    (59.29880388852719, 17.99183917398691), 
    (59.29958066900443, 18.00411769477393), 
    (59.30614906205406, 18.000901047612736)
]

# Function to calculate the shortest distance to a list of amenities
def shortest_distance(apartment_coords, amenities_coords):
    distances = [geodesic(apartment_coords, amenity).meters for amenity in amenities_coords]
    return min(distances)

# Adding new columns for distances
df['Subway Station'] = df.apply(
    lambda row: int(round(shortest_distance((row['Latitude'], row['Longitude']), subway_stations))), axis=1)
df['Grocery Store'] = df.apply(
    lambda row: int(round(shortest_distance((row['Latitude'], row['Longitude']), grocery_stores))), axis=1)
df['Gym'] = df.apply(
    lambda row: int(round(shortest_distance((row['Latitude'], row['Longitude']), gyms))), axis=1)
df['Bakery'] = df.apply(
    lambda row: int(round(shortest_distance((row['Latitude'], row['Longitude']), bakeries))), axis=1)

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Address           47 non-null     object 
 1   Price             47 non-null     int64  
 2   Year of Building  47 non-null     int64  
 3   Floor             47 non-null     int64  
 4   Elevator          47 non-null     int64  
 5   Balcony           47 non-null     int64  
 6   Patio             47 non-null     int64  
 7   Fireplace         47 non-null     int64  
 8   Latitude          47 non-null     float64
 9   Longitude         47 non-null     float64
 10  District          47 non-null     object 
 11  Area              47 non-null     float64
 12  Expenses_per_sqm  47 non-null     float64
 13  Subway Station    47 non-null     int64  
 14  Grocery Store     47 non-null     int64  
 15  Gym               47 non-null     int64  
 16  Bakery            47 non-null     int64  
dtyp

In [23]:
df.head()

Unnamed: 0,Address,Price,Year of Building,Floor,Elevator,Balcony,Patio,Fireplace,Latitude,Longitude,District,Area,Expenses_per_sqm,Subway Station,Grocery Store,Gym,Bakery
0,Adventsvägen 1,7800000,2017,5,1,1,0,0,59.301007,18.007276,Midsommarkransen,85.0,83.870588,286,234,322,240
1,Bäckvägen 42,3200000,1942,1,1,0,0,0,59.303482,18.000106,Midsommarkransen,59.0,85.474576,335,308,541,301
2,Bäckvägen 50,3675000,1943,3,1,1,0,0,59.303267,17.998685,Midsommarkransen,55.0,71.945455,380,350,543,345
3,Cedergrensvägen 18,2700000,1940,1,0,1,0,0,59.302473,18.001264,LM-Staden,35.0,91.6,442,185,414,361
4,Cedergrensvägen 36,2400000,1939,1,0,0,0,0,59.301316,17.998889,LM-Staden,34.5,92.347826,362,245,369,178


Let's make sure there are no extreme values around so far.

In [24]:
df.describe()

Unnamed: 0,Price,Year of Building,Floor,Elevator,Balcony,Patio,Fireplace,Latitude,Longitude,Area,Expenses_per_sqm,Subway Station,Grocery Store,Gym,Bakery
count,47.0,47.0,47.0,47.0,47.0,47.0,47.0,47.0,47.0,47.0,47.0,47.0,47.0,47.0,47.0
mean,4185824.0,1983.382979,2.893617,0.702128,0.680851,0.06383,0.0,59.299722,18.00093,56.482979,73.915979,353.574468,227.06383,259.914894,183.702128
std,1564210.0,38.367532,1.820583,0.462267,0.471186,0.247092,0.0,0.002385,0.006176,20.135641,14.50276,121.625538,126.853117,195.626066,138.054481
min,1998750.0,1913.0,1.0,0.0,0.0,0.0,0.0,59.290658,17.988149,22.0,50.394366,162.0,69.0,2.0,25.0
25%,2875000.0,1942.5,1.5,0.0,0.0,0.0,0.0,59.298833,17.99821,39.5,64.503951,264.5,152.0,107.5,114.0
50%,3830000.0,2009.0,3.0,1.0,1.0,0.0,0.0,59.299328,18.001364,54.5,71.945455,361.0,220.0,194.0,154.0
75%,5265000.0,2017.5,4.0,1.0,1.0,0.0,0.0,59.30119,18.003747,69.5,84.672582,443.0,261.0,373.0,234.0
max,7800000.0,2023.0,10.0,1.0,1.0,1.0,0.0,59.304039,18.015493,102.0,101.4,859.0,827.0,790.0,911.0


### 2.6 Floor

We will now create the new binned `Floor` levels combined with `Districts`.

In [25]:
# Step 1: Round up to the nearest integer and ensure values less than 1 are set to 1
df['Floor'] = np.ceil(df['Floor']).clip(lower=1) #Making sue that value such as 1,5 floors are interpred as 2, and values of -1 or 0 are interpreted as 1.

# Step 2: Bin floor levels into categories
def bin_floors(floor):
    if 1 <= floor <= 3:
        return 'Lo'  # Low-rise
    elif 4 <= floor <= 5:
        return 'Mid'  # Mid-rise
    else:
        return 'Hi'  # High-rise

df['Floor_Category'] = df['Floor'].apply(bin_floors)

# Step 3: Combine District and Floor_Category
df['District_Floor'] = df['District'] + '_' + df['Floor_Category']

# Step 4: Drop 'Floor_Category', 'Floor' and 'District'
df.drop(columns=['Floor_Category', 'Floor', 'District'], inplace=True)

We now create a categorical feature of a combined `District` and binned `Floor` level.

In [26]:
df.head()

Unnamed: 0,Address,Price,Year of Building,Elevator,Balcony,Patio,Fireplace,Latitude,Longitude,Area,Expenses_per_sqm,Subway Station,Grocery Store,Gym,Bakery,District_Floor
0,Adventsvägen 1,7800000,2017,1,1,0,0,59.301007,18.007276,85.0,83.870588,286,234,322,240,Midsommarkransen_Mid
1,Bäckvägen 42,3200000,1942,1,0,0,0,59.303482,18.000106,59.0,85.474576,335,308,541,301,Midsommarkransen_Lo
2,Bäckvägen 50,3675000,1943,1,1,0,0,59.303267,17.998685,55.0,71.945455,380,350,543,345,Midsommarkransen_Lo
3,Cedergrensvägen 18,2700000,1940,0,1,0,0,59.302473,18.001264,35.0,91.6,442,185,414,361,LM-Staden_Lo
4,Cedergrensvägen 36,2400000,1939,0,0,0,0,59.301316,17.998889,34.5,92.347826,362,245,369,178,LM-Staden_Lo


### 2.7 Cleaning

In [27]:
# Drop 'Address' and 'Year of Building', 'Latitude' and 'Longitude'
df.drop(columns=['Address', 'Year of Building', 'Latitude','Longitude'], inplace=True)

# Reorder columns
new_column_order = [
    'Price', 'Area', 'Expenses_per_sqm', 
    'Elevator', 'Balcony', 'Patio', 'Fireplace', 'District_Floor',
    'Subway Station', 'Grocery Store', 'Gym', 'Bakery'
]

df = df[new_column_order]

In [28]:
df.head()

Unnamed: 0,Price,Area,Expenses_per_sqm,Elevator,Balcony,Patio,Fireplace,District_Floor,Subway Station,Grocery Store,Gym,Bakery
0,7800000,85.0,83.870588,1,1,0,0,Midsommarkransen_Mid,286,234,322,240
1,3200000,59.0,85.474576,1,0,0,0,Midsommarkransen_Lo,335,308,541,301
2,3675000,55.0,71.945455,1,1,0,0,Midsommarkransen_Lo,380,350,543,345
3,2700000,35.0,91.6,0,1,0,0,LM-Staden_Lo,442,185,414,361
4,2400000,34.5,92.347826,0,0,0,0,LM-Staden_Lo,362,245,369,178


### 2.8 One-Hot Encloding

In [29]:
df_encoded = pd.get_dummies(df, columns=['District_Floor'], dtype=int, drop_first=True)

In [30]:
df_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 19 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Price                                47 non-null     int64  
 1   Area                                 47 non-null     float64
 2   Expenses_per_sqm                     47 non-null     float64
 3   Elevator                             47 non-null     int64  
 4   Balcony                              47 non-null     int64  
 5   Patio                                47 non-null     int64  
 6   Fireplace                            47 non-null     int64  
 7   Subway Station                       47 non-null     int64  
 8   Grocery Store                        47 non-null     int64  
 9   Gym                                  47 non-null     int64  
 10  Bakery                               47 non-null     int64  
 11  District_Floor_Hökmossen_Lo       

### 2.9 Transformations

In [31]:
# Lists of columns to be transformed
log_columns = ['Area', 'Grocery Store', 'Bakery']     # Columns to transform with log
sqrt_columns = ['Expenses_per_sqm', 'Subway Station']    # Columns to transform with sqrt

# Automatically determine the original columns (i.e., not transformed)
original_columns = df_encoded.columns.difference(log_columns + sqrt_columns)

# Initialize empty DataFrames for transformed data
df_transformed = pd.DataFrame()

# Apply log transformation
for col in log_columns:
    df_transformed[col + '_log'] = np.log(df_encoded[col])

# Apply square root transformation
for col in sqrt_columns:
    df_transformed[col + '_sqrt'] = np.sqrt(df_encoded[col])

# Keep original columns
df_transformed[original_columns] = df_encoded[original_columns]

### 2.10 Arrange Columns

In [32]:
df_transformed.columns

Index(['Area_log', 'Grocery Store_log', 'Bakery_log', 'Expenses_per_sqm_sqrt',
       'Subway Station_sqrt', 'Balcony', 'District_Floor_Hökmossen_Lo',
       'District_Floor_LM-Staden_Lo', 'District_Floor_Midsommarkransen_Hi',
       'District_Floor_Midsommarkransen_Lo',
       'District_Floor_Midsommarkransen_Mid', 'District_Floor_Telefonplan_Hi',
       'District_Floor_Telefonplan_Lo', 'District_Floor_Telefonplan_Mid',
       'Elevator', 'Fireplace', 'Gym', 'Patio', 'Price'],
      dtype='object')

In [33]:
# Rename columns
# Create a dictionary to map old column names to new column names
rename_dict = {
    'District_Floor_Hökmossen_Lo': 'District_Hökmossen_Floor_Lo',
    'District_Floor_LM-Staden_Lo': 'District_LM-Staden_Floor_Lo',
    'District_Floor_Midsommarkransen_Hi': 'District_Midsommarkransen_Floor_Hi',
    'District_Floor_Midsommarkransen_Lo': 'District_Midsommarkransen_Floor_Lo',
    'District_Floor_Midsommarkransen_Mid': 'District_Midsommarkransen_Floor_Mid',
    'District_Floor_Telefonplan_Hi': 'District_Telefonplan_Floor_Hi',
    'District_Floor_Telefonplan_Lo': 'District_Telefonplan_Floor_Lo',
    'District_Floor_Telefonplan_Mid': 'District_Telefonplan_Floor_Mid'
}

# Rename columns in the DataFrame
df_transformed.rename(columns=rename_dict, inplace=True)

# Reorder columns
new_column_order = [
    'Price', 'Area_log', 'Expenses_per_sqm_sqrt', 
    'Subway Station_sqrt', 'Grocery Store_log', 'Gym', 'Bakery_log',
    'Elevator', 'Balcony', 'Patio', 'Fireplace',
    'District_Hökmossen_Floor_Lo', 'District_LM-Staden_Floor_Lo', 
    'District_Midsommarkransen_Floor_Lo', 'District_Midsommarkransen_Floor_Mid', 'District_Midsommarkransen_Floor_Hi',
    'District_Telefonplan_Floor_Lo', 'District_Telefonplan_Floor_Mid', 'District_Telefonplan_Floor_Hi'
]

df_transformed = df_transformed[new_column_order]

# Dropping District_Floor_Midsommarkransen-Lo
df_transformed = df_transformed.drop(columns=['District_Midsommarkransen_Floor_Lo'])



In [34]:
df_transformed.head()

Unnamed: 0,Price,Area_log,Expenses_per_sqm_sqrt,Subway Station_sqrt,Grocery Store_log,Gym,Bakery_log,Elevator,Balcony,Patio,Fireplace,District_Hökmossen_Floor_Lo,District_LM-Staden_Floor_Lo,District_Midsommarkransen_Floor_Mid,District_Midsommarkransen_Floor_Hi,District_Telefonplan_Floor_Lo,District_Telefonplan_Floor_Mid,District_Telefonplan_Floor_Hi
0,7800000,4.442651,9.158089,16.911535,5.455321,322,5.480639,1,1,0,0,0,0,1,0,0,0,0
1,3200000,4.077537,9.245246,18.303005,5.7301,541,5.70711,1,0,0,0,0,0,0,0,0,0,0
2,3675000,4.007333,8.482067,19.493589,5.857933,543,5.843544,1,1,0,0,0,0,0,0,0,0,0
3,2700000,3.555348,9.570789,21.023796,5.220356,414,5.888878,0,1,0,0,0,1,0,0,0,0,0
4,2400000,3.540959,9.609778,19.026298,5.501258,369,5.181784,0,0,0,0,0,1,0,0,0,0,0


## 3. Scaling

In [35]:
# Dividing dataframe into X and y
X = df_transformed.drop(columns=['Price']) # Features
y = df_transformed['Price'] # Target variable

In [36]:
print(X.shape)
print(y.shape)

(47, 17)
(47,)


In [38]:
# Loading saved scaler
feature_scaler = joblib.load('feature_scaler.pkl')
target_scaler = joblib.load('target_scaler.pkl')

# Scaling the features of the the unseen data
X_scaled = feature_scaler.transform(X)

## 4. Prediction 

We create an array of predicted values from the features of the unseen data.
<br> We do this by applying our final model.

In [57]:
# Load the ensemble model
ensemble_model = joblib.load('final_weighted_ensemble_model.pkl')

# Extract individual models and their weights
best_weights = ensemble_model['weights']
sgd_model = ensemble_model['sgd_model']
elastic_net_model = ensemble_model['elastic_net_model']
svr_model = ensemble_model['svr_model']
gbm_model = ensemble_model['gbm_model']

# Get predictions from each model
predictions_sgd = sgd_model.predict(X_scaled)
predictions_elastic_net = elastic_net_model.predict(X_scaled)
predictions_svr = svr_model.predict(X_scaled)
predictions_gbm = gbm_model.predict(X_scaled)

# Combine predictions using weights (you can adjust the way to combine based on your ensemble logic)
y_hat_combined = (best_weights[0] * predictions_sgd +
                  best_weights[1] * predictions_elastic_net +
                  best_weights[2] * predictions_svr +
                  best_weights[3] * predictions_gbm)

# Inverse scale the predictions to get back to the original scale
predictions_original_scale = target_scaler.inverse_transform(y_hat_combined.reshape(-1, 1)).flatten()

# Apply the inverse log transformation to get the original price predictions
predictions_original = np.exp(predictions_original_scale)

# Round the predictions to the nearest thousand
predictions_rounded = np.round(predictions_original / 1000) * 1000

In [58]:
predictions_rounded.shape

(47,)

In [59]:
predictions_rounded

array([5561000., 3976000., 3833000., 2607000., 2650000., 3212000.,
       3745000., 2673000., 3057000., 2615000., 2648000., 3782000.,
       5329000., 6810000., 5319000., 2346000., 6954000., 4268000.,
       4089000., 4089000., 3112000., 6450000., 5262000., 2667000.,
       3204000., 3854000., 3221000., 3189000., 3745000., 3489000.,
       1799000., 3900000., 4390000., 6693000., 5085000., 2724000.,
       6204000., 5077000., 5279000., 5391000., 2686000., 2675000.,
       4179000., 6350000., 3781000., 2564000., 3026000.])

## 5. Presentation of Predictions

In [84]:
# Load unseen data
df = pd.read_csv(r"C:\Users\gustm\Desktop\Unseen_Data_240930.csv", sep=';', encoding='ISO-8859-1')

# Create DataFrame with numeric 'Prediction' and formatted string version
results_df = pd.DataFrame({
    'Address': df['Address'],  # Assuming Address is in your unseen data
    'Living Area': df['Living Area'],  # Include any other relevant columns
    'Prediction Numeric': predictions_rounded,  # Keep numeric for calculations
    'Prediction': [f"{int(pred):,}".replace(',', ' ') for pred in predictions_rounded],
    'Actual': [f"{int(targ):,}".replace(',', ' ') for targ in y],
})

# Calculate the Difference (SEK)
results_df['Difference (SEK)'] = results_df['Prediction Numeric'] - y.values  # Ensure y is a numeric array

# Calculate the Difference (%) 
results_df['Difference (%)'] = ((results_df['Difference (SEK)'] / y.values) * 100).astype(int)  # Ensure y is a numeric array

# Format the Difference (SEK) with a space as a thousand separator for display
results_df['Difference (SEK)'] = [f"{int(diff):,}".replace(',', ' ') for diff in results_df['Difference (SEK)']]

# Calculate 'Difference Numeric' for metrics calculations without adding it to the DataFrame
difference_numeric = results_df['Difference (SEK)'].str.replace(' ', '').astype(float)  # Convert to numeric for calculations

# Sort by 'Difference (%)' in descending order
results_df_sorted = results_df.sort_values(by='Difference (%)', ascending=False)

# Optionally, drop the 'Prediction Numeric' column if you don't need it anymore
# results_df_sorted.drop(columns=['Prediction Numeric'], inplace=True)

In [85]:
results_df_sorted

Unnamed: 0,Address,Living Area,Prediction Numeric,Prediction,Actual,Difference (SEK),Difference (%)
1,Bäckvägen 42,59,3976000.0,3 976 000,3 200 000,776 000,24
26,Responsgatan 12,46,3221000.0,3 221 000,2 800 000,421 000,15
28,Responsgatan 12,54,3745000.0,3 745 000,3 300 000,445 000,13
32,Snickerigatan 11,65,4390000.0,4 390 000,3 900 000,490 000,12
44,Valborgsmässovägen 20A,51,3781000.0,3 781 000,3 400 000,381 000,11
27,Responsgatan 12,46,3189000.0,3 189 000,2 850 000,339 000,11
4,Cedergrensvägen 36,345,2650000.0,2 650 000,2 400 000,250 000,10
39,Tellusborgsvägen 31,69,5391000.0,5 391 000,4 925 000,466 000,9
23,Pingstvägen 19,35,2667000.0,2 667 000,2 550 000,117 000,4
2,Bäckvägen 50,55,3833000.0,3 833 000,3 675 000,158 000,4


In [93]:
# Calculate Mean Absolute Error
mae = np.mean(np.abs(difference_numeric))

# Calculate Root Mean Squared Error
rmse = np.sqrt(mse)

# Calculate R-squared
# Convert 'Actual' back to numeric for calculation
actual_numeric = results_df_sorted['Actual'].str.replace(' ', '').astype(float)
r2 = r2_score(actual_numeric, results_df_sorted['Prediction Numeric'])

# Display the results
print(f'Mean Absolute Error: {round(mae)}')
print(f'Root Mean Squared Error: {round(rmse)}')
print(f'R-squared: {round(r2, 2)}')


Mean Absolute Error: 331250
Root Mean Squared Error: 505194
R-squared: 0.89


Let's calculate the mean value of the final price of the unseen data to compare it to the errors.  

In [6]:
int(df['Price'].mean())

4185824

**Comment**
- The average difference between prediction and actual end sales price is around 330 000 SEK per apartment compared to the final price of around 4 200 000 SEK.
- The measure of Root Mean Squared Error penalizes outliers more. We have a higher average error value of around 510 000 SEK if we want to err on the safe side.
- The model has a strong predictive power (R-squared of 0.89) and is able to explain most of the variability in prices. But the 11% not captured by the model could indicate that we are missing factors influencing prices that we havent included in the feature set, e.g. condition of the apartment (hard to measure)    
- The average errors are not that high but we would like to reduce these errors by improving the model and maybe the handling of the dataset

However, seeing an the variance of absolut error (difference between actual and predicted) where we see values of difference of more than 20% is problematic. 
<br> It would be nice to see a better reliability (less variance).   

## 6. Future Improvements and Research

Running the project has been very insightful, both in terms of ML coding and techniques as well as understanding the underlying features that determine the final price of apartments in Midsommarkransen, Stockholm, Sweden in 2024. 

Identified areas of improvements include:

*Errors*
- The occational high errors of more than 20% is problematic. We would like to decrease the variance of errors.
<br> Maybe this can be done by an increase in the dataset or maybe there is an interesting correlation between extreme errors and a certain feature.
<br> Here, it would be interesting to further investigate this.

*Outlier analysis*
- It would be interesting to analyze outliers in the dataset systematically in order to understand what causes these outliers .

*Residual Analysis and Heteroscedasticity*
- Running a simple model, or one assumed to work well, to investigate the dataset closer using residual analysis.
<br> This to determine if *heteroscedasticity* exists and whether it belongs to the model or the dataset.
<br> After that we can handle this to improve the scores.

*Transformation*
- A systematic way of examining different methods of transformation of features. The ones we used were soft ones (log and sqrt) and had little effect on skew and kurtosis, but still resulted in good scores for the model.
<br> A systematic way will be more reliable and one can measure the impact of the model by applying transformations.

*Dataset*
- With an increased size of the dataset we may be able to get more reliable results and have the models improve their way to learn with more data available.
<br> We have a limited dataset since we are focusing on a specific area of Stockholm (to decrease variability between different areas) and by focusing on a limited timespan (the first nine months of 2024) to eliminate fluctuations due to changes in the overall economy.

*Analyzing features*
- Systematically analyzing the contribution of individual features.
<br> This could result in us dropping certain features and a simple model (few features) would be even more simple.
<br> At the same time, I believe that this is being taken care of by the `Elastic Net` model (practically `Lasso`) in the Ensemble model which is out final model. 

*Presentation*
- We could increase the presentation value by setting up a separate presentation.

*Production*
- We could move into production mode to create an application available on the internet for people to use, and have a batch update of input data continously that updates the model.