# Feature Engineering

In [1]:
# General imports
import pandas as pd
from geopy.distance import geodesic

## 1. Introduction

We are going to feature engineer and select our final features that we want to use in this project.
<br> Let's first load the cleaned training dataset and we will then perform the exact same operations on the test set.

### 1.1 Loading Datasets

We load both datasets to feature engineer them in the same way.

In [2]:
# Paths to the raw CSV files from GitHub
train_path = 'https://raw.githubusercontent.com/gustmic/Project-LinReg-Apartments/main/results/cleaned_training_set_apartments.csv'
test_path = 'https://raw.githubusercontent.com/gustmic/Project-LinReg-Apartments/main/results/cleaned_test_set_apartments.csv'

# Load the CSV files into DataFrames
df_train = pd.read_csv(train_path)
df_test = pd.read_csv(test_path)

In [3]:
df_train.head()

Unnamed: 0,Address,District,Price,Living Area,Side Area,Rooms,Monthly Fee,Floor,Year of Building,Elevator,Balcony,Patio,Fireplace,Date of Sale,Days for Sale,Latitude,Longitude
0,Tellusgången 34,Telefonplan,4770000,73.0,0.0,3.0,5026,5.0,2015.0,1,1,0,0,2024-05-22,13,59.298699,17.989062
1,Tångvägen 31,Hökmossen,2600000,57.0,0.0,2.0,4297,,1944.0,0,0,0,0,2024-02-24,15,59.29211,17.994688
2,Cedergrensvägen 43,LM-Staden,3745000,51.0,0.0,3.0,4285,1.0,1940.0,0,0,0,0,2024-04-30,9,59.300796,17.998543
3,Responsgatan 12,Telefonplan,2760000,46.0,0.0,2.0,3135,3.0,2019.0,1,1,0,0,2024-03-18,16,59.296459,18.002712
4,Mikrofonvägen 21,Telefonplan,4400000,61.0,0.0,2.0,3676,4.0,2010.0,1,1,0,0,2024-08-18,10,59.296911,17.999793


### 1.2 Highlights of Tasks in this Notebook

- Selecting features
- Engineering features
- Discovering interesting points for further examination

## 2. Feature Selection

Looking at the features, the Data Exploration and Visualization showed that the most important feature is likely to be **area** for determining price.

### 2.1 Area
Since the size of the apartment sometimes has an additional 'Side Area', I am thinking if we should create a new feature called `Area` that represent the `Living Area` and the `Side Area` in some combination.
<br> This could be achieved by `Total Area` = `Living Area` + `Side Area`, or `Total Area` = `Living Area` + 0.5 * `Side Area` or similar (since `Side Area` is not as important as `Living Area` due to they have a slanted ceiling).
<br> The problem with having `Side Area` as a separate feature is that we will have a lot of outliers that may distort the model, since most have a `Side Area` equal to 0. 
<br> So, I will weight the `Side Area` to a 0.2 and create a new feature, `Area`, measured in sqm.

### 2.2 Monthly Fee -> Expenses per sqm
We definately would like to keep `Monthly Fee` as it is an important feature that people consider when buying an apartment, i.e. a lower `Monthly Fee` and you can afford a higher `Price`.
<br> However, the `Monthly Fee` has a strong correlation with `Living Area` which is problematic. 
<br> This is to expected though, since the feature `Monthly Fee` is part of a budget that the housing cooperative has budgeted for its yearly amortizations, interest paymnets, maintenance, and services. 
<br> These expenses will be shared by all the apartments/members and will be distributed according to each apartment´s size, `Living Area`, relative to the total area of all member apartments belonging to the housing cooperative.
<br> This means that the `Monthly Fee` for an apartment is made up by `Living Area` * Budgeted expense per m<sup>2</sup>, or `Living Area` of each apartment * (Total Budgeted Expense / Total `Living Area` in the housing cooperative). 
<br> Thus, if we divide each `Monthly Fee` with its `Living Area` we should get that particular housing cooperative's particular value for its needs for expenses that year. 
<br> It is a measurement of the need for expenses for the year, from the housing cooperative's perspective. 
<br> Commonly, housing cooperatives with older houses have a smaller `Monthly Fee` since they have already payed of loans over the years. 
<br> Thus, we will create a new feature called `Expenses_per_sqm`, as in the need to finance these expenses. 

### 2.3 Year of Building
This one will be changed to how old the building is, from `Year of Building` to `Age of Building`, as it will change from a being almost like a categorial feature to a numerical, discrete value.

### 2.4 Date of Sale
I am not really looking to use this one. 
<br> In general, I want the model to be able to predict apartment prices in 2024, and I am not really looking to do any time-series analysis. 
<br> So, I will simply get rid of this feature.
<br> The sales have all taken place during 2024 anyway.

### 2.5 Days for Sale
Hmm...will the price be influenced by how long the object has been on the market? 
<br> One would assume that the feature `Days for Sale` represents the initial price being set too high or ok (long vs short `Days for Sale`). 
<br> But we haven't included any information on initial price and we also are not looking to predict future end price from an initian price or `Days for Sale`. 
<br> Thus, I think we can remove this feature as well.

### 2.6 Latitude and Longitude
These are not important in themselves and will not be included directly in the training algorithm.
<br> But I want to use them to create new features (see 1.7 below).

### 2.7 Distance to Subway Station / Grocery Store / Gym / Café / Bakery
I will engineer new features such as the closest distance to `Subway Station`, `Grocery Store`, `Gym`, and `Bakery`. 
<br> The area, Midsommarkransen, is an area that has experienced a wave of gentrification during the past 20-30 years and people living there are quite well-off and expect ameneties such as cafés, gyms, and restaurants. 
<br> People in Midsommarkransen can be characterized as young people with good jobs in creative, media, tech, and/or consultancies. 
<br> The reliance on public transport (mainly the subway) is big, even though some own their own cars or car pool. 
<br> Closeness to a well-stocked `Grocery Store` will likely be viewed as important, as is a `Gym`. 
<br> However, people may choose go to their `Gym` downtown, on their way to, or after work (or during weekends). 
<br> So the reliance on a nearby `Gym`, is not expected to be that important. And any `Gym` won't do either due to lock-in effects. 
<br> People that have a SATS (a dominant `Gym` chain in Stockholm) subscription service won't neccesarily think that it's great that they live nearby a competing `Gym` service. But let's see if it has any effect.
<br> Furthermore, people living in Midsommarkransen are frequent visitors to cafés in Midsommarkransen. 
<br> But I believe that people will choose their favourite café depending on what they are looking for (ambiance, good coffee, etc.), and thus, they won't mind walking some. 
<br> Also, there are quite a number of cafés, so I won't bother to include this feature.
<br> Being close to a `Bakery` is probably important to some, but I think this is a 'nice-to-have'. I'll include it to see if there is any effect to the price.

### 2.8 Rooms

The feature `Rooms` is a continous or numerical variable which is good as it will work well in a model.
<br> But as we saw in the Data Exploration and Visualization step, we have multicollinearity where `Rooms` play a role. 
<br> We will keep `Area` as it is an important directly dependent variable to `Price`.
<br> And since `Rooms` already is correlated to `Area`, I'm not sure I can decouple it from `Area`in a meaningful way.
<br> If we were to divide `Rooms` with `Area`, similar to the creation of `Expenses_per_sqm`, we would get a measurement of how many `Rooms` we could fit per sqm of `Living Area`.
<br> This measurement could show if there is an effect on `Price` if you have an apartment planned to optimize the number of `Rooms` per apartment, in reference to the `Living Area`.
<br> Some people may want more, smaller rooms (families) while others will likely prefer fewer `Rooms` that have larger spaces.
<br> Thus, I do not forsee this new feature to have a strong correlation to `Price`, but let's try it out.

### 2.9 Floor

The feature `Floor` is a numerical variable, but can also be handled as a categorical value.
<br> The question is if there are certain levels that are more popular, thus have a more strong correlation to `Price`.
<br> This is handled below under "2.6 Creation of New Categorial Features"

## 3. Feature Engineering

### 3.1 Area

In [4]:
# Create the new 'Area' feature
df_train['Area'] = df_train['Living Area'] + 0.2 * df_train['Side Area']
df_test['Area'] = df_test['Living Area'] + 0.2 * df_test['Side Area']

# Drop the original 'Living Area' and 'Side Area' columns
df_train = df_train.drop(['Living Area', 'Side Area'], axis=1)
df_test = df_test.drop(['Living Area', 'Side Area'], axis=1)

In [5]:
df_train.head()

Unnamed: 0,Address,District,Price,Rooms,Monthly Fee,Floor,Year of Building,Elevator,Balcony,Patio,Fireplace,Date of Sale,Days for Sale,Latitude,Longitude,Area
0,Tellusgången 34,Telefonplan,4770000,3.0,5026,5.0,2015.0,1,1,0,0,2024-05-22,13,59.298699,17.989062,73.0
1,Tångvägen 31,Hökmossen,2600000,2.0,4297,,1944.0,0,0,0,0,2024-02-24,15,59.29211,17.994688,57.0
2,Cedergrensvägen 43,LM-Staden,3745000,3.0,4285,1.0,1940.0,0,0,0,0,2024-04-30,9,59.300796,17.998543,51.0
3,Responsgatan 12,Telefonplan,2760000,2.0,3135,3.0,2019.0,1,1,0,0,2024-03-18,16,59.296459,18.002712,46.0
4,Mikrofonvägen 21,Telefonplan,4400000,2.0,3676,4.0,2010.0,1,1,0,0,2024-08-18,10,59.296911,17.999793,61.0


### 3.2 Expenses_per_sqm

In [6]:
# Step 1: Calculate Expenses_per_sqm
df_train['Expenses_per_sqm'] = df_train['Monthly Fee'] / df_train['Area']
df_test['Expenses_per_sqm'] = df_test['Monthly Fee'] / df_test['Area']

# Step 2: Calculate the mean Expenses_per_sqm for each Address
mean_expenses_per_sqm_train = df_train.groupby('Address')['Expenses_per_sqm'].transform('mean')
mean_expenses_per_sqm_test = df_test.groupby('Address')['Expenses_per_sqm'].transform('mean')

# Step 3: Assign the mean value to all rows with the same Address
df_train['Expenses_per_sqm'] = mean_expenses_per_sqm_train
df_test['Expenses_per_sqm'] = mean_expenses_per_sqm_test

# Step 4: Drop the original Monthly Fee column
df_train.drop(columns=['Monthly Fee'], inplace=True)
df_test.drop(columns=['Monthly Fee'], inplace=True)

In [7]:
df_train.head()

Unnamed: 0,Address,District,Price,Rooms,Floor,Year of Building,Elevator,Balcony,Patio,Fireplace,Date of Sale,Days for Sale,Latitude,Longitude,Area,Expenses_per_sqm
0,Tellusgången 34,Telefonplan,4770000,3.0,5.0,2015.0,1,1,0,0,2024-05-22,13,59.298699,17.989062,73.0,69.905008
1,Tångvägen 31,Hökmossen,2600000,2.0,,1944.0,0,0,0,0,2024-02-24,15,59.29211,17.994688,57.0,75.385965
2,Cedergrensvägen 43,LM-Staden,3745000,3.0,1.0,1940.0,0,0,0,0,2024-04-30,9,59.300796,17.998543,51.0,84.019608
3,Responsgatan 12,Telefonplan,2760000,2.0,3.0,2019.0,1,1,0,0,2024-03-18,16,59.296459,18.002712,46.0,68.340343
4,Mikrofonvägen 21,Telefonplan,4400000,2.0,4.0,2010.0,1,1,0,0,2024-08-18,10,59.296911,17.999793,61.0,60.262295


### 3.3 Age of Building

Conversion from `Year of Building` to `Age of Building`.

In [8]:
# Rename the column
df_train.rename(columns={'Year of Building': 'Age of Building'}, inplace=True)
df_test.rename(columns={'Year of Building': 'Age of Building'}, inplace=True)

# Update the values
df_train['Age of Building'] = 2024 - df_train['Age of Building']
df_test['Age of Building'] = 2024 - df_test['Age of Building']

We know that we have some NaN-values in `Age of Building`. Let's examine how many. 

In [9]:
df_train.head()

Unnamed: 0,Address,District,Price,Rooms,Floor,Age of Building,Elevator,Balcony,Patio,Fireplace,Date of Sale,Days for Sale,Latitude,Longitude,Area,Expenses_per_sqm
0,Tellusgången 34,Telefonplan,4770000,3.0,5.0,9.0,1,1,0,0,2024-05-22,13,59.298699,17.989062,73.0,69.905008
1,Tångvägen 31,Hökmossen,2600000,2.0,,80.0,0,0,0,0,2024-02-24,15,59.29211,17.994688,57.0,75.385965
2,Cedergrensvägen 43,LM-Staden,3745000,3.0,1.0,84.0,0,0,0,0,2024-04-30,9,59.300796,17.998543,51.0,84.019608
3,Responsgatan 12,Telefonplan,2760000,2.0,3.0,5.0,1,1,0,0,2024-03-18,16,59.296459,18.002712,46.0,68.340343
4,Mikrofonvägen 21,Telefonplan,4400000,2.0,4.0,14.0,1,1,0,0,2024-08-18,10,59.296911,17.999793,61.0,60.262295


In [10]:
# Count the number of NaN values in the 'Floor' column
nan_count = df_train['Age of Building'].isna().sum()

# Print the result
print(f"Number of NaN values in 'Age of Building': {nan_count}")

Number of NaN values in 'Age of Building': 20


**Comment**
<br> We will keep the `Age of Building` as it is.

### 3.4 Removal of 'Date of Sale' and 'Days for Sale'

In [11]:
df_train.drop(columns=['Date of Sale', 'Days for Sale'], inplace=True)
df_test.drop(columns=['Date of Sale', 'Days for Sale'], inplace=True)

In [12]:
df_train.head()

Unnamed: 0,Address,District,Price,Rooms,Floor,Age of Building,Elevator,Balcony,Patio,Fireplace,Latitude,Longitude,Area,Expenses_per_sqm
0,Tellusgången 34,Telefonplan,4770000,3.0,5.0,9.0,1,1,0,0,59.298699,17.989062,73.0,69.905008
1,Tångvägen 31,Hökmossen,2600000,2.0,,80.0,0,0,0,0,59.29211,17.994688,57.0,75.385965
2,Cedergrensvägen 43,LM-Staden,3745000,3.0,1.0,84.0,0,0,0,0,59.300796,17.998543,51.0,84.019608
3,Responsgatan 12,Telefonplan,2760000,2.0,3.0,5.0,1,1,0,0,59.296459,18.002712,46.0,68.340343
4,Mikrofonvägen 21,Telefonplan,4400000,2.0,4.0,14.0,1,1,0,0,59.296911,17.999793,61.0,60.262295


### 3.5 Distance to closest 'Subway Station', 'Grocery Store', 'Gym' and 'Bakery'

In [13]:
### In the calculations below, I have listed the coordinates of a number of amenities. I then compare each apartment's coordinates to the list of amenities to determine the shortest distance to a ceratin amenitiy.

# Coordinates for amenities
subway_stations = [
    (59.30185067542301, 18.01202406292642), 
    (59.29821711774902, 17.996994798039932), 
    (59.29506118543444, 17.9780168890666), 
    (59.30643713088927, 18.001160022538965), 
    (59.305375257742895, 17.9879852282915)
]

grocery_stores = [
    (59.3016985001167, 18.012611952492634), 
    (59.30113086529778, 18.0031789931638), 
    (59.29659675577328, 18.001528242193288), 
    (59.29361809315374, 18.00445489637013), 
    (59.29699319325665, 17.982844393866806), 
    (59.29886742804572, 17.996848114544783), 
    (59.306187496037204, 18.001237911819175), 
    (59.30535716975179, 17.99327541408561)
]

gyms = [
    (59.30511769646003, 17.989173373641005), 
    (59.29891680440922, 18.003360598064162), 
    (59.297087201555804, 18.003875582138363), 
    (59.29384891582548, 18.00404487764149), 
    (59.294572073002776, 17.986256468635066), 
    (59.29975424959265, 17.992071497321394)
]

bakeries = [
    (59.30135286511991, 18.013197054392563), 
    (59.2986250537181, 18.013282885071593), 
    (59.29858123166955, 18.002875915238732), 
    (59.30027086815059, 17.996533457748022), 
    (59.29880388852719, 17.99183917398691), 
    (59.29958066900443, 18.00411769477393), 
    (59.30614906205406, 18.000901047612736)
]

# Function to calculate the shortest distance to a list of amenities
def shortest_distance(apartment_coords, amenities_coords):
    distances = [geodesic(apartment_coords, amenity).meters for amenity in amenities_coords]
    return min(distances)

# Helper function to compute and add distance columns
def add_distance_columns(df, locations, columns):
    for column, location in zip(columns, locations):
        df[column] = df.apply(lambda row: int(round(shortest_distance((row['Latitude'], row['Longitude']), location))), axis=1)

# List of locations and corresponding columns
locations = [subway_stations, grocery_stores, gyms, bakeries]
columns = ['Subway Station', 'Grocery Store', 'Gym', 'Bakery']

# Adding distance columns for train and test data
add_distance_columns(df_train, locations, columns)
add_distance_columns(df_test, locations, columns)


In [14]:
df_train.head()

Unnamed: 0,Address,District,Price,Rooms,Floor,Age of Building,Elevator,Balcony,Patio,Fireplace,Latitude,Longitude,Area,Expenses_per_sqm,Subway Station,Grocery Store,Gym,Bakery
0,Tellusgången 34,Telefonplan,4770000,3.0,5.0,9.0,1,1,0,0,59.298699,17.989062,73.0,69.905008,455,402,208,159
1,Tångvägen 31,Hökmossen,2600000,2.0,,80.0,0,0,0,0,59.29211,17.994688,57.0,75.385965,693,581,553,763
2,Cedergrensvägen 43,LM-Staden,3745000,3.0,1.0,84.0,0,0,0,0,59.300796,17.998543,51.0,84.019608,301,236,345,129
3,Responsgatan 12,Telefonplan,2760000,2.0,3.0,5.0,1,1,0,0,59.296459,18.002712,46.0,68.340343,380,69,96,237
4,Mikrofonvägen 21,Telefonplan,4400000,2.0,4.0,14.0,1,1,0,0,59.296911,17.999793,61.0,60.262295,216,105,233,256


### 3.6 Creation of New Categorical Features

In [15]:
print(df_train['Floor'].value_counts())
print(df_test['Floor'].value_counts())

Floor
 3.0     99
 2.0     96
 1.0     94
 4.0     36
 5.0     13
 6.0     10
 1.5      1
 10.0     1
 8.0      1
 0.5      1
-1.0      1
Name: count, dtype: int64
Floor
3.0     26
1.0     22
2.0     19
5.0      9
4.0      7
1.5      1
6.0      1
0.5      1
14.0     1
Name: count, dtype: int64


**Comment**
<br> We could see in the Data Exploration and Visualization step that there was a strong interaction between `District` and `Floor` due to mid-rise buildings mainly existing in the `District` 'Telefonmplan'.
<br> Since we have too few values of values 8 and 10, I will *bin* `Floor` in the categories 'Lo' (level 1-3), 'Mid' (4-5) and 'High' (6+).<br> Also, I will remove odd values of `Floor`  replace -1 with 1, 0.5 with 1, and 1.5 with 2.
<br> Also, due to the strong interaction between `District` and `Floor` we will try and capture this possible non-linearity by combining these features into new, categorized features.
<br> Before doing this, we need to see if we have any missing values of `Floor` and handle them.

In [16]:
# Count the number of NaN values in the 'Floor' column
nan_count = df_train['Floor'].isna().sum()

# Print the result
print(f"Number of NaN values in 'Floor': {nan_count}")

Number of NaN values in 'Floor': 27


**Comment**
<br> We can't remove this many rows of data due to our limited dataset.
<br> Instead we will fill the missing values using *mode*, i.e. the most common value, for every `District`.  

In [17]:
# Step 1: Calculate the mode of 'Floor' for each 'District' using the training set to avoid data leakage in the test set
mode_per_district = df_train.groupby('District')['Floor'].agg(lambda x: x.mode()[0])

# Step 2: Define a function to replace NaN with the mode of the district
def fill_missing_with_mode(row, mode_dict):
    if pd.isna(row['Floor']):
        return mode_dict.get(row['District'], row['Floor'])  # Fallback to NaN if district is missing in mode_dict
    else:
        return row['Floor']

# Step 3: Apply the function to fill missing values in the training set
df_train['Floor'] = df_train.apply(fill_missing_with_mode, axis=1, mode_dict=mode_per_district)

# Step 4: Apply the same function to the test set using the mode from the training set
df_test['Floor'] = df_test.apply(fill_missing_with_mode, axis=1, mode_dict=mode_per_district)


In [18]:
# Count the number of NaN values in the 'Floor' column
nan_count_train = df_train['Floor'].isna().sum()
nan_count_test = df_test['Floor'].isna().sum()

# Print the result
print(f"Number of NaN values in 'Floor' in the training set: {nan_count_train}")
print(f"Number of NaN values in 'Floor' in the test set: {nan_count_test}")

Number of NaN values in 'Floor' in the training set: 0
Number of NaN values in 'Floor' in the test set: 0


In [19]:
# Step 1: Change specific floor values
df_train['Floor'] = df_train['Floor'].replace({-1: 1, 1.5: 2, 0.5: 1})
df_test['Floor'] = df_test['Floor'].replace({-1: 1, 1.5: 2, 0.5: 1})

# Step 2: Bin floor levels into categories
def bin_floors(floor):
    if 1 <= floor <= 3:
        return 'Lo'  # Low-rise
    elif 4 <= floor <= 5:
        return 'Mid'  # Mid-rise
    else:
        return 'Hi'  # High-rise

df_train['Floor_Category'] = df_train['Floor'].apply(bin_floors)
df_test['Floor_Category'] = df_test['Floor'].apply(bin_floors)

# Step 3: Combine District and Floor_Category
df_train['District_Floor'] = df_train['District'] + '-' + df_train['Floor_Category']
df_test['District_Floor'] = df_test['District'] + '-' + df_test['Floor_Category']

In [20]:
df_train[['District', 'Floor', 'Floor_Category', 'District_Floor']].head()

Unnamed: 0,District,Floor,Floor_Category,District_Floor
0,Telefonplan,5.0,Mid,Telefonplan-Mid
1,Hökmossen,3.0,Lo,Hökmossen-Lo
2,LM-Staden,1.0,Lo,LM-Staden-Lo
3,Telefonplan,3.0,Lo,Telefonplan-Lo
4,Telefonplan,4.0,Mid,Telefonplan-Mid


In [21]:
df_train['District'].value_counts()

District
Midsommarkransen          157
Telefonplan               111
LM-Staden                  58
Hökmossen                  36
Gamla Midsommarkransen     18
Name: count, dtype: int64

Let's examine if we have any NaN values in our categories.

In [22]:
# Function to print value counts and NaN counts for specified columns
def summarize_column(df_train, column_name):
    print(f"Value counts for {column_name}:")
    print(df_train[column_name].value_counts(dropna=False))
    print(f"Number of NaN values in {column_name}: {df_train[column_name].isna().sum()}")
    print()

# Summarize the columns
for column in ['Floor', 'Floor_Category', 'District_Floor']:
    summarize_column(df_train, column)

Value counts for Floor:
Floor
1.0     111
3.0     110
2.0      98
4.0      36
5.0      13
6.0      10
10.0      1
8.0       1
Name: count, dtype: int64
Number of NaN values in Floor: 0

Value counts for Floor_Category:
Floor_Category
Lo     319
Mid     49
Hi      12
Name: count, dtype: int64
Number of NaN values in Floor_Category: 0

Value counts for District_Floor:
District_Floor
Midsommarkransen-Lo          142
Telefonplan-Lo                65
LM-Staden-Lo                  58
Hökmossen-Lo                  36
Telefonplan-Mid               35
Gamla Midsommarkransen-Lo     18
Midsommarkransen-Mid          14
Telefonplan-Hi                11
Midsommarkransen-Hi            1
Name: count, dtype: int64
Number of NaN values in District_Floor: 0



**Comment**
<br> Let's just check for test set as well.

In [23]:
# Function to print value counts and NaN counts for specified columns
def summarize_column(df_test, column_name):
    print(f"Value counts for {column_name}:")
    print(df_test[column_name].value_counts(dropna=False))
    print(f"Number of NaN values in {column_name}: {df_test[column_name].isna().sum()}")
    print()

# Summarize the columns
for column in ['Floor', 'Floor_Category', 'District_Floor']:
    summarize_column(df_test, column)

Value counts for Floor:
Floor
3.0     29
1.0     27
2.0     21
5.0      9
4.0      7
6.0      1
14.0     1
Name: count, dtype: int64
Number of NaN values in Floor: 0

Value counts for Floor_Category:
Floor_Category
Lo     77
Mid    16
Hi      2
Name: count, dtype: int64
Number of NaN values in Floor_Category: 0

Value counts for District_Floor:
District_Floor
Midsommarkransen-Lo          34
LM-Staden-Lo                 15
Telefonplan-Lo               14
Telefonplan-Mid              11
Hökmossen-Lo                  9
Gamla Midsommarkransen-Lo     5
Midsommarkransen-Mid          5
Telefonplan-Hi                2
Name: count, dtype: int64
Number of NaN values in District_Floor: 0



Looks good.

In [24]:
df_train.head()

Unnamed: 0,Address,District,Price,Rooms,Floor,Age of Building,Elevator,Balcony,Patio,Fireplace,Latitude,Longitude,Area,Expenses_per_sqm,Subway Station,Grocery Store,Gym,Bakery,Floor_Category,District_Floor
0,Tellusgången 34,Telefonplan,4770000,3.0,5.0,9.0,1,1,0,0,59.298699,17.989062,73.0,69.905008,455,402,208,159,Mid,Telefonplan-Mid
1,Tångvägen 31,Hökmossen,2600000,2.0,3.0,80.0,0,0,0,0,59.29211,17.994688,57.0,75.385965,693,581,553,763,Lo,Hökmossen-Lo
2,Cedergrensvägen 43,LM-Staden,3745000,3.0,1.0,84.0,0,0,0,0,59.300796,17.998543,51.0,84.019608,301,236,345,129,Lo,LM-Staden-Lo
3,Responsgatan 12,Telefonplan,2760000,2.0,3.0,5.0,1,1,0,0,59.296459,18.002712,46.0,68.340343,380,69,96,237,Lo,Telefonplan-Lo
4,Mikrofonvägen 21,Telefonplan,4400000,2.0,4.0,14.0,1,1,0,0,59.296911,17.999793,61.0,60.262295,216,105,233,256,Mid,Telefonplan-Mid


**Comment** 
<br> It looks good. No NaN-values and the new features look good.
<br> I'll keep `Floor` for now and maybe examine it in comparison to the new `Floor_Category`.

### 3.7 Reorder Columns

In [25]:
# Reorder columns
new_column_order = [
    'Address', 'Price', 'Area', 'Rooms', 'Expenses_per_sqm', 
    'Age of Building', 'Floor', 'Floor_Category', 'District', 'District_Floor', 'Elevator', 'Balcony', 'Patio', 
    'Fireplace', 'Subway Station', 'Grocery Store', 'Gym', 'Bakery', 'Latitude',
    'Longitude'
]

df_train = df_train[new_column_order]
df_test = df_test[new_column_order]

In [26]:
df_train.head()

Unnamed: 0,Address,Price,Area,Rooms,Expenses_per_sqm,Age of Building,Floor,Floor_Category,District,District_Floor,Elevator,Balcony,Patio,Fireplace,Subway Station,Grocery Store,Gym,Bakery,Latitude,Longitude
0,Tellusgången 34,4770000,73.0,3.0,69.905008,9.0,5.0,Mid,Telefonplan,Telefonplan-Mid,1,1,0,0,455,402,208,159,59.298699,17.989062
1,Tångvägen 31,2600000,57.0,2.0,75.385965,80.0,3.0,Lo,Hökmossen,Hökmossen-Lo,0,0,0,0,693,581,553,763,59.29211,17.994688
2,Cedergrensvägen 43,3745000,51.0,3.0,84.019608,84.0,1.0,Lo,LM-Staden,LM-Staden-Lo,0,0,0,0,301,236,345,129,59.300796,17.998543
3,Responsgatan 12,2760000,46.0,2.0,68.340343,5.0,3.0,Lo,Telefonplan,Telefonplan-Lo,1,1,0,0,380,69,96,237,59.296459,18.002712
4,Mikrofonvägen 21,4400000,61.0,2.0,60.262295,14.0,4.0,Mid,Telefonplan,Telefonplan-Mid,1,1,0,0,216,105,233,256,59.296911,17.999793


## 4. Disussion of Further Analysis

There are a number of features that we may want to eliminate, due to multicollinearity or some features being better at predicting `Price`.
<br> At this moment, I am thinking about the following: 

### 4.1 Age of Building

There may be a case where we have multicollinearity in regards to `Age of Building`, ´Fireplace` and `District` (mainly 'Gamla Midsommarkransen').
<br> We will have to further examine this using VIF analysis as well as maybe alternating between features, when evaluating models, to determine what features captures the correlation best.
<br> Although we have some NaN-values in this feature, I am not able to fill these with ease, and I do not want to drop these rows as I do not want to loose information for analysis with other features.

### 4.2 District

I am wondering a bit about the future possible exclusion of `District`.
<br> What does makes a certain `District` have a certain correlation with `Price`?
<br> Is it simply the closeness to amenities such as `Grocery Store`, `Subway Station`, `Gym`, and `Bakery`?
<br> Or are there other local factors that can be captured that aren't covered by the distance to features. 
<br> These could be a general neighborhood reputation, the attraction of a certain kind of people living there, levels of noise, pollution, nightlife, and crime.
<br> So, let's keep it and compare outcomes by examining effect on models.

### 4.3 Floor

With the creation of new categorical features, the binned `Floor_Category`, I am hoping to eliminate the noise from having few values in the Floor categoy of 8 and 10 (one each in the whole dataset). 
<br> Binning them with `Floor` value 6 to `Floor_Category` called 'Hi', should eliminate the outliers and enuing problems.
<br> So, maybe `Floor` as a feature is redundant. At the same time, I may loose the granuality in `Floor`. 

<br> The additional creation of the feature that combines `District` and `Floor_Category` trying to capture non-linearities that may exist there is promising.
<br> But this will need to be further evaluated, which combination is best.

## 5. Saving the Featured Dataset

All looks good now. I am now ready to move onto the next step - Exploring the data. Let's save the dataframes first to csv-files.

In [34]:
df_train.to_csv(r'...localpath...\featured_training_set_apartments.csv', index=False)
df_test.to_csv(r'...localpath...\featured_test_set_apartments.csv', index=False)

In [35]:
df1 = pd.read_csv(r'C:\Users\gustm\Desktop\featured_training_set_apartments.csv')
df1.head(5)

Unnamed: 0,Address,Price,Area,Rooms,Expenses_per_sqm,Age of Building,Floor,Floor_Category,District,District_Floor,Elevator,Balcony,Patio,Fireplace,Subway Station,Grocery Store,Gym,Bakery,Latitude,Longitude
0,Tellusgången 34,4770000,73.0,3.0,69.905008,9.0,5.0,Mid,Telefonplan,Telefonplan-Mid,1,1,0,0,455,402,208,159,59.298699,17.989062
1,Tångvägen 31,2600000,57.0,2.0,75.385965,80.0,3.0,Lo,Hökmossen,Hökmossen-Lo,0,0,0,0,693,581,553,763,59.29211,17.994688
2,Cedergrensvägen 43,3745000,51.0,3.0,84.019608,84.0,1.0,Lo,LM-Staden,LM-Staden-Lo,0,0,0,0,301,236,345,129,59.300796,17.998543
3,Responsgatan 12,2760000,46.0,2.0,68.340343,5.0,3.0,Lo,Telefonplan,Telefonplan-Lo,1,1,0,0,380,69,96,237,59.296459,18.002712
4,Mikrofonvägen 21,4400000,61.0,2.0,60.262295,14.0,4.0,Mid,Telefonplan,Telefonplan-Mid,1,1,0,0,216,105,233,256,59.296911,17.999793


Check looks good.