In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Prepare Dataset

### 1. Define Analytical needs

1. Frame problem statement in a mathematical fashion;

    It is a **supervised, offline, regression** type of task

2. Select performance measure.

    As it is a classic regression task, where the cost of a house has to be predicted, the right performance measure would be **RMSE**

3. How would we solve the problem manually?

    It is theoretically possible to create an equasion, that would have zip code, area (in square meters), rooms number, longtitute and latitude as independent variables and price as dependant. The coefficients (slopes and intercept) could be calculated with RMSE.

4. List assumptions coming from research questions made so far.

    - the bigger the area of a house, the higher the cost of a house
    - the bigger number of rooms, the higher the cost of a house
    - the closer the longtitute and latitude to the city center, the higher the cost of a house. There can also be district clusters.
    

5. Verify assumptions (if possible).

    All assumprions will be verified during EDA.

6. Fetch the data

In [2]:
data = pd.read_csv(
    "data/HousingPrices-Amsterdam-August-2021.csv",
    usecols= [1,2,3,4,5,6,7]
)
print(data.shape)
data.head()

(924, 7)


Unnamed: 0,Address,Zip,Price,Area,Room,Lon,Lat
0,"Blasiusstraat 8 2, Amsterdam",1091 CR,685000.0,64,3,4.907736,52.356157
1,"Kromme Leimuidenstraat 13 H, Amsterdam",1059 EL,475000.0,60,3,4.850476,52.348586
2,"Zaaiersweg 11 A, Amsterdam",1097 SM,850000.0,109,4,4.944774,52.343782
3,"Tenerifestraat 40, Amsterdam",1060 TH,580000.0,128,6,4.789928,52.343712
4,"Winterjanpad 21, Amsterdam",1036 KN,720000.0,138,5,4.902503,52.410538


### 2. Data Understanding

1. Check how much space it will take and make sure your workspace has enough storage if you are dealing with big datasets

In [3]:
size_b = data.memory_usage(deep=True).sum()  # get size in byte
size_mb = size_b / (1024 * 1024)  # convert byte to mb
print(f"Size data: {size_mb:.2f} Mb")

Size data: 0.17 Mb


2. Check the type of data (time series, sample, geographical, etc.) and make sure they are what they should be.

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 924 entries, 0 to 923
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Address  924 non-null    object 
 1   Zip      924 non-null    object 
 2   Price    920 non-null    float64
 3   Area     924 non-null    int64  
 4   Room     924 non-null    int64  
 5   Lon      924 non-null    float64
 6   Lat      924 non-null    float64
dtypes: float64(3), int64(2), object(2)
memory usage: 50.7+ KB


### 3. Data Preparation

1. Convert the data to a format that is easy to manipulate (without changing the data itself; e.g. .csv, .json).

    In this case the dataset is already in a format easy to manipulate, i.e., `.csv`

2. For training of Machine Learning models sample a test set

In [5]:
# split into test and train
data_train, data_test = train_test_split(data, test_size=0.2)

#check sizes of sets
print(data_train.shape, data_test.shape)

(739, 7) (185, 7)


3. Store train and test locally

In [6]:
data_train.to_csv(
    path_or_buf="data/data_train.csv",
    header=True,  # Write out the column names
    index=False,  # discard index as it is not informative
)

data_test.to_csv(
    path_or_buf="data/data_test.csv",
    header=True,  # Write out the column names
    index=False,  # discard index as it is not informative
)