# Predicting Apartment Prices in Russia

In this project, I'll build a machine-learning model to predict the prices of apartments in Russia. The dataset for this project is obtained from [Daniilak on Kaggle](https://www.kaggle.com/datasets/mrdaniilak/russia-real-estate-2021). The documentation (or description) for the dataset is also available on the same Kaggle page.

I'll follow the common machine learning workflow of:
- Prepare data, which in turn has the following steps:
  - Import data
  - Explore data
  - Split data
- Build model
- Communicate results

In [2]:
# Import libraries
import pandas as pd

## Prepare data

### Import

In [3]:
df = pd.read_csv('datasets/russia-real-estate-dataset.csv')

In [4]:
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11358150 entries, 0 to 11358149
Data columns (total 1 columns):
 #   Column                                                                                                                              Dtype 
---  ------                                                                                                                              ----- 
 0   date;price;level;levels;rooms;area;kitchen_area;geo_lat;geo_lon;building_type;object_type;postal_code;street_id;id_region;house_id  object
dtypes: object(1)
memory usage: 86.7+ MB
None


Unnamed: 0,date;price;level;levels;rooms;area;kitchen_area;geo_lat;geo_lon;building_type;object_type;postal_code;street_id;id_region;house_id
0,2021-01-01;2451300;15;31;1;30.3;0;56.7801124;6...
1,2021-01-01;1450000;5;5;1;33;6;44.6081542;40.13...
2,2021-01-01;10700000;4;13;3;85;12;55.5400601;37...
3,2021-01-01;3100000;3;5;3;82;9;44.6081542;40.13...
4,2021-01-01;2500000;2;3;1;30;9;44.7386846;37.71...


You see in the output above that the data is really messy. It's just a 1 by 1 data frame. But a closer examination of the column title and the row entries shows something interesting: What should have been different features (or columns) are lumped into one, separated by semi-colons.

How do you separate these features to stand alone?

In [8]:
columns = df.columns.str.split(';')[0]
print(columns)

['date', 'price', 'level', 'levels', 'rooms', 'area', 'kitchen_area', 'geo_lat', 'geo_lon', 'building_type', 'object_type', 'postal_code', 'street_id', 'id_region', 'house_id']


In [None]:
df[df.columns.to_list()[0]].str.split(';').head()

In [9]:
df.columns

Index(['date;price;level;levels;rooms;area;kitchen_area;geo_lat;geo_lon;building_type;object_type;postal_code;street_id;id_region;house_id'], dtype='object')