# Predicting Apartment Prices in Russia

In this project, I'll build a machine-learning model to predict the prices of apartments in Russia. The dataset for this project is obtained from [Daniilak on Kaggle](https://www.kaggle.com/datasets/mrdaniilak/russia-real-estate-2021). The documentation (or description) for the dataset is also available on the same Kaggle page.

I'll follow the common machine learning workflow of:
- Prepare data, which in turn has the following steps:
  - Import data
  - Explore data
  - Split data
- Build model
- Communicate results

In [1]:
# Import libraries
import pandas as pd

## Prepare data

### Import

In [2]:
df = pd.read_csv('datasets/russia-real-estate-dataset.csv')

In [3]:
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11358150 entries, 0 to 11358149
Data columns (total 1 columns):
date;price;level;levels;rooms;area;kitchen_area;geo_lat;geo_lon;building_type;object_type;postal_code;street_id;id_region;house_id    object
dtypes: object(1)
memory usage: 86.7+ MB
None


Unnamed: 0,date;price;level;levels;rooms;area;kitchen_area;geo_lat;geo_lon;building_type;object_type;postal_code;street_id;id_region;house_id
0,2021-01-01;2451300;15;31;1;30.3;0;56.7801124;6...
1,2021-01-01;1450000;5;5;1;33;6;44.6081542;40.13...
2,2021-01-01;10700000;4;13;3;85;12;55.5400601;37...
3,2021-01-01;3100000;3;5;3;82;9;44.6081542;40.13...
4,2021-01-01;2500000;2;3;1;30;9;44.7386846;37.71...


You see in the output above that the data is really messy. It's just a 1 by 1 data frame. But a closer examination of the column title and the row entries shows something interesting: What should have been different features (or columns) are lumped into one, separated by semi-colons.

How do you separate these features to stand alone?

In [4]:
columns = df.columns.str.split(';')[0]
print(columns)

['date', 'price', 'level', 'levels', 'rooms', 'area', 'kitchen_area', 'geo_lat', 'geo_lon', 'building_type', 'object_type', 'postal_code', 'street_id', 'id_region', 'house_id']


In [5]:
df = df[df.columns.to_list()[0]].str.split(';', expand=True)

In [6]:
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11358150 entries, 0 to 11358149
Data columns (total 15 columns):
0     object
1     object
2     object
3     object
4     object
5     object
6     object
7     object
8     object
9     object
10    object
11    object
12    object
13    object
14    object
dtypes: object(15)
memory usage: 1.3+ GB
None


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,2021-01-01,2451300,15,31,1,30.3,0,56.7801124,60.6993548,0,2,620000,,66,1632918.0
1,2021-01-01,1450000,5,5,1,33.0,6,44.6081542,40.1383814,0,0,385000,,1,
2,2021-01-01,10700000,4,13,3,85.0,12,55.5400601,37.7251124,3,0,142701,242543.0,50,681306.0
3,2021-01-01,3100000,3,5,3,82.0,9,44.6081542,40.1383814,0,0,385000,,1,
4,2021-01-01,2500000,2,3,1,30.0,9,44.7386846,37.7136681,3,2,353960,439378.0,23,1730985.0


The new data frame now has 15 features, but the column headings are not descriptive enough. Let me fix that.

In [8]:
df.columns = columns
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11358150 entries, 0 to 11358149
Data columns (total 15 columns):
date             object
price            object
level            object
levels           object
rooms            object
area             object
kitchen_area     object
geo_lat          object
geo_lon          object
building_type    object
object_type      object
postal_code      object
street_id        object
id_region        object
house_id         object
dtypes: object(15)
memory usage: 1.3+ GB
None


Unnamed: 0,date,price,level,levels,rooms,area,kitchen_area,geo_lat,geo_lon,building_type,object_type,postal_code,street_id,id_region,house_id
0,2021-01-01,2451300,15,31,1,30.3,0,56.7801124,60.6993548,0,2,620000,,66,1632918.0
1,2021-01-01,1450000,5,5,1,33.0,6,44.6081542,40.1383814,0,0,385000,,1,
2,2021-01-01,10700000,4,13,3,85.0,12,55.5400601,37.7251124,3,0,142701,242543.0,50,681306.0
3,2021-01-01,3100000,3,5,3,82.0,9,44.6081542,40.1383814,0,0,385000,,1,
4,2021-01-01,2500000,2,3,1,30.0,9,44.7386846,37.7136681,3,2,353960,439378.0,23,1730985.0


Let me save this new data frame as a CSV file, so I don't have to go through the long and tedious process of splitting the column each time I return to this notebook. Then I'll use import the new CSV file and continue with data wrangling.

In [9]:
df.to_csv('datasets/russia-real-estate-dataset-clean.csv', index=False)