# Predicting Apartment Prices in Russia

In this project, I'll build a machine-learning model to predict the prices of apartments in Russia. The dataset for this project is obtained from [Daniilak on Kaggle](https://www.kaggle.com/datasets/mrdaniilak/russia-real-estate-2021). The documentation (or description) for the dataset is also available on the same Kaggle page.

I'll follow the common machine learning workflow of:
- Prepare data, which in turn has the following steps:
  - Import data
  - Explore data
  - Split data
- Build model
- Communicate results

In [1]:
# Import libraries
import pandas as pd

## Prepare data

### Import

In [2]:
df = pd.read_csv('datasets/russia-real-estate-dataset.csv')

In [3]:
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11358150 entries, 0 to 11358149
Data columns (total 1 columns):
date;price;level;levels;rooms;area;kitchen_area;geo_lat;geo_lon;building_type;object_type;postal_code;street_id;id_region;house_id    object
dtypes: object(1)
memory usage: 86.7+ MB
None


Unnamed: 0,date;price;level;levels;rooms;area;kitchen_area;geo_lat;geo_lon;building_type;object_type;postal_code;street_id;id_region;house_id
0,2021-01-01;2451300;15;31;1;30.3;0;56.7801124;6...
1,2021-01-01;1450000;5;5;1;33;6;44.6081542;40.13...
2,2021-01-01;10700000;4;13;3;85;12;55.5400601;37...
3,2021-01-01;3100000;3;5;3;82;9;44.6081542;40.13...
4,2021-01-01;2500000;2;3;1;30;9;44.7386846;37.71...


You see in the output above that the data is really messy. It's just a 1 by 1 data frame. But a closer examination of the column title and the row entries shows something interesting: What should have been different features (or columns) are lumped into one, separated by semi-colons.

How do you separate these features to stand alone?

In [4]:
columns = df.columns.str.split(';')[0]
print(columns)

['date', 'price', 'level', 'levels', 'rooms', 'area', 'kitchen_area', 'geo_lat', 'geo_lon', 'building_type', 'object_type', 'postal_code', 'street_id', 'id_region', 'house_id']


In [5]:
df = df[df.columns.to_list()[0]].str.split(';', expand=True)

In [6]:
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11358150 entries, 0 to 11358149
Data columns (total 15 columns):
0     object
1     object
2     object
3     object
4     object
5     object
6     object
7     object
8     object
9     object
10    object
11    object
12    object
13    object
14    object
dtypes: object(15)
memory usage: 1.3+ GB
None


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,2021-01-01,2451300,15,31,1,30.3,0,56.7801124,60.6993548,0,2,620000,,66,1632918.0
1,2021-01-01,1450000,5,5,1,33.0,6,44.6081542,40.1383814,0,0,385000,,1,
2,2021-01-01,10700000,4,13,3,85.0,12,55.5400601,37.7251124,3,0,142701,242543.0,50,681306.0
3,2021-01-01,3100000,3,5,3,82.0,9,44.6081542,40.1383814,0,0,385000,,1,
4,2021-01-01,2500000,2,3,1,30.0,9,44.7386846,37.7136681,3,2,353960,439378.0,23,1730985.0


The new data frame now has 15 features, but the column headings are not descriptive enough. Let me fix that.

In [8]:
df.columns = columns
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11358150 entries, 0 to 11358149
Data columns (total 15 columns):
date             object
price            object
level            object
levels           object
rooms            object
area             object
kitchen_area     object
geo_lat          object
geo_lon          object
building_type    object
object_type      object
postal_code      object
street_id        object
id_region        object
house_id         object
dtypes: object(15)
memory usage: 1.3+ GB
None


Unnamed: 0,date,price,level,levels,rooms,area,kitchen_area,geo_lat,geo_lon,building_type,object_type,postal_code,street_id,id_region,house_id
0,2021-01-01,2451300,15,31,1,30.3,0,56.7801124,60.6993548,0,2,620000,,66,1632918.0
1,2021-01-01,1450000,5,5,1,33.0,6,44.6081542,40.1383814,0,0,385000,,1,
2,2021-01-01,10700000,4,13,3,85.0,12,55.5400601,37.7251124,3,0,142701,242543.0,50,681306.0
3,2021-01-01,3100000,3,5,3,82.0,9,44.6081542,40.1383814,0,0,385000,,1,
4,2021-01-01,2500000,2,3,1,30.0,9,44.7386846,37.7136681,3,2,353960,439378.0,23,1730985.0


Let me save this new data frame as a CSV file, so I don't have to go through the long and tedious process of splitting the column each time I return to this notebook. Then I'll use import the new CSV file and continue with data wrangling.

In [9]:
df.to_csv('datasets/russia-real-estate-dataset-clean.csv', index=False)

In [2]:
df = pd.read_csv('datasets/russia-real-estate-dataset-clean.csv')
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11358150 entries, 0 to 11358149
Data columns (total 15 columns):
date             object
price            int64
level            int64
levels           int64
rooms            int64
area             float64
kitchen_area     float64
geo_lat          float64
geo_lon          float64
building_type    int64
object_type      int64
postal_code      float64
street_id        float64
id_region        int64
house_id         float64
dtypes: float64(7), int64(7), object(1)
memory usage: 1.3+ GB
None


Unnamed: 0,date,price,level,levels,rooms,area,kitchen_area,geo_lat,geo_lon,building_type,object_type,postal_code,street_id,id_region,house_id
0,2021-01-01,2451300,15,31,1,30.3,0.0,56.780112,60.699355,0,2,620000.0,,66,1632918.0
1,2021-01-01,1450000,5,5,1,33.0,6.0,44.608154,40.138381,0,0,385000.0,,1,
2,2021-01-01,10700000,4,13,3,85.0,12.0,55.54006,37.725112,3,0,142701.0,242543.0,50,681306.0
3,2021-01-01,3100000,3,5,3,82.0,9.0,44.608154,40.138381,0,0,385000.0,,1,
4,2021-01-01,2500000,2,3,1,30.0,9.0,44.738685,37.713668,3,2,353960.0,439378.0,23,1730985.0


I'll use a function to complete the rest of the data wrangling process. Almost all the features of this data frame are numerical, which is not descriptive enough for some of these features (e.g. what does a `building_type` of `0` mean?). So, the first thing the wrangle function will do is to replace some of these numerical data with descriptive text data.

In [18]:
sorted(df['id_region'].unique())

[1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 83,
 86,
 87,
 89,
 91,
 92,
 200]

In [15]:
df[df['id_region'] == 24]

Unnamed: 0,date,price,level,levels,rooms,area,kitchen_area,geo_lat,geo_lon,building_type,object_type,postal_code,street_id,id_region,house_id
76,2021-01-01,1850000,9,9,1,39.00,0.0,56.125549,93.333483,2,0,662501.0,,24,
182,2021-01-01,18000000,10,11,4,153.30,0.0,56.011662,92.890097,4,0,660049.0,385730.0,24,2048192.0
209,2021-01-01,10000000,3,10,4,142.40,0.0,56.040261,92.904633,4,0,660077.0,275162.0,24,684486.0
223,2021-01-01,5350000,7,9,2,65.00,12.0,56.055911,92.896262,1,2,660125.0,373170.0,24,1171299.0
228,2021-01-01,1650000,1,5,3,61.00,6.0,55.533727,89.186785,2,0,662312.0,,24,2077517.0
229,2021-01-01,2750000,4,5,1,32.00,6.0,56.011707,92.840010,0,0,660021.0,,24,
269,2021-01-01,3550000,8,10,1,48.00,-100.0,56.064452,92.923387,4,0,660125.0,,24,
479,2021-01-01,2720000,2,5,2,44.00,0.0,56.034209,92.780327,2,0,660113.0,452592.0,24,926274.0
512,2021-01-01,6000000,1,17,2,54.10,0.0,55.978166,92.811528,0,0,660006.0,,24,
618,2021-01-01,2200000,3,24,1,27.40,-100.0,55.987951,92.866190,0,0,660012.0,,24,


In [None]:
region_id_dict = {
    1: 'Adygea',
    2: 'Bashkortostan',
    3: 'Buryatia',
    4: 'Altai Republic',
    5: 'Dagestan',
    6: 'Ingushetia',
    7: 'Kabardino-Balkar',
    8: 'Kalmykia',
    9: 'Karachay-Cherkess',
    10: 'Karelia',
    11: 'Komi',
    12: 'Mari El',
    13: 'Mordovia',
    14: 'Sakha',
    15: 'North Ossetia–Alania',
    16: 'Tatarstan',
    17: 'Tuva',
    18: 'Udmurt',
    19: 'Khakassia',
    20: 'Chechenya', # Chose 20 because df has no entry for 95, Chechen and Chechenya the same
    21: 'Chuvash',
    22: 'Altai Krai',
    23: 'Krasnodar',
    24: 'Krasnoyarsk',
    25: 'Primorsky',
    26: 'Stavropol',
    27: 'Khabarovsk',
    28: 'Amur',
    29: 'Arkhangelsk',
    30: 'Astrakhan',
    31: 'Belgorod',
    32: 'Bryansk',
    33: 'Vladimir',
    34: 'Volgograd',
    35: 'Vologda',
    36: 'Voronezh',
    37: 'Ivanovo',
    38: 'Irkutsk',
    39: 'Kaliningrad',
    40: 'Kaluga',
    41: 'Kamchatka',
    42: 'Kemerovo',
    43: 'Kirov',
    44: 'Kostroma',
    45: 'Kurgan',
    46: 'Kursk',
    47: 'Leningrad',
    48: 'Lipetsk',
    49: 'Magadan',
    50: 'Moscow Oblast',
    51: 'Murmansk',
    52: 'Nizhny Novgorod',
    53: 'Novgorod',
    54: 'Novosibirsk',
    55: 'Omsk',
    56: 'Orenburg',
    57: 'Oryol',
    58: 'Penza',
    59: 'Perm',
    60: 'Pskov',
    61: 'Rostov',
    62: 'Ryazan',
    63: 'Samara',
    64: 'Saratov',
    65: 'Sakhalin',
    66: 'Sverdlovsk',
    67: 'Smolensk',
    68: 'Tambov',
    69: 'Tver',
    70: 'Tomsk',
    71: 'Tula',
    72: 'Tyumen',
    73: 'Ulyanovsk',
    74: 'Chelyabinsk',
    75: 'Zabaykalsky',
    76: 'Yaroslavl',
    77: 'Moscow',
    78: 'St. Petersburg',
    79: 'Jewish Autonomous Oblast',
    83: 'Nenets',
    86: 'Khanty-Mansi',
    87: 'Chukotka',
    88: 'Evenk',
    89: 'Yamalo-Nenets',
    91: 'Kaliningrad',
    92: 'Sevastopol'
}