# A Multi-Method Analysis of the Russian Housing Market
### 3804ICT Assignment Part I | Data Investigation Notebook | Trimester 2, 2019

Joshua Russell (s5057545) | joshua.russell2@griffithuni.edu.au


Joshua Mitchell (s5055278) | joshua.mitchell4@griffithuni.edu.au


Hayden Flatley (s5088623) | hayden.flatley@griffithuni.edu.au

//(Intro)

The sections of the data investigation are as follows:

**1. Data Exploration**

    1.1. Number of Data Samples and Attributes
    
**2. Data Visualisation**

- Heading
    - Sub-heading

**3. Data Pre-Processing**

- Heading
    - Sub-heading

![title](SS.png)

In [69]:
# Imports
import math
import folium
import random
import statistics
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt

from IPython.display import display
from geopy.geocoders import Nominatim

In [2]:
# Load the Russian Housing Market datasets into pandas DataFrames
df_test = pd.read_csv("Data/test.csv")
df_train = pd.read_csv("Data/train.csv")

## 1. Data Exploration

### 1.1. Number of Data Samples and Attributes

The primary property dataset was provided online in two seperate files for training and testing. Below, we display the number of data samples and attributes in each of the datasets:

In [3]:
# Testing set
print("Testing set\n- No. data samples: {}\n- No. of attributes: {}".format(df_test.shape[0], df_test.shape[1]))

Testing set
- No. data samples: 7662
- No. of attributes: 291


In [4]:
# Training set
print("Training set\n- No. data samples: {}\n- No. of attributes: {}".format(df_train.shape[0], df_train.shape[1]))

Training set
- No. data samples: 30471
- No. of attributes: 292


The testing set has one less attribute than that of the training set. The following code checks the name of this attribute:

In [5]:
list(set(df_train.columns).difference(set(df_test.columns)))[0]

'price_doc'

The data dictionary of the Russian Housing Market dataset describes this attribute as the sale price of the property, and moreover as the target variable for house price prediction. These two datasets will be useful for the methods of regression and forecasting. However, for other data mining methods we plan to investigate, such as frequent pattern mining, there is no need to split the data into training and testing sets. As a result, we will concatenate the datasets together for pre-processing. 

In [6]:
# Add a 'price_doc' column to the test DataFrame filled with `None` values
df_test_price_doc = df_test.assign(price_doc=pd.Series([None for i in range(df_test.shape[0])]).values)

In [7]:
# Concatenate test and train DataFrames
df = pd.concat([df_train, df_test_price_doc], sort=False, ignore_index=True)

In [8]:
# Concatenate datasets (both training and testing data samples)
print("Complete dataset\n- No. data samples: {}\n- No. of attributes: {}".format(df.shape[0], df.shape[1]))

Complete dataset
- No. data samples: 38133
- No. of attributes: 292


### 1.2. Types of Attributes

Here we investigate the types of attributes within the dataset. For an initial look at these attributes we display each attribute with its corresponding datatype and an example value from the dataset

In [9]:
# Attributes with corresponding Datatypes and Examples from the Russian Housing Market dataset
print("{:<40} {:<15} {}".format("Attribute", "Datatype", "Example Value"))
print("{:<40} {:<15} {}".format("---------", "--------", "-------------"))

example_values = []
for col in list(df.columns):
    values = [x for x in list(df[col].values) if str(x) != "nan"]
    example_values.append(values[0])
    
for col in np.c_[list(df.columns), list(df.dtypes), list(example_values)]:
    print("{:<40} {:<15} {}".format(col[0], str(col[1]), col[2]))

Attribute                                Datatype        Example Value
---------                                --------        -------------
id                                       int64           1
timestamp                                object          2011-08-20
full_sq                                  float64         43.0
life_sq                                  float64         27.0
floor                                    float64         4.0
max_floor                                float64         17.0
material                                 float64         1.0
build_year                               float64         1907.0
num_room                                 float64         2.0
kitch_sq                                 float64         11.0
state                                    float64         3.0
product_type                             object          Investment
sub_area                                 object          Bibirevo
area_m                                   

There are four distinct types of attributes, those being *nominal attributes*, *ordinal attributes*, *interval-scaled attributes* and *ratio-scaled attributes*. Furthermore, there are also discrete and continuous classifications for distinguishing types of attributes. Since there are a total of 292 attributes in the primary dataset we will not describe the specific attribute type for each attribute. Instead, we will provide examples of attributes from the dataset that fulfill the criteria of the different attribute types.

**Nominal attributes**

Nominal attributes are those which are qualitative and do not have any inherent order or ranking. An example of this type of attribute within the dataset is the `sub_area` attribute. This attribute represents the name of the district that the data sample (property) belongs to. It takes on values such as "Juzhnoe Butovo" and "Perovo".

**Ordinal attributes**

Ordinal attributes are again qualitative. However, the values that these attributes take on have a meaningful order. In the primary housing market dataset there are no attributes that are ordinal. 

**Interval-scaled attributes**

Interval-scaled attributes are ordered, and are those which, as the name implies, are measured on a particular equal-sized interval scale. The distinguishing factor between interval-scaled attributes and ratio-scaled attributes is that interval-scaled attributes do not have a true zero point (i.e. they can be positive, zero, or negative). After observing the name and meaning of each of the attributes in the primary dataset, we did not find any attributes that appeared interval-scaled. We further check this observation below:

In [47]:
# Check for negative values in DataFrame
for col_name in list(df.columns):
    col = df[col_name]
    
    if col.dtype == "float64" or col.dtype == "int64":
        neg_values = []
        
        for val in col:
            if not math.isnan(val) and val < 0:
                neg_values.append(val)
                
        if neg_values:
            print("Column: {}".format(col_name))
            print("Negative values: {}".format(len(neg_values)))

**Ratio-scaled attributes**

Ratio-scaled attributes, like those which are interval-scaled, are ordered measurements which have a particular scale. However, what differentiates ratio-scaled attributes is that they have a true zero point (i.e. they can be positive or zero). An example of a ratio-scaled attribute within the housing market dataset is `full_sq`. This attribute represents the total area of the property in square meters. Since area has an inherent zero-point, as you cannot have a house with negative area, this is a clear example of a ratio-scaled attribute.  

**Discrete attributes**

Discrete attributes are attributes which take on a finite or countably infinite set of possible values. The attribute `product_type` is an example of a discrete attribute within the primary dataset. This attribute states whether the property was bought as an investment property, or for owner-occupancy. Since there are two possible values this attribute can take on (i.e. "OwnerOccupier" or "Investment"), it has a finite set of values and is therefore a discrete attribute. 

**Continuous attributes**

Continuous attributes, in contrast to discrete attributes, take on real valued numbers over a continuous range. An example of this type of attribute in the Russian housing market dataset is `metro_min_walk`, which provides the time it would take to walk to the metro on foot from the property (in minutes). This attribute has floating-point values over a continuous range and can therefore be classified as a continuous attribute.

### 1.3. Feature Selection

The primary Russian housing market dataset contains a significant number of attributes, 292 in total. Consequently, for this investigation we will perform feature selection to select a few attributes which seem interesting and/or significant for predicting property sale price to study in data exploration and visualisation. Firstly, we will perform manual feature selection by examining the features and using domain knowledge to determine which attributes should not be investigated. We will then use metrics and data analysis techniques to determine which attributes seem most promising to examine. 

#### Manual Selection

The following attributes were selected for data exploration and visualisation based upon our domain knowledge of what attributes would be significant in predicting house price and what attributes would be interesting to study. 

`price_doc` - this attribute represents the sale price of the property. This attribute will be interesting to explore as it allows us to find trends in the price and determine what other attributes are correlated to house price. Furthermore, the sale price will be the target variable for our regression and forecasting methods. 

`full_sq` - this attribute represents the total area of the property in square meters (including loggias, balconies and other non-residential living areas). We believe that the total area of the property would be a strong indicator for gauging the price of the property. As larger homes are generally more expensive than smaller ones.

`sub_area` - this attribute represents the name of the district that the property is within. Characteristics of the district, such as its reputation, its population, or whether it has a large central business district will all affect the price of the property and other property/neighbourhood attribute values. Because of this, we thought that it would be interesting to explore. 

`num_room` - this attribute represents the number of living rooms

`build_year` - this attribute represents the year that the property was built. Since newer appartments and homes are generally more expensive, we assume that this attribute will be useful for predicting house price. Furthermore, it may also allude to interesting relationships between house age and the district, or house age and the demographic information of the neighbourhood.

`state` - this attribute represents the condition that the property is in at time of purchase.

#### Metric-Based Selection - Correlation Analysis

#### Metric-Based Selection - XGB Classifier*

#### Selecting Attributes

Here we create a DataFrame with the selected attributes for data exploration and visualisation.

In [52]:
df_exp = df[["timestamp", "price_doc", "full_sq", "sub_area", "num_room", "build_year", "state", 
             "public_transport_station_km", "industrial_km", "metro_min_avto", 
             "market_shop_km", "material", "park_km", "green_zone_km", "school_km"]]

### 1.4. Statistical Information

#### Measures of Central Tendency

Mean Median Midrange Mode | Median Approximation

In [95]:
df_exp.describe()

Unnamed: 0,full_sq,num_room,build_year,state,public_transport_station_km,industrial_km,metro_min_avto,market_shop_km,material,park_km,green_zone_km,school_km
count,38133.0,28561.0,23479.0,23880.0,38133.0,38133.0,38133.0,38133.0,28561.0,38133.0,38133.0,38133.0
mean,54.111172,1.900844,2716.785,2.07165,0.424452,0.767431,4.902702,3.937603,1.83439,3.064809,0.30141,1.347122
std,35.171162,0.84762,130852.1,0.864795,1.352674,0.742974,6.473965,3.477093,1.490923,3.915439,0.293917,3.159086
min,0.0,0.0,0.0,1.0,0.002804,0.0,0.0,0.003847,1.0,0.003737,0.0,0.0
25%,38.9,1.0,1966.0,1.0,0.102853,0.286587,1.719313,1.540511,1.0,0.964876,0.103223,0.273073
50%,50.0,2.0,1980.0,2.0,0.162834,0.571575,2.769542,2.887683,1.0,1.787349,0.216846,0.480208
75%,63.0,2.0,2006.0,3.0,0.279753,1.040324,4.788853,5.439734,2.0,3.30426,0.420417,0.899203
max,5326.0,19.0,20052010.0,33.0,17.413002,14.048162,65.101125,41.103651,6.0,47.351538,2.036755,47.394706


In [12]:
def median_approximation(values, num_intervals=5):
    values_min = min(values)
    values_max = max(values)
    values_range = values_max - values_min
    
    # Determine the range of each interval 
    if isinstance(values[0], int):
        interval_range = values_range // (num_intervals)
    elif isinstance(values[0], float):
        interval_range = values_range / (num_intervals)
    else:
        print("Error: datatype of elements within the passed array is not 'int' or 'float'")
        return -1
        
    # Dictionary mapping interval number to interval range
    interval_ranges = {}  
    
    # Dictionary mapping interval number to values falling within the interval range
    interval_values = {}  
    
    # Determine the range for each interval
    for i in range(num_intervals):
        
        # First interval
        if i == 0:
            interval_ranges[i] = (values_min, values_min + interval_range)
        
        # Last interval
        elif i == num_intervals - 1:
            interval_ranges[i] = (values_min + (i * interval_range), values_max + 1)
            
        # Middle interval
        else:
            interval_ranges[i] = (values_min + (i * interval_range), values_min + ((i + 1) * interval_range))
            
        # Set initial frequency count of interval
        interval_values[i] = 0
            
    # Determine the frequency of values that fall within each interval
    for v in values:
        for i, i_range in interval_ranges.items():
            if i_range[0] <= v < i_range[1]:
                interval_values[i] += 1
                break
    
    # Calculate the median approximation formula components
    median_interval = num_intervals // 2
    sum_lower_freq = sum([interval_values[i] for i in range(0, median_interval)])
    freq_median = interval_values[median_interval]
    width = interval_ranges[median_interval][1] - interval_ranges[median_interval][0]
    L1 = interval_ranges[median_interval][0]
    N = len(values)
    
    # Calculate the median approximation
    median_approx = L1 + ((N/2 - sum_lower_freq) / freq_median) * width
    
    return median_approx

In [38]:
nums = [random.randint(0, 100) for x in range(10000)]
print("Statistics library median function: {:.2f}".format(statistics.median(nums)))
print("Median approximation: {:.2f}".format(median_approximation(nums)))

Statistics library median function: 51.00
Median approximation: 51.14


#### Five-Number Summary 

Range, min, quartiles, median max, variance, standard deviation

## 2. Data Pre-Processing

## 3. Data Visualisation

### Bubble Map

In [79]:
# Find the house purchase counts of each district
values = df['sub_area'].value_counts().keys().tolist()
counts = df['sub_area'].value_counts().tolist()

# Normailze counts into the range of 0-100%
s = sum(counts)
market_shares = [c/s for c in counts]

# Capitalise the titles of each district for visualisation
districts = [w.title() for w in values]

# Find the corresponding latitudes and longitudes of the districts
located_districts = []
location_latitudes = []
location_longitudes = []
located_location_sales_count = []
located_location_market_shares = []

geolocator = Nominatim(user_agent='joshua_russell', timeout=10)

for i, location in enumerate(districts):
    loc = geolocator.geocode(location + ", Moscow, Russia")
    
    # Only record data for districts whose latitude and longitude could be found
    if loc is not None:
        located_districts.append(location)
        location_latitudes.append(loc.latitude)
        location_longitudes.append(loc.longitude)
        located_location_sales_count.append(counts[i])
        located_location_market_shares.append(market_shares[i])

# Create a DataFrame to plot the circles on the map
data = pd.DataFrame({
    "lat": location_latitudes,
    "lon": location_longitudes,
    "district": located_districts,
    "market_share": located_location_market_shares,
    "sales_count": located_location_sales_count
})

In [94]:
# Create an empty map centered at Russia
m = folium.Map(location=[55.7558, 37.6173], zoom_start=10)
 
# Add circles at each location scaled to the market share of the particular location
for i in range(0, len(data)):
   folium.Circle(
      location=[float(data.iloc[i]["lat"]), float(data.iloc[i]["lon"])],
      popup=str("<strong>{}</strong> ({} Home Sales)".format(data.iloc[i]["district"], data.iloc[i]["sales_count"])),
      radius=float(data.iloc[i]["market_share"]*100000),
      color='crimson',
      fill=True,
      fill_color='crimson'
   ).add_to(m)

# Display the map
display(m)

In [68]:
# Save the map to a HTML file
m.save("Russian-House-Market-Share-Map.html")

## Notes

Noise - check when life_sq is greater than full_sq

Noise - 9 with 0 stories

Reasoning for 1 story not being most common property.

We saw that in the attribute floor, 1 floor was not the most common which we thought would be the case, instead 19 fllors was. Our rasoning behind why this ould be is because the dataset is just from a bank who are dealing ith transactions - and people would buy and sell apartments more often than rural 1 story houses. Moreover, there would be more apartments being bought and sold nowadays due to them being constructed more often.