# A Multi-Method Analysis of the Russian Housing Market
### 3804ICT Assignment Part I | Data Investigation Notebook | Trimester 2, 2019

Joshua Russell (s5057545) | joshua.russell2@griffithuni.edu.au


Joshua Mitchell (s5055278) | joshua.mitchell4@griffithuni.edu.au


Hayden Flatley (s5088623) | hayden.flatley@griffithuni.edu.au

//(Intro)

The sections of the data investigation are as follows:

**1) Data Exploration**

- Number of Data Samples and Attributes
    
**2) Data Visualisation**

- Heading
    - Sub-heading

**3) Data Pre-Processing**

- Heading
    - Sub-heading

![title](SS.png)

In [26]:
# Imports
import math
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Load the Russian Housing Market datasets into pandas DataFrames
df_test = pd.read_csv("Data/test.csv")
df_train = pd.read_csv("Data/train.csv")

## 1) Data Exploration

### Number of Data Samples and Attributes

The primary property dataset was provided online in two seperate files for training and testing. Below, we display the number of data samples and attributes in each of the datasets:

In [3]:
# Testing set
print("Testing set\n- No. data samples: {}\n- No. of attributes: {}".format(df_test.shape[0], df_test.shape[1]))

Testing set
- No. data samples: 7662
- No. of attributes: 291


In [4]:
# Training set
print("Training set\n- No. data samples: {}\n- No. of attributes: {}".format(df_train.shape[0], df_train.shape[1]))

Training set
- No. data samples: 30471
- No. of attributes: 292


The testing set has one less attribute than that of the training set. The following code checks the name of this attribute:

In [5]:
list(set(df_train.columns).difference(set(df_test.columns)))[0]

'price_doc'

The data dictionary of the Russian Housing Market dataset describes this attribute as the sale price of the property, and moreover as the target variable for house price prediction. These two datasets will be useful for the methods of regression and forecasting. However, for other data mining methods we plan to investigate, such as frequent pattern mining, there is no need to split the data into training and testing sets. As a result, we will concatenate the datasets together for pre-processing. 

In [6]:
# Add a 'price_doc' column to the test DataFrame filled with `None` values
df_test_price_doc = df_test.assign(price_doc=pd.Series([None for i in range(df_test.shape[0])]).values)

In [7]:
# Concatenate test and train DataFrames
df = pd.concat([df_train, df_test_price_doc], sort=False, ignore_index=True)

In [8]:
# Concatenate datasets (both training and testing data samples)
print("Complete dataset\n- No. data samples: {}\n- No. of attributes: {}".format(df.shape[0], df.shape[1]))

Complete dataset
- No. data samples: 38133
- No. of attributes: 292


### Types of Attributes

Here we investigate the types of attributes within the dataset. For an initial look at these attributes we display each attribute with its corresponding datatype and an example value from the dataset

In [9]:
# Attributes with corresponding Datatypes and Examples from the Russian Housing Market dataset
print("{:<40} {:<15} {}".format("Attribute", "Datatype", "Example Value"))
print("{:<40} {:<15} {}".format("---------", "--------", "-------------"))

example_values = []
for col in list(df.columns):
    values = [x for x in list(df[col].values) if str(x) != "nan"]
    example_values.append(values[0])
    
for col in np.c_[list(df.columns), list(df.dtypes), list(example_values)]:
    print("{:<40} {:<15} {}".format(col[0], str(col[1]), col[2]))

Attribute                                Datatype        Example Value
---------                                --------        -------------
id                                       int64           1
timestamp                                object          2011-08-20
full_sq                                  float64         43.0
life_sq                                  float64         27.0
floor                                    float64         4.0
max_floor                                float64         17.0
material                                 float64         1.0
build_year                               float64         1907.0
num_room                                 float64         2.0
kitch_sq                                 float64         11.0
state                                    float64         3.0
product_type                             object          Investment
sub_area                                 object          Bibirevo
area_m                                   

There are four distinct types of attributes, those being *nominal attributes*, *ordinal attributes*, *interval-scaled attributes* and *ratio-scaled attributes*. Furthermore, there are also discrete and continuous classifications for distinguishing types of attributes. Since there are a total of 292 attributes in the primary dataset we will not describe the specific attribute type for each attribute. Instead, we will provide examples of attributes from the dataset that fulfill the criteria of the different attribute types.

**Nominal attributes**

Nominal attributes are those which are qualitative and do not have any inherent order or ranking. An example of this type of attribute within the dataset is the `sub_area` attribute. This attribute represents the name of the district that the data sample (property) belongs to. It takes on values such as "Juzhnoe Butovo" and "Perovo".

**Ordinal attributes**

Ordinal attributes are again qualitative. However, the values that these attributes take on have a meaningful order. In the primary housing market dataset there are no attributes that are ordinal. 

**Interval-scaled attributes**

Interval-scaled attributes are ordered, and are those which, as the name implies, are measured on a particular equal-sized interval scale. The distinguishing factor between interval-scaled attributes and ratio-scaled attributes is that interval-scaled attributes do not have a true zero point (i.e. they can be positive, zero, or negative). After observing the name and meaning of each of the attributes in the primary dataset, we did not find any attributes that appeared interval-scaled. We further check this observation below:

In [36]:
# Check for negative values in DataFrame
for col_name in list(df.columns):
    col = df[col_name]
    
    if col.dtype == "float64" or col.dtype == "int64":
        neg_values = []
        
        for val in col:
            if not math.isnan(val) and val < 0:
                neg_values.append(val)
                
        if neg_values:
            print("Column: {}".format(col_name))
            print("Negative values: {}".format(len(neg_values)))

**Ratio-scaled attributes**

Ratio-scaled attributes, like those which are interval-scaled, are ordered measurements which have a particular scale. However, what differentiates ratio-scaled attributes is that they have a true zero point (i.e. they can be positive or zero). An example of a ratio-scaled attribute within the housing market dataset is `full_sq`. This attribute represents the total area of the property in square meters. Since area has an inherent zero-point, as you cannot have a house with negative area, this is a clear example of a ratio-scaled attribute.  

**Discrete attributes**

Discrete attributes are attributes which take on a finite or countably infinite set of possible values. The attribute `product_type` is an example of a discrete attribute within the primary dataset. This attribute states whether the property was bought as an investment property, or for owner-occupancy. Since there are two possible values this attribute can take on (i.e. "OwnerOccupier" or "Investment"), it has a finite set of values and is therefore a discrete attribute. 

**Continuous attributes**

Continuous attributes, in contrast to discrete attributes, take on real valued numbers over a continuous range. An example of this type of attribute in the Russian housing market dataset is `metro_min_walk`, which provides the time it would take to walk to the metro on foot from the property (in minutes). This attribute has floating-point values over a continuous range and can therefore be classified as a continuous attribute.

### Feature Selection

The primary Russian housing market dataset contains a significant number of attributes, 292 in total. Consequently, for this investigation we will perform feature selection to select a few attributes which seem interesting and/or significant for predicting property sale price to study in data exploration and visualisation. In Firstly, we will perform manual feature selection by examining the features and using domain knowledge to determine which attributes would not be interesting 

#### Manual Selection

#### Analysis/Metric-based Selection

### Statistical Information

## 2) Data Visualisation

## 3) Data Pre-Processing