**My first attempt on machine learning was using the Housing dataset under Advanced Regression Technique to predict the sale price of the house. However that dataset was fairly easier one with only 1460 rows and a limited set of attributes to play with. **

Sberbank has now provided us with a dataset to develop algorithms which use a broad spectrum of features to predict realty prices. Competitors will rely on a rich dataset that includes housing data and macroeconomic patterns. An accurate forecasting model will allow Sberbank to provide more certainty to their customers in an uncertain economy.

The things which I liked about this dataset is that first of all we have 292 variables which are internal to a house, external features of the house and also contains attributes from the neighbourhood. Practically we tend to like a house on the basis of number of parameters which are beyond the internal features of the house. 

In addition, this dataset also has a macro economic indicator dataset for Russia which gives you the oppurtunity to join two datasets and use the attribues for predicting the house prices. Which again makes sense, since as the economy of the country grows, the prices of property would also go up. In Amsterdam, for instance, the housing prices went up by 11% in last two years due to the upward trend of the economy. 

Following are the activities which I would be doing as part of this exercise. 
1) Explore the attributes in the datasets. 
2) Join the housing data and macro economic indicators
3) Missing Value analysis and fixing. 
4) Fix the Data Quality issues.
5) Evaluate the correlation of Internal Features
6) Evaluate the correlation of Demographic Features
7) Evaluate the correlation of External Features
8) Evaluate the correlation of macro economic Features
9) Model Selection
10) Evaluate Model performance.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import mode

# Set ipython's max row display
pd.set_option('display.max_row', 10000)
#Setting to print all the values in array
np.set_printoptions(threshold=np.nan)
# Set iPython's max column width to 50
pd.set_option('display.max_columns', 500)

In [None]:
#Import  Dataset for EDA
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
macro = pd.read_csv('../input/macro.csv')

train['price_doc'].head()

In [None]:
#Identify the columns with missing Values
total = train.isnull().sum().sort_values(ascending=False)
total.columns = ['column_name', 'missing_count']
percent = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.loc[missing_data['Total']!= 0]

In [None]:
train['price_doc'].describe()

In [None]:
np.log(train['price_doc']).skew()

In [None]:
train['price_doc'].kurtosis()

In [None]:
sns.distplot(train['price_doc'], color = 'g', bins = 100)

In [None]:
sns.distplot(np.log(train['price_doc']), color = 'g', bins = 100, kde = 'True')

In [None]:
train.plot.scatter(x = 'full_sq', y = 'price_doc')

In [None]:
train.plot.scatter(x = 'life_sq', y = 'price_doc')