# Sberbank Russian Housing Market - Exploratory Data Analysis

In this kernel, we will take a look at the train data to see what are we dealing with in this competition.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr

### Amount of data

After loading a few useful libraries, let's load the data and see how many data do we have

In [None]:
# Load train data
df = pd.read_csv('../input/train.csv')

In [None]:
print(df.columns.tolist())
print('\nNumber of columns on train data:',len(df.columns))
print('\nNumber of data points:',len(df))
print('\nNumber of unique timestamp data points:',len(df['timestamp'].unique()))
print('\nNumber of unique id data points:',len(df['id'].unique()))
print('\nData types:',df.dtypes.unique())

If we discard 'id' and 'timestamp' as features, and price_doc because that is the target variable, we have 289 features and only 30471 data points, so overfitting could be a problem here.

It seems we have both categorical and numerical data. Let's take a look to that categorical data:

In [None]:
df.select_dtypes(include=['O']).columns.tolist()

In [None]:
df.select_dtypes(include=['O']).head()

### Missing data

Let's check the quality of the data: next we are going to look at the structure of NaN's in the train set.

In [None]:
print('\nNumber of columns which have any NaN:',df.isnull().any().sum(),'/',len(df.columns))
print('\nNumber of rows which have any NaN:',df.isnull().any(axis=1).sum(),'/',len(df))

Now we are going to plot a bar chart representing the % of missing data for each of the 51 features which have lacking observations:

In [None]:
# Get the number of NaN's for each column, discarding those with zero NaN's
ranking = df.loc[:,df.isnull().any()].isnull().sum().sort_values()
# Turn into %
x = ranking.values/len(df)

# Plot bar chart
index = np.arange(len(ranking))
plt.bar(index, x)
plt.xlabel('Features')
plt.ylabel('% NaN observations')
plt.title('% of null data points for each feature')
plt.show()

print('Features:',ranking.index.tolist())

As we can see in the graph, for some features there are almost 45% of missing data. 

Since the intersection of the missing data between features may not be empty, it is important to see how many data we actually lose when we use more than one feature. It is important to note that there are 51! (1.55 e+66) different ways to sequentially accumulate the features wich have missing observations in order to see how many data is missing as a function of the features selected. Here we will use a very simple heuristic, wich may be suboptimal but I guess that is near-optimal: add features to the list sorted in ascending order by number of missing data.

In [None]:
rank_features = ranking.index
accum_nulls = list()
accum_features = list()
for i, feature in enumerate(rank_features):
    # On each step, we add a new feature to the list
    accum_features.append(feature)
    # We calculate the number of rows with NaN's for that list of features
    accum_nulls.append(len(df)-len(df[accum_features].dropna()))

In [None]:
# Calculate the % of missing data
x = np.array(accum_nulls)/len(df)

# Plot
index = np.arange(len(x))
plt.bar(index, x)
plt.xlabel('Features')
plt.ylabel('% NaN accumulated observations')
plt.title('% of null data points accumulated until each feature')
plt.show()

print('Features:',accum_features)

In this graph we see that, as we add features which contain missing data, we start to gradually lose more and more data points, until we lose 80% of data if we use all the features. There seems to be groups of features, wich have full intersection. This may be due to a changing data collection procedure, in which new features were added after starting the database. This is a very typical problem in this type of data. Let's check if I'm right:

In [None]:
y = df.groupby('timestamp').apply(lambda f: f.isnull().sum().sum()).values

plt.plot(y)
plt.xlabel('timestamp')
plt.ylabel('Number of NaNs')
plt.title('Missing data structure over timestamp')
plt.show()

Unexpected result. If my initial guess were right (that new features were added to the data over time), we should see that the number of missing data decreases. Maybe there are differences in the number of data points for each timestamp, so let's normalize the series dividing by the number of observations on each timestamp:

In [None]:
y = df.groupby('timestamp').apply(lambda x: x.isnull().sum().sum()/len(x)).values

plt.plot(y)
plt.xlabel('timestamp')
plt.ylabel('Number of NaNs')
plt.title('Missing data structure over timestamp normalized')
plt.show()

This plot looks better, but still doesn't shows the pattern I was expecting. There is not a clear time-dependent structure on the missing data.


### Correlations between features and target

Next we are going to calculate the correlations between all the features and the target variable in order to see how difficult is the problem:

In [None]:
# Get the list of features
features = df.iloc[:,2:-1].select_dtypes(exclude=['O']).columns.tolist()
# Get the target name
target = df.iloc[:,-1].name

In [None]:
correlations = dict()
for feat in features:
    df_temp = df[[feat,target]]
    df_temp = df_temp.dropna()
    x1 = df_temp[feat].values
    x2 = df_temp[target].values
    key = feat + ' vs ' + target
    correlations[key] = pearsonr(x1,x2)[0]

In [None]:
df_corrs = pd.DataFrame(correlations, index=['R']).T
df_corrs.loc[df_corrs['R'].abs().sort_values(ascending=False).index].iloc[:20]

It seems that number of rooms and square area are the best regressors (for the moment). Let's plot them jointly with the target variable to see how good is that correlation:

In [None]:
y = df.loc[:,['num_room','full_sq',target]].dropna().sort_values(target,ascending=True).values
x = np.arange(y.shape[0])

In [None]:
plt.subplot(3, 1, 1)
plt.plot(x,y[:,0])
plt.title('num_room & full_sq vs price')
plt.ylabel('num_room')

plt.subplot(3, 1, 2)
plt.plot(x,y[:,1])
plt.ylabel('full_sq')

plt.subplot(3, 1, 3)
plt.plot(x,y[:,2],'r')
plt.ylabel('price')
    
plt.show()

The dataset has some price outliers. It might be interesting to remove them and see if the correlations improve. Let's plot the distribution of the price:

In [None]:
x = df[target].values.astype('int')
sns.distplot(x)
plt.show()

# Conclusions

Very interesting problem. None of the house features can explain the price fully, so the features engineering will play an important role in this competition.

I will post a new analysis on macro data soon.

Thanks for reading, this is my first kernel, I hope you enjoyed it.