##Exploratory Data Analysis - Basic

Firstly lets import the libraries that we will need for our data analysis. 

 - *Numpy* - Advanced mathematical functions and linear algebra
 - *Pandas* - Data analytics and easy CSV input / output
 - *Matplotlib* - Basic plotting functionality
 - *Seaborn* - "Snazzier" plots - automatically updates matplotlib plots

In [None]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

pd.options.mode.chained_assignment = None  # default='warn'
pd.set_option('display.max_columns', 500)

##Read in the the train, test and macro data

The pandas package that we read in above, comes with a function (pandas.read_csv) which reads in a csv file and turns it into and pandas dataframe.  pandas comes with functions for reading in many types of data, json, excel, sas and other.  See http://pandas.pydata.org/pandas-docs/version/0.20/io.html for further details. 

In [None]:
train_df = pd.read_csv('../input/train.csv', parse_dates=['timestamp'])
test_df = pd.read_csv('../input/test.csv', parse_dates=['timestamp'])
macro_df = pd.read_csv('../input/macro.csv', parse_dates=['timestamp'])

### Look at the head and tail of the data

Lets start by looking at the first and last few records of each data set to get a feel for what they look like.  If you are used to R,  using 'head' and 'tail' to inspect the first few or last few records of a data frame will be familiar to you.  In R, they are functions and you use them as such e.g. head(train_df).  In pandas, they are methods and you use them by typing them after the object you wish to apply them to.  For example if your object is 'df_', you would type df_.head()

In [None]:
# Fill in the line of code below so that it applies the head() method to train_df
# The default is to show the first 5 records.  
# Type a number in the brackets (e.g. head(10)) to override the default
# What number does the row numbering start from?
# Are there any missing values (NaN is the symbol for Not A Number - and represents missing values)?

train_df. 

In [None]:
train_df.tail() # this will show us the last five records in the training set

Already we are getting a feel for the data - note that NaN is short for Not a Number and represents missing data.

Lets continue with a look at the test set.

In [None]:
test_df.head()

And finally lets have a quick look at the macro data.

In [None]:
macro_df.head()

### It is also useful to have an idea of the number of rows and columns of each dataset

The method 'shape' will give you the number of rows and columns of a pandas dataframe.  Apply the shape method to test_df, train_df, macro_df

In [None]:
# To apply the method shape to a data frame called df_, type df_.shape. 
# Fill in the missing shape method below after the period.
# Enclosing this in print() ensures that the result is printed.
# Do the train and test data have the same number of columns?  If not, why not?
print(train_df.)
print(test_df.)
print(macro_df.)

### Next we take a look at the datatypes contained in the training set
The method 'dtypes' will give you the datatypes in a data frame

In [None]:
# In the line below, fill in the 'dtypes' method after the period
# What type are most of the data fields?
data_types = train_df.
print(data_types)

Run the code below to see how many of each type of data there are.  (This code uses a very popular approach called split-apply-combine.  In R this is achieved with packages such as dplyr.  In Python this is part of the pandas package.  We don't have time to discuss this further.  See http://pandas.pydata.org/pandas-docs/stable/groupby.html for further details.)

In [None]:
# Find the type of each column and store the results in a data frame
df_dataTypes = train_df.dtypes.reset_index()

# Rename the columns for convencience (note the columns method being used)
df_dataTypes.columns = ["count","dtype"]

# So far we have one line per feature
df_dataTypes.head(10)

# In the next code chunk we use split-apply-combine to summarise it..

In [None]:
# Now use split-apply-combine to get the result
# First we choose the data we are going to summarise, in this case df_dataTypes[["count","dtype"]]
# Then we split it into groups of dtypes with .groupby(by = "dtype")
# Finally we apply the 'length' function (len) and combine with the aggregate function

df_dataTypes[["count","dtype"]].groupby(by = "dtype").aggregate(len).reset_index()

# Are most of the columns numeric, strings (i.e. 'object') or dates?

The code below uses the seabourne package to create a bar chart of the number of each datatype.  For those used to ggplot in R, you will find the seabourne output sometimes looks surprisingly familiar.  We don't have time to review the code below in detail, but there are many introductory tutorials online (for example https://elitedatascience.com/python-seaborn-tutorial).  

seabourne is typically imported as 'sns'.  We have done the same which is why you will see it referred to as 'sns' below.

In [None]:
# Run the code below to produce a bar chart showing the count of each data type.

# Count of different datatypes
plt.figure(figsize=(10,8))
sns.countplot(df_dataTypes['dtype'])
plt.xlabel('dtype', fontsize=24)
plt.ylabel('count', fontsize=24)
plt.xticks(fontsize=14)
plt.yticks(fontsize=18)
plt.show()

### What is the distribution of house prices?

For now lets focus on the training set - and in particular the variable we are asked to predict - the house price.

This is represented by the **price_doc** variable.

We start with a dot plot to check for outliers.

In [None]:
# Run the code below to create a dot plot of all the prices, in increasing order

plt.figure(figsize=(8,6))
plt.scatter(range(train_df.shape[0]), np.sort(train_df.price_doc.values), s = 2)
plt.xlabel('index', fontsize=24)
plt.ylabel('price', fontsize=24)
plt.xticks(fontsize=14)
plt.yticks(fontsize=18)
plt.show()

Looks skew - with most values being less that 20m but with the rest of the values reaching up to 100m.  We can look at the distribution using the seabourne package...

In [None]:
# Run the code below to create a histogram of the distrubtion of prices.

plt.figure(figsize=(10,6))
sns.distplot(train_df['price_doc'], kde=False, bins=50)
plt.xlabel('price (RUB 00m)', fontsize = 24)
plt.xticks(fontsize = 16)
plt.show()

We can see the data is positively skewed and the range in large.

It looks as if a log-transform may suit this data. 

**Lets plot log of the target variable.**

In [None]:
# Run the code below to view the distribution of the (natural) log of the prices.
# How would the skew of the price data influence how you would normally go about modelling this data?
# How should the skew influence how you create predictions from the point of view 
# of this competition?

plt.figure(figsize=(10,6))
sns.distplot(np.log(train_df['price_doc']),kde=False,bins=50)
plt.xlabel('log of price', fontsize = 24)
plt.show()

### How have house prices changed over time?

In [None]:
# For convenience create a new feature which is the year and month of the sale
train_df['year'] = train_df['timestamp'].dt.year
train_df['month'] = train_df['timestamp'].dt.month
train_df['yearmonth'] = train_df['timestamp'].dt.strftime("%Y%m")

In [None]:
# Now use split-apply-combine to find the median price of sale in each month

# Remember how we used split-apply-combine above?
# df_dataTypes[["count","dtype"]].groupby(by = "dtype").aggregate(len).reset_index()

# Let us do the same here.  We will store the result in df_grouped, so we start with
# df_grouped = 
# We will summarise: train_df[['yearmonth', 'price_doc']]
# We will groupby 'yearmonth'
# We will apply the function np.median
# Fill in the code below
df_grouped = (train_df[['yearmonth', 'price_doc']]
              .groupby(by = )
              .aggregate()
              .reset_index())

df_grouped.head()

In [None]:
# Now run the following code chunk to plot the trend in median price
fig, ax = plt.subplots(figsize = [12,8])
sns.pointplot(df_grouped['yearmonth'].values, df_grouped['price_doc'].values)


x_ticks_labels = df_grouped['yearmonth'].values
x_ticks = 4 * np.arange(0,12)
plt.xticks(x_ticks, x_ticks_labels[x_ticks], rotation = 45)

plt.xlabel('year - month', fontsize=24)
plt.ylabel('median price', fontsize=24)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)
plt.show()

### Correlations of Internal Characteristics in the house

From the data dictionary we can see that there are seven features which describe each property:

'full_sq', 'life_sq', 'floor', 'max_floor', 'material', 'num_room', 'kitch_sq'

The other features describe the reason for the purchase, the neighbourhood and the "Raion" (post code).

We take a quick look at how the basic features of the property are correlated with each other and with the price.

In [None]:
# Run the code below to create a heat map of the correlations.

plt.figure(figsize=(10,8))
internal_characteristics=['full_sq', 'life_sq', 'floor', 'max_floor', 'material',
                          'num_room', 'kitch_sq','price_doc']
heatmap_data=train_df[internal_characteristics].corr()
sns.heatmap(heatmap_data, annot=True)
plt.show()

We note a very high correlation (0.7) between full_sq (the area of the home) and num_room (the number of rooms in the property).

Also there are high correlations of 0.34 and 0.48 respectively between the house price (price_doc) and full_sq and num_room.

These two are definitely key features for our model.

###Lets examine the number of houses built each year

**Grouping the data by build year will quickly identify any outliers**

In [None]:
# Find the frequency of property by build year.
grouped_data_count = train_df.groupby('build_year')['id'].aggregate('count').reset_index()
grouped_data_count.columns=['build_year','count']
print(grouped_data_count.head())
print(grouped_data_count.tail())

In [None]:
# Print the tail of grouped_data_count, can you see any outliers?

print()

# Now print the head - are there any other outliers?

print()

###This is highlighting some data quality issues 

We have a build year of 20052009, clearly a mistake in the data.

We also have build year of 4965 - likely to have been meant to be 1965.

Lets fix both of these by just removing these outliers. 

We can set an upper limit on build year of 2019 (some houses may be bought off plan).

There are various kernels which deal with data quality issues (for example https://www.kaggle.com/keremt/very-extensive-cleaning-by-sberbank-discussions)

###Finally we can plot the missing data and percentages to get an indication of what values are missing

In [None]:
train_missing = train_df.isnull().sum()/len(train_df)
train_missing = train_missing.drop(train_missing[train_missing == 0].index).sort_values(ascending = False).reset_index()
train_missing.columns = ['column name','missing percentage']

plt.figure(figsize = (12,8))
sns.barplot(train_missing['column name'], train_missing['missing percentage'], palette = 'coolwarm')
plt.xticks(rotation = 'vertical')
plt.show()

Of 292 columns, 51 have missing values. The percentage of values missing ranges from 0.1% in metro_min_walk to 47.4% in hospital_beds_raion.

##Ok so now we have a better grasp of the data we can start to think about how we might model it. 